NSF0614652

CSR—SMA: Reliability Modeling and Evaluation of Fault-Tolerant Hierarchical Computer Systems


  • Due to the ever increasing system complexity, reliability modeling and analysis are becoming increasingly essential components in the design and tuning of fault-tolerant hierarchical computer systems. Tremendous research efforts have been expended in this area, but two practical issues, modular imperfect coverage (MIPC) resulted from imperfect recovery mechanisms and common-cause failures (CCF) arising from a shared root cause, have generally been missed or not been fully considered in existing computer system reliability models. Failure to model either of them accurately results in over/understated system reliability, which makes reliability analysis less effective in the design and tuning of computer systems. The primary goals of this project are to develop novel reliability models for fully describing MIPC and CCF, and to explore efficient model evaluation methods leading to more accurate analysis of hierarchical computer system reliability. This project involves three phases: model development, model evaluation, and the development of a reliability analysis software tool that applies the concepts and methods developed through this research. The new models and evaluation methods developed through this work are fundamental contributions to the body of knowledge on the computer system reliability. Research results from this project will support the design of reliable computer systems subject to MIPC and CCF. The PI will disseminate information and knowledge to the academic community and the industry through seminars, classroom materials, conference/journal publications, and an Internet website for the project.

    • Members

      • Dr. Liudong Xing (PI)
      • Akhilesh Shrestha (Ph.D. student): common-cause failure analysis for dynamic systems, performance analysis of multi-valued decision diagrams based methods
      • Boddu Prashanthi (M.S. student): modular imperfect coverage analysis and formal specification, probabilistic CCF analysis
    • Collaborators

      • Dr. Leila Meshkat, Senior Engineer of Systems and Software Engineering Section, Jet Propulsion Laboratory (JPL)
      • Dr. Wendai Wang, Senior Manager of Reliability and Senior Member of Technical Staff, Applied Materials
      • Dr. Suprasad V. Amari, Senior Reliability Engineer at Relex Software Corporation
    • Related Publications

      • L. Xing, P. Boddu, Y. Sun, and W. Wang, “Reliability Analysis of Static and Dynamic Fault-Tolerant Systems subject to Probabilistic Common-Cause Failures,” Proc IMechE, Part O: Journal of Risk and Reliability, Vol. 224, No. 1, pp.43-53, 2010.
      • L. Xing and Y. Dai, “A New Decision Diagram Based Method for Efficient Analysis on Multi-State Systems,” IEEE Trans. Dependable and Secure Computing, Vol. 6, No. 3, pp. 161-174, July-Sept. 2009.
      • Liudong Xing, Akhilesh Shrestha, Leila Meshkat, and Wendai Wang, “Incorporating Common-Cause Failures into the Modular Hierarchical Systems Analysis,” IEEE Transactions on Reliability, Vol. 58, No. 1, March 2009, pp. 10-19.
      • L. Xing, P. Boddu, and Y. Sun, “System Reliability Analysis Considering Fatal and Non-Fatal Shocks,” Proc. of The 55th Annual Reliability & Maintainability Symposium, Fort Worth, TX, January 2009.
      • L. Xing and W. Wang, “Probabilistic Common-Cause Failures Analysis,” Proc. of The 54th Annual Reliability & Maintainability Symposium, Las Vegas, Nevada, January 2008.
      • L. Xing and Suprasad V. Amari, “Effective Component Importance Analysis for the Maintenance of Systems with Common-Cause Failures,” International Journal of Reliability, Quality and Safety Engineering, Vol. 14, No. 5, pp. 459-478, October 2007.
      • A. Shrestha, L. Xing, and Y. Dai, “MBDD versus MMDD for Multistate Systems Analysis,” Proc. of The 3rd IEEE International Symposium on Dependable, Autonomic and Secure Computing, pp. 172-180, Columbia, MD, USA, September 2007.
      • P. Boddu and L. Xing, “Incorporating Modular Imperfect Coverage into Dynamic Hierarchical Systems Analysis,” Proc. of The 3rd IEEE International Symposium on Dependable, Autonomic and Secure Computing, pp. 21-28, Columbia, MD, USA, September 2007.
      • A. Shrestha and L. Xing, “Common-Cause Failure Analysis for Dynamic Hierarchical Computer Systems,” Proc. of The IEEE 21st International Conference on Advanced Information Networking and Applications Workshops/Symposia (Frontiers in Networking with Applications), Niagara Falls, Canada, May 21-23, 2007, pp. 166- 171.
      • L. Xing, “Efficient Analysis of Systems with Multiple States,” Proc. of The IEEE 21st International Conference on Advanced Information Networking and Applications, Niagara Falls, Canada, May 21-23, 2007, pp. 666-672 (acceptance rate: 29%).
      • L. Xing, L. Meshkat, and S. K. Donohue, “Reliability Analysis of Hierarchical Computer-Based Systems Subject to Common-Cause Failures,” Reliability Engineering and System Safety, Vol. 92, no. 3, pp. 351-359, March 2007 (available online in October 2006; Top 25 Hottest Articles (most read) from October to December 2006 in the journal)(#16)
    • Software Tool

Dependable Network Analyzer (DNA): a software tool for reliability analysis of networked computer systems with one-level modular imperfect fault coverage and common-cause failures. You may download the free version  of the software and its tutorial for non-commercial use.

Acknowledgment: This site is based upon work supported by the National Science Foundation under Grant No. 0614652. Any opinions, findings, and conclusions or recommendations expressed in this site are those of the author(s) and do not necessarily reflect the views of the National Science Foundation