Date of Award


Degree Type


Degree Name

Doctor of Philosophy (PhD)


Electrical Engineering and Computer Science


Jae C. Oh


Distributed Fault Mitigation, Distributed Real-time System, Fault-tolerance, Multi-agent System, On-line Machine Learning, Rare Event Prediction

Subject Categories

Computer Sciences


In a large scale real-time distributed system, a large number of components and the time criticality of tasks can contribute to complex situations. Providing predictable and reliable service is a paramount interest in such a system. For example, a single point failure in an electric grid system may lead to a widespread power outage like the Northeast Blackout of 2003. System design and implementation address fault avoidance and mitigation. However, not all faults and failures can be removed during these phases, and therefore run-time fault avoidance and mitigation are needed during the operation. Timing constraints and predictability of the system behavior are important concerns in a large scale system as well. This dissertation proposes several distributed fault tolerance mechanisms using multi-agent technologies to predict and mitigate faults with various frequencies and severities. Some faults are frequently observed over time and some are not. In general, frequent fault types often cause relatively less severe consequences. Rare faults, however, are extremely difficult to predict, yet the consequences can be catastrophic. A rare fault -- often indicated by repeated doses of common faults -- causes severe harm. In our preliminary study, we design distributed rational agents using a probabilistic prediction mechanism to discover faults in the CMS experiments at CERN. All fault-mitigating activities of the agents and application tasks are guaranteed by the urgency-based priority scheduling policy with multiple steps of feasibility tests. The experiment shows that the distributed approach provides 15% more system availability than centralized approaches. This dissertation also explores the problem of predicting rare events. Many adaptive fault tolerant mechanisms attempt to predict faults through learning from data. However, in order to train the system, we need a significant amount of training data, which is not easily available for rare fault events. We use the PNNL (Pacific Northwest National Laboratory) system failure data collected from about 1,000 nodes over 4 years. We find that the severity of observed fault events is power-law distributed and there are certain associations among these events. Based on the power-law observation, we generate training data for the machine learning algorithm developed in this dissertation. The algorithm incorporates the power-law distribution principle, Bayesian inference, and logistic regression to predict rare events as well as common ones. The logistic regression is used to predict the probability of each type of events and the Bayesian inference is used for finding associations among events. A new learning algorithm is deployed with fully distributed agents using a rational decision model. The simulation study based on the PNNL data shows that the new prediction algorithm provides 15$\%$ better system availability than the prediction using the simple update method that was used in our preliminary study; and it achieves more than 10 times less system loss caused by rare faults. Finally, we developed a comprehensive simulation library, named SWARM-eTOSSIM for cyber-physical systems research. The library provides a framework suitable for simulating power-aware real-time distributed networked systems with powerful simulation controls and graphical interface. We downsized the new fault-mitigation mechanism so that it can be ported to devices with limited resources, such as sensor network elements.