Fault-tolerant Load Balancing, Multicomputer Systems, Network Partitioning, Distance-Transitive Graphs, Performance Evaluation, Task Scheduling
This paper presents a fault-tolerant scheme applicable to any decentralized load balancing algorithms used in soft real-time distributed systems. Using the theory of distance-transitive graphs for representing topologies of these systems, the proposed strategy partitions these systems into independent symmetric regions (spheres) centered at some control points. These central points, called fault-control points, provide a two-level task redundancy and efficiently re-distribute the load of failed nodes within their spheres. Using the algebraic characteristics of these topologies, it is shown that the identification of spheres and fault-control points is, in general, is an NP-complete problem. An efficient solution for this problem is presented by making an exclusive use of a combinatorial structure known as the Hadamard matrix. Assuming a realistic failure-repair system environment, the performance of the proposed strategy has been evaluated and compared with no fault environment, through an extensive and detailed simulation. For our fault-tolerant strategy, we propose two measures of goodness, namely, the percentage of re-scheduled tasks which meet their deadlines and the overhead incurred for fault management. It is shown that using the proposed strategy, up to 80% of the tasks can still meet their deadlines. The proposed strategy is general enough to be applicable to many networks, belonging to a number of families of distance transitive graphs. Through simulation, we have analyzed the sensitivity of this strategy to various system parameters and have shown that the performance degradation due to failures does not depend on these parameter. Also, the probability of a task being lost altogether due to multiple failures has been shown to be extremely low.
Ghafoor, Arif and Ahmad, Ishfaq, "Fault-Tolerant Load Management for Real-Time Distributed Computer Systems" (1991). Electrical Engineering and Computer Science - Technical Reports. 108.