"Fault-Tolerant Load Management for Real-Time Distributed Computer Syst" by Arif Ghafoor and Ishfaq Ahmad

Electrical Engineering and Computer Science - Technical Reports

Title

Fault-Tolerant Load Management for Real-Time Distributed Computer Systems

Author(s)/Creator(s)

Arif Ghafoor
Ishfaq Ahmad, Syracuse University, School of Computer and Information ScienceFollow

Document Type

Report

Date

7-1991

Keywords

Fault-tolerant Load Balancing, Multicomputer Systems, Network Partitioning, Distance-Transitive Graphs, Performance Evaluation, Task Scheduling

Language

English

Disciplines

Computer Sciences

Description/Abstract

This paper presents a fault-tolerant scheme applicable to any decentralized load balancing algorithms used in soft real-time distributed systems. Using the theory of distance-transitive graphs for representing topologies of these systems, the proposed strategy partitions these systems into independent symmetric regions (spheres) centered at some control points. These central points, called fault-control points, provide a two-level task redundancy and efficiently re-distribute the load of failed nodes within their spheres. Using the algebraic characteristics of these topologies, it is shown that the identification of spheres and fault-control points is, in general, is an NP-complete problem. An efficient solution for this problem is presented by making an exclusive use of a combinatorial structure known as the Hadamard matrix. Assuming a realistic failure-repair system environment, the performance of the proposed strategy has been evaluated and compared with no fault environment, through an extensive and detailed simulation. For our fault-tolerant strategy, we propose two measures of goodness, namely, the percentage of re-scheduled tasks which meet their deadlines and the overhead incurred for fault management. It is shown that using the proposed strategy, up to 80% of the tasks can still meet their deadlines. The proposed strategy is general enough to be applicable to many networks, belonging to a number of families of distance transitive graphs. Through simulation, we have analyzed the sensitivity of this strategy to various system parameters and have shown that the performance degradation due to failures does not depend on these parameter. Also, the probability of a task being lost altogether due to multiple failures has been shown to be extremely low.

Additional Information

School of Computer and Information Science, Syracuse University, SU-CIS-91-

Recommended Citation

Ghafoor, Arif and Ahmad, Ishfaq, "Fault-Tolerant Load Management for Real-Time Distributed Computer Systems" (1991). Electrical Engineering and Computer Science - Technical Reports. 108.
https://surface.syr.edu/eecs_techreports/108

Source

local

Download

Included in

Computer Sciences Commons

COinS

Electrical Engineering and Computer Science - Technical Reports

Title

Author(s)/Creator(s)

Document Type

Date

Keywords

Language

Disciplines

Description/Abstract

Additional Information

Recommended Citation

Source

Included in

Browse

Search

Author Resources

Links

Electrical Engineering and Computer Science - Technical Reports

Title

Author(s)/Creator(s)

Document Type

Date

Keywords

Language

Disciplines

Description/Abstract

Additional Information

Recommended Citation

Source

Included in

Share

Browse

Search

Author Resources

Links