Document Type





fault-tolerant open distribution systems, component failure, fault-tolerance




Computer Engineering


A distributed system consists of autonomous computing modules that interact with each other using messages. Designing distributed systems is more difficult than designing centralized systems for several reasons. Physical separation and the use of heterogeneous computers complicate interprocessor communication, management of resources, synchronization of cooperating activities, and maintenance of consistency among multiple copies of information. The main advantages of distributed systems include increased fault-tolerance capabilities through the inherent redundancy of resources, improved performance by concurrently executing a single task on several computing modules, resource sharing, and the ability to adapt to a changing environment (extensibility). Distributed systems cover a wide range of applications, recent advances in VLSI devices and network will further increase the use of distributed systems. As the complexity of these systems increases, so does the probability of component failure, which can adversely affect the performance and usefulness of such systems. Thus, reliability, availability, and fault tolerance become important design issues in distributed systems. Fault tolerance is the system’s ability to continue executing despite the occurrence of failures. Increasing the reliability and fault tolerance of a system involves a trade-off between the cost of failure (for example, costs incurred by incomplete or incorrect computations) and the cost of incorporating the redundancy and recovery mechanisms. Because of their inherent redundancy, distributed systems provide a cost-effective way to apply fault-tolerance techniques. Open distributed systems provide universal connectivities among their components because their designs are based on the standard protocols adopted by the international standard organization (ISO). In this computing environment, interacting processes communicate through messages that traverse a stack of software layers. Consequently, applying fault-tolerance techniques to execute critical tasks can be costly in terms of execution time. In this article, we first provide an overview of the main techniques for designing fault-tolerant software and hardware systems. We identify the important features of the building blocks (computers, memories, buses, etc.) that can support an efficient implementation of fault-tolerant open distributed systems (FTODS). Taking into account the features of these building blocks, we propose an organization for FTODS. In FTODS, the algorithms needed for transferring files and synchronizing the concurrent activities of the computing modules – and for recovery – are ISO standard protocols. We propose the use of low-level voting and recovery algorithms that can run as a layer of software above the operating system to make the open distributed system an attractive environment for applying fault-tolerant techniques.