A multithreaded message passing environment for ATM LAN/WAN

Large scale High Performance Computing and Communication (HPCC) applications (e.g. Video-on-Demand, and HPDC) would require storage and processing capabilities which are beyond existing single computer systems. The current advances in networking technology (e.g. ATM) have made high performance network computing an attractive computing environment for such applications. However, using only high speed network is not sufficient to achieve high performance distributed computing environment unless some hardware and software problems have been resolved. These problems include the limited communication bandwidth available to the application, high overhead associated with context switching, redundant data copying during protocol processing and lack of support to overlap computation and communication at application level. In this paper, we propose a multithreaded message passing system for parallel/distributed processing that we refer to as NYNET communication system (NCS). NCS, being developed for NYNET (ATM wide area network testbed), is built on top of an ATM application programmer interface (API). The multithreaded environment allows applications to overlap computations and communications and provides a modular approach to support efficiently HPDC applications with different quality of service (QOS) requirements.


Introduction
Large scale High Performance Computing and Communication (HPCC) applications (e.g.Video-on-Demand and HPDC) would require storage and processing capabilities which are beyond existing single high performance computer systems.Current advances in processor technology, and advent of high speed networking technology such as ATM have made high performance network computing an attractive computing environment for such applications.Aggregate power of workstation clusters (Alpha cluster for example) is comparable to that of a supercomputer 10].Compared to massively parallel computers, network computing is typically less expensive and more exible.
Another reason for recent rapid growth in network computing area is the availability of Parallel/Distributed tools that simplify process management, inter-processor communication and program debugging in distributed computing environment.However, these advances cannot be fully exploited unless some hardware and software problems are resolved.These problems can be attributed to the high cost of operating system calls, context switching and the use of ine cient communication protocols (e.g TCP/IP).The complexity and the ine ciency of these communication protocols can be justi ed in 1970s where the networks were slow (operating in Kbps range) and not very reliable while the processing power was relatively three orders of magnitude higher than the network speed.
With advent of high speed networks operating, at Mbps and Gbps, new methods are needed to process protocols e ciently.Reducing the communication latency has been active research area in parallel processing eld.Most of the techniques proposed are based on using active messages 2], reducing operating system overhead 1], and overlap computation with communication using Multithreading 7].In distributed computing, most of the research focused on developing new communication protocols (XTP, FLIP) 11], streamlining existing ones (by merging several layers into one), or building intelligent hardware interface to o -load host from protocol processing.
The main objective of the research presented in this paper is to implement NYNET Communication System (NCS) based on these techniques (e.g.multithreading, reduce data copying and operating system overhead and parallel data transfer) that have been proven to be successful in parallel as well as distributed computing.NCS uses multithreading capability to provide e cient techniques to overlap computation and communication.Furthermore, multithreaded message passing represents an interesting distributed programming paradigm NPAC/ECE, Syracuse University to support e ciently a wide range of Quality-of-Service (QOS) requirements.NCS uses multiple Input/Output bu ers to allow parallel data transfer and thus reduces transmission and receiving time.NCS is implemented on top of an ATM API and uses its own read/write trap routines to reduce latency and avoid using ine cient communication protocols (e.g TCP/IP).
The rest of the paper is organized as follows.Section 2 describes the experimental environment used in our benchmarking and outlines the speci cations of the hardware and software tool used.Section 3 and 4 describe the design approach and implementation issues of NCS system, respectively.Section 5 presents the performance results using NCS to implement several distributed computing applications.Section 6 summarizes and concludes the paper.

Experimental Environment
NCS is being developed as part of a larger project involving the development of high performance computing and communication applications for the NYNET (ATM Wide Area NPAC/ECE, Syracuse University Network) testbed shown in Figure 1.NYNET is a high-speed ber-optic communications network linking multiple computing, communications, and research facilities in New York State.NYNET ATM testbed uses high speed ATM switches interconnected by ber-optic SONET (Synchronous Optical Network) links to integrate the parallel computers and supercomputers available at NYNET sites into one virtual computing environment.Most of the wide area portion of the NYNET operates at speed OC 48 (2.4 Gbps) while each site is connected with two OC 3 links (155 Mbps).The upstate to downstate connection is through DS-3 (45 Mbps) link.
In this paper, we report on the performance gain that can be achieved when NCS is used to develop several HPDC applications.These applications include JPEG compression/decompression, Fast Fourier transform (FFT) and parallel matrix multiplication.These applications have been benchmarked on several high performance distributed systems, that are brie y described below.
SUN/ATM LAN/WAN: This con guration consists of SUN SPARCstation IPXs interconnected by an ATM LAN using an ATM FORE switch.SUN IPX nodes operate on a approximately 40MHz clock.Host computers are connected to the ATM switch through Fore's SBA-200 SBus adaptors.The SBA-200 has a dedicated Intel i960 processor (running at 25 MHz) to support segmentation and reassemble functions and to manage data transfer between the adaptor and the host computer.The SBA-200 also has special hardware for AAL CRC and special-purpose DMA hardware.140 Mbps TAXI interface is provided between the workstations and the ATM switch.SUN/ATM WANhas similar characteristics to the SUN/ATM LAN except that the IPXs are now interconnected through the NYNET testbed.

NCS Design Approach
Our approach to implement the NYNET Communication System is based on the following strategies.
Simplicity: Most message-passing systems have been built on top of traditional communication protocols (e.g., TCP/IP and UDP/IP) which were developed to run in NPAC/ECE, Syracuse University adverse conditions and unreliable networks.But the advent of reliable, high speed networks makes most of the functions provided by these protocols unnecessary.NCS avoids using complex communication protocols and instead uses the ATM API to implement the required communication services.
Parallel Data Transfer: To reduce the data transfer time for send and receive operations, NCS uses multiple input/output bu ers as shown in Figure 2. In this scheme, NCS copies data to be sent to the rst output bu er and then signals the network interface.The network interface starts transferring the data in the rst bu er while NCS is lling the second output bu er.Similar technique is used to reduce the receiving time by using the multiple input bu ers.During data transmission, the application writes data into the application bu er and then invokes a system call to send the data.The socket layer copies the application bu er into a socket bu er in the kernel space.The transport layer (TCP) reads the data in the kernel bu er and modi es it according to the TCP protocol functions.Then, the data is copied out to the network interface.Consequently, the memory bus system is accessed ve times for each word of transmitted data.Figure 3(b) shows the data path using NCS system.The application writes the data into the application bu er, NCS copies the data in the application bu er into the kernel space.System calls are not required because the kernel bu ers are made accessible to NCS by mapping these bu ers into the NCS address space.Then kernel transfers data from these bu ers to the network interface.In this scheme, system bus is accessed only three times and thus reduces the data transfer time.Similar approach is used to receive data from the network.Overlap communication and computation: Overlapping computation and communication is an important feature in network-based computing.In wide area network (WAN) based distributed computing, the propagation delay (limited by the speed of light) is several orders of magnitude greater than the time it takes to actually transmit the data 5].For example, to transmit one Kbyte le across U.S at 1Gbps transmission rate, it takes only 8 microseconds.However, the time it takes for the rst bit to arrive at its destination (propagation delay) is 15 milliseconds.Consequently, the transmission time of this le is insigni cant when compared to the propagation delay which cannot be avoided.The only viable approach to reduce the impact of propagation delay is to modify the structure of computations such that we can overlap them with communications.NCS adopts multithread (multiple threads per node) programming paradigm to achieve the desired overlapping and thus reduce the propagation delay impact on HPDC applications.For example, Figure 4 shows how multithread message passing approach can reduce the overall execution time of matrix multiplication.Let us assume that two processors (with one process on each processor) are involved in the computation.Process 1 is responsible for calculating C0 and C1 (see Figure 4) and process 2 is responsible for calculating C2 and C3.If there are no threads process 2 has to wait till both A2 and A3 are received before it can start its computation.But if there are two threads per process, then thread 0 of process 2 can start computing C2 as soon as it receives A2 while thread 1 is waiting to receive A3.As can be seen from the gure this overlapping reduces the overall execution time.Modular: Since the multithread approach is modular, it allows us to support a wide range of HPCC Quality of Service (QOS) requirements.Figure 5 shows two applications, Video-On-Demand (VOD) application and a parallel and distributed application (PDA), each of them has a multiple number of compute threads.Their requirements of QOS (e.g ow control) would be di erent.As shown in the gure, NCS provides di erent ow control mechanisms such that the one that best suites a given application can be invoked dynamically at runtime.We do believe that it is not possible to build a communication system that can support e ciently all applications.Also there are applications where interoperability requirement is more important than the performance.On the other hand, in real-time parallel and distributed applications performance is essential.NCS provides a framework to address these two con icting requirements by supporting two classes of applications viz., Normal Speed Mode (NSM) and High Speed Mode (HSM).Figure 6 shows the two tier architecture of NCS.The NSM emphasizes interoperability and uses traditional communication systems (e.g.TCP/IP), whereas the HSM uses NCS or other message passing tools ported to NCS, which in turn is built on ATM API.The message passing lters shown in the gure allow p4, PVM and other message passing tools' primitives to be mapped to NCS primitives.

NCS services
Parallel and Distributed Computing tools can be broadly characterized in terms of the following classes 6].
Point-to-Point Communication: send and receive are the basic message passing primitives that provide this type of interprocess communication.
Group Communication: This involves communication among multiple senders and receivers.These are further divided into three classes based on the number of senders and receivers viz.1-to-many, many-to-1, and many-to-many.NCS bcast(int from thread, int from process, identi er *list, char *data, int size) where from thread and from process are the sending thread's and sending process's identi ers. to thread and to process are the receiving thread's and receiving process's identi ers.
list in NCS bcast gives the list of thread and process identi ers to which the data is to be sent.data is pointer to the data to be sent or received.size if the size of the data to be sent or received.Exception Handling: Exception Handling is more di cult for distributed applications.A few software tools provide functions that handle exceptions.NCS supports the above classes of functions.Figure 7 gives the syntax of some of the primitives supported by NCS.

Implementation Issues
Figure 8 shows the main components of the NYNET Communication System and how they are interconnected.The programmer develops HPDC application using multiple compute threads, NCS send, receive, and ow control threads.The compute threads do the actual computation and use NCS calls NCS send() and NCS recv() for communication.These functions wake up the send and receive threads respectively and block the calling thread.The send and receive threads do the actual data transfer and when they are done, they wake NPAC/ECE, Syracuse University up the corresponding compute threads.Also we use multiple input and output bu ers to overlap the data transfer between the kernel bu ers and the network interface.They are mapped to address space of the NCS system so that NCS has direct access to these bu ers.The NYNET Communication System can be viewed as two main subsystems.
1. NCS MultiThread Subsystem (NCS MTS) which provides all thread related services.

NCS Message Passing Subsystem (NCS MPS) which provides the communication services.
In what follows we describe our approach to implement each of these subsystem.

NCS MTS Implementation
NCS MTS is built on top of QuickThreads toolkit that has been developed at University of Washington 4].QuickThreads is a user-space thread toolkit, that is the operating system of the host has no information about the number of the threads running and their states.Threads are realized within a conventional process and thread management is done by the NPAC/ECE, Syracuse University run-time system.Also, the data structures for threads are maintained in shared memory.QuickThreads is not a stand-alone threads package, rather it is used to build user-level thread packages.It only provides the capability for thread initialization and context switching.We have added scheduling and synchronization capabilities to Quick thread that are needed for implementing NCS MTS.Also, NCS MTS can support several scheduling and synchronization techniques.In NCS MTS, there are N priority levels (current implementation has N = 16) and within each priority level, round robin scheduling scheme is used.We implemented this scheduling mechanism using doubly linked lists (see Figure 9).In NCS MTS a thread can be in one of three states: blocked, runnable or running.Blocking can be viewed as the mechanism that synchronizes a thread with some event (e.g., waiting to receive a message).We implemented blocked queue by doubly linked list to speed up search operation during unblocking of threads.A thread is unblocked when the event it is waiting for, is completed and is placed into runnable queue according to its priority level.
In our implementation we have two classes of threads : System threads and User threads.System threads include send, receive, ow control and error control threads.These threads are created during thread environment initialization (NCS init( ow, error)).NCS init takes two arguments which allow the programmer to select the ow and error control threads for his/her application.If no argument is provided then default ow and error control threads are used.User threads include the computation threads and are created by the application itself (using NCS t create()).Figure 10 shows a general model on how NCS can be used to develop parallel/distributed applications.
In our current implementation of NCS MTS, we have only implemented send and receive system threads and uses the ow and error control provided by p4 8].

NCS MPS Implementation
We have considered two approaches for the implementation of NCS MPS.One approach is based on integrating an existing message passing software tool such as p4 8] with the NCS MTS subsystem.The main objective of this approach is to demonstrate that multithreaded message passing programming paradigm is a viable approach to develop HPDC applications.Figure 11 shows the integration approach of NCS MTS with p4.In this case, application programs are written using NCS MTS and underlying message passing layer (e.g., p4) is hidden from programmer.We have developed basic NCS non-blocking communication primitives NCS send() and NCS recv() using calls provided by p4, viz.p4 messages available(), p4 send() and p4 recv().Non-blocking in the sense that these calls block only the thread which calls them but do not block the whole process.This allows other threads to run and do useful work and thus overlap computations and communications.NCS recv is called when it is required to receive a message from another thread either remote or local.This function wakes up the receive thread and blocks the calling thread.This blocked thread is unblocked by the receive thread when it receives the required message.Meanwhile, other threads can continue their computations.We have seen a signi cant performance gain by using multithreaded message passing programming paradigm to develop high performance distributed computing applications as will be discussed later in section 5. We are currently investigating the performance of implementing NCS using other message passing tool such as PVM 12] and MPI 13].Our second approach avoids using traditional communication protocols (e.g., TCP/IP) and uses modi ed ATM API. Figure 12 shows this implementation approach of the NCS MPS.The second approach does not change the implementation of NCS applications that have been developed based on NCS MTS/p4 implementation.
In this second implementation, we reduce the overhead associated with data-copying operations by allowing NCS to access the kernel bu ers without context switching.This would be done by using UNIX system call mmap() to map the kernel bu er into user space.Furthermore, we use traps to transfer control between NCS and UNIX kernel, and to read and write from network interface.The use of traps has been shown to be more e cient than using UNIX read/write system calls 1].We will also develop the message passing lters for the commonly used message passing tools (e.g., p4, PVM, MPI) so that any parallel/distributed application written using these tools can be ported to NCS without any change.

Benchmark Results
In this section we evaluate the performance of the rst implementation of NCS using three applications.The second implementation is not fully operational when this paper is written.We also compare the performance of the NCS implementation of these applications against the performance of using only message passing tool p4 8].

Matrix Multiplication
We have used a simple distributed matrix multiplication algorithm since our intent is to compare the performance of NCS implementation with p4.Given A and B matrices, the problem is to compute C = A*B.We have used the host-node programming model.The host process sends the whole B matrix to all the node process and distributes the rows of A matrix equally among the nodes.Each of the node processes then calculates its portion of the C matrix and sends the result to the host process.Figure 13 shows an implementation of this algorithm in p4. Figure 14 gives an implementation of this algorithm using NCS, where we assume there are two threads per process.In this implementation thread 0 of host process communicates with thread 0 of node processes and calculates the one half of the C matrix.Similarly, thread 1 of host process calculates the other half of C matrix.Notice that B matrix is sent to a particular node only once, since all the threads share the same address space on a particular node.
Table 1 shows the performance that can be gained by using NCS MTS/p4.The execution time for one node is approximately equal for both p4 and NCS MTS/p4 which is expected.The small di erence is because NCS MTS/p4 implementation has overhead of maintaining threads.In general, NCS MTS/p4 implementation outperforms the p4 implementation because of overlapping computations and communications.For examples, for 4 nodes col-   The execution times associated with NYNET testbed are better because of two reasons: the computers connected to ATM network are faster machines and ATM network operates at a faster speed than Ethernet.

JPEG Compression/Decompression
JPEG (Joint Photographic Experts Group) is emerging as a standard for image compression.JPEG standard aims to be generic and can support a wide variety of applications for continuous-tone images.We have used data parallel programming model to implement a distributed JPEG algorithm on a cluster of workstations.In this implementation half of the computer participate in compression of an image le while the second half reconstruct the compressed image.The image to be compressed is divided into N/2 equal parts (where N denotes the number of processors) by the master process and then shipped to one half of the processors.Each processor performs the sequential JPEG compression algorithm on its portion of the image.After compression the processors send the compressed image to another set of N/2 processors which perform the decompression.Once decompression is done, the results are sent back to the master process which combines them into one image.Consequently, this algorithm involves ve stages viz.distribution of uncompressed image , compression of the image, transmission of compressed image, decompression of the image, and nally combining the decompressed images.
In multithreaded environment we have two computation threads running on each processor, so if one thread is blocked for communication, the other thread can run and perform useful work.The general communication pattern between threads on di erent processor is   15.Here, N/2 left processors compress the image whereas N/2 right processors decompress the image.Figure 16 shows the state (computation, communication or idle) of each processor during the application execution for two cases: 1.One thread per node (this represents pure message passing implementation).
2. multithreaded implementation with two threads per node.
The Pseudo code implementations of the host and node programs using NCS are shown in Figures 17 and 18, respectively.In this implementation Thread 1 works on the rst half portion of the image while Thread 2 works on the other half of the image.Each thread sends its compressed image to the corresponding thread on the remote processor for decompression.
Table 2 compares the performance of NCS MTS/p4 implementation against p4 implementation for 600 Kbyte image on two distributed computing systems.It also shows the percentage improvement of NCS MTS/p4 implementation over p4 implementation, which is calculated as di erence in execution times over p4 execution time.The improvement in performance is consistent with those obtained from the previous application.For example, for 4 nodes working on JPEG compression/decompression performance gain of NCS MTS/p4 implementation vs p4 implementation is around 42% for Ethernet and 60% on NYNET testbed.

FFT on N workstations using p4
Suppose we have N workstations on the network and M =N 2 n (n 1), the DIF (Decimation in Frequency) algorithm for FFT can be mapped onto the network of workstations.A case of M = 8 and N = 2 is shown in Figure 19.Each of the small circles shown in the gure represents a computation which takes two sample inputs and gives two outputs.If sample size is M, then there are M=2 rows of computation.Each node takes M=(2 N) rows.The lines crossing the bold lines represent interprocessor communication.There are log 2 M computation steps and log 2 N communication steps.The algorithm has been implemented using host-node programming model.The host process distributes the sample inputs to the node processes and collects results from the node processes.All the work is being done by node processes which communicate among themselves during computation.

FFT on N workstations using p4 and NCS MPS
In this case, there are multiple threads per process.We assume that there are two threads per node process and host process has only one thread.Thus if we have N workstations, then there are 2 N threads working on the problem.Figure 20 shows how the computation proceeds using this approach for a case of M = 8 and N = 2.Each thread gets M=(4 N) NPAC/ECE, Syracuse University The host process does the same job as before, it distributes the input points among the di erent threads equally and collects the results from these threads.The algorithm for two threads of the node process is given in Figure 21.There are log 2 M computation steps and log 2 2 N communication steps.Note that the last communication step is local among threads and does not involve remote communication.
The advantage of this algorithm is that when thread 0 of a node is waiting to receive data, thread 1 can continue its computation.This overlap of communication and computation improves the performance as shown in Table 3.In this implementation, we chose the number of sample points, M = 512 and 8 sample sets.As expected, for one node the execution times for both p4 and NCS MTS/p4 are approximately equal.NCS MTS/p4 implementation performs better than p4 implementation as the number of nodes is increased.For example, for 4 nodes performance gain of NCS MTS/p4 implementation vs p4 implementation is 5.7% on Ethernet and 10.66% on NYNET testbed.

Conclusion
In this paper we presented a multithreaded message passing environment for parallel/distributed computing over ATM LAN/WAN, which is built on top of an ATM Application Programmer NPAC/ECE, Syracuse University We implemented the multithreaded subsystem and integrated it with p4 parallel/distributed tool.From the benchmark results presented it is clear that multithreaded message passing is a powerful distributed computing paradigm.This programming paradigm allows user to reduce the impact of propagation delay on application performance and support e ciently a wide range of HPDC applications with di erent QOS requirements.
We are also investigating the performance of NCS MTS/p4 implementation when p4 is replaced by PVM and MPI.We are currently implementing the second approach for NCS implementation.Once this implementation is complete, we do believe that NCS applications would run at much higher speed than that can be obtained using existing parallel/distributed software tools.

Figure 4 :
Figure 4: Overlap of Computation and Communication

Figure 10 :
Figure 10: Generic model for application programs

Figure 12 :
Figure 12: NCS MPS implementation using ATM API

Figure 14 :
Figure 14: Matrix Multiplication in Multithread Message Passing environment

1 Figure 20 :
Figure 20: DIF FFT on NCS MTS/p4 environment This con guration consists of SUN SPARCstation ELCs interconnected by traditional Ethernet LAN.The ELCs operate at a clock rate of approximately 33 MHz.

Table 1 :
Execution times of Matrix Multiplication (seconds)

Table 3 :
Execution times of FFT in secondsInterface (API).The Multithreaded environment allows applications to overlap computation and communication and provide a modular approach to e ciently support applications with a wide range of Quality of Service requirements.