Scheduling regular and irregular communication patterns on the CM-5

The authors study the communication characteristics of the CM-5 (Connection Machine 5) and the performance effects of scheduling regular and irregular communication patterns on the CM-5. They consider the scheduling of regular communication patterns such as complete exchange and broadcast. They have implemented four algorithms for complete exchange and studied their performances on a 2-D FFT (fast Fourier transform) algorithm. They have also implemented four algorithms for scheduling irregular communication patterns and studied their performance on the communication patterns of several synthetic as well as real problems such as the conjugate gradient solver and the Euler solver.<<ETX>>


Introduction
The performance of a distributed memory computer depends to a large extent on how fast interprocessor communication can be performed.Despite signi cant improvements in design, scalability and the underlying technology of parallel computers, the improvements in communication time have lagged far behind those in the computation power of each node.It is still two orders or more expensive to access a remote datum than to access a local datum.
This paper presents an experimental study of the communication capabilities of the Connection Machine 5 (CM-5) and the problem of scheduling regular and irregular communication patterns on the CM-5.Similar studies have been performed for other parallel machines such as Intel iPSC/2 3], Symult 2010 5], Intel iPSC/860 1, 2] and CM- 2 15].We have done this study taking into account the fact that the current version of CM-5 software supports only synchronous communication.
Section 2 presents some details of the CM-5 architecture.The problem of scheduling regular communication patterns is considered in section 3. We have implemented four algorithms for complete exchange and two for broadcast and studied their performance for various message and machine sizes.Section 4 presents four algorithms for scheduling irregular communication patterns.We have studied the performance of these problems on several synthetic as well as real problems such as the conjugate gradient solver and the Euler solver 12].Finally, conclusions are presented in Section 5. 2 The CM-5 Architecture The CM-5 is a scalable distributed memory multiprocessor system 6].It can be scaled up to 16K processors.It supports both SIMD and MIMD programming models.Each node on the CM-5 is a SPARC processor which can operate at a peak speed of 32 MIPS and has four optional vector processors.Thus, each node is capable of peak 128 MFLOPS The nodes can be organized into a single partition or multiple partitions.Each partition has a manager which governs the allocation of parallel resources.
The CM-5 has two internal networks that support interprocessor communication -1) control network and 2) data network.The control network supports operations that require global communication such as global reduction operations, parallel pre x operations and processor synchronization.It has a latency of about 2{5 microseconds.The data network supports point-to-point communication.The data network topology is a fat tree as shown in Figure 1.
Both data and control networks have a peak bandwidth of 20 MBytes/Sec.The maximum bandwidth is obtained when communication takes place among nodes in the same cluster of four processors.A data message is broken into a collection of packets.The packet size is 20 bytes, of which 16 bytes are for user data and the remaining 4 bytes contain control information such as destination and size.The CM-5 router employs a random routing scheme, and therefore, the packets may be received in random order.The data network guarantees a system-wide minimum bandwidth of 5 MBytes/sec no matter where the data is being sent in the system.The data network has a communication latency -sending a 0 byte message of 88 microseconds.We used CMMD library functions to do all our experiments.A detailed discussion of interprocessor communication overhead on the CM-5 can be found in 14,4].Further details of the CM-5 architecture can be found in 6].

Scheduling Regular Communication Patterns
A regular communication pattern is one in which the pattern of data access is regular and can be detected at compile time; for example shift, complete exchange, broadcast etc.
The complete exchange (all-to-all personalized) communication pattern is commonly encountered in computations such as matrix transpose and twodimensional FFT 2,10].Scheduling regular communication patterns on hypercubes can be done using CrOS III communication system described in 7].In this section we study the behavior of four algorithms for complete exchange on the CM-5.

Linear Exchange (LEX)
This is the simplest of the four algorithms.For an N processor system, there are N steps in the algorithm.In step i, 0 i < N, processor i receives messages from every processor except itself.The entry i j in table 1 indicates that processor i receives a message from processor j.The current version of CM-5 supports only synchronous communica- tion.Since at each step all processors send messages to a particular processor i, synchronous communication will adversely a ect the performance.If asynchronous (or non-blocking) communication is allowed, processors need not wait for their messages to be received in step i in order to proceed to step i + 1.

Pairwise Exchange (PEX)
The Pairwise Exchange algorithm is shown in Figure 2.There are N?1 steps in an N processor system.The communication schedule for this algorithm is as follows.At step i, 1 i N ? 1, each processor exchanges a message with another processor determined by taking the exclusive-or of its processor number with i.Therefore, this algorithm has the property that the entire communication pattern is decomposed into a sequence of pairwise exchanges.The communication schedule of the pairwise exchange algorithm for 8 processors is given in Table 2.The entry i $ j in the table indicates that processors i and j exchange messages.
The PEX algorithm is better than LEX in terms of utilizing the bandwidth of the network and reducing do j=

Recursive Exchange (REX)
The Recursive Exchange algorithm is a lg N step algorithm for a system with N processors.Each message is of size n N=2 for an exchange involving n bytes per processor.The algorithm is shown in Figure 3.The communication schedule of the REX algorithm for 8 processors is given in Table 3.
Although this algorithm takes less number of steps than the other two algorithms, the amount of data transmitted in each step is much higher.Since it is a store-and-forward algorithm, each step incurs additional overhead of reshu ing data 10].

Balanced Exchange (BEX)
In the pairwise exchange algorithm, the communication schedule is such that in the rst four steps, all processors in a cluster of four processors communicate with each other.That is, in the rst four steps, all the communication is between nearest neighbors.In the next four steps, all processors in a cluster of four communicate with processors in a neighboring cluster and so on.Since all four processors try to do so simultaneously, there is contention.Instead of having a communication schedule in which all processors rst communicate within a cluster and then all communicate with some remote cluster, one can have a more balanced schedule in which at every step two proces-bytes = Size/2 for i = 0, lg N ?3: Recursive Exchange Algorithm sors in a cluster communicate with each other and two communicate with processors in a remote cluster.This balances the amount of local and remote communication, so that all processors do not try to simultaneously communicate over a long distance.We call this algorithm as balanced exchange algorithm (BEX).BEX algorithm is particularly suitable for CM-5 fattree architecture as contention at the root of the tree is reduced.Unlike pairwise exchange algorithm, in this algorithm messages passing through the root of the fat-tree are optimally distributed across each step in the algorithm.
Such a balanced exchange algorithm (BEX) can be obtained by a simple modi cation of the pairwise exchange algorithm as shown in Figure 4.For the purpose of determining the communicating pairs of processors, we de ne a mapping between the physical number of a processor and its virtual number as where N is the total number of processors in the system.
With this mapping, if we apply the pairwise exchange algorithm using the virtual processor numbers, we get the communication schedule shown in Table 4 which is balanced with respect to local and remote communications.In an N (Nmod16 = 0) processor system, 3N=4 N=2 exchange pairs (global exchanges) use the root of tree to perform complete exchange.The PEX algorithm schedules complete exchange in virtual = (mynumber + 1) MOD nprocs do j=

Performance of the Complete Exchange Algorithms
Figure 5 compares the communication time of the four exchange algorithms on a 32 node CM-5.The message size was varied between 0 and 2048 bytes.Due to the synchronous communication constraint, the LEX algorithm performs much worse than the other algorithms.Therefore we did not consider it for any further analysis.For small message sizes, the performance of PEX, REX and BEX is virtually indistinguishable on this scale.However, for large message sizes, PEX performs much better than REX and BEX performs better than PEX.This is because of the following two reasons.First, even though the number of steps in REX is only lg N, as compared to N steps in PEX, the message size in REX remains constant at n N=2, whereas the size of each message in PEX is n.Second, each node needs to bu er and reshu e data in REX so that appropriate data can be sent to the appropriate node.These two overheads outweigh the savings in the number of communication steps.BEX performs the best because it balances local and remote communication at each step.
We selected a few message sizes in di erent ranges, and collected the communication times for several machine sizes.Figures 6, 7 and 8 show the communication times on up to 256 processors for algorithms REX, PEX and BEX. Figure 6 shows times for messages of size 0 bytes and 256 bytes, Figure 7 shows times for messages of size 512 bytes and Figure 8 shows times for messages of size 1920 bytes.
Clearly for messages of size 0 byte, REX performs better than PEX and BEX for all multiprocessor sizes because there is no data shu ing involved and it has only lg N exchanges compared to N ? 1 exchanges in PEX and BEX.For messages of size 256 bytes, PEX performs better than REX for small multiprocessor sizes because the overhead of message size and number of steps dominate for REX.As the number of processors increases, the overhead of the larger number of messages dominates the overhead of larger message size and reshu ing in REX, and therefore, REX performs better.BEX performs the best for messages of size 256 bytes.For message sizes of 512 and 1920 bytes, and small multiprocessor sizes, BEX and PEX perform better than REX.But for large multiprocessor sizes, REX performs the best.
We implemented a 2D FFT algorithm using these complete exchange algorithms.The 2D array is distributed along rows among processors.Each Processor initially performs 1D FFT operation on its local data and performs a complete exchange using anyone of the algorithms described.Each processor then, performs 1D FFT on new data.The performance of this 2D FFT on various sizes of data are shown in table 5.

Broadcast
Broadcast is a very common communication primitive encountered in many applications.We consider one-to-all broadcast (also known as single source broadcast) 11].This section presents the performance of two broadcast algorithms; namely, Linear Broadcast (LIB) and Recursive Broadcast (REB).We compare these algorithms with the system broadcast function.
The LIB is the simplest broadcast algorithm.It has N ? 1 steps.The processor broadcasting a message simply sends the message one by one to all the processors.In the REB algorithm, there are lg N steps.Without loss of generality, consider processor 0 to be the broadcasting source.In the rst step, it sends the message to processor N=2, in the second step processor 0 sends the message to processor N=4 and processor N=2 sends the message to processor 3N=2, and so on.The REB algorithm is given in Figure 9.
Figure 10 shows the performance of the two algorithms and the broadcast function provided by the system 4] as a function of message size for a 32 node machine partition.Clearly, the LIB algorithm performs much worse than the REB algorithm.Therefore, we did not consider the LIB algorithm any further.The REB performs better than the system broadcast when the message size is more than 1K byte.The REB selectively broadcast to a particular group of processors in a partition whereas, the current version of the system broadcast function requires all processors in the partition to participate in the process.Selective broadcasting is sometimes necessary for instance, when processors are con gured as a mesh and broadcast along a row or a column is required.
Figure 11 shows the performance of the REB algorithm and the system broadcast as a function of multiprocessor size for various message sizes.The performance of the built-in broadcast was almost the same irrespective of the number of processors in the system.So, we have shown only one curve for it in the Figure 10: Broadcast Algorithms on 32 nodes gure.For small size messages, the system broadcast function performs better than the REB.However, as the message size, the REB is better than the system broadcast.For instance, the REB is better than the system when the message size is more than 2K bytes when the number of processors is 256.

Scheduling Irregular Communication Patterns
An irregular problem is one in which the pattern of data access is input-dependent 13, 7].Hence, when an irregular problem is implemented on message passing machines, the communication between the processors will also be irregular and will not be known beforehand.Such irregular communication patterns occur in a large number of computationally intensive problems such as unstructured mesh methods used to solve problems in computational uid dynamics.To optimize communication between processors, the communication patterns in these problems can be captured and scheduled at runtime.Such dynamic scheduling of messages on hypercube can be done by using crystal router described in 7].The performance e ects of irregular communication patterns on the CM-2 have been studied in 15].In this section we study their We have implemented four di erent algorithms for scheduling irregular communication patterns namely Linear Scheduling (LS), Pairwise Scheduling (PS), Balanced Scheduling (BS) and Greedy Scheduling (GS).We have studied the performance of these algorithms for communication patterns of synthetic as well as real problems such as conjugate gradient solver and Euler solver for several data sets.A communication pattern is represented as a two-dimensional array called 'Pattern'.The element Pattern i] j] indicates the number of bytes to be sent from processor i to processor j.

Pairwise Scheduling (PS)
The Pairwise Scheduling algorithm is a modi cation of the Pairwise Exchange algorithm of Figure 2 to take into account the irregular communication.The communicating pairs are determined in the same way as in pairwise exchange.But, in addition, each processor checks the communication matrix to see whether the operation to be performed is either an exchange, send, receive or no communication at all.If the matrix indicates no communication, the processor remains idle in that step.The communication schedule of the pairwise scheduling algorithm for 8 processors with communication pattern 'P' is given in Table 8.The entire communication is done in 6 steps.

Balanced Scheduling (BS)
The Balanced Scheduling algorithm is a modi cation of the balanced exchange algorithm given in Figure 4.The communicating pairs are determined in the same way as in balanced exchange.But, in addition, each processor checks the communication matrix to see whether the operation to be performed is either an exchange, send, receive or no communication at all.If the matrix indicates no communication, the processor remains idle in that step.The communication schedule of the balanced scheduling algorithm for 8 processors with communication pattern 'P' is given

Greedy Scheduling (GS)
In this algorithm each processor rst uses a greedy strategy to determine the processors it has to communicate with at every step, and then uses this schedule to perform the communication.For a complete exchange operation this algorithm creates the same communication schedule as pairwise exchange.But when the communication is irregular, the greedy algorithm creates a di erent communication schedule than that by the pairwise scheduling algorithm.This is because in the greedy algorithm, if processor i does not have to communicate with processor j, it will communicate with the next available processor with which it needs to communicate.In the pairwise scheduling algorithm, if a pair of processors i; j] determined by the algorithm do not have to communicate, they remain idle in that step.The communication schedule of the greedy scheduling algorithm for 8 processors with communication pattern 'P' is given in Table 10.The entire communication is done in 6 steps.

Performance Comparison
The communication schedule needs to be created only once and can be used thereafter to perform the communication for as many iterations as required.Hence the time to compute the schedule can be amortized over all the iterations.We have created synthetic communication patterns with di erent communication while (msgs to send != 0) do iteration = iteration + 1 for i = 1 to nprocs do P i selects the next available P j among the processors it has to send to If P j also sends to P i then do an exchange Mark P i and P j as unavailable for this iter Decrement msgs to send appropriately end for end while Figure 12: Greedy Scheduling Algorithm 7 !0 densities of 10%, 25%, 50% and 75% of complete exchange and studied the performance of the above algorithms on these patterns for message sizes of 256 and 512 bytes on a 32 processor system.The results are given in Table 11.We see that the linear scheduling algorithm performs the worst in all cases because of the synchronous communication constraint.The performance of the pairwise and balanced scheduling algorithms is comparable.The greedy algorithm performs the best for communication densities of less than 50%, because the number of steps involved in the communication is the minimum of all the algorithms.But when the communication density is higher than 50%, the greedy algorithm may require more number of steps than the pairwise and balanced algorithms, which degrades the performance.In this case, balanced scheduling performs the best.The performance of these algorithms on real problems such as the conjugate gradient solver and Euler solver for unstructured meshes of di erent sizes, is given in Table 12.The table shows the communication time for each algorithm as well as the average number of bytes transferred in each problem and the percentage of communication operations with respect to complete exchange.The communication percentage varies from 9% in the conjugate solver to 44% in the Euler solver for meshes with 2K and 9K vertices.The average number of bytes transferred per communication operation varies from 85 bytes for the Euler solver for a mesh with 545 vertices to 643 bytes for the conjugate gradient solver.The performance of the algorithms on the real problems is consistent with that on the synthetic patterns.Since the communication density is less than 50% in the real problems, the greedy algorithm performs the best.

Conclusions
This paper presented experimental results for communication overhead on the CM-5 and the performance e ects of scheduling regular and irregular communication patterns on the CM-5.We studied the communication overhead of four complete exchange algorithms.For a large number of processors, the Recursive Exchange algorithm performs the best.Balanced exchange performs the best for small message sizes.For large message sizes in a small multiprocessor system, pairwise exchange performs better than the other algorithms.
We implemented two algorithms for one-to-all selective broadcast; namely, Linear Broadcast and Recursive Broadcast.The recursive broadcast algorithm performs better than linear broadcast and it is also better the system broadcast functions when the message size is large.
For irregular communication patterns, the greedy algorithm performs the best when the communication density is less than 50%.The balanced exchange algorithm performs the best when the communication density is higher than 50%.The linear scheduling algorithm su ers because of the synchronous communication constraint.

Figure 11 :
Figure 11: Recursive Broadcast Algorithm on Varying Sizes of Nodes

Table 1 :
8 Processor Communication Schedule for Linear Exchange

Table 5 :
Performance of Scheduling Algorithms on 2D FFT (Time in Secs.)

Table 7 :
Communication Schedule for Pattern 'P' using Linear Scheduling linear scheduling algorithm for 8 processors with communication pattern 'P' is given in Table7.The entire communication schedule is completed in 8 steps.

Table 8 :
Communication Schedule for Pattern 'P' using Pairwise Scheduling

Table 9 .
The entire communication is done in 7 steps.

Table 10 :
Communication Schedule for Pattern 'P'