Static and Runtime Algorithms for All-to-Many Personalized Communication on Permutation Networks

With the advent of new routing methods, the distance to which a message is sent is becoming relatively less and less important. Thus, assuming no link contention, permutation seems to be an e(cid:14)cient collective communication primitive. In this paper we present several algorithms for decomposing all-to-many personalized communication into a set of disjoint partial permutations. We discuss several algorithms and study their e(cid:11)ectiveness from the view of static scheduling as well as runtime scheduling. An approximate analysis shows that with n processors and assuming that every processor sends and receives d messages to random destinations, our algorithm can perform the scheduling in O(dnln d) time on an average, and use an expected number of d + log d partial permutations to carry out the communication. We present experimental results of our algorithms on the CM-5.

In parallel computing, it is important to map the program such that the total execution time is minimized. Experience with parallel computing has shown that a \good" mapping is a critical part of executing a program on such computers. This mapping can typically be performed statically or dynamically. For most regular and synchronous problems 10], this mapping can be performed at the time of compilation by giving directives in the language to decompose the data and its corresponding computations (based on the owner computes rule where each processor computes only values of data it owns 5,17,21]). This ordinarily results in regular collective communication between processors. Many such primitives have been developed in 1,16]. Load balancing and reduction of communication are two important issues for achieving a good mapping. The directives of Fortran D 6] can be used to provide such a mapping for a large class of regular and synchronous problems.
For some other classes of problems 3, 19,20] that are irregular in nature, achieving a good mapping is considerably more di cult 7]. Further, the nature of this irregularity may not be known at the time of compilation and can be ascertained only at runtime. The handling of irregular problems requires the use of runtime information to optimize communication and load balancing 9, 13,14]. These packages derive necessary communication information based on the data required for performing local computations and data partitioning. Typically, the same schedule is used a large number of times. Communication optimization is therefore very important and a ects the performance of applications on a parallel machine.
In this paper we develop and analyze several simple methods of scheduling communication. These methods are e cient enough that they can be used statically as well as at runtime. Assuming a system with n processors, our algorithms take as input an n n communication matrix COM. COM(i; j) is equal to 1 if processor P i needs to send a message to P j , 0 i; j n ? 1. Our algorithms decompose the communication matrix COM into a set of disjoint partial permutations, pm 1 ; pm 2 ; ; pm l , such that if COM(i; j) = 1, then there exists a unique k, 1 k l, that pm k (i) = j.
With the advent of new routing methods 8, 15,18], the distance to which a message is sent is becoming relatively less and less important 2]. Thus, assuming no link contention, permutation is an e cient collective communication primitive. Permutation also has the useful property that every processor both sends and receives at most one message. For an architecture like the CM-5, the data transfer rate seems to be bounded by the speed at which data can be sent or received by any processor 4]. Thus, if a particular processor receives more than one message or has to send out more than one message in one phase, then the 1 time will be lower bounded by the time required to remove messages from the network by the processor receiving the maximum amount of data.
Assuming that each of the n processors sends out at most d messages and receives at most d messages, we perform an approximate probabilistic analysis and show that the complexity of the algorithm is O(nd ln d) on an average. Assuming that the cost of completing one permutation is of O( + 'M), where is the communication set up time and ' is the transmission time per byte, the minimum time required for communication is of the O(d( + M')). Thus the cost of the scheduling algorithm as compared to the cost of communication is negligible if M n ln d. If the number of times the same communication schedule is used is large (which happens for a large class of problems 6]), the fractional cost of the scheduling algorithm is quite small. Further, the average number of permutations generated is approximately d+log d. Thus, on an average, the fraction of extra permutations generated is not very high. Compared to a naive algorithm for communication of messages for a sparse communication matrix that takes time proportional to n permutations, this algorithm has signi cant speedup. On a 32-node CM-5, our experimental results show that the cost of scheduling is no more than the cost of communication for small messages (16 bytes). For large messages (4K bytes or larger sizes), the cost is less than one-quarter of the total time for communication. For many applications, the same schedule is utilized repeatedly 6], thus our algorithms would also be useful for many applications for which the communication structure can be derived only at runtime.
The rest of this paper is organized as follows. Notations and assumptions are given in Section 2. Section 3 presents scheduling algorithms and their time complexity analysis. Section 4 provides an improved version of our algorithm and its time complexity analysis. Section 5 presents the experimental results. Finally, conclusions are given in Section 6.

Preliminary
The communication matrix COM is an n n matrix where n is the number of processors. COM(i; j) is equal to 1 if processor P i needs to send a message to P j , otherwise COM(i; j) = 0, 0 i; j < n. Thus, row i of COM represents the sending vector, sendl i , of processor P i , which contains information about the destination processors of outgoing messages. Column i of COM represents the receiving vector, recvl i , of processor P i , which contains information about the source processors of incoming messages. The entry sendl i (j) (recvl i (j)) represents the j th entry in the vector sendl i (recvl i ). Assuming COM(i; j) = 1, then sendl i (j) = recvl j (i) = 1. We will use sendl and recvl to represent each processor's sending vector and 2 receiving vector when there is no ambiguity.

Notations and Assumptions
We categorize the routing algorithms in several di erent categories: 1. Uniformity of message|Uniform messages mean all messages are of equal size. In this paper we assume that all messages are approximately of the same size.
2. Density of communication matrix|If the communication matrix is dense, then all processors send data to all other processors. If the communication matrix is sparse, then every processor sends to only a few processors.
3. Static or runtime scheduling|Communication scheduling must be performed statically or dynamically.
We make the following assumptions for the complexity analysis.
1. All permutations can be completed in ( +M') time, where is the communication set up time, M is the maximum size of any message sent, and ' represents the transmission time per byte (i.e., 1=' is the bandwidth of the communication channels).
2. Each processor can send only one message and receive only one message at a time.
3. In case communication is sparse, all nodes send and receive an approximately equal number of messages; if the density of sparseness is d, then at least d permutations are required to send all the messages.

Cost of Random Permutations on CM-5
The algorithms described in this paper do not take link contention into account. Principally because the routing is randomized on the CM-5 and it is not possible to statically schedule messages in such a fashion that link contention can be avoided, although randomization alleviates that problem to a large extent. On a 32-node CM-5, we generated 5000 random permutations in which each processor sends and receives a message of 1K bytes. Over 99.5% (4979 out of 5000) of the permutations were within 5% of the average cost (the average communication cost over these 5000 random permutations is 0.543 milliseconds) ( Figure 1). Thus, the variation of time required for di erent random permutations (in which each node sends a data to a random, but di erent node) is very small on a 32-node CM-5. Observations reveal that the performance of our algorithms, which use permutation as the underlying communication scheme, are not signi cantly a ected by a given sequence of permutation instances. The bandwidth achieved for these permutations is approximately 4M bytes/sec, which is close to the peak bandwidth of 5M bytes/sec provided by the underlying hardware for long distance messages.

Scheduling Algorithms
In this paper we assume that each processor has an identical communication matrix COM. The communication matrix COM is a sparse matrix, i.e., each processor will send and receive d messages (in a system with n processors, d n). In case only the vector sendl is available at every node, the communication matrix COM can be generated by using a concatenate operation. For architectures like the CM-5, performing a concatenate operation is e cient and can be completed in O(dn) amount of time 4]. These operations have e cient implementation on other architectures such as hypercubes and meshes.
The communication patterns considered in this paper are all-to-many personalized communication (all-to-all personalized communication is a special case of all-to-many personalized communication). In personalized communication, one processor sends a unique message to other processors 12]. We also assume that COM is a uniform communication pattern, i.e., all messages are of equal size. We are currently developing methods for the case when messages are non-uniform. 4 Asynchronous Send Receive() For all processors P i , 0 i n ? 1, in parallel do 1. Allocate bu ers and post requests for incoming messages; 2. Send out all outgoing messages to other processors; 3. Check and con rm incoming messages from other processors. We propose several scheduling algorithms, and the analysis of their time complexity in following subsections. All the algorithms proposed in this paper are executed in SPMD (single-program multi-data) mode, i.e., every processor has the same copy of a program, but each processor runs its program in an asynchronous pattern.

Asynchronous Communication (AC)
The most straightforward approach is to use asynchronous communication. The algorithm is divided into three phases: 1. Each processor rst posts requests for expected incoming messages (this operation will pre-allocate bu ers for those messages).
2. Each processor sends all of its outgoing messages to other processors.
3. Each processor checks and con rms incoming messages (some of which may already have arrived at their receiving bu er(s)) from other processors. During the send-receive process the sending processor need not wait for a completion signal from the receiving processor, but can keep sending outgoing messages until they are all done. This naive approach is expected to perform well when density d is small. The asynchronous algorithm is given in Figure 2. Similar schemes were proposed in several parallel compiler projects 11,13].
In the worst case the time complexity of this algorithm is di cult to analyze, as it will depend on the network congestion and contention on which it is performed. Further, each processor may have only limited space of message bu er. When the bu er is fully occupied by unconsumed messages, further messages will be blocked at the sending processors' Linear Permutation() For all processors P i , 0 i n ? 1, in parallel do for k = 1 to n ? 1 do j = i k; if COM(i; j) > 0 then P i sends a message to P j ; if COM(j; i) > 0 then P i receives a message from P j ; endfor Figure 3: Linear permutation algorithm.
side. The over ow will block processors from doing further processing (including receiving messages) because processors are waiting for other processors to consume and empty their bu ers to receive new incoming messages. This situation may never resolve and a deadlock may occur among processors. In order to avoid a deadlock, one needs to monitor the production/consumption rate very carefully to guarantee the completion of communication. In case the system bu er is too small to hold all messages at one time, one needs to introduce a strip mining scheme 11] to perform sends and receives alternately such that there are a smaller number of unreceived messages accumulated in the bu er and an over ow will not occur.

Linear Permutation (LP)
In this algorithm (Figure 3), each processor P i sends a message to processor P (i k) 1 and receives a message from P (i k) , where 0 < k < n. When COM(i; j) = 0, processor P i will not send a message to processor P j (but will receive a message from P j if COM(j; i) > 0).
The entire communication uses pairwise exchange (j = i k , i = j k).
The overhead of this algorithm is O(n), regardless of the number of messages each processor actually sends/receives. This scheme is typically useful when each processor needs to send a message to a large subset of all the processors involved in the communication. The algorithm in Figure 3 assumes that the number of processors, n, is a power of 2; it can easily be extended to the case where n is not a power of 2.

Global Masking (GM)
A high-level description of this algorithm is given in Figure 4. At each iteration we rst set all entries of vectors sendl and recvl to ?1. Then within each row x of COM, 0 x n?1, we try to nd a column y, 0 y n ? 1, with COM(x; y) = 1 and recvl(y) = ?1, if such a y exists, then set sendl(x) = y and recvl(y) = x. Processors then send/receive messages according to vectors sendl and recvl. This procedure is repeated until all messages are sent/received. As mentioned in the previous section, we assume the communication matrix COM is a sparse matrix and each processor sends out d messages to d di erent processors. Further, we assume that each processor receives approximately d messages. Clearly, the number of permutations would be lower bounded by the maximumnumber of messages received by any processor. In this algorithm, the number of iterations, , needed to complete the message routing is bounded by d U, where U = maxfthe number of messages received by each processorg. Because each iteration will take O(n 2 + +'M) time to complete, the total time complexity of this algorithm is O( (n 2 + + 'M)).
As compared to the permutation algorithm presented in the previous subsection, the global masking algorithm takes fewer iterations to complete the message routing, but it takes extra time to schedule communication. If n 2 + 'M, i.e., the message size is large compared to the number of processors, the global masking algorithm may outperform the linear permutation algorithm.

Enhanced Scheduling Algorithm
In the global masking algorithm described in the previous section, when looking for an entry with COM(i; j) = 1 along row i, we may rst visit several entries with COM(i; k) = 0, where 0 k < j, before reaching column j. The visits to useless entries should be avoided to minimize unnecessary computation overhead. Having this in mind, we present an enhanced version of the global masking algorithm|compact global masking algorithm (CGM). The scheme can be used to eliminate undesired computations by copying all useful COM entries to an n d matrix CCOM ( Figure 5).
The vector prt is used as a pointer whose elements point to the maximum number of positive columns in each row of CCOM. Also, the reason for performing Random Swap(CCOM) is to perturb the sorted order in each row so that the expected number of collisions (i.e., within one iteration, the entries along a column k are repeatedly chosen and tested, but eventually only one entry is selected and other tests are fruitless) can be reduced. If we perform this compression statically, the time complexity will be O(n(n + d)) = O(n 2 ). Further, this operation can be performed at runtime: each processor compacts one row, and then all processors participate in a concatenate operation that will combine all rows into an n d matrix. The cost of this parallel scheme is O(n + d + dn) = O(dn), assuming that the concatenate can be completed in O(dn) time, which was shown to be true for CM- 5 4].
We assume that CCOM(i; j) = ?1 if this entry doesn't contain active information.
After the copy procedure, the rst d columns of each row will contain active entries. When searching for an available entry along row i, the rst column j with CCOM(i; j) = k and recvl(k) = ?1 will be chosen. We then set sendl(i) = k and recvl(k) = i. In order to avoid any unnecessary travel through useless holes (entries), we will move entry CCOM(i; l) to CCOM(i; j) and reset CCOM(i; l) = ?1, where l = prt(i). With this \compact" approach, the rst several columns in each row contain no useless entries and one will eliminate any unnecessary visits to inactive entries in following iterations. The worst case time complexity to form a routing schedule in this algorithm is O(dn), comparing to O(n 2 ) in the GM algorithm. The compact global masking algorithm is described in Figure 6.
Step 1 takes O(n 2 ) time to complete in a sequential program, but we can parallelize this step: each processor creates one row of CCOM, then all processors participate in concatenating the result together. We make the following assumptions to get an insight of the average complexity of the CGM algorithm. Wherever possible, we support these assumptions by simulation results.
1. At the beginning of each outer loop (Step 2 of Figure 6), the number of active entries, d, in each row of CCOM is approximately equal and the destinations to which each node will send data are random (between P 0 and P n?1 ). 2. Di erent stages are assumed to act independently of each other. Each stage starts with the number of messages in each node equal to the average number of messages left in each node by the previous stage. Assuming at Step 2c, the probability, Prob k , of nding a available entry in row k is Prob 0 = n n 9 Compact Global Masking() 1. Use the n n matrix COM to create an n d matrix CCOM, also generate a vector prt; 2. For all processors P i , 0 i n ? 1, in parallel do Repeat (a) Set all entries of vectors sendl and recvl to ?1; (b) x = random(0::n ? 1); (c) for k = 1 to n do i. Along row x of CCOM, try to nd an entry CCOM(x; z) = y that satis es y > ?1 and recvl(y) = ?1.
ii. If such a z exists, then set sendl(x) = y and recvl(y) = x. Also set CCOM(x; z) = CCOM(x; prt(x)), CCOM(x; prt(x)) = ?1, and prt(x) = prt(x) ? 1.; iii. x = (x + 1) mod n; endfor (d) if sendl(i) 0 then P i sends a message to P sendl(i) ; if recvl(i) 0 then P i receives a message from P recvl(i) ; Until all messages are sent/received. (1) Thus the expected computation cost of one iteration is O(n ln d + n). We are also interested in the number of entries CCOM(i; j) being consumed in one iteration, i.e., the number of entries CCOM(i; j) being reset to ?1 in one iteration. In the case when each row has d active entries, the rst d rows would always nd an available entry, the probability of success in nding an available entry in the (d + 1) th row is 1 ? ( d n ) d (there are d active entries in each row). The probability of success in nding an available entry in each row is S = 1 + (2) It is di cult to analyze the number of messages in each row at the next step. We use d as the new value of d at the next step. This assumption is made for all future steps. Assuming Y i is the number of useful entries remained at each row after one iteration. Then Thus the number of iterations used to reduce Y m from d to d=2 is upper bounded by d 2 + 1. The number of iterations needed to complete the entire message routing is given by for j = 0 to n ? 1 do COM(j; k) = 1; k = (k + 1) mod n; endfor endfor for i = 0 to ManyTimes do loc1 = random() mod n; loc2 = random() mod n; switch row loc1 with row loc2; (and/or switch column loc1 with column loc2); endfor

Experimental Results
We have implemented our algorithms on the CM-5. The experiments are focused on evaluating three parameters: (1) the number of permutations to complete the communication; (2) the cost to execute the communication scheduling algorithms; and (3) the cost to carry out the communication. The rst two parts have been implemented in a machine-independent fashion, so that the experiments are not restricted by the actual number of processors available. The third part is executed on a 32-node CM-5.
Most of the algorithms we present in this paper are executed in a loosely synchronous fashion. We did not explicitly use global synchronization to enforce synchronization between communication phases in any of the algorithms proposed in this paper.
In our experiments the number of processors, n, ranges from 32 to 1024, and every processor will send and receive d di erent messages, where 1 d < n. For each (d; n) combination, we sample 300 di erent communication matrices COM and record each category's maximum, minimum, and average values. In order to guarantee that in COM every row and every column has approximately d active entries, COM is generated by the algorithm given in Figure 7.
In order to prove that the communication cost on the CM-5 is not sensitive to di erent permutations, we randomly generate 1000 di erent permutations and record their communication cost (  10% of average value for most cases. Thus the performance of our algorithms is not signicantly a ected by a given permutation instance (i.e., the CM-5 can complete all permutations in nearly the same amount of time). Tables 2 and 3 give the performance of our algorithms. The results reveal that the GM and CGM algorithms have a superior performance compared to other schemes (but GM employs a much higher scheduling cost). The comparisons in Figure 8 do not include the cost of scheduling, which is negligible compared to the total cost if the sizes of messages are large or the same schedule is used many times. The tables also show the number of permutations generated by each algorithm and their corresponding cost, and they reveal that the CGM algorithm generates the smallest number of permutations in most cases. Figure 9 shows the fraction of scheduling overhead, scheduling cost/communication cost, of the LP, GM, and CGM algorithms. These observations reveal that the LP algorithm has a very small scheduling overhead (but its overall performance is not good enough, especially when d is small). The GM algorithm has a communication cost similar to that of the CGM algorithm, but it has a relatively high scheduling overhead. The CGM algorithm shows a moderate scheduling overhead, and the fraction decreases as the message size increases (assuming the same communication schedule is utilized only once). The cost of scheduling is thus at most equal to the cost of communication for small messages (16 bytes) and negligible for large messages (less than 0.25 for messages of size 4K). In most applications the same schedule will be utilized many times, hence the fractional cost would be considerably lower (inversely proportional to the number of times the same schedule is used). Thus, our algorithm is also suitable for runtime scheduling. Table 4 shows the performance of the CGM algorithm. The standard deviations of these results are small (in fact, the maximum and minimum values are within 10% of the average value in most cases), which indicates that this algorithm is very stable for a large class of communication patterns. Figure 10 shows the scheduling time/n versus d ln d (for d ln d less than 150). The experimental results con rm our theoretical analysis of scheduling time complexity (i.e., O(dn ln d)).

Discussion
From the previous section it is clear that CGM is a better choice than GM. Thus, for the rest of this section, we will compare only the performances of LP and CGM, and discuss their use for di erent ranges of d and n. In Section 3 we showed that the time complexity for the LP algorithm is O(n( + M')), but in this algorithm many permutations are in fact sending no message. Based on our experimental results a better modeling on the CM-5 is n + C 1 dM', where C 1 is a constant. Also, the time complexity for CGM can be rewritten as C 2 dn ln d + C 3 d( + M'), where C 2 and C 3 are some constants. We are interested in nding the break-even points for di erent message sizes where CGM can outperform LP. C 2 dn ln d + C 3 d( + M') n + C 1 dM' C 2 dn ln d (n ? C 3 d) + (C 1 ? C 3 )dM' d ln d (n ? C 3 d) C 2 n + (C 1 ? C 3 )dM' C 2 n : We rst investigate the case where the message size M is small. When M is small, the second term in RHS can be eliminated. Also, the rst term in RHS can be reduced to =C 2 when d is small. Thus CGM will outperform the LP algorithm when d ln d C 2 : (5) When the message size M is large, the e ect of becomes less signi cant than M', thus C 2 dn ln d + C 3 d( + M') n + C 1 dM' The above discussion is based on the assumption that the same schedule is used only once. When the number of times the same schedule is utilized increases, the CGM algorithm would be better for a large range of d. 16

Conclusions
In this paper we have developed algorithms to perform message routing for all-to-many personalized communication. The linear permutation algorithm is very straightforward. It introduces very small computation overhead. The worst case complexity of this algorithm is O(n( + 'M)) (the experimental results for a 32-node CM-5 show a complexity of O(n +C 1 dM'), where every node sends d messages). The second algorithm, GM, eliminates unnecessary communication at the cost of signi cant computation overhead. The complexity of this algorithm is O( (n 2 + + 'M)). When M is relatively large and n and d are small, this algorithm outperforms LP.
The performance of the asynchronous communication algorithm depends on the congestion and contention of the network on which it is performed. This algorithm is machinedependent and its complexity may vary from machine to machine.
We also present an enhanced version of the GM algorithm|CGM algorithm. In this algorithm we use the information of COM(i; j) to create an n d matrix CCOM such that all useful entries appear at the rst several columns, and useless entries (CCOM(i; j) = ?1) are moved to the bottom of each row. We show that with this approach, the time complexity to complete one iteration is O((n ln d + n) + ( + 'M)), and we need only at most d + log d iterations to complete the whole message routing.
Another advantage of our algorithm as compared to the other algorithms is that once the schedule is completed, communication can potentially be overlapped with computation, i.e., computation on a packet received in a previous phase can be carried out while the communication of the current phase is being carried. It is also worth noting that due to the compaction, nearly all processors receive data packets, and the load is nearly balanced on every node. Clearly, the number of computation phases would increase by log d (from d to d + log d). Thus, using overlap of communication and computation would only be useful if the overlap is more than the extra computation overhead.
This paper assumes that each node sends d messages and receive d messages. These algorithms can be extended to the case when the number of messages to be sent by each processor are not equal. Clearly, if d is the maximum number of messages to be sent, our CGM algorithm should produce no more than an expected number of d+log d permutations.
In such case, we believe that our algorithm, on an average, would produce lower than d+log d permutations. Since the number of permutations cannot be lower than d, our algorithm would produce a near optimal number of permutations.
Our paper also assumes that all messages are approximately of the same size. For many applications, this is not the case. We are currently investigating methods that are useful when the message sizes are not equal.