Efficient Algorithms for Array Redistribution Efficient Algorithms for Array Redistribution

Dynamic redistribution of arrays is required very often in programs on distributed memory parallel computers. This paper presents e(cid:14)cient algorithms for redistribution between di(cid:11)erent cyclic(k) distributions, as de(cid:12)ned in High Performance Fortran. We (cid:12)rst propose special optimized algorithms for a cyclic(x) to cyclic(y) redistribution when x is a multiple of y, or y is a multiple of x. We then propose two algorithms, called the GCD method and the LCM method, for the general cyclic(x) to cyclic(y) redistribution when there is no particular relation between x and y. We have implemented these algorithms on the Intel Touchstone Delta, and (cid:12)nd that they perform well for di(cid:11)erent array sizes and number of processors.


Introduction
In distributed-memory parallel computers, arrays have to be distributed among processors in some fashion. The distribution can either be regular i.e. block, cyclic, or block-cyclic, as in Fortran D 2] and High Performance Fortran (HPF) 4,9], or irregular in which there is no simple arithmetic function specifying the mapping of arrays to processors. The distribution of an array does not need to remain xed throughout the program. In fact, it is very often necessary to change the distribution of the array at run-time, which is called array redistribution. This requires each processor to calculate what portions of its local array to send to other processors, what portions of its local array to receive from other processors, and perform the necessary communication. It is essential to use e cient algorithms for redistribution, otherwise the performance of the program may degrade considerably.
This paper describes e cient and practical algorithms for redistributing arrays between di erent cyclic(k) distributions, as de ned in HPF. The cyclic(k) distribution is the most general regular distribution in which blocks of size k of the array are distributed among processors in a round-robin fashion. It is also commonly known as a block-cyclic distribution. Redistribution from a cyclic(x) to a cyclic(y) distribution, for any general x and y, is interesting because there is no direct algebraic formula to calculate the set of elements to send to a destination processor and the local addresses of these elements at the destination.
We rst propose e cient algorithms for two special cases of the cyclic(x) to a cyclic(y) redistribution|when x is a multiple of y, or y is a multiple of x. We then propose two methods called the GCD Method and the LCM Method for the general case when there is no particular relation between x and y. The GCD and LCM methods make use of the optimized algorithms developed for the above special cases. The proposed algorithms have low runtime overhead, and are simple and practical enough to be used in the runtime library of a compiler, or directly in application programs requiring redistribution. The rest of this paper is organized as follows. The notations, assumptions and de nitions used in this paper are given in Section 2. Section 3 describes the algorithm for the special case of a cyclic(x) to cyclic(y) redistribution where x is a multiple of y. Section 4 describes the algorithm for the special case where y is a multiple of x. The GCD and LCM methods for the general case are proposed in Section 5. Section 6 discusses related work in this area, followed by conclusions in Section 7. N global array size P number of processors p i logical processor i p logical number of the processor executing the program p s source processor p d destination processor L local array size m block size The notations used in this paper are given in Figure 1. We assume that all arrays are indexed starting from 1, while processors are numbered starting from 0. We also assume that the number of processors on which the array is distributed remains the same before and after the redistribution. In HPF, an array can be distributed as block(m) or cyclic(m), which are de ned as follows. Consider an array of size N distributed over P processors. Let us de ne the ceiling division function CD(j; k) = (j + k ? 1)=k, and the ceiling remainder function CR(j; k) = j ? k CD(j; k). Then, block(m) distribution means that index j of the array is mapped to logical processor number CD(j; m) ? 1. Note that for a valid block(m) distribution, m P N must be true. Block by de nition means the same as block(CD(N; P)). In a cyclic(m) distribution, index j of the global array is mapped to logical processor number mod(CD(j; m) ?1;P) 1 . Cyclic by de nition means the same as cyclic (1).
In other words, in a block distribution, contiguous blocks of the array are distributed among processors. In a cyclic distribution, array elements are distributed among processors in a roundrobin fashion. In a cyclic(m) distribution, blocks of size m are distributed cyclically. Block and cyclic distributions are special cases of the general cyclic(m) distribution. A cyclic(m) distribution with m = dN=Pe is a block distribution, and a cyclic(m) distribution with m = 1 is a cyclic distribution. The formulae for conversion between local and global indices for the di erent distributions in HPF are given in Table 1.
The redistribution algorithms proposed in this paper are intended to be portable. Hence, we do not specify how data communication should be performed because the best communication algorithms are often machine dependent. Redistribution requires all-to-many personalized communication in general, and in many cases it requires all-to-all personalized communication. Algorithms Table 1: Data Distribution and Index Conversion Note: This assumes that arrays are indexed starting from 1 and processors are numbered starting from 0 CD(j; k) = (j + k ? 1)=k and CR(j; k) = j ? k CD(j; k) global index (g) to p = CD(g; m) ? 1 p = mod(g ? 1; P) p = mod(CD(g; m) ? 1; P) processor number (p) global index (g) l = m + CR(g; m) l = (g ? 1)=P + 1 l = mod(g ? 1; m) + 1+ to local index (l) (g=(m P))m local index (l) to g = l + m p g = (l ? 1)P + p + 1 g = mod(l ? 1; m) + 1+ global index (g) (P((l ? 1)=m) + p)m to implement these communication patterns are described in detail in 15,10,17,12,13]. The performance results presented in this paper were obtained using the communication algorithms given in 15,10,17]. We do assume that all the data to be sent from any processor i to processor j has to be collected in a packet in processor i and sent in one communication operation, so as to minimize the communication startup cost. The redistribution algorithms described in this paper are for onedimensional arrays. Multidimensional arrays can be redistributed by applying these algorithms to each dimension of the array separately. In the rest of this paper, any division involving integers should be considered as integer division unless speci ed otherwise.
3 Cyclic(x) to Cyclic(y) Redistribution: Special Case x = k y For a general cyclic(x) to cyclic(y) redistribution, there is no direct algebraic formula to calculate the set of elements to send to a destination processor, and the local addresses of these elements at the destination. Hence, we consider two special cases where x is a multiple of y, or y is a multiple of x. For the general case where there is no particular relation between x and y, we propose two algorithms called the GCD method and the LCM method, which make use of the optimized algorithms developed for the above two special cases.
Let us rst consider the special case where x is a multiple of y. Let x = k y.
Theorem 3.1 In a cyclic(x) to cyclic(y) redistribution where x = k y, if k < P, each processor communicates with k or k ? 1 processors. If k P, each processor communicates with all other processors.
Proof: Assume k < P. Since x = k y, each block of size x is divided into k sub-blocks of size y and distributed cyclically. Consider any processor p i . Assume that it has to send its rst sub-block of size y to processor p j . Then the remaining k ?1 sub-blocks of the rst block are sent to the next k ?1 processors in order. The next k(P ?1) sub-blocks of the global array are located in the other P ? 1 processors. This results in a total of k P sub-blocks. Hence, the (k + 1) th sub-block of size y in p i is also sent to p j . As a result, all sub-blocks from p i are sent to k processors starting from p j . One of these processors may be p i itself, in which case p i has to send data to k ? 1 processors. For the receive phase, consider the rst k P sub-blocks of size y in the global array corresponding to the rst P blocks of size x. Let us number these k P sub-blocks from 0 to k P ? 1. Out of these, the sub-blocks that are mapped to processor p i in the new distribution are numbered p i to P(k ? 1) + p i with stride P. These sub-blocks come from fP(k?1)+p i g?p i P + 1 = k processors. One of these processors may be p i itself, in which case p i receives data from k ? 1 processors.
If k P, each block of size x has to be divided into k sub-blocks and distributed cyclically, where the number of sub-blocks is greater than or equal to the number of processors. So, clearly each processor has to send to and receive from all other processors (all-to-all communication). 2 The algorithm for cyclic(x) to cyclic(y) redistribution, where x = k y is given in Figure 2. We call this the KY TO Y algorithm. In the send phase, each processor p calculates the destination processor p d of the rst element of its local array as p d = mod(k p; P). The rst y elements have to be sent to p d , the next y to mod(p d + 1; P), the next to mod(p d + 2; P) and so on till the end of the rst block of size x. The next k sub-blocks of size y have to be sent to the same set of k processors starting from p d . The sequence of destination processors can be stored and need not be calculated for each block of size x. In the receive phase, there are two cases depending on the value of k :-1. (k P) and (mod(P; k) = 0) : In this case, each processor p calculates the source processor of the rst block of size y of its local array as p s = p=k. The next block of size y will come from processor mod(p s +P=k; P), the next from mod(p s +2(P=k); P) and so on till the rst k blocks. The above sequence of processors is repeated for the remaining sets of k blocks of size x, and hence can be stored and reused. The data received from other processors cannot be directly stored in the local array as it has to be stored with a stride. As a result, the data has to be rst stored in a temporary bu er in memory. This gives us two choices in implementing the receive phase:-Synchronous Method: In this method, each processor waits till it receives data from all other processors, before placing any data in the local array. This increases the memory requirements of the algorithm and also increases the processor idle time. These problems worsen as the number of processors is increased, so this method is not scalable.
1. The destination processor (p d ) of the rst element of the local array is p d = mod(k p; P). 2. For each block of size x in the local array 3.
For i = 0 to k ? 1

4.
The destination processor of elements (i y + 1) to (i + 1)y of this block of size x is mod(p d + i; P). 5. Send data to other processors.
Receive Phase 1. If (k P) and (mod(P; k) = 0) then 2. The source processor (p s ) of the rst element of the local array is p s = p=k.
Synchronous Method: 3. Receive data from all processors into temporary bu ers.
Read the next block of size y from the data received from processor mod(p s + i(P=k); P).
Asynchronous Method: 3. The i th block of size y, 0 i k ? 1, is to be received from processor mod(p s + i(P=k); P).

5.
Receive data from any processor p i into a temporary bu er. 6.
Place the rst block of size y in the local array starting from the location calculated above, and the other blocks with stride x. 7. Else 8. Receive data from all processors into temporary bu ers.

10.
The source processor (p s ) of the rst element (j = i y + 1) of this block of size y is p s = mod (i P + p)=k; P] 11.
Read this block of size y from the data received from p s . Asynchronous Method: In this method, the processors do not wait for data from all processors to arrive. Instead, each processor takes any packet which has arrived and places the data into appropriate locations in the local array. This method overlaps computation and communication. Each processor posts non-blocking receive calls and waits for data from any processor to arrive. As soon as a packet is received, it places the data in appropriate locations in the local array. During this time, data from other processors may have arrived. When the processor has placed all data from the earlier packet into the local array, it takes up the next packet, and so on. This reduces processor idle time. Since all packets do not have to be in memory at the same time, it also reduces memory requirements. This method is scalable as neither processor idle time nor memory requirements increase as the number of processors is increased.
If the synchronous method is used for receiving data, the local array needs to be scanned only once and the i th block, 0 i dL=ye ? 1, of size y of the local array will be read from the data received from processor mod(p s + i(P=k); P). If the asynchronous method is used, the rst block from the data received from some processor p i will be stored starting at the location calculated above. The remaining blocks will be stored with stride x.
2. If k does not satisfy the above condition, it is necessary to calculate the source processor of the rst element (j = i y + 1) of each block of size y, 0 i dL=ye ? 1, of the local array as p s = mod (i P + p)=k; P]. The block is read from the data received from p s . The sequence of processors does not repeat itself and hence cannot be stored. In this case, the synchronous method is used.
In the synchronous method, the local array needs to be scanned only once to be lled. In the asynchronous method, array elements are lled with a certain amount of stride and the array has to be scanned P times. So, clearly the synchronous method makes better use of the cache than the asynchronous method. We have tested the performance of the KY TO Y algorithm using both synchronous and asynchronous methods on the Intel Touchstone Delta. Figure 3 compares the performance of the synchronous and asynchronous methods for a cyclic(4) to cyclic(2) redistribution of a global array of 1M integers for di erent number of processors. We observe that the asynchronous method performs better than the synchronous method, even though in this case each processor communicates with at most two other processors. This is because the asynchronous method overlaps computation and communication, and thus reduces processor idle time. The better cache utilization of the synchronous method does not improve its overall performance. Figure 4 shows the performance of the KY TO Y algorithm for a cyclic(4) to cyclic(2) redistribution on 64 processors for di erent array sizes. For small arrays, the di erence in performance between the synchronous and asynchronous methods is small, because of the small data sets. For large arrays, the di erence is signi cant because of the higher processor idle time in the synchronous method.
4 Cyclic(x) to Cyclic(y) Redistribution: Special case y = k x We now consider the special case where y is a multiple of x. Let y = k x. This is essentially the reverse of the case where x = k y. Theorem 4.1 In a cyclic(x) to cyclic(y) redistribution where y = k x, if k < P, each processor sends data to k or k ? 1 processors and receives data from k or k ? 1 processors. If k P, each processor has to communicate with all other processors (all-to-all communication).  Proof: Assume k < P. Consider the rst k P sub-blocks of size x in the global array corresponding to the rst P sub-blocks of size y. Let us number these k P sub-blocks from 0 to k P ? 1. Out of these, the sub-blocks that are located in processor p i are numbered from p i to P(k ? 1) ? 1 + p i with stride P. In the new distribution, these sub-blocks will be mapped to fP(k?1)?1+p i g?p i P +1 = k processors. One of these processors may be p i itself, in which case p i sends data to k ?1 processors. In the receive phase, since y = k x, each block of size y in the new distribution consists of k subblocks of size x which will come from k processors. Consider any processor p i . Assume that it receives its rst sub-block of size x from processor p j . Then the remaining k ? 1 sub-blocks of the rst block are received from the next k ? 1 processors in order. The other P ? 1 processors receive the next k(P ? 1) sub-blocks of the global array. This results in a total of k P sub-blocks. Hence the next sub-block in p i , which is the rst sub-block of the next block of size y, is also received from p j . As a result, all sub-blocks from p i are received from k processors starting from p j . One of these processors may be p i itself, in which case p i receives data from k ? 1 processors.
If k P, each block of size y will consist of k sub-blocks of size x, where the number of subblocks is greater than or equal to the number of processors. So, clearly each processor has to send to and receive from all other processors (all-to-all communication).

2
The algorithm for cyclic(x) to cyclic(y) redistribution, where y = k x, is given in Figure 5. We call this the X TO KX algorithm. In the send phase, there are two cases depending on the value of k :-1. (k P) and (mod(P; k) = 0): In this case, each processor p calculates the destination processor of the rst block of size x of its local array as p d = p=k. The next block of size x has to be sent to processor mod(p d + P=k; P), the next to mod(p d + 2(P=k); P), and so on till the rst k blocks. The above sequence of processors is repeated for the remaining sets of k blocks of size x, and hence need not be calculated again.
2. If k does not satisfy the above condition, it is necessary to individually calculate the destination processor of each block i of size x, 0 i dL=xe ? 1, as p d = mod (i P + p)=k; P].
In the receive phase, each processor p calculates the source processor of the rst element of its local array as p s = mod k p; P]. As in the KY TO Y algorithm, the receive phase can be implemented using either the synchronous method or the asynchronous method. If the synchronous method is used, for each block of size y of the local array, the k sub-blocks of size x are read from the packets received from the k processors starting from p s in order of processor number. If the asynchronous method is used, we know that the i th block of size x of the local array, 0 i k ? 1, will be 1. If (k P) and (mod(P; k) = 0) then 2. The destination processor (p d ) of the rst element of the local array is p d = p=k.

5.
The destination processor of the next block of size x of the local array is mod(p d + i(P=k); P). 6. Else 7. For i = 0 to dL=xe ? 1 8.
The destination processor (p d ) of the rst element (j = i x + 1) of this block of size x is p d = mod (i P + p)=k; P]. 9. Send data to other processors.
Receive Phase 1. The source processor (p s ) of the rst element of the local array is p s = mod k p; P].
Synchronous Method: 2. Receive data from all processors into temporary bu ers. 3. For each block of size y in the local array do 4.
For i = 0 to k ? 1 do

5.
Read elements (i x + 1) to (i + 1)x of the current block of size y from the packet received from processor mod(p s + i; P).
Asynchronous Method: 2. The i th block of size x, 0 i k ? 1, is to be received from processor mod(p s + i; P).

4.
Receive data from any processor p i into a temporary bu er.

5.
Place the rst block of size x in the local array starting from the location calculated above, and the other blocks with stride y. Figure 5: X TO KX algorithm for cyclic(x) to cyclic(y) redistribution, where y = k x received from processor mod(p s + i; P). Thus the local index of the rst block received from any source processor can be calculated. The remaining blocks have to be stored with stride y.
We have tested the performance of the X TO KX algorithm on the Intel Touchstone Delta for di erent array sizes and number of processors. Figure 6 compares the performance of the synchronous and asynchronous methods for a cyclic(2) to cyclic(4) redistribution of an array of 1M integers for di erent number of processors. Figure 7 compares the performance of the two methods for di erent array sizes on 64 processors. The results are similar to those obtained for the KY TO Y algorithm. The asynchronous method is found to perform better in all cases.

General Cyclic(x) to Cyclic(y) Redistribution:
Let us consider the general case of a cyclic(x) to cyclic(y) redistribution in which there is no particular relation between x and y. One algorithm for doing this is to explicitly calculate the destination and source processor of each element of the local array, using the formulae given in Table 1. We call this the General Method and is described below.

General Method
In the send phase, the destination processor of each element of the local array can be determined by rst calculating its global index based on the current distribution and then the destination processor based on the new distribution. These two calculations can be combined into a single expression to give the destination processor of element i of the local array as p d = mod fmod(i ? 1; x) + (P((i ? 1)=x) + p)x + yg=y ? 1; P]. Similarly in the receive phase, the source processor of each element of the local array can be determined by rst calculating its global index based on the new distribution and then the source processor based on the old distribution. These two calculations can be combined into a single expression to give the source processor of element i of the local array as p s = mod fmod(i ? 1; y) + (P((i ? 1)=y) + p)y + xg=x ? 1; P].
The drawback of this algorithm is that calculations are needed individually for all elements of the array. We propose two algorithms called the GCD method and the LCM method, which make use of the optimized KY TO Y and X TO KX algorithms, and hence require a lot less calculations.

GCD Method
This method takes advantage of the fact that we have developed special optimized algorithms for a cyclic(x) to cyclic(y) redistribution when x = k y (the KY TO Y algorithm) and y = k x (the X TO KX algorithm). In the GCD method, the redistribution is done as a sequence of two phases | cyclic(x) to cyclic(m) followed by cyclic(m) to cyclic(y), where m = GCD(x; y). Since both x and y are multiples of m, the KY TO Y algorithm can be used for the cyclic(x) to cyclic(m) phase, and the X TO KX algorithm can be used for the cyclic(m) to cyclic(y) phase. This is described in Figure 8. The GCD method involves the cost of having to do two separate redistributions. For small arrays, the increased communication may outweigh the savings in computation, but for large arrays in some cases, the performance is better than that of the general method. This can be observed from Figure 9 which shows the performance of a cyclic (15) to cyclic(10) redistribution, for an array of size 1M integers on the Delta. We see that for small number of processors, the GCD method performs considerably better than the general method because of the saving in the amount of computation per processor. Since the size of the global array is kept constant, as the number of processors is increased, the size of the local array in each processor becomes smaller and each processor spends less time on address calculation. Hence the performance improvement of the GCD method over the general method is also small.
One disadvantage of the GCD method is that in the intermediate cyclic(m) distribution, the block size m is smaller than both x and y. In the KY TO Y and X TO KX algorithms, all the address and processor calculations are done for blocks of size x or y. Since m is the GCD of x and GCD Method  calculations have to be done for each element, which is no better than in the general method. In this case, the general method is expected to perform better than the GCD method. Figure 10 shows the performance of cyclic(11) to cyclic(3) redistribution on the Delta for an array of size 1M integers. Since the GCD of 11 and 3 is 1, we nd that the general method always performs better than the GCD method.

LCM Method
The key to getting good performance in this two-phase approach for redistribution is to have a large value for m. One way of ensuring that m is always large is by choosing m as the LCM of x and y. Since m is a multiple of both x and y, the X TO KX algorithm can be used for the cyclic(x) to cyclic(m) phase and the KY TO Y algorithm can be used for the cyclic(m) to cyclic(y) phase. This is described in Figure 8. Also, since m is larger than both x and y, all calculations are done for this larger block size. This results in fewer calculations than in the GCD and general methods. Figures 9 and 10 compare the performance of the LCM, GCD, and general methods for an array of 1M integers on di erent number of processors. We observe that the LCM method performs better in all cases. Figure 11 compares the performance of the LCM and general methods for a cyclic(11) to cyclic(3) redistribution keeping the number of processors xed at 64 and varying the array size. We observe that for small arrays, the general method performs better because it has less communication, but for large arrays the LCM method performs better because the saving in computation is higher than the increase in communication.
Note that the timings for the GCD and LCM methods in Figures 9, 10, and 11 include the time for calculating the GCD and LCM. For the cyclic(15) to cyclic(10) redistribution, both the specialcase redistributions within the GCD and LCM algorithms were performed using the asynchronous method, since the condition ((k P) and (mod(P; k) = 0)) is satis ed in this case. For the cyclic(11) to cyclic(3) redistribution, however, the synchronous method was used in the KY TO Y  algorithm, since the condition ((k P) and (mod(P; k) = 0)) is not satis ed, and the asynchronous method was used in X TO KX algorithm, since it does not require the above condition. to calculate the sequence of local memory addresses that a given processor must access while doing a computation involving the regular array section A(l : h : s), when the array A has a cyclic(k) distribution. They show that the local memory access sequence is characterized by a nite state machine of at most k states. Stichnoth 14] de nes a cyclic(k) distribution as a disjoint union of slices, where a slice is a sequence of array indices speci ed by a lower bound, upper bound and stride (l : h : s). The processor and index sets for array assignment statements are calculated in terms of unions and intersections of slices.

Conclusions
We have presented e cient and practical algorithms for redistributing arrays between di erent cyclic(k) distributions, which is the most general form of redistribution. The algorithms are portable and independent of the communication mechanism used. We nd that the asynchronous method performs better than the synchronous method in all cases, because it reduces processor idle time. For the general case where there is no particular relation between x and y, the general method performs well for small arrays because it requires communication only once. However, for large arrays, the LCM method performs much better than the general method, because it requires a lot less address calculation. The GCD method also performs better than the general method for large arrays, provided the GCD of x and y is greater than 1. The LCM method always performs better than the GCD method because the LCM of x and y is always greater than their GCD.
The relative performance of the three methods may be a ected by changes in the underlying architecture of the system. For example, in a system with very high communication costs, the general method may perform better since it has only one communication phase. Improved scalar compilers that optimize expensive index calculations may also improve the performance of the general method.