Runtime array redistribution in HPF programs

Describes efficient algorithms for runtime array redistribution in High Performance Fortran (HPF) programs. We consider block(m)-to-cyclic, cyclic-to-block(m) and the general cyclic(x)-to-cyclic(y) type redistributions. We initially describe algorithms for one-dimensional arrays and then extend the methodology to multidimensional arrays. The algorithms are practical enough to be easily implemented in the runtime library of an HPF compiler and can also be directly used in application programs requiring redistribution. Performance results on the Intel Paragon are discussed.<<ETX>>


Introduction
High Performance Fortran (HPF) is a language designed to support portable high performance programming on a wide variety of machines, including massively parallel SIMD and MIMD systems and vector processors 4].It supports the data parallel programming model and uses Fortran 90 as the base language.HPF extends Fortran 90 in several areas by providing data distribution features, data parallel execution features, extended intrinsic functions and standard library, and EXTRINSIC procedures.At Syracuse University, we are developing an HPF compiler which translates HPF to Fortran 77 plus message passing node programs for distributed memory computers 1].
HPF provides directives (ALIGN and DIS-TRIBUTE) which specify how arrays should be partitioned among the processors of a distributed memory computer.Arrays are rst aligned to a template or index space.The DISTRIBUTE directive species how the template is to be distributed among the processors.In HPF, an array (or template) can be distributed as block(m) or cyclic(m) 4].Though the distribution of an array is speci ed at compile time, there are a number of reasons for which it may be necessary to redistribute an array during the execution of Also with the Dept. of Electrical and Computer Eng., Syracuse University y Also with the Dept. of Computer and Information Science, Syracuse University the program.
HPF has directives REDISTRIBUTE and RE-ALIGN which require arrays to be remapped.It is not practical to write intrinsic and runtime libraries for all possible distributions.Libraries can be written for a few common distributions and for any other distribution it is necessary to redistribute before calling the subroutine.This approach also eliminates the need for complex inter-procedural analysis.After returning from the subroutine it is necessary to redistribute back to the original distribution.
In many applications such as 2D FFT and the ADI method for solving multidimensional PDEs, dynamic redistribution can result in signi cant performance improvement 2].Array redistribution is an expensive operation as it involves data communication as well as computation of destination processors and addresses.It is necessary to do redistribution e ciently, otherwise the overhead of redistribution itself will o set the bene t of using a di erent distribution.In this paper, we present algorithms for redistributing arrays e ciently.We consider block(m) to cyclic, cyclic to block(m) and the general cyclic(x) to cyclic(y) type redistributions.We consider the redistribution of one-dimensional arrays in Sections 3, 4 and 5 and then extend the methodology to multidimensional arrays in Section 6.We have concentrated on making the algorithms practical enough to be easily implemented in the runtime library of an HPF compiler.They can also be directly used in application programs requiring redistribution.The algorithms make good use of cache and optimize communication.We do not make any special assumptions about the machine architecture or the message passing paradigm provided on the machine.

Notations and De nitions
The notations used in this paper are given in Figure 1.We assume that all arrays are indexed starting N global array size P number of processors p i logical processor i p logical number of the processor executing the program p s source processor p d destination processor L local array size m block size Figure 1: Notations used in this paper from 1, while processors are numbered starting from 0 and that arrays are identically aligned to the template.The algorithms can be easily extended for the general case.Also, the algorithms do not specify how to perform data communication because the best communication algorithms are often machine dependent.These communication algorithms are described in detail in 9, 7].We do assume that all the data to be sent from any processor i to processor j has to be collected in a packet in processor i and sent in one communication operation, so as to minimize the communication startup cost.In the rest of this paper, any division involving integers should be considered as integer division unless speci ed otherwise.
The block(m) and cyclic(m) distributions in HPF are de ned as follows.Consider an array of size N distributed over P processors.Let us de ne the ceiling division function CD(j; k) = (j+k?1)=kand the ceiling remainder function CR(j; k) = j?k CD(j; k).Then block(m) distribution means that index j of the array is mapped to logical processor number CD(j; m) ? 1.
Note that for a valid block(m) distribution, m P N must be true.Block by de nition means the same as block(CD(N; P)).In a cyclic(m) distribution, index j of the global array is mapped to logical processor number MOD(CD(j; m) ?1; P).Cyclic by de nition means the same as cyclic (1).
In other words, in a block distribution, contiguous blocks of the array are distributed among the processors.In a cyclic distribution, array elements are distributed among processors in a round-robin fashion.In a cyclic(m) distribution, blocks of size m are distributed cyclically.The cyclic(m) distribution with 1 < m < dN=Pe is commonly referred to as a blockcyclic distribution with block size m 5].Block and cyclic distributions are special cases of the general cyclic(m) distribution.A cyclic(m) distribution with m = dN=Pe is a block distribution and a cyclic(m) distribution with m = 1 is a cyclic distribution.The formulae for conversion between local and global indices for the di erent distributions are given in Table 1.

Block(m) to Cyclic Redistribution
We rst consider the case of block(m) to cyclic redistribution.
Theorem 3.1 Let l i1 and l i2 be the local array sizes in processor p i corresponding to block(m) and cyclic distributions respectively.In a block(m) to cyclic redistribution, the number of processors to which p i has to send data is P ? 1 if l i1 P l i1 or l i1 ? 1 if l i1 < P The number of processors from which p i has to receive data is P ? 1 if l i2 P l i2 or l i2 ? 1 if l i2 < P Proof: Note that if N is not a multiple of P, l i1 may not be equal to l i2 .If l i1 < P, each of the l i1 elements of p i corresponding to a block(m) distribution will lie in a di erent processor when the array is distributed cyclically.At most one of the l i1 elements will be remapped to processor p i itself.Therefore, p i will have to send data to either l i1 or l i1 ? 1 processors.If l i1 P, then clearly p i has at least one element to send to every other processor.The result for the number of processors from which p i has to receive data can be proved similarly. 2 The algorithm for block(m) to cyclic redistribution is given in Figure 2. In the send phase, each processor only needs to calculate the destination processor of the rst element of the local array.The other elements have to be sent to the other processors in a round-robin fashion.Thus the array is scanned only once, which makes good use of the cache.In the receive phase, the data received from other processors has to be stored in contiguous memory locations in order of logical processor number.In every processor, the packet received from processor 0 is stored rst; from processor 1 second and so on.Hence addresses do not need to be calculated.If the amount of data to be received from all processors is known, the packets can be directly received into appropriate locations in the array.
In a block(m) distribution, the last element N of the global array is mapped to processor p N = (N ?1)=m.If p N < P ?1, no elements of the array are mapped to processors p N +1; p N +2; :::; P ?1.The index of the last element of the local array in processors p < p N is last index = m.The index of the last element in p N is last index = MOD(N ?1; m)+1.The index of the rst element of p s p N that is mapped to p in a cyclic distribution is given by first index = MOD p ? MODfps MOD(m; P); Pg + P; P] + 1 If m is divisible by P, the rst element of p s that is mapped to p is p+1.However, if m is not divisible by Table 1: Data Distribution and Index Conversion Note: This assumes that arrays are indexed starting from 1 and processors are numbered starting from 0 CD(j; k) = (j + k ?1)=k and CR(j; k) = j ?k CD(j; k) global index (g) to p = CD(g; m) ? 1 p = MOD(g ?1; P) p = MOD(CD(g; m) ?1; P) processor number (p) global index (g) l = m + CR(g; m) l = (g ?1)=P + 1 l = MOD(g ?1; m) + 1+ to local index (l) (g=(m P))m local index (l) to g = l + m p g = (l ?1)P + p + 1 g = MOD(l ?1; m) + 1+ global index (g) (P((l ?1)=m) + p)m P, a shift is introduced in this simple mapping which is taken into account by the MOD expression in the above equation.Hence, the number of elements to be sent from any processor p s to p is 0; if (last index < rst index) or (ps > pN) (last index ?rst index)=P + 1; otherwise where first index, last index and p N are calculated as above.

Cyclic to Block(m) Redistribution
A cyclic to block(m) redistribution is essentially the reverse of a block(m) to cyclic redistribution.The algorithm for cyclic to block(m) redistribution is given in Figure 4.In a cyclic distribution, the size of the local array in processor p is L = dN=Pe if MOD(N; P) = 0; or p < MOD(N; P) dN=Pe ? 1 otherwise In the send phase, each processor p calculates the destination processor p d and the destination local address l d of the rst element of its local array as p d = CD(p + 1; m) ? 1 and l d = m + CR(p + 1; m).The rst (m ?l d )=P + 1 elements from p s have to be sent to p d .The next set of elements starting from i = (m ?l d )=P + 2 have to be sent to processor p d1 = CD((i ?1)P + p + 1; m) ? 1.The destination local address of element i is given by l d1 = m+CR((i?1)P+p+1;m) and so (m?l d1 )=P +1 elements starting from i have to be sent to processor p d1 .This process is continued for the remaining elements of the array.Since all the elements to be sent to a particular processor are located in contiguous memory locations, there is no need to create packets.
In the receive phase, the packets received cannot be directly stored in the array as the data has to be stored with a stride equal to the number of processors.Hence the packets have to be stored in a temporary bu er in memory.This gives us two choices in implementing the receive phase :{ 1. Synchronous Method: Each processor waits till it receives packets from all other processors, before placing any data in the local array.This increases the memory requirements of the algorithm and also increases the processor idle time.These problems worsen as the number of processors is increased, so this method is not scalable.
2. Asynchronous Method: The processors do not wait for data from all processors to arrive.Instead, each processor takes any packet which has arrived and places the data from that packet into appropriate locations in the local array.This method overlaps computation and communication.Each processor posts non-blocking receive calls and waits for any packet to arrive.As soon as a packet is received, it places the data in appropriate locations in the local array.During this time, other packets may reach the processor.When the processor has placed all the data from the earlier packet into the local array, it takes up the next packet and so on.This reduces processor idle time.Since all packets do not have to be in memory at the same time, it also reduces memory requirements.This method is scalable as neither processor idle time nor memory requirements increase as the number of processors is increased.The array locations where incoming data has to be stored can be calculated as follows.The source processor (p s ) of the rst element of the local array is given by p s = MOD(m p; P).The next (P ? 1) elements will be received from the remaining processors in order of processor number.This cycle is repeated for all elements of the local array.If all packets are present in memory (Synchronous Method), the local array needs to be scanned only once to be lled.If the packets are processed one at a time (Asynchronous Method), the array elements are lled with stride P and the array has to be scanned P times.So, clearly the Synchronous Method makes better use of the cache than the Asynchronous Method.Figure 3    Put element i into the packet for processor MOD(pd + i; P).The index of the rst element of ps mapped to p is first index = MOD p ? MODfps MOD(m; P); Pg + P; P] + 1 8.
Number of elements to be received from ps is 0, if (last index < first index) (last index ?first index)=P + 1, otherwise 9.No data is to be received from processors pN + 1; pN + 2; :::; P ? 1.
10 Read the incoming data directly into appropriate locations in the array.) elements and the number of processors varied between 2 and 32.We observe that the Asynchronous Method performs much better than the Synchronous Method as it overlaps computation and communication and thus reduces processor idle time.This difference is larger for a small number of processors because in this case, the amount of data communicated per processor is larger.
5 Cyclic(x) to Cyclic(y) Redistribution For a general cyclic(x) to cyclic(y) redistribution, there is no direct algebraic formula to calculate the set of elements to send to a destination processor and the local addresses of these elements at the destination 6].Gupta et al 6] propose a virtual processor approach in which a block-cyclic distribution is considered to be either a virtual block or a virtual cyclic distribution using many virtual processors per physical processor.
There has also been some research on a slightly different problem of determining the local addresses and communication sets for array assignment statements like A(l 1 : h 1 : s 1 ) = B(l 2 : h 2 : s 2 ) where A and B have di erent cyclic(m) distributions.Chatterjee et al 3] present an approach to calculate the sequence of local memory addresses that a given processor must access and the corresponding communication sets, using a nite state machine model which requires solving linear Diophantine equations.Stichnoth 8] de nes a cyclic(m) distribution as a disjoint union of slices, where a slice is a sequence of array indices speci ed by a lower bound, upper bound and stride (l : h : s).The processor and index sets are calculated in terms of unions and intersections of slices which also involves solving linear Diophantine equations.
We propose a simpler and faster method for the cyclic(x) to cyclic(y) redistribution, which can be easily implemented and does not involve the overhead of solving linear Diophantine equations.We propose special methods for the cases where x is a multiple of y or y is a multiple of x.We use a general method when there is no particular relation between x and y.

Special case x = k y
We rst consider the special case where x is a multiple of y.Let x = k y.Theorem 5.1 In a cyclic(x) to cyclic(y) redistribution where x = k y, if k < P, each processor communicates with k or k ? 1 processors.If k P, each processor communicates with all other processors.Proof: Assume k < P. Since x = k y, each block of size x is divided into k sub-blocks of size y and distributed cyclically.Consider any processor p i .Assume that it has to send its rst sub-block of size y to processor p j .Then the remaining k ? 1 sub-blocks of the rst block are sent to the next k ? 1 processors in order.The next k(P ? 1) sub-blocks of the global array are located in the other P ? 1 processors.This results in a total of k P sub-blocks.Hence the (k+1) th sub-block of size y in p i is also sent to p j .Thus all Send Phase 1. i = 1 2. While (i L) do
Elements i to i + (m ?ld)=P have to be sent to processor pd.

Receive Phase
Synchronous Method: 1. Receive all packets.2. Get the rst element of the local array from the packet received from processor ps = MOD(m p; P). 3.For i = 2 to L do 4.
Get element i of the local array from the packet received from processor ps = MOD(ps + 1; P).
Asynchronous Method: 1. Source processor of rst element of the local array is ps = MOD(m p; P).
Receive a packet from any processor (say pi) 5.
Starting from the location calculated above, place elements from the packet into the array with stride P.
Figure 4: Algorithm for Cyclic to Block(m) Redistribution sub-blocks from p i are sent to k processors starting from p j .One of these processors may be p i itself, in which case p i has to send data to k?1 processors.For the receive phase, consider the rst k P sub-blocks of size y in the global array corresponding to the rst P blocks of size x.Let us number these k P sub-blocks from 0 to k P ? 1.Out of these, the sub-blocks that are mapped to processor p i in the new distribution are numbered p i to P(k?1)+p i with stride P.These subblocks come from fP(k?1)+pig?pi P + 1 = k processors.One of these processors might be p i itself, in which case p i receives data from k ? 1 processors.
If k P, each block of size x has to be divided into k sub-blocks and distributed cyclically, where the number of sub-blocks is greater than or equal to the number of processors.So clearly each processor has to send to and receive from all other processors (all-to-all communication).2 The algorithm for cyclic(x) to cyclic(y) redistribution, where x = k y is given in Figure 5.In the send phase, each processor p calculates the destination processor p d of the rst element of its local array as p d = MOD(k p; P).The rst y elements have to be sent to p d , the next y to MOD(p d +1; P), the next to MOD(p d +2; P) and so on till the end of the rst block of size x.The next k sub-blocks of size y have to be sent to the same set of k processors starting from p d .The sequence of destination processors can be stored and need not be calculated for each block of size x.In the receive phase there are two cases depending on the value of k :-1.(k P) and (MOD(P; k) = 0) : In this case, each processor p calculates the source processor of the rst block of size y of its local array as p s = p=k.The next block of size y will come from processor MOD(p s +P=k; P), the next from MOD(p s + 2(P=k); P) and so on till the rst k blocks.The above sequence of processors is repeated for the remaining sets of k blocks of size x and hence can be stored and used.If the Synchronous Method is used for receiving data, the local array needs to be scanned only once and the i th block, 0 i dL=ye ? 1, of size y of the local array will be read from the packet received from processor MOD(p s + i(P=k); P).If the Asynchronous Method is used, the rst block from the packet received from some processor p i will be stored starting at the location calculated above.The remaining blocks will be stored with stride x. 2. If k does not satisfy the above condition, it is necessary to calculate the source processor of the rst element (j = i y + 1) of each block of size y, 0 i dL=ye ? 1, of the local array as p s = MOD (i P + p)=k; P].The block is read from the packet received from p s .The sequence of processors does not repeat itself and hence cannot be stored.In this case, the Synchronous Method is used.Figure 6 compares the performance of the Synchronous and Asynchronous Methods on the Intel Paragon for a cyclic(4) to cyclic(2) redistribution of a global array with 1M elements.As in the case of cyclic to block(m) redistribution, we observe that the Asynchronous Method performs better than the Synchronous Method even though in this case each processor communicates with at most two other processors.

Special case y = k x
We now consider the special case where y is a multiple of x.Let y = k x.This is essentially the reverse of the case where x = k y.The algorithm for cyclic(x) to cyclic(y) redistribution, where y = k x, is given in Send Phase 1. Create packets for communication.2. Calculate the destination processor (pd) of the rst element of the local array as pd = MOD(k p; P). 3.For each block of size x in the local array do 4.
Put elements (i y + 1) to (i + 1)y of the current block of size x into the packet for processor MOD(pd + i; P).For i = 0 to k ? 1 do 6.
Read the next block of size y from the packet received from processor MOD(ps + i(P=k);P).
Asynchronous Method: 3. The i th block of size y, 0 i k ? 1, will be received from processor MOD(ps + i(P=k);P).4. For i = 0 to k ? 1 do 5.
Receive a packet from any processor pi.

6.
Place the rst block of size y starting from the location calculated above and the other blocks with stride x.
Calculate the source processor (ps) of the rst element (j = i y + 1) of this block of size y as ps = MOD (i P + p)=k; P] 11.
Read the block from the packet received from ps.  ; P) and so on till the rst k blocks.The above sequence of processors is repeated for the remaining sets of k blocks of size x, and hence need not be calculated again.2. If k does not satisfy the above condition, it is necessary to individually calculate the destination processor of each block i of size x, 0 i dL=xe?
1, as p d = MOD (i P + p)=k; P].In the receive phase, each processor p calculates the source processor of the rst element of its local array as p s = MOD k p; P].If the Synchronous Method is used to receive data, for each block of size y of the local array, the k sub-blocks of size x are read from the packets received from the k processors starting from p s in order of processor number.If the Asynchronous Method is used, we know that the i th block of size x of the local array, 0 i k ? 1, will be received from processor MOD(p s + i; P).Thus the local index of the rst block received from any source processor can be calculated.The remaining blocks have to be stored with stride y.

General Case
In a general cyclic(x) to cyclic(y) redistribution in which there is no particular relation between x and y, there is no direct way to determine which indices have to be sent to which processor and where in the local array to place incoming data.For this case, the following general method is proposed.In the send phase, each processor p scans the local array and calculates the destination processor of each element i as p d = MOD fMOD(i ?1; x) + (P((i ?1)=x) + p)x + yg=y ?1; P].The element is put in a packet for that processor.In the receive phase, each processor p calculates the source processor p s for each element i in the local array as p s = MOD fMOD(i?1; y)+(P((i?1)=y)+p)y+xg=x?1;P].After the packets from other processors are received, the processor fetches each element from the appropriate packet in order.For i = 0 to k ? 1 do 6.
Place the next block of size x of the local array into the packet for processor MOD(pd + i(P=k); P).Calculate the destination processor (pd) of the rst element (j = i x + 1) of this block of size x as pd = MOD (i P + p)=k; P]. 10.
Place this block into the packet for pd.
11 Exchange packets with other processors.
Receive Phase For i = 0 to k ? 1 do 6.
Read elements (i x + 1) to (i + 1)x of the current block of size y from the packet received from processor MOD(ps + i; P).
Asynchronous Method: 3. The i th block of size x, 0 i k ? 1, will be received from processor MOD(ps + i; P). 4. For i = 0 to k ? 1 do 5.
Receive a packet from any processor pi.

6.
Place the rst block of size x starting from the location calculated above and the other blocks with stride y.The redistribution of two and higher dimensional arrays can be classi ed into two types :-1.Shape Retaining: The shape of the local array remains unchanged after the redistribution, eg.(block,block) to (cyclic,cyclic).
2. Shape Changing: The shape of the local array changes after the redistribution, eg.(block,*) to (*,block) where '*' indicates that the corresponding dimension is collapsed.This type of redistribution is required very often for multidimensional FFT and the ADI method.The shape retaining and shape changing redistributions are quite di erent from each other and require di erent algorithms.

Shape Retaining Redistributions
A shape retaining redistribution may involve redistribution in only one dimension feg.(block,block) to (cyclic,block)g or more than one dimension feg.
(block,block) to (cyclic,cyclic)g.If the redistribution is only along one dimension, it is similar to the redistribution of one-dimensional arrays and the same algorithms described earlier can be used.If both dimensions have to be redistributed, it can either be done directly or indirectly as a series of one-dimensional redistributions.In the indirect method, the array is redistributed separately along each dimension.For example, if an array has to be redistributed from (block,block) to (cyclic,cyclic), it is rst redistributed to (block,cyclic) and then to (cyclic,cyclic).This method has the advantage that all the optimizations developed for one-dimensional arrays in the previous sections can be easily extended to multidimensional arrays.The order in which the dimensions are redistributed does not a ect the performance.This is because the order of dimensions chosen only results in a di erent set of data values being communicated, and does not a ect the amount or type of communication.So, one could also do the above redistribution as (block,block) to (cyclic,block) to (cyclic,cyclic).
In the direct method, data is sent directly to the destination processor corresponding to the new distribution.Hence the optimized algorithms developed for the one-dimensional case cannot be used.This method requires di erent algorithms for di erent number of dimensions and di erent types of redistributions and these algorithms cannot be optimized much.However data needs to be communicated only once in the direct method.Figure 8 compares the performance of the direct and indirect methods for redistributing an array of size 1K 1K from (block,block) to (cyclic,cyclic) on the Intel Paragon.The indirect method is found to perform much better even though data is communicated twice.This is because the algorithms for one-dimensional redistribution are highly optimized and a lot of the communication during each one-dimensional redistribution actually takes place in parallel.We have observed similar results for other array sizes also.Hence the indirect method is preferred.

Shape Changing Redistributions
This type of redistribution occurs when at least one dimension of the array is collapsed before or after the redistribution.Consider the redistribution from (X,*) to (*,Y) or vice-versa, where X and Y can be either block, cyclic or cyclic(m).This is basically a collapsed to distributed type of redistribution in one of the dimensions which is done as follows.Each processor divides the local array into packets along the collapsed dimension, depending on the type of the new distri- The other type of redistribution involving change of shape of the local array is of the type (X,*) or (*,X) to (Y,Z), or vice versa.That is, in either the source or target distributions, one of the dimensions is collapsed and in the corresponding target or source distributions, both the dimensions are distributed.Each case of this type requires a di erent algorithm and so has to be considered separately.

Conclusions
Runtime array redistribution is required very often in HPF programs and it needs to be done e ciently in order to get good performance.We have described e cient algorithms for block(m) to cyclic, cyclic to block(m) and the general cyclic(x) to cyclic(y) type redistributions.The algorithms try to minimize address calculation and communication, and make good use of the cache.They are also practical enough to be easily implemented in the runtime library of an HPF compiler, or directly in application programs requiring redistribution.We have tested the performance of the algorithms on the Intel Paragon and observed that the Asynchronous Method performs better than the Synchronous Method as it overlaps computation and communication.
compares the

1 .
Create packets to send to other processors.2. Calculate the destination processor (pd) of the rst element of the local array as pd = MOD(p m; P).

3 .
Put the rst element into the packet for processor pd. 4. For i = 2 to L do 5.

Figure 2 :
Figure 2: Algorithm for Block(m) to Cyclic Redistribution

6 .
Exchange packets with other processors.Receive Phase 1.If (k P) and (MOD(P; k) = 0) then 2. Calculate the source processor (ps) of the rst element of the local array as ps = p=k.Synchronous Method: 3. Receive packets from all processors.4. For j = 1 to dL=xe do 5.

Figure 5 :
Figure 5: Algorithm for Cyclic(x) to Cyclic(y) Redistribution, where x = k y

Send Phase 1 .
Create packets for communication.2. If (k P) and (MOD(P; k) = 0) then 3. Calculate the destination processor (pd) of the rst element of the local array as pd = p=k.4. For j = 0 to dL=ye ? 1 do 5.

1 .
Calculate the source processor (ps) of the rst element of the local array as ps = MOD k p; P].Synchronous Method: 3. Receive packets from all processors.4. For each block of size y in the local array do 5.

Figure 7 :
Figure 7: Algorithm for Cyclic(x) to Cyclic(y) Redistribution, where y = k x 6 Redistribution of Multidimensional Arrays