Array Decompositions for Nonuniform Computational Environments

Two-dimensional arrays are useful in a large variety of scientific and engineering applications. Parallelization of these applications requires the decomposition of array elements among different machines. Several data-decomposition techniques have been studied in the literature for machines with uniform computational power. In this paper, we develop new methods for decomposing arrays into a cluster of machines with nonuniform computational power. Simulation results show that our methods provide superior decomposition over naive schemes.


Introduction
Data-parallel applications requires the partitioning of data among processors in a way that the computation load on each node is proportional to its computational power, while minimizing communication. Two-dimensional arrays are widely used in scienti c and engineering problems such as weather prediction and image processing. In this paper we discuss the decomposition of twodimensional arrays for a nonuniform computational environment (NUCE). This environment is made up of machines with di erent computational power, and the arrays must therefore be divided unequally among processors.
Most previous work has been done on parallel machines with uniform computational power. For these machines, decomposition of arrays is relatively simple. The array is partitioned equally along one or more dimensions. This is re ected in the BLOCK directive of High-Performance Fortran 1]. Belkhale and Banerjee 5] have proposed methods for partitioning a nonuniform grid. Grid problems with varying resource demands have been studied by Nicol and Saltz 13]. Nicol 12] has described the decomposition of unstructured grids into perfectly rectilinear partitions.
There has been very little work done on decomposition of arrays for nonuniform machines. Crandall and Quinn 10] have presented an algorithm for decomposing a two-dimensional data array for a heterogeneous cluster of workstations. This algorithm uses an orthogonal recursive bisection to perform the decomposition. Snyder and Socha 15] have developed a polynomial time algorithm for allocating an I J array of array points to a K K array of processors. However, their algorithm is restricted to uniform processors. Berger and Bokhari 6] have addressed the problem of decomposing a domain into multiprocessors where subparts of the domain require di erent computational loads. The domain is divided into rectangles by recursively dividing the partitions into subpartitions of equal computational load until each processor has been allocated its share of the domain.
Although our work is speci cally targeted towards a cluster of workstations, it can be generalized to a heterogeneous computing environment consisting of clusters of workstations, clusters of supercomputers, or a hybrid of both. An example of such an environment is shown in Figure 1. For such cases a multilevel decomposition may be required. An array may be decomposed among several higher-level machines, followed by decomposition within each machine.
The algorithms developed in this paper assume that all the processors are connected by an interconnection network in which the cost of unit communication is the same between all the processors (e.g., a bus). We show that the number of possible decompositions is at least O(p!2 p ), where p is the number of processors. Thus, trying all possible cases is not possible beyond a small value of p. Our simulation results show that the total amount of communication produced by our algorithms is smaller as compared to naive approaches.
The rest of this paper is organized as follows. Section 2 provides a formal description of the problem. In Section 3 we discuss the decomposition of a two-dimensional m n array into k (2 k 4) partitions. Section 4 introduces a special class of decompositions. In Section 5 we discuss the e ect of latency on the algorithms described in Section 4. In Section 6, several decomposition algorithms are presented and performance comparison of the algorithms is presented.  NUCE's are made up of processors with di erent computational power connected by a network. A special case of a NUCE is a cluster of heterogeneous workstations that are connected by Ethernet or Token Ring. An e cient decomposition would minimize interprocessor communication while simultaneously balancing the computational load among the processors in NUCE. We consider only decompositions that assign to each processor a rectangular (or square) subarray, thus decompositions such as are shown in Figure 2 those not allowed. This restriction is necessary for two reasons: 1. The local data assigned to each node is an array, making storing and referencing of data on a local processor e cient.
2. The boundaries of the local subarray are parallel to the length and the width, which makes nding the owner of a particular element relatively simple.
Several ways of decomposing a 2-D array for a four-processor NUCE such that each partition satis es the above property are shown in Figure 3. Each decomposition can be looked at as a planar graph with a number of nodes and edges. Edges can be classi ed as internal edges or external edges. Internal edges are shown by dotted lines in Figure 3. All other edges are external edges. We would like to divide the array in such a fashion that the total communication cost is minimized. Clearly, the total communication cost depends on the way data is accessed for a given application. However, for most scienti c applications the cost is minimized if the sum of the length of internal edges (representing the communication) is minimized. We will use the term Acost to represent this sum. It is worth noting that the minimization of Acost reduces the sum of the perimeter of all the rectangles, thus a decomposition that provides the best partitioning would, on the average, have each partition close to a square. In the above discussion we have assumed that the cost of communication for a unit of data is the same across all machines. In some cases the sum of internal edges, along with the external edges, is a better representation of communication cost. This is required for cases in which interactions are wraparound along rows and/or columns. If two parallel external edges belong to the same partition, then they are not counted in the cost since they do not result in any communication. We will use the term Bcost to represent this sum. In Figure 4 Acost is m + n 0 and Bcost is 2m + 2n 0 . Let a 1 ; a 2 ; a 3 : : :; a p be the relative computational power of di erent processors such that a 1 + a 2 + a 3 : : :a p = 1 and a 1 a 2 a 3 a 4 : : :: a p . The problem can then be stated formally as follows: Decompose the m n array into p partitions such that the size of each partition is in the ratio a 1 : a 2 : a 3 : a 4 : : :a p ; and Acost (or Bcost) is minimized. In practice the cost of communication on a local area network can be modeled as O( + B), where is the software overhead of sending a message (also known as the startup latency), is the transfer rate and B is the size of the message. The startup latency for communication is typically large and can have an impact on the total cost of the communication. The cost measures studied so far do not consider this setup time. We will discuss the e ect of startup latency in Section 5.

Decomposition for a small number of partitions
In this section we present several ways of decomposing a two-dimensional m n array into k (2 k 4) partitions such that the size of each partition is a i mn; 1 i k. Our basic intention is to demonstrate some key ideas that could be used for the development of algorithms for large values of k.

Two partitions
There are four possible ways to decompose an array into two partitions ( Figure 5). The decomposition in Figure 5 (a) has Acost equal to m and Bcost equal to 2m. For decomposition in Figure 5 (b) Acost is n and Bcost is 2n. Since m n, decomposition in Figure 5 (a) is always better than decomposition in Figure 5 (b) using either measure. The cost of decomposition in Figure 5 (c) and Figure 5 (d) is equal to that of Figure 5 (a) and Figure 5 Figure 6: Partitioning an array into three partitions (a) Acost = m + n(a 2 + a 3 ), Bcost = 2m + 2n(a 2 + a 3 ); (b) Acost = n + m(a 2 + a 3 ), Bcost = 2n + 2m(a 2 + a 3 ); (c) Acost = 2m, Bcost = 3m: There are 36 ways to decompose an array into three partitions. Three representative decompositions are shown in Figure 6. The decomposition in Figure 6 (a) has Acost equal to m + n(a 2 + a 3 ) and Bcost equal to 2m + 2n(a 2 + a 3 ): Decomposition derived by interchanging a 1 with a 2 or a 3 will increase the total cost. For example, if we interchange a 1 with a 2 , then Acost is m + n(a 1 + a 3 ), m + n(a 1 + a 3 ) m + n(a 2 + a 3 ), because a 1 a 2 . Also, the decomposition achieved by rotating (interchanging the orientation of internal edges by 90 degrees) the decomposition in Figure 6 (a) will result in an increase of communication cost (Figure 6 (b)). For example, the decomposition in Figure (b) has Acost equal to n + m(a 2 + a 3 ), n + m(a 2 + a 3 ) m + n(a 2 + a 3 ), because n m.
Similarly, rotating the decomposition in Figure 6 (c) will lead to a larger communication cost.
Acost of decomposition in Figure 6 (a) is better than Acost of Figure 6 (c) if m n(a 2 + a 3 ).
The following observations can be made for achieving the decomposition with the least cost: 1. Partitions should be considered in decreasing order of their size.
2. Partitioning should be done along the longer dimension.

Four partitions
There are at least 384 possible ways to decompose an array into four partitions ( Figure 7 shows four of them). If we rotate these decompositions we will get another four decompositions. For each of the eight decompositions, we can get 4! di erent decompositions by interchanging a 1 ; a 2 ; a 3 ; a 4 with each other. Acost and Bcost for these cases are shown in Figure 7. Interchanging a's in any of the decompositions shown will not reduce Acost or Bcost. For example, if we interchange a 1 with a 2 in Figure 7 (c), then Acost is m + 3n(a 1 + a 3 + a 4 ), m + 3n(a 1 + a 3 + a 4 ) m + 3n(a 2 + a 3 + a 4 ), because a 1 a 2 .
Another decomposition of an array into four partitions is shown in Figure 8. The structure of the decomposition in Figure 8 is di erent from the ones described in Figures 5 through 7, because the latter can be derived by partitioning along X-axis (Y -axis), followed by partitioning each of the subarrays along Y -axis (X-axis).
The Acost and Bcost of the decomposition in Figure 8 are m + m(a 3 + a 4 )=(a 2 + a 3 + a 4 ) + n(a 2 + a 3 + a 4 ), and 2m + m(a 3 + a 4 )=(a 2 + a 3 + a 4 ) + 2n(a 2 + a 3 + a 4 ), respectively. We get the decomposition in Figure 8 by rotating the dotted subpart of Figure 7 (b) (see Figure   9). The Acost of decomposition in Figure 8 is better than Figure 7 (b) only when m(1 ? (a 3 + a 4 )=(a 2 + a 3 + a 4 )) ? na 2 > 0 , m(a 2 =(a 2 + a 3 + a 4 )) > na 2 , m > (a 2 + a 3 + a 4 )n: This corresponds to the fact that after the rst division (removing a 1 ), the length of the remaining rectangle is less than the width, and the decomposition for the subpart corresponds to the best decomposition for k = 3 with a 2 , a 3 , and a 4 . This does not hold for Bcost, because a 2 in Figure 7 (b) has a free communication edge along the width. For Bcost, the decomposition in Figure 8 is thus better than Figure 7 (b) only when m(1 ? (a 3 + a 4 )=(a 3 + a 4 + a 2 )) > 2(a 2 )n ) m > 2(a 2 + a 3 + a 4 )n:

Discussion
From the special cases which we studied in this section we make the following observations: 1. A good method for minimizing Acost seems to be to look at all the decompositions of the type given in Figure 7. These decompositions can be achieved by partitioning along the larger dimension into several subarray, followed by partitioning each of the subarrays along the other dimension. 2. To achieve the decomposition with a good cost, the partitions should be considered in decreasing order of their sizes.
3. The partitioning should be done along the longer dimension. Whenever the remaining length is less than the remaining width, switching directions will generally, produce better decomposition.
For most decompositions, especially for a large number of partitions, the di erence between Acost and Bcost will be equal to the perimeter of the array. Hence a good decomposition for one of the cost measures will also be a good decomposition for the other cost measures. The exceptions to the rule are the cases in which a given two parallel external edges belong to the same partition.
In the following we will limit our discussion and analysis to only one of the cost measures, Acost.
However, most of the discussion can potentially be extended to the other cost measure.

XY decompositions
Based on our observation in Section 3, we nd that partitioning along the dimension with the larger size, followed by partitioning along the smaller size, leads to good decomposition in most cases.
We will use the term XY decompositions to represent this class. Without loss of generality we will assume that the X-axis is larger of the two array dimensions. An example of such a decomposition for 9 partitions is given in Figure 10. We will use the term vertical subarrays to denote the subarrays formed by partitioning along the X-axis, and the term horizontal subarrays to denote the subarrays formed by partitioning each vertical subarray along the Y-axis. Each horizontal subarray corresponds to one partition and we will use these terms interchangeably. We make the following observations regarding XY decompositions. Let the number of vertical subarrays be v. We de ne count i ; 1 i v to be the number of horizontal subarrays (partitions) assigned to the vertical subarray i.
Lemma 1 Vertical subarrays can be interchanged so that they are in an increasing order of their count values without increasing the cost.
Lemma 2 The partitions within vertical subarrays can be arranged in a decreasing order of their size without increasing the cost.
where v is the number of vertical subarrays, count i is the number of partitions in the i-th vertical subarray, and K i = P i?1 x=1 (count i ).
To prove Theorem 1 we need only to prove the following two general cases: 1. k (a x + a y + ) + k (a i + a j + ) + =k (a y + a x + ) + k (a i + a j + ) + 2. k (a x + a y + ) + k (a i + a j + ) + k (a i + a y + ) + k (a x + a j + ) + such that k k , a x a i and a x a y .
The rst case is always true, since addition is commutative. To prove the second case we need only to prove the following: k (a x + a y + ) + k (a i + a j + ) ? k (a i + a y + ) ? k (a x + a j + ) 0 , (k a x + k a i ) ? (k a i + k a x ) 0 , (k a x ? k a i ) + (k a i ? k a x ) 0 , k (a x ? a i ) + k (a i ? a x ) 0 , (k ? k )(a i ? a x ) 0; which is always true since k k and a x a i . Thus, one can always move a horizontal subarray with a high value of a i to the left by interchanging it with a horizontal subarray having a value lower than a i without increasing the cost. 2 The number of potentially di erent arrangements can be derived by nding the number of ways p can be partitioned into k parts, 1 k p such that x 1 + x 2 + x 3 + x k = p, where x 1 , x 2 , x 3 , : : : , x k are integers. This can be shown to be 2 p?1 . For each such case there are p! ways of arranging a 1 , a 2 , : : :, a p . Thus the total number of XY decompositions is O(p!2 p?1 ). Let pt(p) represent the number of cases in which x 1 x 2 x 3 : : : x k . For example, pt(5) is equal to 7, namely, 5 = 4+1 = 3+2 = 3+1+1 = 2+2+1 = 2+1+1+1 = 1+1+1+1+1. Theorem 1 thus reduces the possible cases to be considered for nding the best XY decomposition from p!2 p?1 to pt(p), where p is the number of partitions. There is no closed-form expression for pt(p) 9]. Table  1 gives the comparison for small values of p.
Lemma 3 Let the communications cost of a decomposition produced by partitioning an m n array along an X-axis followed by partitioning it along a Y-axis be Acost = dm + kn: Then the communication cost of the same array with the same decomposition rotated (partitioning along the Y-axis followed by partitioning along the X-axis) will be better only if k > d. Lemma 4 If the partitions are equal, then Acost is given by: min 1 l p f (bp=lc(bp=lc ? 1))(l ? p mod l) + (dp=le(dp=le ? 1))(p mod l)]na i + lmg:

Latency considerations
In the previous section latency was ignored in the communication cost. Latency in current implementations of message-passing software is two to three orders of magnitude larger than the transmission time required for sending a byte of information across the network. This can be signi cantly reduced by decreasing the number of software layers which the message-passing software is built upon. Several such e orts are currently underway. In this section we discuss the e ect of latency on some of the results derived in the previous section. On a broadcast network like Ethernet every message is received on all the nodes. The lower layers ignore the messages not destined for the node. However, future message-passing software should support a multicast. In such cases the software overhead of sending multiple messages over a broadcast network could be reduced by multicasting one message with the combined data for all the processors together. Each receiver would use the part for which it was the destination, which would reduce the overhead cost without increasing the actual cost of transmission. For the above scenario the latency overhead for each processor would be xed independently of the number of processors a given processor needs to send a message. Hence, the results discussed in Section 4 would still hold. When point-to-point communication is used there is a latency cost associated whenever two partitions share a common boundary. Thus, the additional cost is equal to the number of internal edges for a given decomposition multiplied by the latency cost. The number of internal edges can be evaluated by using the well-known Euler Theorem for planar graphs, V ? E + F = 2; where E is the number of edges, V is the number of vertices and F is the number of regions. Assuming p partitions, F = p + 1. Hence, E = V + p + 1 ? 2 , E = V + p ? 1: (1) We de ne any vertex to be a boundary vertex if it lies on the outside boundary of the array to be decomposed. Otherwise, the vertex is de ned to be an internal vertex. Similarly, any edge is de ned to be a boundary edge if both endpoints of that edge are boundary vertices, else it is de ned as an internal edge. Let V b , V i , E b , E i be the number of boundary vertices, internal vertices, boundary edge, and internal edges, respectively. Since E b = V b , we can derive the following by using (1): Thus the number of internal edges for a given decomposition can be derived by nding the unique corner points of partitions of a given decomposition, which can be done either by hashing or sorting. In the latter case the time requirements are O(p log p) for each decomposition.
Consider the equivalence class derived by the \rearrangement" operation described in Section 4. The number of internal edges for two di erent decompositions from the same equivalence class need not be equal. Consider the decompositions given in Figure 12 (a) and Figure 12 (b). Although they belong to the same equivalence class, the number of internal edges of the decomposition in Figure  12 (a) is less than in Figure 12 (b). A decomposition has the sorted order property if the vertical subarrays are sorted according to their count values (number of partitions) and the partitions within and between the vertical subarrays are sorted according to their size from top to bottom and left to right, respectively. A possible improvement is to consider all the rearrangements of a decomposition satisfying the sorted order property by allowing interchanges between any two vertical subarrays and/or interchanges of partitions within a vertical subarray and then choosing the decomposition with the minimum number of edges. Unfortunately, the number of such decompositions, though less than p!2 p , would still be quite large.
One would expect the di erence between the number of internal edges to be small within a given equivalence class. Making this assumption, an algorithm that considers only pt(p) decompositions as discussed in Section 4, should be able to nd a close to optimal solution for all XY decompositions when the cost of decomposition includes the e ect of latency corresponding to every internal edge.

Algorithms
In this section we describe four algorithms for decomposing an array into nonuniform-sized partitions. These are based on the ideas discussed in Sections 3, 4, and 5.

XY algorithm
A high-level algorithm that generates all the XY partitions is given in Figure 13. One can improve the quality of partitioning produced by the algorithm by using Lemma 3. For each decomposition considered, we compute the constant associated with each dimension separately (i.e., Cost = d m + k n). If d < k, then rotating the decomposition will produce a cheaper communication cost. We will call the new algorithm XY 2.

TILE algorithm
In Section 3 we observed that partitioning along the shorter dimension will reduce the communication cost. We can use this fact to improve the algorithm in the previous section. This new     algorithm (described in Figure 14) is similar to XY , with one modi cation: whenever the remaining length of the current dimension along which the vertical subarrays are partitioned becomes less than the other dimension, the order of the partitioning is switched, i.e, the XY algorithm is applied with the other dimension as length and the current dimension as width. The lower bound on the computational requirements of this new algorithm is (pt(p)). This happens when partitioning of vertical subarrays continues along one dimension and the algorithms behaves as an XY algorithm. The worst-case complexity of the algorithm is O(2 p?1 ).

Recursive Bisection algorithm
A simple algorithm was proposed by Crandall and Quinn 10] for decomposing a two-dimensional array for a NUCE (Figure 15). We shall refer to it as RecursiveBisection. RecursiveBisection partitions the current array according to the rst half of the partitions available. Two simple variations are: 1. RecursiveBisection switches the dimension along which partitioning is performed independently of the di erence between the length and the width of the array. If the aspect ratio of an array is large, the communication cost can be reduced by partitioning along the dimension with the larger size. We shall refer to this variation as \RecursiveBisection2." This algorithm, also partitions the current array according to the y partitions available such that the sum of their weights is approximately half of the total.
2. The way RecursiveBisection2 partitions the array may result in unbalanced total weights being assigned to the two partitions. For example, consider the case of four partitions with sizes 0.4, 0.4, 0.1, and 0.1, respectively. In this case the array would be partitioned into two parts with one partition having 80% of the size. We wanted to study the e ect of choosing balanced partitions by using a binpacking algorithm to divide the list into two parts with nearly equal total weight. We call this variation \RecursiveBisection3."

Simulation results
In this section we present the performance of the algorithms described in the previous sections. The quality of partitioning produced by the di erent algorithms was measured for the following parameters: 1. Number of partitions: We performed our simulation for 4, 5, 7, 10, 15, and 20 partitions. We decided upon a limit of 20 partitions because we believe that the use of such environments for data-parallel computing would typically be limited to this number.
2. Ratio of Computational power of Maximum/Minimum: We varied this ratio from 1 to 8 to study the e ect of nonuniformity between the computational units.
3. Size of the arrays: We considered arrays of di erent sizes and shapes ranging from 1K 1K to 1K 20K to study e ects based on the aspect ratio of the array to be partitioned.

Latency Costs:
We considered three di erent latency costs: 0, 100, 1000. The rst corresponds to the case when latency e ects are not relevant ( e.g., in the case of a multicast primitive as described in Section 5). The latter two cases represent practical scenarios.
Based on our preliminary experiments, we concluded that RecursiveBisection had the worst average communication cost, XY 2 produced the same as or marginally better results than XY , and RecursiveBisection3 had similar performance as RecursiveBisection2. Therefore, we will not present the performance of RecursiveBisection, RecursiveBisection3 and XY .
We decided to present comparative rather than absolute results due to the large amount of data generated by our simulation. All comparisons were made with RecursiveBisection2 because it produces very good results at a relatively low cost. A comparative study is more useful for determining the utility of a more expensive algorithm. The following summarizes the di erent results presented.
1. Tables 2 through 4 give the percentage improvement of the Tile algorithm over the Recur-siveBisection2 for di erent values of the parameters.
2. Tables 5 through 7 give the percentage improvement of the XY 2 algorithm over the Recur-siveBisection2 for di erent values of the parameters.

Zero latency
This section compares the di erent algorithms based on the total amount of communication generated. The following observations can be made from Tables 2 and 5: 1. The performances of Tile and XY 2 are better than that of RecursiveBisection2. The performance improves with number of partitions when the aspect ratio is less than the number of partitions.
2. The performances of Tile and XY 2 are comparable. However, the computational cost of XY 2 is considerably less.
3. When the aspect ratio of the array is larger than the number of partitions, the partitioning is generally along one dimension and hence all algorithms perform equally well.
We conclude that XY 2 performs better than the other schemes when the number of partitions is large and the aspect ratio is smaller than the number of partitions. In all other cases, Recursive-Bisection2 produces reasonably good-quality solutions at a very small cost. The decompositions produced by di erent algorithms for partitioning an array of size 1000 3000 into 7 partitions (a 1 = 0:5, a 2 = 0:1, a 3 = 0:1, a 4 = 0:1, a 5 = 0:1, a 6 = 0:05 and a 7 = 0:05) are given in Figure 16. The communication costs are 5750, 4500, 4500, 4666, and 4700 by using RecursiveBisection, XY 2, TILE, RecursiveBisection2, and RecursiveBisection3, respectively.

Nonzero latency
This section compares the di erent algorithms when latency costs are important. From Tables 3,  4, 6 and 7, we observe the following: 1. The performances of Tile and XY 2 are better than that of RecursiveBisection2. The performance improves with the number of partitions when the aspect ratio is less than the number of partitions.
2. The performance of XY 2 is better than that of Tile as the latency increases. This suggests that the number of internal edges produced by Tile is larger due to the change in direction when partitioning is performed. The relative performance improvements of XY 2 are better when the latency is higher, the number of partitions is larger, and the aspect ratio is not larger than the number of partitions. 3. When the aspect ratio of the array is larger than the number of partitions, the partitioning is generally along one dimension and hence all algorithms have a similar performance.
Based on the above observations, the XY 2 is the algorithm of choice when latency cost is an important factor. It produces the best results at a reasonable cost.

Conclusions
In this paper we have presented several decomposition algorithms for decomposing two-dimensional data array for a heterogeneous processors network. One important question that is relevant here is whether there are any decompositions that none of the algorithms described in this paper consider. One such decomposition is given in Figure 17. When all the partitions are of equal size it is easy to show that the communication generated by decomposition in Figure 17 (using both the communication cost and the latency overhead) is much larger then the decomposition produced by XY 2 partitioning. For di erent-sized partitions, it is di cult to evaluate the communication generated by such decompositions. Hence it is di cult to comment on the optimality of the decompositions generated by our algorithms.
Our results indicate that XY -based partitioning produces the best cost decomposition at a relatively small cost. The number of decompositions considered by XY partitioning is pt(p), where p is the number of partitions. The growth of this function is reasonable (Table 1) for practical values of p. Further, the time required for partitioning is not prohibitive for most applications (Table 8). This makes the algorithm very important for practical considerations.
In an adaptive system the resources may change over a period of time and may require remapping of data, thus these algorithms may have to be executed at runtime. If computational cost is at a premium the simple recursive bisection algorithm, which always partitions along the smaller dimension, works very well. One can also use a hybrid algorithm that uses RecursiveBisection for the rst few steps, followed by an XY partitioning.