A probabilistic analysis of a locality maintaining load balancing algorithm

This paper presents a simple load balancing algorithm and its probabilistic analysis. Unlike most of the previous load balancing algorithms, this algorithm maintains locality. The authors show that the cost of this load balancing algorithm is small for practical situations and discuss some interesting applications for data remapping.<<ETX>>


Introduction
Parallel computing is the art of executing a program on computers with more than one processors. It is important to map the program such that the total execution time is minimized. Experience with parallel computing has shown that a 'good' mapping is a critical part of executing a program on such computers. This mapping can be typically performed statically or dynamically.
For most regular and synchronous problems [FOX91], this mapping can be performed at the time of compilation by giving directions in the language to decompose the data and its corresponding computations (based on the owner computes rule). We are currently developing a compiler for Fortran D, which provides a rich set of such directives [CHOU92]. Load balancing and reduction of communication are two important issues for achieving load balancing. The directives of Fortran D can be used to provide such a mapping for a large class of regular and synchronous problems.
For some other class of problems, which are irregular in nature, achieving good mapping is considerably more difficult. Further, the nature of this irregularity may not be known, and can be derived only at run time. Many problems can be characterized as a discrete model of a physical system, and a set of values are to be calculated at every domain point of the system [NIC090]. The mapping of such problems entails mapping of regions of model domain to each processor. The computational work associated with each subdomain may change over a period of time and hence the load on each processor may become unbalanced. For many problems, the computations may be characterized as a series of phases. The output of each phase acts as an input for the next phase. Although the input may have uniform pattern, the output may be nonuniform. For example, computer vision requires the conversion of image {low level structure) into higher level structures. The processing passes through several phases. The following are some of the low-level tasks: 1. The image is converting into a set of edges by application of a Sobel operator [BALL86](to give an edge image).
2. The edge image can be used to detect lines or circles in the image. 2 3. Multiple images can be used to perform stereo for detection of motion or distance of the object.
Thus the output of phase 1 would be used as a input to phase 2 or phase 3 or both.
A typical parallelization of these tasks would require partitioning of the input image. Assume that we have an N x N image distributed on p processors such that each processor gets a N x N rectangular block (We note that it may be useful in p some cases to divide the image in each processor such that each node gets a :Jr, x :Jr, square block. However, we restrict ourselves to the previous mapping).
The number of edges in each partition in general will not be equal. However, phase 2 or 3 may require locality of edges. In such cases the load needs to be balanced in a fashion that each node has equal number of edges (assume that the computation depends on the number of edges).
In such cases a remapping needs to be performed in order to achieve load balancing and have potential improvement in performance. There are many algorithms described in the literature for mapping irregular problems (e.g. [CHOU82,HADD89,IQBA86]). Most of these algorithms perform the mapping statically and are very time consuming. For many problems, this is acceptable as the structure of the problem does not change over its execution. However, they are prohibitive for a large class of applications. There are several algorithms proposed in the literature for balancing the load at run time [CHOW79,HINZ90,NIHW85,SALE90]. However, most of these algorithms shuffle data around in a fashion that locality between data items is no longer maintained.
Most applications possess some natural locality, i.e., the computations utilize data items which have some sense of proximity. For such applications shuffling of the data to balance the load will, in general, lead to a greater and irregular communication and may significantly reduce the advantages of having the load balance.
In this paper, we analyze a simple load balancing algorithm for irregular problems. A similar algorithm has been described in [JAJA89] for load balancing for fine grained hypercube machines. We show that if irregularity is such that the compu-tation points are distributed with a certain class of distributions and the granularity (number of points per processor) is reasonably high, then the cost of this load balancing is nominal and reduces to a simple shift algorithm. Further the load balancing algorithm maintains locality which is one of the desirable features. We give some simple applications of the load balancing algorithm which could be used in several domains.
The rest of this paper is organized as follows. Section 2 describes several different versions of the load balancing algorithm. Section 3 presents analysis of the load balancing algorithm. These algorithms are developed in an architecture independent fashion using collective communication primitives. This makes them suitable for a wide variety of architectures. We make reasonable assumptions about the cost of these primitives. Section 4 presents a simple application. Finally, conclusions are presented in Section 5.

Load Balancing Algorithm
Let the data which is useful in processor P1c be given by array aLoc~c(O .. X~c -1), where X1c represents the number of useful elements in processor P~c. We assume that the data in each local array is sorted in order of locality.
The load balancing algorithm is given in Figure 1. The following variables are used in the algorithm: • prefix sum l'k = E~;J xi.
• average number of useful elements X = ~ Ei;J Xi. We assume that X is an integer (we make this assumption for ease of presentation). The algorithm can be easily modified when this is not satisfied.
In this paper, we analyze our algorithms in a reasonably architecture independent fashion. We assume a store-and-forward message passing approach for calculating the complexity of the communication. However, our algorithms are developed using collective communication, which could utilize wormhole or cut-through routing [DALL87]. Further, the main results of our paper are not dependent on the above choice. We assume that a linear array can be efficiently embedded in the architecture. This is true for popular architectures like meshes, toruses, a.nd hypercubes [QUIN87].
The time to send a. message of size S from any node to a. neighbor node is assumed to be 0( r + c.pS), where T represents the set up cost and c.p represents the inverse of the data transfer rate. For efficiency reasons our algorithms require efficient evaluation of parallel prefixes. Prefix operations are provided in hardware on CM-5 [TMC92], it is expected that it should be available on most future computer architectures.
In this paper we propose several schemes for data. movement, each approach may be suitable for particular system architectures. The time required for step 1, 2, and 4 ( Figure 1) is upper bounded by the time required for parallel prefix.
Step 3 can be completed in 0(1). We develop several algorithms for step 5. All the algorithms assume that a linear array can be embedded in the given architecture.

Approach 1
In this approach (Figure 2), each processor P1c first concatenates all packets it needs to send to its left hand side processors (i.e. Pi, i < k). At each stage, P1c shifts its packets to P~c-1 and receives packets from P1c+1, P1c then accepts and removes the packets which are targeted to P1c from the packets it received. The stage will be repeated until all packets reach their final destination. The right shift operation will follow the same procedure, but in other direction. Assume S represents the maximum size of packets (in terms of data. elements) which would be left shifted among processors, also let D represent the longest left shift distance among processors. Then in the worst case one processor may contain as many as DS data. needed to be left shifted, so the time takes to complete the left shift process would be So the worst case time complexity of this approach is O(D-r + D 2 cpS)). This approach is geared towards architectures which utilize store and forward communication method.
The other way to perform the complexity analysis is to assume that the maximum amount of data to be sent by any processor is X. In that case the complexity is O(D(-r + Xcp)).

Approach 2
In this approach (Figure 3), each processor P1c initializes a vector send~c, where , which will return a vector receive(] with receive[k] representing number of processors which will send packets to P~c. Finally, processors use this information to send and receive packets. The complexity of this algorithm is difficult to analyze. The cost of steps 1 to 3 ( Figure 3) is upper bounded by the parallel sum. The cost of step 4 and 5 in the worst case is difficult to analyze as it will depend on the network congestion and contention on which it is performed. A very loose upper bound on the complexity is O(n 2 (-r +cpS)). The performance of this algorithm should be much better in practice.

Approach 3
During the load balancing process, assume that P1c will left shift packets to Pi, where k -maxl1c ::5 i :5 k -minl~c, maxl1c and minl1c represent P1c 's maximum left shift distance and minimum left shift distance (> 0), respectively. These values can be calculated locally in 0(1) time. We observe that Plc+1 's maximum left shift distance maxllc+1 must be less than or equal to minl1c + 1. With this observation, we know that at any left shift stage, if P1c left shift packets to Pa and Plc+1 left shift packets to Pb, then a< b. So we can conclude there is no link conflict at any shift stage. This is assuming that shift is carried over on an embedded linear array. The same conclusion holds for right shift operation.   The worst case time complexity of this algorithm (assuming that each node sends out a maximum of T packets to a maximum distance of D 1 ) (Figure 4), is O(T · D( T +cpS)). This is because each shift can be in 0( D( T +cpS)) amount of time. This algorithm will be better than algorithm 1 ( Figure 2) and 2 ( Figure 3) if value T and Dis small.

Total Complexity
Thus the cost of load balancing is of the order to the cost of computing a parallel prefix followed by the time required for one of the approaches for data movement. The cost of parallel prefix is O(log n · ( T + 'P)) for hypercube architectures [RANK90]. We believe that many of the future architectures would have some hardware support for such a primitive. In such case it can be assumed that parallel prefix can be calculated in 0(1) time; such is the case for CM-5 [TMC92]. Approach 1 has a better worst case time complexity than approach 2 and 3. However in practice, approach 2 and 3 may work better.
Up to now, we have only performed the worst case complexity analysis. The worst case cost of the above algorithms makes them prohibitive for load balancing for many problems. However, as we shall show in the next section, the cost will be small if the granularity (amount of data) per node is reasonably large and the irregularity follows some reasonable distribution.

Probabilistic Analysis
We assume that each node has number of elements which are given by a distribution with mean p. and variance cr 2 • We will derive results without any assumption on the distribution and present specific results for normal distribution. Within the load balancing algorithm (Figure 1) there are two important parameters which typically affect the complexity of the algorithm, Z : the maximum number of elements at any node. This will affect the maximum number of packets which are sent out by every node, and, D the maximum amount of distance which has to be traversed by a packet sent out by any node.
In the following analysis we study properties of the above two parameters. Towards this goal we first state a general result.
Let U1 , • · ·, Un be independent and identically distributed random variables with mean 0, variance 1, distribution function F, and associated density function f. Let Then, for large n, the distribution of normalized z• is given by the extreme-value- So the ath percentile of (Z-p)fu would be x and for n = 16, 64, they are 3.6 and 3.9, respectively. It also means that for Z the ath percentile would be p+ux, implying that (Zp) would have to go as much change as ux. Consequently, probability that at least one processor will acquire a large number of elements is high even for small number of processors (if the variance is high).
In comparison with Z, distributional properties of D are considerably more involved. Let where X= n-1 (X1 + · · · + Xn)· Thus, \tk divided by the average number of elements per processor (after load balancing) represents the amount of shift which is required for the first few elements of processor k. Distributional properties of \tk are easy to observe by rewriting 13 k k Vk = (1 --) (X1 + · · · + Xk) --(Xk+l + · · · + Xn) n n and recalling that each of the X's are independent random variables. Thus behavior of each Vk is given by the properties of a normally distributed random variable. These properties of Vk 's show that more deviation from zero will occur in the middle. Since Vk indicates amount of data movement from one processor to another, it would be useful to find probabilistic bounds on size of Vk 's. For example, when n = 16, the eighth processor would encounter large data movement [variance of Vk is largest for n = 16] and since P(IVsl/u > 4) = 0.05 it follows that as much as ( 4 x u) elements may have to move from this processor to some neighboring processors with probability 0.05. If n = 64, then as much as (8 xu) elements may have to move from this processor in either direction with the same probability. However, Vk's are statistically dependent and their correlations are positive. This implies that if Vk is positive then the probability is higher than 1/2 that Yk+1 and Vk+2 will also be positive, i.e., smaller runs of positive and negative values of Vk's will be observed [a run-an uninterrupted sequence of +ve V's or -ve V's]. Now we consider properties of another random variable, vV, which is of interest in analysis of D. This variable is defined as
14 Thus, random variable W represents maximum change among all processors.
Properties of this random variable will allow us to quantify amount of data movement from one processor to others. Approximate asymptotic distribution of W' is obtained by realizing that the stochastic process generated by Vt/u...fii, ~/u...fii, · · · is a Brownian Bridge. In other words, if we define Then, as n --io oo, the behavior of the process {W 0 (t) : 0 ~ t ~ 1} is such that values oft the distribution of E(W 0 (t)) is Gaussian. Therefore, properties of this process can be used to obtain asymptotic distributions of interest. In particular, asymptotic distribution of W' is the same as the distribution of SUPo::;t::;l W 0 (t) and the latter satisfies (BILL68]: Therefore, for large n P(W' ~ X) = 1 -e-2 x 2 , x > 0.
In summary, the distribution of W, i.e., P(W ~ x), can be approximated by 1 -e-2 (x 2 /u 2 n) for x > 0. The ath percentile of W is easily obtained from this approximate distribution and is given by u...;;:J2( -ln(1-a)) 1  which represents the maximum shift in either direction. However, our algorithms perform a shift along left followed by right. Hence the above distribution is not useful for evaluating the complexity of the algorithms. We give the following result for sake of completeness. Again using properties of the Brownian Bridge, we obtain the following asymptotic distribution forD': as n--+ oo, [BILL68], Consequently, for large n, the distribution of D*, P(D* ~ x), can be approximated by 1 + 2Ei:l(-1)ie-(2i2x2/n0'2), for X> 0.
Returning back to W', it is easy to show that

2V2
Finally, we consider the behavior of the normalized maximum right shift random variables By the strong law oflarge numbers, it follows that X--+ p almost surely [SHOR86), and by Slutsky's Theorem [BICK77], asymptotic distributions of l-V* and D are 'essentially' the same as of W' I p and D' I p respectively. Consequently, for large values of n, the following approximations can be used (By symmetry, the distribution for maximum left shift should be similar.) These distributions can be used to obtain desired probability bounds on the magnitudes of amount of data items sent from one processor to another.

J.t J.t J.t
The cost of left shift is also the same. Hence total cost of load balancing = 2 · e.
The above is the expected time for completion of our algorithm. In case J.t 2:: uJ~nln n, we can prove that n Thus the probability of a shift more than 1 is very low, this result indicates that most of the data movement occur among neighbor processors.

JL
Thus for all distribution with JL = O(uVn), the effective time for data shifting on an average is 0(.\(r + 't'JL)). We will show in the next section that binomial distribution satisfies the above properties. Assuming that parallel prefix can be calculated reasonably efficiently (it can be calculated in 0( r log n) for most architectures, and nearly constant time in architectures like CM-5), the cost of load balancing should make it practical for use for many applications. Further if r is negligible when compared to 't'J' and parallel prefix can be calculated in 0(1) time, then the total cost is proportioned to O(Aft'JL). Assuming that the cost of computation is at least proportional to number of elements in every local array, this result shows that the cost of load balancing should be no greater than the cost of computation. Typically load balancing needs to be performed after several iterations of computation, thus our load balancing algorithms would add a small incremental cost if the above assumptions are satisfied.

A Simple Application
In the following we analyze the cost of load balancing for a specific instance. Assume that the input of a computational phase is a dense linear array which is distributed equally (each node has M elements). Assume that each element represents a computation with probability p (and no computation with a probability 1p) Mp(1-p)). Let maxx = maxo::c:;i<n Xi, the extra expected cost dues to load imbalance will be C(M)(E(maxx)-1' ). If the cost is greater than the expected cost of load balancing (and possibly remapping), then it will be benefit from the load balancing. That is The above analysis has to be modified suitably if the cost of parallel prefix is not 0(1).
For example, for the CM-5 the time required for a scan operation is approximate 10 p,sec, the value ofT is approximate 140 p,sec and the value of <pis approximate 0.5 p,secfword (assuming a word size of 4 bytes). Assuming lvf = 4096, n = 256, p = 0.5. 'Ne have p = 2048, u = 32, and .A = 0.25 .
Neglecting the cost of parallel prefix, we have 19 C(M) x 32 x 3.33 ~ 2(1.156r + 360cp) =} C(M) x 106.56 ~ 2.312r + 720cp =} C(M) ~ 0.022r + 6.756cp Substituting r = 140 x 10-6 sec and cp = 0.5 x 10-6 sec, =} C(M) ~ 6.458 X 10-6 Assuming a peak performance of 5 MFlops (the current CM-5 SPARC microprocessor), this implies that you need approximate 30 instructions. Thus load balancing may be preferable if the above condition is satisfied (which will be true for a large variety of applications). We should note that the value would go up if the processing speed increases (with the possible addition of vector units in CM-5).

Conclusions
In this paper, we present a simple load balancing algorithm and its probabilistic analysis. We demonstrate that the cost can be reduced to 0( .\( r + <pJL)) plus the cost of a parallel prefix. Our analysis indicate that in most practical cases the number of packets sent out by each processor is less than or equal to 2 (at most one on each side), and the size of these packets is almost surely less than or equal to the average number of elements on every node.
Our algorithms are suitable for most commercial architectures, which in most cases reduce the data movement to neighbor processors' shift operations. The algorithms also preserve the data locality between data items which is extremely important in reducing inter-processor communication.
This paper provides load balancing only along one dimension. For many cases the data is distributed along two or more dimensions. We are currently analyzing a similar load balancing algorithms for two or more dimensions.