Parallel incremental graph partitioning using linear programming

Partitioning graphs into equally large groups of nodes while minimizing the number of edges between different groups is an extremely important problem in parallel computing. For instance, efficiently parallelizing several scientific and engineering applications requires the partitioning of data or tasks among processors such that the computational load on each node is roughly the same, while communication is minimized. Obtaining exact solutions is computationally intractable, since graph-partitioning is an NP-complete. For a large class of irregular and adaptive data parallel applications (such as adaptive meshes), the computational structure changes from one phase to another in an incremental fashion. In incremental graph-partitioning problems the partitioning of the graph needs to be updated as the graph changes over time; a small number of nodes or edges may be added or deleted at any given instant. We use a linear programming-based method to solve the incremental graph partitioning problem. All the steps used by our method are inherently parallel and hence our approach can be easily parallelized. By using an initial solution for the graph partitions derived from recursive spectral bisection-based methods, our methods can achieve repartitioning at considerably lower cost than can be obtained by applying recursive spectral bisection from scratch.<<ETX>>


Introduction
Graph partitioning is a well-known problem for which fast solutions are extremely important in paral-This research was supported in part by DARPA under contract #DABT63-91-C-0028. lel computing and in research areas such as circuit partitioning for VLSI design. For instance, parallelization of many scienti c and engineering problems requires partitioning the data among the processors in such a fashion that the computation load on each node is balanced, while communication is minimized. This is a graph-partitioning problem, where nodes of the graph represent computational tasks, and edges describe the communication between tasks with each partition corresponding to one processor. Optimal partitioning would allow optimal parallelization of the computations with the load balanced over various processors and with minimized communication time. For many applications, the computational graph can be derived only at runtime and requires that graph partitioning also be done in parallel. Since graph partitioning is NP-complete, obtaining suboptimal solutions quickly is desirable and often satisfactory.
For a large class of irregular and adaptive data parallel applications such as adaptive meshes 2], the computational structure changes from one phase to another in an incremental fashion. In \incremental graph-partitioning" problems, the partitioning of the graph needs to be updated as the graph changes over time; a small number of nodes or edges may be added or deleted at any given instant. A solution of the previous graph-partitioning problem can be utilized to partition the updated graph, such that the time required will be much less than the time required to reapply a partitioning algorithm to the entire updated graph. If the graph is not repartitioned, it may lead to imbalance in the time required for computation on each node and cause considerable deterioration in the overall performance. For many of these problems the graph may be modi ed after every few iterations (albeit incrementally), and so the remapping must have a lower cost relative to the computational cost of executing the few iterations for which the computational structure remains xed. Unless this incremental partitioning can itself be performed in parallel, it may become a bottleneck.
For many applications, the computational graph is such that the vertices correspond to two-or threedimensional coordinates and the interaction between computations is limited to vertices that are physically proximate. In this paper we concentrate on methods for which such information is not available, and which therefore have wider applicability. Our incremental graph-partitioning algorithm uses linear programming. Using recursive spectral bisection, which is regarded as one of the best-known methods for graph partitioning, our methods can partition the new graph at considerably lower cost. The quality of partitioning achieved is close to that achieved by applying recursive spectral bisection from scratch. Further, our algorithms are inherently parallel.
The rest of the paper is outlined as follows. Section 2 de nes the incremental graph-partitioning problem. Section 3 describes the linear programming-based incremental graph partitioning. Experimental results of our methods on sample meshes are described in Section 4. Conclusions are given in Section 5.

Problem de nition
Consider a graph G = (V; E), where V represents a set of vertices, E represents a set of undirected edges, the number of vertices is given by n = jV j, and the number of edges is given by m = jEj. The graphpartitioning problem can be de ned as an assignment scheme M : V ?! P that maps vertices to partitions.
We denote by B(q) the set of vertices assigned to a partition q, i.e., B(q) = fv 2 V : M(v) = qg.
The weight w i corresponds to the computation cost (or weight) of the vertex v i . The cost of an edge w e (v 1 ; v 2 ) is given by the amount of interaction between vertices v 1 and v 2 . The weight of every partition can be de ned as The cost of all the outgoing edges from a partition represent the total amount of communication cost and is given by We would like to make an assignment such that the time spent by every node is minimized, i.e., min q (W(q) + C(q)), where represents the ratio of cost of unit computation/cost of unit communication on a machine. Assuming computational loads are nearly balanced (W(0) W(1) W(p ? 1)), the second term needs to be minimized. In the literature P C(q) has also been used to represent the communication.
Assume that a solution is available for a graph G(V; E) by using one of the many available methods in the literature, i.e., the mapping function M is available such that and the communication cost is close to optimal. Let G 0 (V 0 ; E 0 ) be an incremental graph of G(V; E).
i.e., some vertices are added and some vertices are deleted. Similarly, i.e., some edges are added and some are deleted. We would like to nd a new mapping M 0 : V 0 ?! P such that the new partitioning is as load balanced as possible and the communication cost is minimized.
The methods described in this paper assume that G 0 (V 0 ; E 0 ) is su ciently similar to G(V; E) that this can be achieved, i.e., the number of vertices and edges added/deleted are a small fraction of the original number of vertices and edges.

Incremental partitioning
In this section we formulate incremental graph partitioning in terms of linear programming. A high-level overview of the four phases of our incremental graphpartitioning algorithm is shown in Figure 1. Some notation is in order. Let 1. P be the number of partitions. 2. B 0 (i) represent the set of vertices in partition i. 3. represent the average load for each partition = P i jB 0 (i)j P . The four steps are described in detail in the following sections.
Step 1: Assign the new vertices to one of the partitions (given by M 0 ).
Step 2: Layer each partition to nd the closest partition for each vertex (given by L 0 ).
Step 3: Formulate the linear programming problem based on the mapping of Step 1 and balance loads (i.e., modify M 0 ) minimizing the total number of changes in M 0 .
Step 4: Re ne the mapping in Step 2 to reduce the communication cost. Figure 1: The di erent steps used in our incremental graph-partitioning algorithm.

Assigning an initial partition to the new nodes
The rst step of the algorithm is to assign an initial partition to the nodes of the new graph (given by M 0 (V )). A simple method for initializing M 0 (V ) is given as follows. Let For all the vertices v 2 V 1 , For the examples considered in this paper we assume that G 0 is connected. If this is not the case, several other strategies can be used.
If G 00 (V V 1 ; E E 1 ) is connected, this graph can be used instead of G for calculation of M 0 (V ). If G 00 (V V 1 ; E E 1 ) is not connected, then the new nodes that are not connected to any of the old nodes can be clustered together (into potentially disjoint clusters) and assigned to the partition that has the least number of vertices. For the rest of the paper we will assume that M 0 (v) can be calculated using the de nition in (7), although the strategies developed in this paper are, in general, independent of this mapping. Further, for ease of presentation, we will assume that the edge and the vertex weights are of unit value. All of our algorithms can be easily modi ed if this is not the case. Figure 2 (a) describes the mapping of each the vertices of a graph. Figure 2 (b) describes the mapping of the additional vertices using the above strategy.

2.2
Layering each partition f map v j]] represents the mapping of vertex j. g f adji j] represents the j th element of the local adjacent list in partition i. g f xadji v j]] represents the starting address of vertex j in local adjacent list of partition i. g f S (j;k) i represents the set of vertices of partition i at a distance k from a node in partition j.  The above mapping would ordinarily generate partitions of unequal size. We would like to move vertices from one partition to another to achieve load balancing, while keeping the communication cost as small as possible. This is achieved by making sure that the vertices transferred between two partitions are close to the boundary of the two partitions. We assign each vertex of a given partition to a di erent partition it is close to (ties are broken arbitrarily).
where x is such that is satis ed; d(v; x) is the shortest distance in the graph between v and x. A simple algorithm to perform the layering is given in Figure 3. It assumes the graph is connected. Let ij represent the number of such vertices of partition i that can be moved to partition j. For the example case of Figure 3, labels of all the vertices are given in Figure 4. A label 2 of vertex in partition 1 corresponds to the fact that this vertex belongs to the set that contributed to 12 .

Load balancing
Let l ij represent the number of vertices to be moved from partition i to partition j to achieve load balance. There are several methods for load balancing. However, since one of our goals is to minimize the communication cost, we would like to minimize X i X j l ij , because this would correspond to a minimization of the amount of vertex movement (or \deformity") in the original partitions. Thus, the load-balancing step can be formally de ned as the following linear programming problem. Minimize X 0 i6 =j P l ij (10) subject to 0 l ij ij jB 0 (i)j (11) X 0 i<P (l ij ? l ji ) = jB 0 (j)j ? 0 j < P: (12) Constraint 12 corresponds to the load balance condition.
The above formulation is based on the assumption that changes to the original graph are small and the initial partitioning is well balanced. Hence, moving the boundaries by a small amount will give balanced partitioning with low communication cost.
There are several approaches to solving the above linear programming problem. We decided to use the simplex method because it has been shown to work well in practice and because it can be easily parallelized. 1 The simplex formulation of the example in Figure 2 is given in Figure 5. The corresponding solution is l 03 = 8 and l 12 = 1. The new partitioning is given in Figure 6.
The above set of constraints may not have a feasible solution. One approach is to relax the constraint in (11) and not have l ij ij as a constraint. Clearly, 1 We have used a dense version of simplex algorithm. The total time can potentially be reduced by using sparse representation. (11)    this would achieve load balance but may lead to major modi cations in the mapping. Another approach is to replace the constraint in ( 12) by: X 0 i<P (l ij ? l ji ) = jB 0 (j)j ? 0 j < P: (13)

Constraints in
Assuming C > > 1, this would not achieve load balancing in one step, but several such steps can be applied to achieve load balancing. If a feasible solution cannot be found with a reasonable value of (within an upper bound C), it would be better to start partitioning from scratch or solve the problem by adding only a fraction of the nodes at a given time, i.e., solve the problem in multiple stages. Typically, such cases arise when all the new nodes correspond to a few partitions and the amount of incremental change is greater than the size of one partition.

Re nement of partitions
The formulation in the previous section achieves load balance but does not try explicitly to reduce the number of cross-edges. The minimization term in (10) and the constraint in (11) indirectly keep the crossedges to a minimum under the assumption that the initial partition is good. In this section we describe a linear programming-based strategy to reduce the number of cross-edges, while still maintaining the load balance. This is achieved by nding all the vertices of partitions i on the boundary of partition i and j such that the cost of edges to the vertices in j are larger than the cost of edges to local vertices (Figure 7), i.e., the total cost of cross-edges will decrease by moving the vertex from partition i to j, which will a ect the load balance. In the following a linear programming formulation is given that moves the vertices while keeping the load balance.
Let M 00 (k) : V 0 ?! P represent the mapping of each vertex after the load balancing step. Let out (k; j) represent the number of edges of vertex k in partition M 00 (k) connected to partition j(j 6 = M 00 (k)) and in (k) represent the number of vertices a vertex k is connected to in partition M 00 (k). Let b ij represent the number of vertices in partition i which have more outgoing edges to partition j than local edges. b ij = jfV 2 B 00 i j out (V; j) ? in (V ) 0:gj We would like to maximize the number of vertices moved so that moving a vertex will not increase the cost of cross-edges. The inequality in the above denition can be changed to a strict inequality. We leave  Figure 6. the equality, however, since by including such vertices the number of points that can be moved can be larger (because these vertices can be moved to satisfy load balance constraints without a ecting the number of cross-edges).
The re nement problem can now be posed as the following linear programming problem: Maximize This re ning step can be applied iteratively until the e ective gain by the movement of vertices is small. After a few steps, the inequalities (l ij b ij ) need to be replaced by strict inequalities (l ij < b ij ); otherwise, vertices having an equal number of local and nonlocal vertices may move between boundaries without reducing the total cost. The simplex formulation of the example in Figure 6 is given in Figure 8 and the new partitioning after re nement is given in Figure 9.  Figure 6 after the Re nement step.
In this section, we present experimental results of the linear programming-based incremental partitioning presented in the previous section (we will use the term Incremental Graph Partitioner (IGP) to refer to this algorithm). The timings are given for 32 partitions on a 1-node and 32-node CM-5.
We have used two sets of adaptive meshes for our experiments. These meshes were generated using the DIME environment 11]. The initial mesh of the rst set is given in Figure 10. The other incremental meshes are generated by making re nements in a localized area of the initial mesh. These meshes represent a sequence of re nements in a localized area. The number of nodes in the meshes are 1071, 1096, 1121, 1152, and 1192 respectively.
The partitioning of the initial mesh (size 1071 nodes) was determined using Recursive Spectral bisection. This was the partitioning used by algorithm IGP to determine the partition of the incremental mesh (of size 1096). The repartitioning of the next set of re nement (with 1121, 1152, and 1192 nodes, respectively) was achieved using the partitioning obtained by using the IGP for the previous mesh in the sequence. The results show that, even after multiple re nements, the quality of partitioning achieved is comparable to that achieved by recursive spectral bisection from scratch, thus this method can be used for repartitioning for several stages. The time required by repartitioning is about half of the time required for partitioning using RSB. The algorithm provides speedup of around 15 to 20 on a 32 node CM-5.
Most of the time spent by our algorithm is in the so-p -parallel timing on a 32-node CM-5. s -timing on a one-node CM-5.
SB -Spectral Bisection. IGP -Incremental Graph Partitioner. IGPR -Incremental Graph Partitioner with Re nement. Figure 14: Incremental graph partitioning using linear programming and its comparison with spectral bisection from scratch for meshes in Figure 12 and Figure  13. with 68, 139, 229, and 672 additional nodes over the mesh in Figure 12. The partitioning achieved by algorithm IGP for mesh in Figure 13 using the partition of mesh in Figure 12 for mesh is given in Figure 14. The number of stages required (by choosing an appropriate value of , as described in section 2.3) were 1, 1, 2, and 3, respectively. 2 It is worth noting that although the load imbalance created by the additional nodes was severe, the quality of partitioning achieved for each of the cases was close to that of applying Recursive Spectral Bisection from scratch. Further, the sequential time is at least an order of magnitude better than that of Recursive Spectral Bisection. The CM-5 implementation improved the time required by a factor of 15 to 20. The time required for repartitioning Figure 14 (b) and Figure 14 (c) is close to that required for meshes in Figure 10. The timings for meshes in Figure 14 (d) and 14 (e) are larger because they use multiple stages.
The above results show that the IGP at a fraction of the cost, can be e ectively used for repartitioning to achieve solutions similar in quality to those obtained by applying recursive spectral bisection from scratch. Further, the algorithm can be parallelized e ectively.

Conclusions
In this paper we have presented a novel linear programming-based formulation for solving incremental graph-partitioning problems. The quality of partitioning produced by our methods is close to that achieved by applying the best partitioning methods from scratch. Further, the time needed is a small fraction of the latter and our algorithms are inherently parallel. We believe the methods described in this paper are of critical importance to the parallelization of the adaptive and incremental problems described earlier.