Runtime support for parallelization of data-parallel applications on adaptive and nonuniform computational environments

In this paper, we discuss the runtime support required for the parallelization of unstructured data-parallel applications on nonuniform and adaptive environments. The approach presented is reasonably general and is applicable to a wide variety of regular as well as irregular applications. We present performance results for the solution of an unstructured mesh on a cluster of heterogeneous workstations.


Introduction
Most computing environments consist of a cluster of nodes connected by a high-speed interconnection network. Node architectures include high-performance SIMD and MIMD parallel computers as well as numerous high-performance workstations. By pooling as many resources as possible, these environments represent the largest machine to which a researcher has access. This pool of resources may c hange over the lifetime of the computation due to machine failures or di ering usage patterns. It should be possible to add or remove computational resources without signi cantly a ecting the other machines and without changing the existing software. In such a n e n vironment an individual machine can be dedicated to a single user's computation or shared by users. The former has the advantage of providing static computing capability f o r e a c h machine, while the latter has a higher rate of utilization. The resources available to the user may be classi ed as: 1. Static: Computational resources are xed throughout the completion of all tasks.
2. Dynamic: Computational resources vary dynamically throughout the computation because of sharing among users.
3. Adaptive: Computational resources remain xed for a reasonable interval of time followed by a c hange.
E cient parallelization of data-parallel applications require careful attention to: Load Balance: The computational load on each processor should be proportional to the processor's computational power.
Data Partitioning: Data should be partitioned such that nonlocal data accesses are minimized. This results in low c o m m unication costs.
Several methods of data partitioning to achieve e cient parallelization of data-parallel applications for static computational environments have been discussed in the literature and are part of dataparallel languages such as High-Performance Fortran 17] and potential extensions 16]. Limited research has been targeted towards parallel compilers and runtime support for nonuniform and/or adaptive e n vironments. Nedeljkovic and Quinn 23] d e v eloped a data-parallel C compiler with dynamic load balancing for a network of workstations. Siegell and Steenkiste 29] implemented a runtime system that supports automatically generated programs with dynamic load balancing for workstations. Keyser, Lust, and Roose 22] implemented a parallel 2-D multiblock Euler/Navier-Stokes solver with adaptive block re nement and runtime load balancing for di erent parallel architecture, including clusters of workstations.
In this paper we discuss the runtime support required for the parallelization of unstructured mesh on a cluster of workstations. Many of these optimizations and issues are equally important for parallelization of a wide variety of structured as well as unstructured applications on an adaptive computing environment. The software developed is part of the STANCE (Software Techniques for Adaptive and Nonuniform Computational Environments) runtime library 18].

1
The remainder of the paper is organized as follows. Section 2 discusses the computational environment and the important issues and major contributions of this research. Section 3 describes the runtime support library. Section 4 presents performance measures for nonuniform and adaptive environments. Section 5 presents the performance of the library on a cluster of heterogeneous workstations connected by Ethernet. We conclude in Section 6.
2 Computational environment Our model is restricted to the Single Processor Multiple Data (SPMD) model of execution. In this model the same program is executed on all processors. Parallelism is achieved by partitioning the data structures and associated computations among processors. We are targeting a nonuniform computational environment where the computational resources available may c hange adaptively.
These changes should be gradual enough that remapping is not required as soon as the computational resources adapt. Data-parallel programs execute by iterating through a sequence of several phases. There is an implicit synchronization at the end of execution of every phase. We assume that remapping can be performed after a phase is completed. The e ect of the change in computational resources during the execution of one phase is not expected to cause the overall performance to deteriorate signi cantly.
Minimal amount of computational resources are available for the remapping and redistribution of data. Clearly, one can terminate the process as soon as it stops performing e ective computation for the given data-parallel application. However, when the resource is available again this may require spawning a new process that may be considerably more expensive.
It is currently left to the programmer to choose the speci c places in the program where checks are made to ensure that the e ects of any c hange of available computational resources warrant a redistribution of the data.

Important issues and contributions
In the following we describe the important issues for the parallelization of unstructured data-parallel applications on adaptive e n vironments: 1. Fast Methods for Remapping The amount o f a vailable computational resources may change during computation, which m a y require redistributing data items to achieve l o a d b a lancing. It is important that this redistribution be done such that locality is maintained after the redistribution. Most unstructured data-parallel applications can be represented as computational graphs. We use a simple architecture-independent transformation that permutes all the nodes of the graph such that locality is improved. Let T : V ;! f 1 2 3 : : : n g de ne the above p e r m utation. The goal of this transformation is to achieve good partitioning for a wide range of partitions. Several methods for achieving this transformation are described in 7, 1 9 ] and elaborated on in Section 3. Mapping and remapping becomes relatively easy once this transformation is available.

Minimization of Communication Cost Several optimizations can be performed to reduce
the amount of communication, including the removal of duplicate accesses and message coa- lescing 27]. For many data-parallel applications the accesses are symmetric. We describe in Section 3 several methods to reduce communication requirements for such cases in.
3. Minimization of Redistribution Cost There are several good ways to repartition data.
The communication cost of redistribution can be reduced by c hoosing a repartitioning that minimizes the amount o f d a t a m o vement among the processors. We describe several strategies in Section 3.
4. Address Translation Parallel loops can be transformed into an inspector and an executor 27]. The inspector examines the data references and computes the o -processor data to be fetched. It also computes where the data will be stored once it is received. The executor uses this information to perform its computation. The use of a one-dimensional representation removes the necessity for maintaining explicit translation tables. The only information required at every node is the current partitioning of a one-dimensional list (memory requirements are proportional to the number of processors). This can be used to locally determine the location of all the data items.
3 Runtime support Phase A Data Partitioning Transforms a graph into one-dimensional list Phase B Inspector Translates indices generates schedules Phase C Executor Uses schedules for data movement executes computations Phase D Load Balancing Monitors load on each processor redistributes data Parallelization of iterative and unstructured data-parallel applications requires four major phases (see Figure 1). The rst phase involves data partitioning. In this phase the nodes of the graph are renumbered to improve l o c a l i t y, which makes it easy to repartition the graph when the available resources change. The next two phases concern analyzing data-access patterns and communication between processors. The last phase involves load balancing, in which the load on each processor is monitored and, if necessary, the data is redistributed to balance the load. In static environments phase C tends to be executed multiple times, while phase B is executed once. In adaptive environments and/or adaptive applications 1 phase B is executed whenever data is redistributed. 1 For these classes of applications the computational structure adapts after every few iterations. 3 3.  Figure 2: Mapping a graph into one-dimensional space using recursive coordinate bisection A large number of unstructured data-parallel applications 8] can be represented as computational graphs from the perspective of parallel computing. The nodes of these graphs represent tasks that can be executed concurrently, while the edges represent the interactions between them. Further, the computational graphs derived from many applications are such that the vertices correspond to two-or three-dimensional coordinates, and the interaction between computations is limited to vertices that are physically proximate. Several graph-partitioning methods are described in the literature. There are simple and fast heuristics for achieving partitioning by clustering physically proximate nodes (based on coordinate information) in two or three dimensions. Important heuristics include recursive coordinate bisection, inertial bisection, scattered decomposition, geometry-based partitioners, and index-based partitioners 9, 1 2 , 1 3 , 6 , 2 5 , 3 0 , 3 2 ]. There are a number of methods that use explicit edge information to achieve better partitioning. Important heuristics include simulated annealing, mean-eld annealing, recursive spectral bisection, recursive spectral multisection, mincut-based methods, and genetic algorithms 1, 11, 10, 1 4 , 1 5 , 21, 20, 26].
When computational resources are nonuniform, the parallelization of this computational graph requires partitioning the graph such t h a t e a c h processor is assigned nodes with computational weight proportional to the computational capabilities of that processor, and the number of cross edges are minimized. In adaptive e n vironments there is a need to remap the graph when the available computational resources adapt according to the new computational capabilities of the processors. Many of the above methods are computationally expensive and thus are not suitable for such e n vironments.
We h a ve s h o wn that computational graphs representing applications from the physical domain (i.e., embedded in two or three dimensions) can be transformed into a simple architectureindependent one-dimensional representation that encapsulates the locality in these graphs (see Figure 2). This representation allows for a fast mapping of the computational graph onto the underlying computational resources at the time of execution. Let the nodes of the vertex set be numbered from 1 through n. The architecture-independent transformation permutes all the nodes of the graph such that locality i s i m p r o ved. Let T : V ;! f 1 2 3 : : : n g de ne the above permutation. The goal of this transformation is to achieve good partitioning for a wide range of partitions. Several methods for achieving this transformation are described in 19,7]. After the initial transformation it is inexpensive to partition the one-dimensional list among the processors according to their computational capability, since partitioning is equivalent to assigning contiguous blocks of vertices to each partition. The size of each b l o c k is proportional to the weight of the partition. When the computational resources adapt, the same transformation can be used for repartitioning. Several algorithms for achieving this transformation and their performance are described in 19].

Inspector
In this section we outline the preprocessing needed by the inspector to generate the arguments required by the executor to perform the computations. The inspector has two main functions: data referencing, and generating a communication schedule 27].  Translation table  Translation table  Translation table   Processor  When a one-dimensional transformation is used (Section 3.1), each processor is assigned an interval of data elements. Storing the rst and last elements belonging to every processor in the transformed space is su cient to generate the (processor, local index) tuple. The size of this list is proportional to the number of processors. It can be replicated on each processor (see Figure 3). To nd the home processor of a particular element the list is searched until the processor holding the element is found. A processor holds an element if the element is greater than or equal to the rst element that belongs to the processor, and less than or equal to the last element that belongs to it. The local address of a particular element is computed by subtracting it from the rst element that belongs to its home processor. Although the computation cost of the translation using this table is signi cant, it is negligible compared to the cost of using communication for dereferencing using the simple scheme.
Communication Schedules Communication schedules are used to fetch nonlocal data elements into a local bu er or/and to scatter local data elements to other processors. Each processor provides the following information to generate a communication schedule: 1. Local list: local references to be gathered from or scattered to other processors 2. Processor list: processors to be gathered from or scattered to 3. Data size: Size of data elements involved in the gathering or scattering The following information is available at a given processor P at this stage: 1. Send list: a list of arrays that store the local references of processor P that must be sent t o other processors. The size of each a r r a y i s m a i n tained.
2. Permutation list: an array that stores the placement order in the local bu er of P for the data elements that processor P will receive when the schedule is used in the executor phase.
It also includes information about the sizes of the messages that P will receive from other processors.
E cient generation of communication schedules for nonlocal references can be done using two phases. The rst phase removes duplicate accesses to avoid fetching a data item more then once. This is done by using a hash  For many irregular applications the accesses are symmetric (commutative) in nature (i.e., iterative techniques for the nite element method). If nodes n 1 and n 2 are stored on di erent processors and there is an edge between them, then the processor that stores n 1 will access n 2 and vice versa. One can exploit this symmetry to eliminate the communication required to generate the communication schedule. Although a processor may be able to determine the nodes it needs to send to every processor, it will not be able to determine the order in which these nodes are sent. Sorting of nodes based on their indices can determine the correct order of the nodes. This optimization is useful only when the cost of sorting is much smaller than the cost of o -processor accesses.
We h a ve developed two methods for building communication schedules based on the above optimizations. We shall refer to them as schedule sort1 and schedule sort2. I n s c hedule sort1 we sort both the sending list and the permutation list of each processor in increasing order. Each segment of the permutation list which p o i n ts to the locations of the nodes that will be received from a particular processor is sorted according to the local references of these nodes in their home processor. Each segment of the sending list is sorted independently, t h us the contents of each message is sent in increasing order and received in the same order (see Figure 4). Sorting the sending list can be avoided if a restriction is added that the nodes are traversed in increasing order according to their local references when building a communication schedule. We shall refer to this method as schedule sort2.

Executor
The executor uses the communication schedules generated by the inspector to move data between the processors in the environments and to perform the necessary computations. There are two basic primitives, gather and scatter. Gather is used to fetch o -processor elements, while scatter is used to to send o -processor elements.

Minimizing the amount of data movement
There are several ways to achieve the repartitioning such t h a t c o n tiguous blocks are assigned to every processor. We will use the term arrangements to represent e a c h of the possible ways of partitioning. There are p! arrangements for p processors. We discuss a simple strategy for the minimization of the communication cost of redistributing data items. The two factors contributing to data redistribution time are the amount of data to be transferred and the number of messages generated.
The amount of data movement can be reduced by nding a new arrangement that maximizes the overlap between the original intervals and the new intervals. For example, consider a list of 100 elements and 5 processors with the following ratios of computational capabilities: P 0 = 0:27 P 1 = 0 :18 P 2 = 0 :34 P 3 = 0 :07 and P 4 = 0 :14. Let us assume that the one-dimensional list is divided among the processors using the arrangement ( P 0 P 1 P 2 P 3 P 4 ). If the computational capabilities of the processors adapts to 0.10, 0.13, 0.29, 0.24, 0.24, respectively, then dividing the list according to the original arrangement ( P 0 P 1 P 2 P 3 P 4 ) will yield 29 overlapped elements (see Figure 5    list is divided using the arrangement ( P 0 P 3 P 1 P 2 P 4 ), the number of overlapped elements will increase to 65 (see Figure 5 (b)). The number of messages generated can also be taken into account by incorporating it into the cost of redistribution. Using the rst arrangement ( Figure 5 (a)), the number of messages needed to redistribute the data is 5 the number of messages needed to redistribute the data for the latter arrangement ( Figure 5 (b)) is only 3.
Choosing the best arrangement b y trying out all cases is feasible only for a small number of processors. Figure 6 gives a simple greedy algorithm which generates only a subset of all the arrangements, considering data overlap and number of messages generated. Our simulations show that this algorithm (MinimizeCostRedistribution (MCR)) produces good suboptimal results. The algorithm MOVE, which is used by MCR, is described in Figure 7. The time requirement for this algorithm is O(p 3 ), where p is the number of processors.

Adaptive load balancing
When the available computational resources adapt, a remapping of data items may be required to maintain good load balance. This can be divided into four phases: Monitoring local load on each processor. Exchanging load information between processors.
Making a decision to remap if remapping is required, choosing the appropriate partitioning of the array to minimize data movement. If remapping is required, performing the data movement. In our current implementation each processor monitors its own load and sends it to a controller processor, which makes the decision about repartitioning the data. Centralized load-balancing algorithms are suitable for an environment with a small number of processors. This currently requires sending the load information as separate messages to the controller, which broadcasts the decision to all the processors. When better resource management tools are available, we h o p e t o have distributed strategies.
The goal of a good parallelization for the targeted environment is to minimize the idle time on any given processor. Using information from the current phase, the data (and associated computations) should be redistributed such that the idle time for the next phase is minimized. This assumes that the computational resources allocated for the data parallel computation are the same as for the previous phase. 2 The controller determines from time to time whether the remapping of data is pro table. Remapping is considered pro table if its cost is o set by an improvement in time for the next phase. If it is not pro table, the controller broadcasts an appropriate message to all the processors, and computations are resumed for the next phase. Otherwise, the controller computes new data intervals for each processor based on its estimated computational capability i n t h e previous phase. The new intervals are broadcast to all the processors and the data is redistributed among the processors.
The frequency of this load-balancing check has to be set based on the following: The overhead of load balancing. This should represent a small fraction of the time between successive load-balancing steps The rate at which the underlying computational resources adapt. If the computational environment adapts slowly, the frequency can be low. Clearly, if the computational resources adapt very frequently, e ective parallelization will not be possible.
Techniques to choose the best frequency are outside the scope of this paper. The controller receives the new computational capability of the processors and determines whether remapping the data is pro table. Remapping is considered pro table if the e ect of the change in the load is expected to improve the overall computation time for the environment in the next phase to o set the cost of remapping. If remapping is not pro table, the controller processor broadcasts an appropriate message to the processors and computations are resumed for the next iteration.

Other communication optimizations
Latency is an important factor when performing parallel computing on a general network. The number of messages generated by our library could be reduced signi cantly by using multicast. Our library has the ability t o u s e m ulticast to perform all communications between processors in the environments if the network supports multicast (e.g., Ethernet 3], ATM 2]).

Performance measures
The performance of a parallel application is usually measured in terms of speedup and e ciency. I t is di cult to have analagous terms for nonuniform computational environment. In this section we give a general de nition of e ciency that is suitable for data-parallel applications in a nonuniform environment. Let the amount of time required for computing a task be given by T (p i ) on processors i if it is executed sequentially. Thus, processors i can complete 1 T (p i ) of the task per unit time.
Collectively all the processors can complete (assuming no parallelization overheads) P n i=1 1 T (p i ) of the task per unit time. Thus, one can de ne the e ciency of parallelization as E (p 1 p 2 : : : p n ) = 1 T (p 1 p 2 ::: pn) where T (p 1 p 2 : : : p n ) represents the time taken for completing the task when processors p 1 p 2 : : : p n are all used together.
For adaptive computational environments, assume that T (p 1 p 2 : : : p n ) is the total time taken for completing the task. Let the fraction of the whole task which could have been completed by processor i during that time be given by f i (T). Then the e ciency of the parallelization can be given by E (p 1 p 2 : : : p n ) = 1 P n i=1 f i (T) : Unfortunately, the value of f i (T) is di cult to compute in an adaptive e n vironment.

Experimental results
In this section we study the e ectiveness of the di erent optimizations suggested in the previous section. We e v aluated the library on a cluster of SUN4 workstations connected by Ethernet using the P4 message-passing environment.  Table 1 shows the execution time of MinimizeCostRedistribution in seconds. Its execution time is small, even for 20 processors. Table 2 shows the average cost of remapping di erent array sizes ( oating point) over 100 randomly generated samples. These results show that using the heuristic improved the cost of data remapping in all cases. It also shows that the total time required for remapping (with or without the optimization) is very small. This is critical for e ective parallelization. We parallelized the loop in Figure 8. The indirection array corresponds to the unstructured mesh in Figure 9. The mesh has 30269 vertices and 44929 edges. The loop was repeated 500 times. The nodes of the mesh were transformed into a one-dimensional array using Recursive Spectral Bisection-based indexing 19]. The load-balancing algorithm requires an estimate of the current computational resources available on a given processor. There are several ways of estimating the computational resources available to the data-parallel applications on a given processor. One metric we h a ve u s e d i s t h e a verage computation time per data item. Each processor computes this information by dividing the total time spent on the computation by the number of data elements it owned. This assumes that the variation in computational cost per data unit is relatively small.     Table 5: Execution time of the parallelize loop for 500 iterations in an adaptive e n vironment (in seconds).
We rst measured the performance of the library in a static environment. Table 3 shows the time required to build a communication schedule using the di erent methods described in Section 3. Simple Strategy corresponds to the time for building the communication schedule when an explicit translation table is used (which requires communication). Sort1, Sort2 correspond to the time for building the communication schedule using Schedule sort1 and Schedule sort2, respectively. F or a xed graph, as the number of processors increase, the cost of sorting-based schedules will decrease because the amount of data assigned to each processor decreases. When the number of processors increases, the number of message setups increases, adversely a ecting the simple strategy. The time requirements for the latter two s c hemes can be reduced by i m p r o ving our current s o f t ware. Table 4 gives the execution time of the library in static environments. These results show that a reasonable e ciency can be achieved in most cases.
We used the same environment a s a b o ve to measure the performance in a controlled adaptive environment. The performance was measured using the following initial conditions: 1. A constant competing load was added to one of the processors (processor 1).
2. The graph was decomposed assuming all the processors had equal computational ratio.
We performed the following experiments: 1. The parallel loop was executed for 500 iterations without any load balancing.
2. The loop was executed for 10 iterations. A check w as made after 10 iterations. Using the information gathered for the 10 iterations, a remapping was performed and was used for the remaining 490 iterations.
The results are presented in Table 5. As expected, these results show that using the remapping substantially improves the time required for execution. The cost of load balancing (remapping and building the new communication schedule) is close to the time required for completing a few iterations of the parallel loop, while the cost of performing the load balance check is an order of magnitude lower. These results show that even if a check i s d o n e e v ery 10 iterations, the overhead of performing this check will be small compared to the total execution cost however, if the environment adapts during that time, the potential advantages of the remapping can be substantial. The frequency of this check and when the remapping should be performed are important parameters for achieving good performance, but are beyond the scope of this paper.

Conclusions
In this paper we h a ve presented several optimizations necessary for the parallelization of dataparallel applications on an adaptive and nonuniform computational environment. The library was evaluated on a cluster of workstations using P4 in static and adaptive e n vironments. We s h o wed that our runtime library can be used for e ective parallelization in the above e n vironment. Several methods described in the paper are preliminary approaches for solving the subproblems. We are currently investigating improved methods for achieving similar goals, but at a considerably lower runtime overhead. Although the library was targeted towards solving an unstructured grid on a cluster of workstations, we believe m a n y of the techniques developed in this paper are relevant for e cient solution of other regular as well as irregular data-parallel applications in a nonuniform and adaptive computational environment.