Software Support for Irregular and Loosely Synchronous Problems

A large class of scienti(cid:12)c and engineering applications may be classi(cid:12)ed as irregular and loosely synchronous from the perspective of parallel processing. We present a partial classi(cid:12)cation of such problems. This classi(cid:12)cation has motivated us to enhance Fortran D to provide language support for irregular, loosely synchronous problems. We present techniques for parallelization of such problems in the context of Fortran


Introduction
Although parallel computer systems have been widely available for several years, they have not yet ful lled their enormous promise. In spite of the widespread interest in parallel systems, few scientists and engineers are using parallel machines to do their most important calculations, relying instead on conventional supercomputers. There are two reasons for this. First, parallel computer systems have only recently become powerful enough to outperform conventional supercomputers. Second, and more importantly, there exists no machine-independent programming interface for parallel machines that can achieve an e ciency comparable to programs hand coded in languages that re ect the speci c underlying architectures. This second problem is particularly troublesome because it puts the parallel programming investment at risk|if a program is converted at great e ort to run on a parallel machine, the investment may be lost when the next generation of parallel computers emerges with an entirely di erent programming interface. Today, each new parallel architecture requires a signi cantly di erent software implementation.

Software Model
An important lesson learned from using parallel machines has been the need for a close coupling between software and applications. Even though the problems that we and others have looked at tend to be in a limited domain, predominantly scienti c and engineering simulations, we expect this lesson to be valid in general. Good performance for a parallel machine requires a good mapping of the problem onto the machine. Getting this mapping \right" seems to imply a close coupling between the application requirements and the software environment. Good mappings for many large problems have been discovered by users tuning their codes \by hand" using relatively crude software approaches. The Caltech Computation Project, for example, developed 50 successful parallel applications using node Fortran or C plus message passing on distributed memory MIMD multicomputer. Building on that success requires a more automatic method of detecting and implementing good problem mappings. Our thesis is that providing such an environment will be a great help toward establishing a portable programming model for parallel machines.
The success of hand-parallelization should be contrasted with the experience of parallelizing compilers where false dependencies often prevent the compiler from exploiting the available parallelism. We can understand this as follows: the problem has a computational graph (such as a mesh for many signal processing or partial differential equation algorithms) that needs to be mapped onto the underlying parallel machine topology. In hand-coding programs, users are responsible for identifying the problem and machine topology and performing the mapping. The automatic compiler approach to parallelizing the C, Fortran or ADA code version of the problem fails when the compiler is unable identify the underlying graph and the relation between program components. This can happen for a number of reasons.
1. The compiler's analysis can simply fail, reporting a dependence when none exists. (This is a particular problem in the loosely synchronous problems in Section 3, due to the data structures required there.) In these cases, there is little the programmer can do except complain to the compiler vendor.
2. An actual dependence may be an artifact of a sequential optimization, such as reusing an array's storage to save memory. In these cases, it is often possible to rewrite the program to allow parallelization, if the user can detect the problem.
3. The program may use an inherently sequential algorithm, or an algorithm with limited parallelism. For example, the standard method of solving a tridiagonal system uses a rst-order recurrence that cannot be directly parallelized. In this case, the best option is to change to a di erent algorithm.
Our experience has been fully automatic compilers often fail on realistic applications, although they may perform better on individual loop nests. Languages such as *LISP, C* and CM Fortran have succeeded on larger-scale problems because unlike Fortran 77 or C, these \data-parallel" languages properly express the the structure of the problem and its computation. Generalizing from the above discussion, we feel that successful parallel software models must provide a mechanism for expressing the decomposition by the programmer (as in C with message passing extensions) or provide this mechanism indirectly (as in C*). We feel that the interaction of applications and software support (languages, run time systems) is very important for parallel computing. In other words, parallel computing demands \high-level" software support{ software that precisely and e ectively captures the structure of the application resulting in automatic generation of good parallel programs. Our belief is that there is no need to write software designed for a single specialized domain. On the other hand, it is very hard to design universal software models. Instead, we de ne broad classes of computations (we now have a total of about ten) which together can cover a large range and each is itself large enough to warrant individually tailored category-speci c software support. We believe that our approach can be e ectively extended to a much broader range of applications. Although, this work was motivated by our Fortran D compiler project for SIMD and MIMD distributed memory machines, we believe the classi cation can immediately be used for these applications with other languages including C, C++ and ADA.

Problem Classi cation
We have classi ed problems into ve broad categories in terms of the parallelization and software support issues they address: synchronous Loosely Synchronous Asynchronous Embarrassingly Parallel

Loosely Synchronous Complex
Each problem category covers a broad range of applications. Current data parallel languages such as C* and Fortran D provide language support for expressing regular synchronous and loosely synchronous problems. The success of the Fortran D compiler project is partly due to our experience in parallelizing this class of scienti c applications. In this paper we examine scienti c applications that are irregular and loosely synchronous in nature. We present an overview of techniques for parallelizing such problems. Although we use speci c applications as examples, our parallelization techniques are applicable to other disciplines and are in no way restricted to these particular codes. We propose language extensions and compiler techniques that are useful for successfully expressing such problems in a data parallel language such as Fortran D.
Section 2 provides a review the architectural classi cation for problems. In Section 3 we describe di erent subclasses of irregular and loosely synchronous problems. In Section 4, we discuss several parallelization strategies for the inclusion of these problems in the solution space of Fortran D.

Problem Architectures
We have looked at many applications in a detailed survey in 20]. Our analysis of problem architecture or structure is based on a break up of each problem into spatial (data) and temporal (control) aspects. Following Fox 14] we describe three problem architecture classes in terms of their temporal (time or synchronization) structure. The temporal structure of a problem is analogous to the hardware classi cation into SIMD and MIMD. The spatial structure of a problem provides the computational graph of the problem at a given instant and is analogous to the interconnect or topology of the hardware. The detailed spatial structure is important in determining the performance of an implementation but it does not a ect the broad categories.
Synchronous problems are data parallel with the restriction that the time dependence of each data point is computed by the same operations. Both algorithmically and in the natural SIMD implementation, the problem is synchronized microscopically at each computer clock cycle. Such problems are particularly common in academia as they naturally arise in any description of a system in terms of identical fundamental units. We believe that Fortran D (in its current version) should be able to address almost all of these problems.
Loosely synchronous problems are also typically data parallel but now we allow di erent data points to be evolved with distinct algorithms. Points are also often connected in an irregular, data-dependent manner; for this reason we sometimes refer to this class as \irregular problems." Such problems appear when one describes the world macroscopically in terms of the interactions between irregular inhomogeneous objects evolved in a time synchronized fashion. Loosely synchronous problems are spatially irregular but temporally regular. This class is the main focus of this paper.
The asynchronous problem class is irregular in space and time. Because of this irregularity, it is di cult to give general methods for parallelizing asynchronous problems. Some run well with functional decompositions, some require real-time synchronization techniques, and some have never been run successfully on massively parallel machines. For a detailed description of these classes the reader is referred to 19].
The class of embarrassingly parallel problems contains those problems that are totally disconnected in space and time. In these problems, no synchronization or communication is needed at all. (Actually, there is typically a nal synchronized phase to collect the computed answers, but this only uses a small part of the total time.) Depending on the structure of the problem at each point, these can be run e ciently on either SIMD or MIMD hardware. We believe that Fortran D and other data-parallel languages should be able to express these problems well.
The class of loosely synchronous complex contains problems that are an asynchronous collection of loosely synchronous problems. A typical application in command and control belongs in this class. Each of the tasks in such an application is synchronous or loosely synchronous and can be parallelized individually. An overall asynchronous expert system coordinates the interaction between these tasks.

Types of Loosely Synchronous Problems
General purpose mapping tools and runtime support must be able to handle a reasonably broad range of problems. As mentioned in the previous sections, we intend to develop a parallel software environment for what we call loosely synchronous problems, linked to the Fortran D compiler project at Rice and Syracuse Universities. This concept has been explained in detail in 13,14,15]. The current Fortran D is designed to handle the special cases of synchronous problems and loosely synchronous problems with regular interconnection patterns. In extending the Fortran D environment, we have found it useful to divide this problem into several subclasses which are described below. All loosely synchronous problems can, by de nition, be divided into a sequence of concurrent computational phases. The di erences between the subclasses lie in how the phases are separated and when the computation and communication patterns within phases are set. In the remainder of this section, we will describe several subclasses of loosely synchronous problems, illustrated by actual applications. We present these subclasses to give an idea of the types of problems we plan to address, but we do not claim at this point to be in a position to present any kind of formal taxonomy. As described in Section 3.5, our classi cation is of course not complete and we are continuing our study of problem structures 13, 14].

Static Single Phase Computations
A static single phase computation consists of a single concurrent computational phase, which may be executed repeatedly without change. Examples of static single phase computations are iterative solvers using sparse matrix-vector multiplications (e.g. 32]) and explicit unstructured mesh uids calculations (e.g. 42]). The key problem in e ciently executing these programs is partitioning the data and computation to minimize communication while balancing load. This partitioning then dictates the program's synchronization and communication requirements, which must also be computed. Because the computational pattern is only set at run time, this cannot be done directly by the compiler; instead, calls to a run-time environment must be generated to do the partitioning dynamically. Reducing the overhead of these calls, both by reusing information computed in the calls and by performing the calls eciently, is also vital for high e ciency. The PARTI library 10] and the Kali compiler 17] introduced the inspector/executor paradigm to perform these optimizations.
In the remainder of this section, we describe some of the details that must be considered in implementing these kernels.
In some cases, there is a straightforward relationship between the way we partition distributed arrays and the way we partition work. Figure 1 depicts a sparse matrix vector multiply. The integer array col is used to represent the sparsity structure of the matrix. Loop S1 sweeps over the matrix rows, while loop S2 sweeps over the columns of the sparse matrix and calculates the required inner product. If the sparse matrix vector multiply in Figure 1 is to be carried out repeatedly, it is reasonable to partition x and y between processors in a conforming manner. In such a problem, we can follow the common convention of carrying out computational work associated with computing a value for distributed array element y(i) on the processor onto which y(i) is mapped 16].
There are other common cases in which the assignment of distributed array elements to processors and assignment of work to processors cannot be coupled in such a straightforward fashion. Figure 2 depicts a loop that sweeps over the edges of a mesh; indirection is used to index array x on the right hand side of S3 while indirection is used to index array y on the left hand side of S4 and S5. In this loop, it appears to C This is a simpli ed sweep over edges of a mesh. A ux across a C mesh edge is calculated. Calculation of the ux involves C ow variables stored in array x. The ux is accumulated to array y. be advantageous to assign each iteration of loop to a single processor. By doing this, we avoid having either to recalculate or to communicate values for f lux. Since y(n1) and y(n2) appear on the left hand sides of statements. We can see that we must now determine separately how to partition distributed array elements and loop iterations.

Multiple Phase Computations
A multiple phase computation consists of a series of dissimilar loosely synchronous computational phases. Such applications usually have several parallelizable loops that involve a variety of distributed arrays. In this section, we will only consider the case where each individual phase is a static single phase computation as de ned above. Examples of these computations include unstructured multigrid (e.g. 26]), parallelized sparse triangular solver (e.g. 4, 1]), particle-in-cell codes (e.g. 38, 24]), and vortex blob calculations 3]. The key problem in implementation is again partitioning computation and data, but now the task is complicated because the interfaces between phases must be considered in the partitioning. The synchronization and communication requirements are similarly complicated by the multiple phases. As for static single phase computations, this partitioning must be performed at run time. Saltz and his coworkers have recently extended the PARTI library to include incremental routines which will be applicable to these problems 29] It is not clear whether further extensions will also be needed. It is clear, however, that these computations can again  In the remainder of this section, we describe the unstructured multigrid application to show some of the implementation complexities of this class.
Unstructured multigrid codes 26], carry out mesh relaxation over each of several increasingly re ned meshes M 1 ; :::; M n . Figure 3 and Figure 4 depict two levels of these meshes from a uid dynamics code that we have parallelized. Both of these grids represent the same physical geometry but the grid in 4 is more highly re ned than the grid in 3. The algorithm alternates between sweeping over each mesh and moving data between meshes, as shown in Figure 5. The meshes M 1 ; :::; M n should be partitioned so that 1. sweeps over each mesh M i do not require excessive amounts of interprocessor communication,

the computation involved in sweeping over each mesh should exhibit good load balance and
3. interpolations and projections should only require modest amounts of data movement.
We have partitioned the grids in our example using the partitioner described in 35] with good results, but there are many other possible partitioners.

Adaptive Irregular Computations
An adaptive irregular computation consists of a loosely synchronous computation executed repeatedly in which the data access pattern changes between iterations. The changes may be gradual, re ecting adiabatic changes in the physical domain, or largescale, re ecting additions to a data structure. Molecular dynamics applications often exhibit the rst behavior because interactions between particles are implemented by neighbor lists which change as the atoms move 6]. Adaptive PDE solvers are often examples of the second behavior, as discussed below. Other examples with which we are familiar include some vision algorithms including region growing and labeling 7, 41], statistical physics simulations near critical points 8], and the particle sorting phase of a direct monte carlo simulation 9]. The key problems in implementing these algorithms are to react quickly to changes in the data structure. The physical and C Greatly oversimpli ed multiple mesh computation -Sweep over coarse C mesh, transfer information to ne mesh, sweep over ne mesh C and transfer information back to coarse mesh. xc,yc represent coarse C mesh variables, xf,yf represent ne mesh variables. C Typically these computations are carried out in an iterative manner. C Sweep over coarse mesh do i =3D 1; N coarse do j =3D 1; K course yc(i) =3D yc(i) + ac(i; j ) xc(ic(i; j )) end do end do C Transfer data from coarse mesh to ne mesh do i =3D 1; N f ine do j =3D 1; N interpf (i) xf (i) =3D xf (i) + weightf (i; j ) yc(interpf (i; j )) end do end do C Sweep over ne mesh do i =3D 1; N f ine do j =3D 1; K f ine yf (i) =3D yf (i) + af (i; j ) xf (if(i; j )) end do end do C Transfer data from ne mesh to coarse mesh do i =3D 1; N coarse do j =3D 1; N interpc(i) xc(i) =3D xc(i) + wc(i; j ) yf (interpc(i; j )) end do end do  numeric properties of these algorithms typically guarantee that large-scale restructuring of data is only needed infrequently. New constructs are needed, however, to communicate this to the underlying system software.
Adaptive algorithms are useful for solving Euler and Navier Stokes problems that arise in aerodynamics. In these algorithms, mesh re nement is carried out in portions of a computational domain where it is estimated that additional resolution may be required (e.g., see 39,30]). The grid in Figure 6 is an adaptive re nement of the grid in Figure 4. The initial mesh-point distribution is determined from the geometry of the airfoil to be simulated. Adaptive mesh re nement is achieved by adding new points in regions of large ow gradients. A simple version of the algorithm is presented in Figure 8. The remapping needs to be performed before the inner do loop is executed.

Implicit Multiphase Loosely Synchronous Computations
An implicit multiphase computation is one containing irregular inter-iteration dependencies. The problems discussed thus far have consisted of a sequence of clearly demarcated computational phases. There are a number of problems in which there are inter-iteration dependencies that might at rst appear to inhibit parallelization. These data dependency patterns C Adaptive Two Mesh Algorithm C Coarse mesh U c covers entire domain C Re ned mesh U r covers \active" portion of domain C Location, shape, and size of re ned mesh all change do k c =3D 1 to K Sweep over the U c Flag region of U c that should be re ned. If agged region is not empty. Modify shape of U r Interpolate boundary values for U r from U c . do k r =3D 1 to K r Sweep over U r end do Inject values of U r into U c end do 2. can be fully predicted before a program enters the irregular loop or loops. Figure  8 shows a the back substitution phase of a sparse matrix factorization, a simple algorithm of this type.
This is similar to solving sparse triangular systems of linear equations arising from ILU preconditioning methods 36,37]. Another example of this class is the tree generation phase of the adaptive fast multipole algorithms for particle dynamics 18,33].
The key problem in implementing these algorithms is to detect and exploit opportunities for partial parallelization. In Figure 8, it is often possible to carry out many simultaneous row substitutions. The sparsity structure of the system determines which row substitutions can be carried out concurrently; however, this information is only available at run time. In such problems, we carry out a form of runtime preprocessing with the goal of de ning a sequence of loosely synchronous computational phases. In bus based shared memory multiprocessors, we have demonstrated that it is possible to integrate runtime parallelization with compilers 34]. We anticipate that it will also be possible to link runtime parallelization with compilers aimed at scalable multiprocessors and have carried out preliminary work in this area. A more di cult problem is that of runtime aggregation of work and data. When we carry out sparse computations such as sparse triangular solves or sparse direct factorizations 11], our runtime preprocessing can determine the number and content of the concurrent computational phases that will comprise a computation. We will call this process runtime aggregation or runtime tiling. There have been a variety of numerical algorithms to carry out what we call runtime tiling for multiprocessor and vector computers, a small subset of this extensive collection of methods may be found in 21, 2].

Static and Dynamic Structured Problems
This class of problems consist of highly structured computations on sets of subdomains that are coupled in an irregular manner. The computations on each individual subdomain are frequently highly structured, but the computational relationship between the subdomains is known only at runtime. Furthermore, the relationship between the subdomains frequently changes dynamically during the course of a computation. The examples described in this sub-section di er from the examples described in the previous four sub-sections in that the previous problems consist of irregularly coupled \points" whereas we now deal with collections of nontrivial structures. Examples of such problems include the adaptive mesh method described below and a combined hydrodynamics and particle astrophysical simulations implemented by Edelson at Syracuse 12]. The key to e ciency on these problems is to aggressively apply optimizations to the regular subproblems, which can be implemented with lower overheads. Also, the larger granularity of the coupled subproblems can be exploited to reduce preprocessing overheads and also reduce memory requirements 5].
An example of this class is shock pro ling as described in 5]. The basic problem is to solve a partial di erential equation in the presence of a shock, computing the pro le (detailed shape) of the shock. Resolution of the pro le implies that a highly re ned grid must be used in a neighborhood of the shock. The method initially computes the solution on a coarse mesh. An error estimator is then applied to determine the regions that will be covered by a re ned mesh. An example mesh from this two-level re nement is shown in Fig. 9. The solution is time-dependent. Time-marching on the re ned mesh is performed by taking many (e.g. 100) time steps on the re ned mesh for a single coarse-grid time step. The re ned mesh is dynamic { its location, shape, and size all change. This means that the relationship of the two meshes will change during the execution of the program. Hence the structure of the computations change with time and a non-uniform communication pattern arises due to the sharing of data between grids. This example also generalizes to a full structured adaptive multigrid. An example of a mesh employed in such a full structured adaptive multigrid may be seen in Figure 10. This mesh is used in a solution of the Euler equations used to

Conclusions
In this paper, we presented a partial classi cation of scienti c and engineering applications which are irregular and loosely synchronous from the perspective of parallel processing. This classi cation should be helpful in extending Fortran D to permit its application to a large class of loosely synchronous problems. There are a few important tasks may be necessary for the above. While we have made signi cant progress on each of these tasks, there is still much work that remains to be carried out.
Firstly there is a need for development of automatic and semi-automatic data partitioners and a strategy for incorporating these in a compiler. Currently, partitioners are designed using programmers' a priori knowledge about a problem's computational structure and its expected computational behavior. There has been signi cant progress in the development of robust partitioners for static single phase loosely synchronous calculations see e.g. 35,22] but much work remains to be done in order to deal with other problem classes. Similarly, we have proposed a scheme for integrating data partitioners into compilers that appears to be appropriate for static single and perhaps for multiphase loops 29]. Much work is needed to generalize these methods before they are able to handle the more challenging classes of computations. Some preliminary work along these lines has been reported in 28] and 25].
Time dependent or iterative loosely synchronous computational problems can ex- hibit a range of dynamic behaviors. These behaviors can be divided into three rough categories: (A) data dependency pattern is static and does not change between iterations.
(B) data dependency pattern is modi ed on occasion but between changes, the dependency pattern remains static for many iterations (C) data dependency pattern changes every iteration.
Problems in category A would fall either into the class of static, single phase loosely synchronous computations (Section 3.1) or into the class of static, multiple phase loosely synchronous computations (Section 3.2), while problems in categories B and C would fall into the class of unstructured adaptive problems (Sections 3.3, 3.4) or structured adaptive problems (Section 3.5). It is also useful to categorize irregular problems by whether a given iteration or time-step is composed of multiple, dissimilar loosely synchronous computational phases. In such cases, it is often necessary to partition a problem in a way that takes into account all of the computational phases in an iteration. Further, there are issues related to partitioning and runtime aggregation 28, 21, 2]. which can a ect the performance of these problems Secondly, we need to standardize extensions to Fortran D to facilitate the specication of partitioning strategies and irregular meshes. These extensions will be used to 1. indicate which loops in a program should to be taken into account when considering how to partition distributed arrays, 2. allow users to force the selection of a particular partitioner, 3. allow users to assert that a given set of loop dependencies can or cannot change when the loop is iteratively invoked, 4. allow users to specify the granularity with which parallelism is to be exploited.
In 29], we have proposed extensions (and developed runtime support) that ful ll the rst two of the above mentioned goals. There is also a need for development of new data structures targeted towards problems in which highly structured computations on a set of subdomains are coupled in an irregular manner. We are particularly interested in representing structured adaptive problems in which subdomains are coupled by irregular tree dependency structures.
In this volume, in the paper by Sussman et. al., we describe the portable runtime support for static single and multiphase problems, and for static structured problems. This runtime support is oriented towards distributed memory MIMD architectures. The runtime support for static single and multiphase problems has also been ported to SIMD architectures, but the static structured runtime support has as yet not been implemented on a SIMD architecture. There is still a clear need for development of appropriate runtime support for Adaptive Irregular Computations, Implicit Multiphase Loosely Synchronous Computations, and Dynamic Structured Problems targeted towards SIMD and MIMD distributed memory architectures.