Compiling Fortran 90D/HPF for Distributed Memory MIMD Computers

Distributed memory multiprocessors are increasingly being used to provide high performance for advanced calculations with scientific applications. Distributed memory machines offer significant advantages over their shared memory counterparts in terms of cost and scalability, though it is widely accepted that they are difficult to program given the current status of software technology. Currently, distributed memory machines are programmed using a node language and a message passing library. This process is tedious and error prone because the user must perform the task of data distribution and communication for non-local data access. 
This thesis describes an advanced compiler that can generate efficient parallel programs when the source programming language naturally represents an application's parallelism. Fortran 90D/HPF described in this thesis is such a language. Using Fortran 90D/HPF, parallelism is represented with parallel constructs, such as array operations, where statements, forall statements, and intrinsic functions. The language provides directives for data distribution. Fortran 90D/HPF gives the programmer powerful tools to express a problem with natural data parallelism. To validate this hypothesis, a prototype of Fortran 90D/HPF was implemented. The compiler is organized around several major units: language parsing, partitioning data and computation, detecting communication and generating code. 
The compiler recognizes the presence of communication patterns in the computations in order to generate appropriate communication calls. Specifically, this involves a number of tests on the relationships among subscripts of various arrays in a statement. The compiler includes a specially designed algorithm to detect communications and to generate appropriate collective communication calls to execute array assignments and forall statements. 
The Fortran 90D/HPF compiler performs several types of communication and computation optimizations to improve the performance of the generated code. 
Empirical measurements show that the performance of the output of the Fortran 90D/HPF compiler is comparable to that of corresponding hand-written codes on several systems. 
We hope that this thesis assists in the widespread adoption of parallel computing technology and leads to a more attractive and powerful software development environment to support application parallelism that many users need.


Introduction
problems. In Fortran 90D/HPF, parallelism is represented with parallel constructs such as array operations, where statements, forall statements, and intrinsic functions. This gives the programmer a p o werful tool to express the data parallelism natural to a problem. This paper presents the design of a prototype compiler for Fortran 90D/HPF. The compiler takes as input a program written in Fortran 90D/HPF. Output is a SPMD (Single Program Multiple Data) program with appropriate data and computation partitioning and communication calls for distributed memory MIMD machines. Therefore, the user can still program using a data parallel language but is relieved of the responsibility to perform data distribution and communication.
The rest of this paper is organized as follows. The compiler system overview is described in Section 2. Data partitioning, and computation partitioning are discussed in Sections 3, and 4. Section 5 presents the communication primitives and communication generation for Fortran 90D/HPF programs. In Section 6, we present the runtime support system including the intrinsic functions. Section 7 summarizes our initial experience using the current v ersion of the compiler. It also presents a comparison of the performance with hand-written parallel code. Section 8 presents a summary of related work. Finally, a summary and conclusions are presented in Section 9.

Compiler System Overview
Our Fortran 90D/HPF compiler exploits only the parallelism expressed in the data parallel constructs. We do not attempt to parallelize other constructs such a s do loops and while loops, since they are used only as naturally sequential control constructs in this language. The foundation of our design lies in recognizing commonly occurring computation and communication patterns. These patterns are then replaced by calls to optimized run-time support system routines. The run-time support system includes parallel intrinsic functions, data distribution functions, communication primitives and several other miscellaneous routines.
The basic structure of our Fortran 90D/HPF compiler is organized around four major modules: parsing, partitioning, communication detection and insertion, and code generation. Given a syntactically correct Fortran90D/HPF program, the rst step of the compilation is to generate a parse tree. The front-end to parse Fortran 90 for the compiler was obtained from ParaSoft Corp. This module parses the input program into an abstract syntax tree, performs semantic analysis to annotate the tree with type information, and builds up a symbol table it also performs error checking.

Data Partitioning
Distributed memory systems solve the memory bottleneck o f v ector supercomputers by h a ving separate memory for each processor. However, distributed memory systems requirehigh locality f o r good performance. Therefore, the distribution of data across processors is of critical importance to the e ciency of a parallel program in a distributed memory system.
Fortran D provides users with explicit control over data partitioning with data alignment and distribution speci cations. It has three main compiler directives. The DECOMPOSITION directive is used to declare the name, dimensionality, and the size of each problem domain. A decomposition is simply an abstract representation of a problem or index domain. We call it \template" (a name chosen to describe DECOMPOSITION in HPF 3]).
The ALIGN directive speci es ne-grain parallelism, mapping each array element o n to one or more elements of the template. There may b e m ultiple templates representing di erent problem mappings, but an array m a y be aligned to only one template at any time. All scalars are replicated. An array not explicitly aligned to any template serves as its own template.
The DISTRIBUTE directive speci es coarse-grain parallelism, grouping template elements and mapping them (and aligned array elements) to the nite resources of the machine. Each dimension of the template is distributed in either block or cyclic fashion. The selected distribution can a ect the ability of the compiler to minimize communication and load imbalance in the resulting program.
The Fortran 90D/HPF compiler maps arrays to the physical processors using a three stage mapping as shown in Figure 1. This three stage mapping has also been proposed in HPF 3]. Stage 1: ALIGN directives are processed to compute functions that map the array index domain to the template index domain and vice versa. Stage 2: Each dimension of a template is mapped onto the logical processor grid based on the DISTRIBUTE directives. Block divides the template into contiguous chunks. Cyclic speci es a round-robin division of the template. The mapping functions and ;1 to generate relationship between global and local indices are computed. These function have been studied extensively by Koelbel 9]. Stage 3: The logical processor grid is mapped onto the physical system. This mapping can change from one system to another, but the data mapping onto the logical processor grid does not need to change. This enhances portability across a large number of architectures.
By performing the above three stage mapping, the compiler is decoupled from the speci cs of a given machine or con guration. We n o w g i v e the compilation techniques of Stage 1. The compilation of distribution directives and Stages 2 and 3 are described in more detail 10].
Compiling the ALIGN Directive (Stage 1) Alignment determines which portions of two or more arrays will be mapped to the same processor.
Clearly, i f a r r a ys involved in the same computation are aligned in such a manner that after distribution their respective sections lie on the same processor then the number of non-local accesses will be reduced. The DECOMPOSITION directive de nes the shape and rank of a given template.
Let A be an m-dimensional array a n d TE MPLbe an n-dimensional template. The general form of the alignment directive i s C$ ALIGN A(i1 *], ... ,im *]) WITH TEMPL(f1(ia 1 ) *], ... ,fn(ia m ) *]). The speci ed elements of A are aligned to those of TE M PL . The template is eventually distributed on a set of processors and the compiler guarantees that the array elements aligned to the same element of the template will be mapped to the same processor.
An alignment function, f k , is required to be an a ne function. That is, f k = s k i a k + o k or f k = o k . The parameters i a k , s k , and o k correspond to the three components of the alignment: The symbol \*" indicates that the corresponding dimension is replicated or collapsed. It may Algorithm 1 (Compiling Align directives) Input: Fortran 90D/HPF syntax tree with some alignment functions to template Output: Fortran 90D/HPF syntax tree with identical alignment functions to template Method: For each aligned array, and for each dimension of that array, carry out the following steps: Step 1. Extend aligned arrays to match template size.
Step 2. Determine local shape of arrays.
Step 3. Apply alignment functions to the aligned arrays.
Step 4. Transform into canonical form.
appear in both the array and the template subscripts. The array rank (number of dimensions) m may be di erent from the rank of the template, n. F or example, the directive C$ ALIGN A(i,*) WITH TEMPL(i + 1 ) requires the second dimension of the array A be collapsed (not distributed), while the directive C$ ALIGN A(i) WITH TEMPL(*,i + 1 ) forces replication of array A along the rst dimension of the template TE MPL .
Algorithm 1 gives the steps in the algorithm used by our Fortran 90D/HPF compiler to process the align directives.
The following example illustrates the steps and all the transformations performed to transform array indices from the array index domain to template index domain and vice versa.
Consider the Fortran 90D/HPF code fragment s h o wn in Figure 2. There are three arrays ODD(N/2), EVEN(N/2) and NUM(N). Elements of the array ODD are aligned with odd elements of TEMPL. Similarly, elements of the array EVEN are aligned with the even elements of TEMPL. NUM is aligned identically with TEMPL. Hence, ODD and EVEN are aligned with odd and even indices of NUM respectively, because they are aligned to the same template.
Step 1. Extend aligned a r r ays to match template size. Note that we assume that the array size is equal to or smaller than the template size in the distributed dimension(s). If an array size is smaller than the template size in the distributed dimension, the compiler extends the array size to match the template size. For example, ODD and EVEN arrays are extended to size N to match the template TEMPL's size, which i s N. Note that an array is only extended in the distributed compute and store the local index in a table. However, this introduces a level of indirection for each access. Furthermore, storing an index table also requires memory space that can be potentially as large as the distributed array itself. Many of the commercial compilers (e.g., Dec MPP Fortran 6], CM-Fortran 4], and Cray M P P F ortran 13]) extend arrays to the nearest power of two, whereas we extend in the distributed dimension to match the template size.
Step 2. Determine local shape o f a r r ays. In this step, the compiler determines the local shape and size of the distributed arrays based on the processor grid information associated with the corresponding template. In the above example, the template TEMPL is distributed on P processors. P is a compile-time parameter for each dimension of DISTRIBUTION directive. Hence, the compiler determines the size of the distributed dimension of arrays as ODD(dN=Pe), EVEN(dN=Pe) a n d NUM(dN=Pe). Since our compiler produces SPMD code, array declarations are the same in every processor.
Step 3. Apply alignment functions to the aligned a r r ays. In this step, all indices of each Step 4. Transform into canonical form 1 . In this step, the compiler simpli es all functions applied in Step 3 by performing symbolic manipulation and partial evaluation of constants. For example, statement (1) becomes NUM(I)=ODD(I).
The above simpli cation of indices helps the compiler to choose e cient collective c o m m unication routines. Our communication detection algorithm is based on symbolically comparing the lhs and rhs reference patterns and determining if the pattern is associated with one of the collective communication routines. In the above statement the compiler compares lhs and rhs indices and determines that no communication is required, because both of the array reference patterns are given by I and are aligned to the same template. However, if the rhs are ODD(I+2), it will be recognized as a shift communication.
Step 5. Compute f ;1 (i). For each array, w e compute the inverse alignment function f ;1 (i) corresponding to each f(i) f ;1 (i) is stored in the Distributed A rray Descriptor (DAD) 14]. In DAD, for each alignment function (a ne) we store constants a and b, where f(i) = ai + b. This function is needed when any computation needs to be performed using the original index of an array. F or example, the last statement in Figure 2 calls the intrinsic function MAXLOC to nd the location of the maximum element i n t h e a r r a y ODD. This function must be evaluated using the original array indices. The inverse function for array ODD is f ;1 (i) = i+1 2 . MAXLOC will return the location of the maximum value in the original array index domain by applying the f ;1 function.

Computation Partitioning
Once the data is distributed, there are several alternatives for assigning computations to processing elements (PEs) for each instance of a forall statement. Note that we i n ternally transform all array statements into equivalent forall representations. One of the most common methods for computation assignment i s t o u s e t h e owner computes rule. In the owner computes rule, the computation is assigned to the PE owning the lhs data element. This rule is simple to implement and performs well in many cases. Most of the current implementations of parallelizing compilers use the owner computes rule 5, 1 5 ]. However, it may not be possible to apply the owner computes rule for every case. The following examples describe how our compiler performs computation partitioning.
Example 1 (canonical form) Consider the following statement, taken from the Jacobi relaxation program In the above example, as in a large number of scienti c computations, the forall statement c a n be written in the canonical form. In this form, the subscript value in the lhs is identical to the forall iteration variable. In such cases, the iterations can be easily distributed using the owner computes rule.  However, this type of forall statement will result in either Case 3 or Case 4 in Figure 2. The generated code will be in the following order. 1. Improved p erformance o f F ortran 90D/HPF programs. T o a c hieve good performance, interprocessor communication must be minimized. By developing a separate library of interprocessor communication routines, each routine can be optimized. This is particularly important given that the routines will be used by m a n y programs compiled through the compiler.

Communication Primitives
In order to perform a collective c o m m unication on array elements, the communication primitive needs the following information: send processors list, receive processors list, local index list of the source array, and local index list of the destination array. There are two w ays of determining the above information. 1) Using a preprocessing loop to compute the above v alues or, 2) based on the type of communication the above information may be implicitly available, and therefore, not require preprocessing. We classify our communication primitives into unstructured and structured communication, respectively.
Our structured communication primitives are based on a logical grid con guration of the processors. Hence, they use grid-based communications such as shift along dimensions, broadcast along dimensions, etc. The following summarizes some of the structured communication primitives implemented in our compiler. transfer: Single source to single destination message. multicast: Broadcast along a dimension of the logical grid. overlap shift: Shifting data into overlap areas in one or more grid dimensions. This is particularly useful when the shift amount i s k n o wn at compile time. This primitive uses that fact to avoid intra-processor copying of data and directly stores data in the overlap areas 20]. temporary shift: This is similar to overlap shift except that the data is shifted into a temporary array. This is useful when the shift amount is not a compile time constant. This shift may require intra-processor copying of data.
concatenation: This primitive concatenates a distributed array and the resultant array ends up in all the processors participating in this primitive.
We h a ve implemented two sets of unstructured communication primitives: 1) where the communicating processors can determine the send and receive lists based only on local information, and hence, only require preprocessing that involves local computations, 9] and 2) where to determine the send and receive lists preprocessing itself requires communication among the processors 21]. precomp read: This primitive is used to bring all non-local data to the place it is needed before the computation is performed. postcomp write: This primitive is used to store remote data by sending it to the processors that own the data after the computation is performed. Note that these two primitives require only local computation in the preprocessing loop.
gather: This is similar to precomp read except that preprocessing loop itself may require communication.
scatter: This is similar to postcomp write except that preprocessing loop itself may require communication.
The compiler must recognize the presence of collective c o m m unication patterns in the computations in order to generate the appropriate communication calls. Speci cally, this involves a number of tests on the relationships among the subscripts of various arrays in a forall statement. These tests should also include information about array alignments and distributions. We use pattern matching techniques similar to those proposed by Li and Chen 22]. Further, we extend the above tests to include unstructured communication. Table 1

Communication Generation
Having The second subscript of B marked as multicast and the rst one as no communication.  The array B is marked as requiring gather communication since the subscript can only be known at runtime. The receiving processors can know what non-local data they need from other processors, but a processor may not know what local data it needs to send to other processors.
For simplicity, in this example we assume that the indirection array V is replicated. If it is not replicated, the indirection array m ust also be communicated to compute the receive list on each processor. Once the scheduling is completed, every processor knows exactly which non-local data elements it needs to send to (and receive from) other processors. The task of scheduler2 is to determine exactly which send and receive communications must be carried out by each processor. The scheduler rst gures out how m a n y messages each processor will have to send and receive during the data exchange. Each processor computes the number of elements (receive list) and the local index of each element it needs from all other processors. In schedule2 routine, processors communicate to combine these lists (a fan-in type of communication). At the end of this processing, each processor contains the send and receive list. After this point, each processor transmits to the appropriate processors a list of required array elements (local list). Each processor now has the information required to set up the communication schedule.
The schedule isch can also be used to carry out identical patterns of data exchanges on several di erent, but identically distributed arrays or array sections. The same schedule can be reused repeatedly to carry out a particular pattern of data exchange on a single distributed array. In these cases, the cost of generating the schedules can be amortized by executing it only once. This analysis can be performed at compile time. Hence, if the compiler recognizes that the same schedule can be reused, it does not generate code for scheduling but passes a pointer to the already existing schedule.
The gather and scatter operations are powerful enough to provide the ability t o r e a d a n d write distributed arrays with vectorized communication facility. These two primitives are available in PARTI (Parallel Automatic Runtime Toolkit at ICASE) 23] designed to e ciently support irregular patterns of distributed array accesses. PARTI and other communication primitives and intrinsic functions form the run-time support system of our Fortran 90D compiler.

Run-time Support System
The Fortran 90D/HPF compiler relies on a very powerful run-time support system. The run-time support system consists of functions that can be called from the node programs of a distributed memory machine. Intrinsic functions support many of the basic data parallel operations in Fortran 90. They not only provide a concise means of expressing operations on arrays, but also identify parallel computation patterns that may be di cult to detect automatically. F ortran 90 provides intrinsic functions for operations such as shift, reduction, transpose, reshape, and matrix multiplication. The intrinsic functions that may induce communication can be divided into ve categories as shown in Table 2. The rst category requires data to be transferred using fewer overhead structured shift communications operations. The second category of intrinsic functions require computations based on local data followed by the use of a reduction tree on the processors involved in the execution of the intrinsic function. The third category uses multiple broadcast trees to spread data. The fourth category is implemented using unstructured communication patterns. The fth category is implemented using existing research on parallel matrix algorithms 16]. Some intrinsic functions can be further optimized for the underlying hardware architecture. Table 3 presents a sample of performance numbers for a subset of the intrinsic functions on iPSC/860. A detailed performance study is presented in 14]. The times in the table include both the computation and communication times for each function. For most of the functions we were able to obtain almost linear speedups. In the case of the TRANSPOSE function, going from one processor to two or four actually results in an increase in the time due to communication requirements. However, for larger size multiprocessors the times decrease, as expected. Arrays may be redistributed across subroutine boundaries. A dummy argument that is distributed di erently from its actual argument in the calling routine is automatically redistributed upon entry to the subroutine by the compiler, and is automatically redistributed back to its original distribution at subroutine exit. These operations are performed by the redistribution primitives which transform from block to cyclic or vice versa.
When a distributed array is passed as an argument to some of the run-time support primitives, it is also necessary to provide information such as its size, distribution among the nodes of the distributed memory machine, etc. All this information is stored into a structure called distributed array descriptor (DAD) 14].

Experimental Results
To illustrate the performance of our compiler, we present b e n c hmark results from four programs and the rst 10 Livermore loop kernels. Gauss solves a system of linear equations with partial pivoting. Nbody program simulates the universe using the algorithm in 16]. Option program predicts the stock option pricing using stochastic volatility European model. Piprogram calculates the value of , using numerical integration. The Livermore kernels are 24 loops abstracted from actual production codes that have been widely used to evaluate the performance of various computer systems. Data for all programs were block distributed and were written outside of the compiler group at NPAC b y experienced message passing programmers. Tables 4 and 5 show the performance of compiler generated codes (F90D=HPF) and handwritten f77+ MP code. The tables contain data from running these programs with a varying number of processors on Intel iPSC/860. The compiler generated codes and hand-written codes use the Express message passing library. Timings were taken using extime() function having an accuracy of one microsecond. The programs were compiled by using Parasoft Express Fortran compiler, which calls Portland Group if77 release 4.0 compiler with all optimizations turned on (-O4).
We observe that the performance of the compiler generated codes are usually within a factor of 2 of the hand-written codes. This is due to the fact that an experienced programmer can incorporate more optimizations than our compiler currently does. For example, a programmer can combine or eliminate some of the communication or some of the intra-processor temporary copying. The compiler uses a more generic packing routine, whereas a programmer can combine communication for the same source and destination for di erent arrays. Another observation is that our run-time system shift routine is slower than the programmer's shift routines. 8 Summary of Related Work Callahan and Kennedy 24] proposed distributed-memory compilation techniques based on datadependence driven program transformations. These techniques were implemented in a prototype compiler in the ParaScope programming environment. Currently, a F ortran 77D compiler is being developed at Rice 25,26]. The Fortran 77D compiler introduces and classi es a number of advanced optimizations needed to achieve acceptable performance they are analyzed and empirically evaluated for stencil computations. SUPERB 5] is a semi-automatic parallelization tool designed for MIMD distributed-memory machines. It supports arbitrary user-speci ed contiguous rectangular distributions, and performs dependence analysis to guide interactive program transformations. KALI 27,9 ] is the rst compiler system that supports both regular and irregular computations on MIMD machines. KALI requires that the programmer explicitly partition loop iterations onto the processor grid. An inspector/executor strategy is used for run-time preprocessing of the communication for irregularly distributed arrays. Dataparallel C 28, 29] is a variant of the original C* programming language, designed by Thinking Machines Corporation for its Connection Machines processor array. Data parallel C extends C to provide the programmer access to a parallel virtual machine. ARF is a compiler for irregular computations

Summary and Conclusions
Fortran 90D/HPF is a language that incorporates parallel constructs and allows users to specify data distributions. In this paper, we h a ve presented a design for a Fortran 90D/HPF compiler for distributed memory machines. Speci cally, t e c hniques for the processing of distribution directives, computation partitioning, communication detection, and generation were presented. Our design is both e cient and portable. We presented preliminary performance results from our compiler. We believe that the methodology presented in this paper to compile Fortran 90D/HPF can be used by the designers and implementors of the HPF compiler.