Compiling distribution directives in a Fortran 90D compiler

Data partitioning and mapping is one of the most important steps of writing a parallel program, especially a data parallel one. Recently, Fortran D, and subsequently, High Performance Fortran (HPF) have been proposed to allow users to specify data distributions and alignments for the arrays in their programs. The paper presents the design of the data partitioning module of Fortran 90D compiler that processes the alignment and distribution directives.<<ETX>>


Introduction
Distributed memory systems solve the memory bottleneck of vector supercomputers by having separate memory for each processor.However, distributed memory systems demand high locality for good performance.Therefore, the distribution of data across processors is of critical importance to the performance of a parallel program in a distributed memory system.The focus of this paper is to describe the design and implementation of the data partitioning module in Fortran 90D compiler.We discuss how to distribute data and manage computations given the data distribution directives.Specifically, we show how the alignment and distribution directives can be systematically processed to produce an efficient code.Details of other modules can be found in [l, 21.Fortran D provides users with explicit control over data partitioning with data alignment and distribution specifications [3].The distribution directives can be used with Fortran 77 or Fortran 90.In this paper we consider Fortran 90.Fortran D has three compiler directives: DECOMPOSITION, DISTRIBUTE, and ALIGN.
The DECOMPOSITION directive is used to declare the name, dimensionality, and the size of each problem domain.A decomposition is simply an abstract problem or index domain.We call it "template" ( the name "template" h a s been chosen to describe "DECOMPOSITION" in HPF [4]).Arrays in a program are mapped to templates using the ALIGN directive.There may be multiple templates representing different problem mappings, but an array may be aligned only to one template at any point in time.All scalars are replicated.An array not explicitly aligned to any template serves as its own template.The DIS-TRIBUTE directive specifies the mapping of the template onto a logical processor grid.Each dimension of the template is distributed in a block, cyclic or irregular manner; the symbol "*" marks dimensions that are not distributed (i.e.collapsed or replicated).The selected distribution can affect the ability of the compiler to minimize communication and load imbalance in the resulting program.

Design Methodology
Fortran 90D compiler maps arrays to physical processors by using a three stage mapping as shown in Figure 1.This three stage mapping has also been proposed i n HPF [4].
Stage 1 : ALIGN directives are processed to compute functions that.map array i n d e x donzain to the tempfaie index domatn and vice versa.Also, local shape of the arrays it determined.
Stage 2 : Each dimension of the template is mapped onto the logical processor grid based on the Stage 3 : Logical processor grid is mapped onto physical system.This mapping can change from one system to another but the data mapping onto logical processor grid does not need to change.This enhances portability across a large number of architectures.
By performing the above three stage mapping, the compiler is decoupled from the specifics of a given machine or configuration.

Compiling the ALIGN Directive (Stage 1)
Alignment of data arrays to templates is specified by the ALIGN directives.In this section, we describe how the ALIGN directive is processed in our compiler.
Alignment determines which portions of two or more arrays will be in the same processor for a particular data partitioning.Clearly, if arrays involved in the same computation are aligned in such a manner that after distribution their respective sections lie on the same processors then the number of non-local accesses would be reduced.
Alignment is a relation that specifies a one-to-one correspondence between elements of a pair of array objects.The template is defined by a DECOMPO-SITION directive with its shape and ranks.Let A be an m-dimensional array and T E M P L be an ndimensional template.The general form of alignment directive is ,;,,,[*I The exhibited elements of A are aligned to those of T E M P L .The template is eventually distributed on a set of processors.The compiler guarantees that the array elements aligned to the same element of the template will be mapped to the same processor.
Fortran 90D compiler requires that each of A's subscripts i l , ... ,i, appears exactly once on the righthand side of the relation, so that a one-to-one correspondence with a section of T E M P L is established.This restriction does not permit skew alignments such as aligning A ( I ) with T E M P L ( 1 , I ) or A ( I , J ) with T E M P L ( I + J).The order of axis in the array may be different than the order of axis in the template (not necessarily ik = z u k ) .This permits transpose style alignments such as aligning A ( I , J ) with The symbol "*" shows the replication or collapse of the corresponding dimension.It may appear in both the array and the template subscripts.The array rank (the number of dimension) m may be different than the rank of template, n.For example, the directive

C $
A L I G N A(%,*) W I T H T E M P L ( t + 1).requests the second dimension of the array A be collapsed, while the directive

CS ALIGN A ( % ) W I T H TEMPL(*,z + 1 ) . forces replication of array A along the first dimension of the template T E M P L .
The alzgnmenl functzon fk is required to be a linear The parameters i,,, sk, and ok correspond to the three components of the alignment function: axw, stnde, and offset.Misalignment in the axis or stride components causes zrregular communzcatzon, and misalignment in t.he offset component causes nearest-neighbor communication [5].
T E M P L ( J , I ) .

Data Distribution (Stage 2)
In this section, we describe how the Fortran 90D compiler distributes the template on the logical processor grid (Figure 1).In this phase, the compiler uses information provided by the DISTRIBUTE directives.
The DISTRIBUTE directives assigns an attribute to each dimension of the template.Each attribute describes the mapping of the data in that dimension of the template on the logical processor grid.For example the following directive

The inverse distribution junction p -' ( p , i , P, N ) ---t I transforms the local index i in processor p back into global index I.
The term global index will be used to refer to the index of a data item within the global array (global name space) while the term local index will denote the index of a data item within a logical processor.
The BLOCK attribute indicates that blocks of global indices are mapped to the same processor.The block size depends on the size of the template dimension, N , and the number of processors, P, on which that dimension is distributed.This results in a very simple and efficient distribution function as shown in the first column of Table 1.The CYCLIC attribute indicates that global indices of the template in the specified dimension should be assigned to the logical processors in a roundrobin fashion.The last column of Table 1 shows the CYCLIC distribution functions.This also yields an optimal static load balance since first N mod P processors get [$] elements; the rest get [$] elements.
In addition, these distribution functions are efficient and simple to compute.Although cyclic distribution functions provided a good static load balance, the l e cality is worse than that using block distributions because cyclic distributions scatter data.

Grid Mapping Functions (Stage 3)
There are several advantages of decoupling logical processors from physical system configurations.These advantages include localzty, portabzlzty and groupang.
Locality: Multiple accesses to consecutive memory locations is called spatzal localzty.Spatzal localtty is very important for Distributed Memory Machines.Arrays representing spatial locations are distributed across the parallel computer.For instance, it makes sense to have data distributed in such a way that processors that need to communicate frequently are neighbors in the hardware topology.It has been shown that this IS extremely important in the common regular problems in scientific applications such as relaxation [SI.Our template is a d-dimensional mesh.If this template is BLOCK distributed on a d-dimension grid of processors, the neighboring array elements (spatial locality) will be in the neighboring processors.The grid topology is a very good topology for spatial locality.Fortran 90D makes logical processor topology grid according to the number of dimensions of the template.
Portability: The physical topology of the system may be a grid, a tree, a hypercube etc.The mapping for the best (possible) grid topology chaiiges from one physical topology to another.To enhance portability of our compiler, we separate the physical and logical topologies.Therefore, porting the compiler from one hardware platform to another involves changing the functions that map the logical grid topology to the target hardware.
Grouping: Operations on a subset of dimensions in arrays are very common in scientific programming, e.g., row and column operations on matrices.Fortran 90 provides intrinsic functions such as SPREAD, SUM, MAXVAL and CSHIFT that let a user to specify operations along different dimensions by specifying the DIM parameter.These dimensional operations conceptually group elements in the same dimension.The dimensional array operations result in "dimensional array communications".We have designed a set of collective communication routines that operate along one or more dimensions (groups of processors) of the grid.For example, we have developed spread (broadcast along dimension) shift along dimensions and concatenate communications.The usage of these primitives is discussed in [2].

An Experiment with Distributions
A data distribution that minimizes the communication requirements for one application does not necessarily do the same for another applications since different applications may have different reference patterns.
An advantage of being able to specify different distribution directives is the ability to experiment with various distributions without extensive recoding.Hence, with a few experiments a user can choose one of the best distributions for his/her application.We illustrate the above using an example of l-D FFT.
FFT is a difficult algorithm to compile efficiently because of the butterfly communication requirements for an efficient implementation.Automatic recognition of such patterns is difficult.Our compiler uses unstructured communication to implement such conimunication patterns.
We use FFT to illustrate the difference in performance when data distribution is changed.
Figure 2 shows the performance of the FFT for block and cyclic distributions.Cyclic distribution performs better due to two main reasons.One, it cyclicly distributes data, and hence, far off elements are stored in closer processors.This reduces the communication requirements compared to that when block distribution is used [7].Second, for unstructured communication, the destination processors and locations must be calculated using the distribution functions.The overhead of computing these functions is the least for cyclic distribution resulting in less overhead in the generated FFT code.[2] Z.

Figure 1 :
Figure 1: Three stage a r r a y mapping (iu,)[*l, .-. ,fn(io, )[*I)* distributes the template TEMPL blockwise in the first dimension and cyclicly in the second dimension.A Fortran SOD program is written in the global name space.Therefore, the arrays and template indices refer to indices in the global name space.Par- allelizing the program onto a distributed memory machine requires mapping a global index onto the pair: processor number and local index because on a Distributed Memory Machine, each node has a separate name space.For the above index transformations, we define data-distribution functions (index-conversion functions) as given in Definition 1 below.Definition 1: A data-distribution function for each dimension of template j i maps three integers, p(Z, P, N ) ---t ( p , i), where I is the global index, 0 5 I < N , P is the number of processors, and N as the size of global index.The pair ( p , i ) represents the processor p , (0 < p < P ) and i as the local index of p (0

Table 1 :
Data distributionfunction(refer to Definition 1): N is the size of the global index space.P is the number of processors.N and P are known at compile time and N 2 P. I is the global index.i is the local index and p is the owner of that local index i.11 Block-distribution I Cyclic-distribution I global to proc p = I mad P local to global I = i P + p cardinality

Figure 2 :
Figure 2: Comparison of Fortran 90D compiler generated BLOCK and CYCLIC distribution of l-Dimensional FFT codes on a 16 node Intel iPSC/860(time in seconds).
Bozkus and et al.Compiling the FORALL statement on MIMD parallel computers.Technical Report SCCS-389, Northeast Parallel Architectures Center, July 1992.