PCRC-based HPF compilation

,

1 Introduction HPF has been around for a while 1].Some early expectations|e cient and robust compilers arriving at the market within a few years of the speci cation's release|have not been fully realized, but we do see HPF compilers from PGI, IBM, and DEC.A number of other companies, including SUN, have HPF compiler groups.In any case, HPF provides an excellent c o n text for research and development on parallel compilation systems, because it coherently embodies many of the issues and concepts that have emerged over several years concerning parallel processing on distributed memory machines.
One of motivations for HPF was that it is just too di cult to build a parallelizing compiler that produces e cient c o d e f r o m r a w F ORTRAN applications.HPF requests the application programmer to help the compiler.Experience to date shows that, even with this help, it remains non-trivial to build a fullfeatured and high performance compiler for a language as complicated as HPF.This is especially true if the node program generated by the compiler must be directly concerned with low-level communication issues, at the level of calls send() and receive() operations.Thus, PCRC emphasizes the use of a runtime library 2].The node program will implemented in terms of higher level operations, more easily generated by a compiler and more easily understood by human.Of course, performance remains an issue, since this implies the compiler must relinquish some opportunities for global optimization.It is unclear how much performance is actually gained from those global optimizations.
As part of the PCRC e ort, an HPF compiler is being constructed based on this approach.Two aspects are being addressed through this e ort.One is to evaluate the e ectiveness of PCRC-based approach to parallel compiler construction the other is to see the performance pro le in comparison with \lower level" approaches.
At present, the system has been partially constructed.As we will illustrate, the runtime-based approach a l l o ws us to attack a full range of issues encountered in real world compiler construction.For instance, the compilation of procedure calls is implemented as a rst priority, which i s u n usual in academic compiler work.
This paper describes the design and implementation of our system.In particular the approaches to issues such as directive analysis and communication detection are discussed in detail.Section 2 provides an overview of the system architecture and some of global considerations.Section 3 describes some of the key technologies to some extent.Section 4 puts everything together, and illustrates how the compilation is done with some examples of generated code.Section 5 gives a few results from a benchmark comparision.

PCRC runtime
The architecture of NPAC P C R C r u n time is discussed in section 3.2.It basically consists of three groups of functions.One is distributed data management the second is various data movement r u n tines the third is computational functions corresponding to HPF intrinsic functions.The library is implemented in C++, and provides a Fortran interface to the compiler.Section 4 gives a avor of the Fortran interface.

HPFfe
HPFfe is a compiler front-end for High Performance Fortran Version 1.0.It's main thrust is its complete coverage of HPF 1.0 syntax and most of compile-time checkable semantics.As a result, Fortran 90 is fully covered.Besides syntax and semantics modules, a class library extended from Sage++ 7] is incorporated in the front-end, which allows us to write transformations e ectively.For a more detailed description of HPFfe, the reader is referred to 3] or Chapter 10 of 6].

Transformation modules
The compilation can be divided into two major phases: a program analysis phase and a program transforming phase.In the rst part, the compiler will use the available information to detect what kind of communication pattern is needed in the program.The second part will carry out the actual transformation according to the record from the rst phase to generate node program.It can subdivided as program format transformation and node program generation two parts.These modules will be discussed further in section 4.

Key technologies
We describe three technologies employed in our compilation system, which a r e essential both to the compiler construction work and the performance of generated code.They are distributed data descriptor, t h e N P AC runtime kernel, and communication detection algorithm.Other methods taken in handling various issues of the compilation will be illustrated in section 4 as we present a complete node program generated by the compiler.

Distributed data descriptor
Explicit array data distribution is a core concept of HPF.It frees the compiler from the task of data partitioning.Data distribution directives, such a s ALIGN and DISTRIBUTE, p r o vide a convenient w ay to describe how arrays in a global address (index) space are distributed among processors of a distributed memory machine.An e ective mechanism to tell the node program the data distribution is a key to e ective compiler construction and runtime function implementation.We employ the notion of a distributed data descriptor or DAD for this purpose.Similar mechanisms are also used in other compilers (such as PGI compiler, shpf compiler, and previous NPAC F90D compiler), but actual designs di er considerably.Our experience has shown that designing an e ective DAD is non-trivial, if it has to support various data distributions (such as block-cylic, collapsed, replicated, etc.), and various dynamics of a distributed array during the course of a program execution (such as rank-reduced sectioning, passing to a subroutine, etc) while still retaining runtime e ciency.
A notional tabular representation of the DAD is given in gure 2. This picture gives a feel for the information held in the actual array descriptor, although

Runtime kernel
The kernel of NPAC library is a C++ class library.It is descended from the run-time library of an earlier research implementation of HPF 5] with in uences from the Fortran 90D run-time and the CHAOS/PARTI libraries.The kernel is currently implemented on top of MPI.The library design is solidly object-oriented, but e ciency is maintained as a primary goal.Inlining is used extensively, and dynamic memory allocation, unnecessary copying, true procedure calls, virtual functions and other forms of indirection are generally avoided unless they have clear organizational or e ciency advantages.
The overall architecture of the library is illustrated in gure 3.At the top level there are several compiler-speci c interfaces to a common run-time kernel.The four interfaces shown in the gure are illustrative.They include two di erent F ortran interfaces (used by di erent HPF compilers), a user-level  The largest part of the kernel is concerned with global communication and arithmetic operations on distributed arrays.These are represented on the right-hand side of gure 3. The communication operations supported include HPF/F90 array i n trinsic operations such a s CSHIFT, the function pcrc write halo, which updates ghost areas of a distributed array, the function remap, w h i c h i s equivalent t o a F ortran 90 array assignment b e t ween a conforming pair of sections of two arbitrarily distributed HPF arrays, and various gather-and scattertype operations allowing irregular patterns of data access.Arithmetic operations supported include all F95 array reduction and matrix arithmetic operations, and HPF combining scatter.A complete set of HPF standard library functions is under development.
Nearly all these operations (including many of the arithmetic operations) are based on reusable schedules, in the PARTI/CHAOS mold.As well as supporting the inspector-executor compilation strategy, this organization is convenient in an object-oriented setting|a communication pattern becomes an object.As an illustration, consider the reduction operations.All reductions from a distributed array to a global result are described by an abstract base class using virtual functions for local block reductions.Speci c instances such as SUM or PRODUCT are created by deriving concrete classes that instantiate the arithmetic virtual functions.This is a cleaner and more type-secure (hence, potentially, more e ciently compilable) alternative to passing function pointers to a generic reduction function.
For regular data movement operations a schedule consists of lists of source and destination blocks for local copies or send or receive operations.A block is de ned as a multi-dimensional local array section parametrized by an o set and two short vectors of extents and strides.Where blocks are non-contiguous due to striding, or several blocks need to be communicated between the same pair of processors to execute a schedule, data is agglomerated by c o p ying from user space to a bu er before sending, and copied back after receiving.
All the data movement s c hedules are dependent on the infra-structure on the left-hand side of the gure 3.This provides the distributed array descriptor, and basic support for traversing distributed data (\distributed control").Important substructures in the array descriptor are the range object, which describes the distribution of an array global index over a process dimension, and the group object, which describes the embedding of an array in the active processor set.
At t h e time of writing the kernel is fully functional and quite mature, two of the four interfaces illustrated are complete, and others are in progress.

Communication classi cation and detection
HPF directives release the compiler from the task of choosing the data distribution, and owner computes rule (or other heuristics) more or less releases compiler from computation partitioning.Thus, essentially two pieces of work are left for compiler to do: communication detection and node program generation.
Taking the following array assignment as example, In general the conditions for no communication may be non-trivial to compute.In our scheme no communication is assumed if the conditions de ned below f o r shift communication obtain, but with a shift amount of zero (a su cient but not exhaustive test).Shift communication implies communication is needed, but a shift along array's template is adequate to move corresponding elements into the same processor.For instance,
The condition for shift communication is based on the concept of shifthomomorphism.Consider the fragment of HPF in gure 4. Assume t x and t y are normalized to be multiples of p.The array sections in the assignment a r e shift-homomorphic if they have the same extent ( n umber of elements) and a x :x s a y :y s = t x t y (1) (a di erent de nition applies if both templates are cyclically distributed).
If this condition holds the section assignment can be implemented by shifting the values of X along template TY then performing a local copy.We omit the proof of this claim and the formula for the shift amount.
Remap communication is the nal catch-all|the bag in which all other section assignments are put.
Appropriate functions are provided in PCRC r u n time to support the three situations.For instance, a pcrc write halo() function is designed to e ciently ... X(x l :x u :x s ) = Y(y l :y u :y s ) Figure 4: Generic array section assignment deal with shift communications, and a pcrc remap() function is designed to handle remap communications.For detailed derivation of our communication detection algorithm, the reader is referred to 9] or Chapter 8 of 6].Section 4 will also give a speci c application of the algorithm.

Putting the pieces together
The NPAC compiler is implemented as a translator from HPF to Fortran 77.It focusses on exploitation of explicit forall parallelism in the source HPF program.The transformation modules perform two basic functions, program analysis and transformation.In this section, we describe these modules and give concrete fragments of node code generated by our compiler.

Program analysis
In the program analysis phase, the following items are examined to prepare basic information for the next phase: processor information, including rank and size in each rank template information, also including rank and size in each r a n k distribution information for each template align information for each distributed array variable reference in each forall statement array dummy in procedure argument The rst four items are obtained from PROCESSOR, TEMPLATE, DISTRIBUTE and ALIGN statements respectively.Their translation in node program are straight-forward|generating a DAD for each array declaration, as illustrated later in this section.In translating a forall statement i n to a FORTRAN DO construct to be executed on a sequential machine, the \owner computes" rule is used to assign the computation to each node processor.For example: If A is a non-partitioned array and B is a partitioned array, then a broadcast is needed.If the array is a partitioned one the communication needed is dependent on the reference pattern of the forall index.Detection of the communication pattern was discussed in section 3.3.

Program transformation
From the implementation point of view, most of the transformation needed to deal with each part can be subdivided as two phases: format transformation and node program generation.In format transformation, the components of the actual source program is changed, making them suitable for being further processed while keeping the semantics xed.For example simple array assignments can be trivially converted to forall statements, and treated as such i n t h e next phase.The language features encountered in the second phase are thus narrowed down.Since the transformation keeps the semantics of the original program unchanged, it is possible to further divide the whole process as different small parts, with each of them takes care of a particular issue in format transformation.This method helps us separate the transformation program as di erent modules, implemented and tested independently.
The program generation phase carries out the actual translation work and generates the node program.Below we will use simple examples to illustrate the translations done for dif ferent language components.For simpli cation, the examples only involve one-dimension arrays.The scheme introduced here can be generalized to deal with the multi-dimension arrays and array sections.This generalization is implemented in our HPF compiler framework.

Housekeeping: memory management a n d address translation
There are two memory allocation strategies used in our compiler: dynamically allocate a temporary for each RHS term, or allocate a \ghost area" for arrays that appear in RHS contexts where they need a small shift along the processor grid.The rst method is used to handle \remap" communication.When a call to pcrc remap is needed, a temporary array i s allocated with the same alignment and distribution as the LHS target array and the RHS term is copied to the temporary array.The second method is used to e cently handle \shift" communication.If the compiler detects the need for a shift a \ghost area" is added to the RHS array.\Edge" elements are transferred using pcrc write halo.This saves the cost of copying a whole array.
As well as memory allocation, the node program must deal with translation between global array subscripts and local (node) subscripts.The run-time provides various functions to help with this translation 3 .The node program linearizes subscript computations for multi-dimensional arrays.Linearization of array segments, in conjunction with use of DAD inquiry functions provided in the runtime library, is important for implementing transcriptive features of HPF procedure, such as the INHERIT directive.Unnecessary copy-in and copy-out in caller or callee are generally avoided.

DAD generation
The compiler must generate code and initialize the distributed array descriptors (DADs) passed to run-time functions and sub-programs.Using the PCRCruntime Fortran interface, DAD initialization is straightforward.
The HPF program For each processor array a grp value is created with the appropriate shape.For each template dimension, a rng value is created to record its distribution code, distribution stride and o set.For each partitioned array, a dad value is created to record its shape and its alignment stride and o set, it is the DAD handle for this array.These are all integer handles to runtime objects.At t h e e n d o f t h e program, destructors will be called for the created objects.

Expressions and assignment
Some preliminary work has already been done in the format transformation phase, and the major task of this phase is to deal with forall statements and scalar assignments.

Discussion
The PCRC-based HPF compilation system described above has been partially implemented.From this experience, we see runtime based approach to compiler construction as a viable methodology in compiler research and development, as well as education.It always emphasizes the \bigger" picture, without getting lost in ne points.Automatic generation of message passing programs from data distribution speci cations has been explored for some time in the context of various data parallel languages 10], 11], 12], 13] and 14].
In 13], the support of the run-time functions are relatively weak the compiler needs to generate send and receive primitives to accomplish communication.Though this may have more e cient code generation after extensive program analysis, the compiler may become too complicated to be operational.
The most recent paper on HPF compiling was 15], in which a local set enumeration method was used to generate local part of a loop iteration and derive the communication set.Comparatively speaking, we believe our run time support method to get values is more straightforward and e cient, especially for regular access to the array data.
Emphasizing the runtime in compilation system construction is essentially taking a divide-conque philosophy.It allows a complicated system to be cleanly divided into two large pieces.Di erent people can independently work on different pieces.Once some function is well understood in the runtime, it may b e inlined in the compiler generated code, or used directly by the compiler to improve performance.Rich r u n time becomes a valuable infrastructure supporting di erent compiler constructions.This is the idea of PCRC.

Figure 1 :
Figure 1: Compilation system overview 2 System Overview There are three major components in our system (see gure 1): a full featured HPF 1.0 front-end, HPFfe 3], a set of transformation modules, and the PCRC runtime 4].

Figure 2 :
Figure 2: A representation of the distributed array descriptor

Figure 3 :
Figure 3: PCRC r u n time architecture needed depends on whether each pair of corresponding elements are in the same processor.Because of the two level mappings (alignment and distribution) de ned in HPF, the answer may not be readily obtainable.Our basic strategy is to classify communication requirement i n a n array assignment (the basis for every thing else) into three categories, namely, no communication, shift communication, and remap communication.We h a ve developed a theory to detect them by the compiler.The the meaning of no communication is self evident.Here is a reasonably straightforward example REAL X(16), Y(16) !HPF$ TEMPLATE T(48) !HPF$ PROCESSORS P(4) !HPF$ DISTRIBUTE T(BLOCK) ONTO P !HPF$ ALIGN X(i) WITH T(3*i-1) !HPF$ ALIGN Y(i) WITH T(2*i+1) ... X(1:9:2) = Y(2:14:3)
Both compilers achieve about the same performance on a single node, but generally our compiler exhibits better speedup on multiple processors, presumably due to more e ective handling of communication.The synthetic benchmark involves no communication|it is a forall assignment involving large arrays.It suggests that (unlike the PGI compiler) we deal with address translation e ciently, e v en for cyclic distribution format.(Speedup is relative to an equivalent sequential program compiled with the IBM Fortran compiler.)Whilethese examples are necessarily select, in general we nd that (on code that both compilers can successfully compile) the NPAC compiler compares very favourably with the commercial compiler.