Evaluation of High Performance Fortran through Application Kernels

. Since the de(cid:12)nition of the High Performance Fortran (HPF) standard, we have been maintaining a suite of application kernel codes with the aim of using them to evaluate the available compilers. This paper presents the results and conclusions from this study, for sixteen codes, on compilers from IBM, DEC, and the Portland Group Inc. (PGI), and on three machines: a DEC Alphafarm, an IBM SP-2, and a Cray T3D. From this, we hope to show the prospective HPF user that scalable performance is possible with modest e(cid:11)ort, yet also where the current weaknesses lay.


Introduction
In this paper, we shall rst motivate the use of the High Performance Fortran language as a means of exploiting the parallelism within a program.We shall then clarify the purpose of the NPAC HPF Applications suite, and explain the methodology by which these code have been benchmarked.In essence, we shall be comparing the performance of the codes against the ideal of a perfectly scaling code with no overhead from the use of the HPF language.Finally, a discussion will be made on how near or far the compilers and code tested are in meeting this aggressive standard, and the possible reasons why.

The High Performance Fortran Language
High Performance Fortran is a de nition agreed on by vendors and users on exploiting the data parallelism already implicit in the Fortran 90 language.The aim is to provide additional constructs with which the user and compiler can produce a scalable executable, with performances comparable to hand-tuned message passing code.The principle means by which this is achieved is through the use of compiler directives, statements which a traditional Fortran compiler would ignore as a commented line, but which an HPF compiler would use to ascertain how data arrays are to be distributed and how the code may be executed in parallel THPFF94].In addition, HPF has introduced several new features into the Fortran language; the most obvious of which are new intrinsic functions (mostly through the HPF LIBRARY module) and the FORALL statement/construct for more generalized array expressions than are possible with Fortran 90 array syntax and WHERE statement/construct.It is interesting to note that at present it appears that some of these HPF language features are currently being considered for inclusion in the forthcoming Fortran 95 standard.
The HPF approach has several strengths as a means of parallelizing existing code RYHF96].First of all, the HPF computation model has been de ned so as to have a single execution thread.Whilst this does present problems of e ciency in ensuring all non-distributed data are kept identical across all processors, it also means that any changes from HPF statements are by de nition benign with respect to the code's behaviour when compiled for one or for many processors.This should be contrasted with the message-passing case where it is more usual to have separate parallel and serial versions, which the developer is then obliged to individually maintain.
The other major attribute of HPF is that it is a standard, designed and agreed on by major vendors and users.This protection of software investment means that, like the MPI standard for message passing TMPIF94], users can compile the same code for di erent platforms ranging from a single workstation to a dedicated massively parallel processing machine.Whilst the idea of data-parallel languages have been discussed since the late 1960's Ric95], HPF is the rst |and thus far only| portable standard available for consideration.

The High Performance Fortran Applications Suite
The NPAC HPF Applications (HPFA) suite is a set of programs collected and developed over a number of years to provide feedback on available HPF compilers.From this, one would be able to provide quantitative details on the strengths and weaknesses of these compilers.
All the HPFA kernel suite of codes benchmarked have had one of two origins: they were either ports from existing Fortran programs, or they were written from scratch.Where the codes were originally Fortran 77 this usually required extensive rewriting to make use of the HPF array data parallelism syntax.However, where the codes originated from implementations for machines such as the MasPar or the Thinking Machines CM series, the work required was usually a simple one-to-one replacement of function names or language feature.
The codes benchmarked for this paper, and the language features & intrinsic functions which they exploit to express their parallelism are: Where each code is typically under 500 lines in length.The distribution of the arrays for most of these problems are along one dimension, and typically block or cyclic-1.Where distributed arrays are passed into subroutines, descriptive mapping is used to assure the compiler of the correct data distribution.The reasons for these somewhat conservative decisions are largely historical, when it was felt complete and e cient HPF implementation would not have been immediately available.In a similar vein, whilst the intent is to cover as wide a range of di erent applications as is feasible, a balance had to be made in using codes which could be parallelized, and which would t into the present HPF regular data framework.
These codes are also available from the NPAC website, for use by anyone to test their HPF implementation NPA96].

Compilers and Platforms
For the benchmarking, the following compilers and platforms con gurations were available, executing on 1,2,4 and 8 processors.
{ Portland Group Inc. PGHPF v2.1-1 compiler on an IBM SP-2, installed July 1996.This is a largely complete HPF implementation, and none of the missing features had any impact on the HPFA codes.
{ IBM XLHPF v1.1 compiler on an IBM SP-2, installed March 1996.This is an implementation of the subset speci cation of HPF, plus some other features.
{ DEC Fortran 90/POE v4.0 compiler on a DEC Alphafarm connected via a Gigaswitch, installed on February 1996.This is a full implementation of the HPF language, albeit with certain parallelism features disabled.
{ PGI PGHPF v2.1 compiler on a Cray T3D, installed June 1996.As for the IBM implementation, this is a largely complete HPF implementation.
In addition, Fortran 90 single processor runs were made so as to ascertain the additional overhead of using HPF on each of these machines: { PGI PGHPF on the IBM SP-2 without the `-pghpf' execution ag, for comparisons with the PGI-PGHPF IBM SP-2 runs.
{ IBM XLF90 for comparisons with the IBM XLHPF runs.{ DEC Fortran 90 without the `-wsf' parallel software environment compiler ag, for comparisons with the DEC HPF runs.
All three machines examined are distributed memory multiprocessor machines.It is generally expected that the HPF language will perform more closely to handwritten message passing codes for shared memory (or virtual-shared memory, as in the case of the Hewlett-Packard/Convex Exemplar series) memory architecture machines.
In all cases, at least eight timings were made at each con guration, and the minimum execution times used.These timings referred to the wall-clock execution time, as provided by the Fortran 90 `SYSTEM CLOCK()' intrinsic function.
The hardware for the IBM SP-2 runs were made courtesy of the Cornell Theory Center, the DEC Alphafarm via the Northeast Parallel Architectures Center, and the Cray T3D via the Edinburgh Parallel Computing Centre.

Benchmark Results
The results presented are an attempt to show each code's behaviour with respect to the number of participating processors.The information we wish to extract are the overhead induced by the use of HPF over that from an Fortran 90 execution on a single processor, and the subsequent scaling in the execution times.In addition, timing calls have been inserted into the codes so as to determine the times spent on purely computational tasks and on combined communication & computation |the latter as it is sometimes impossible to separate the times spent on communications and computation within a HPF program statement or intrinsic function.The graphical pro ler from the PGI compiler was used to determine the parts of the code which contain communications, and the observations fed back into the programs by inserting explicit calls to timing routines around the areas of interest.This methodology obviously su ers from extrapolating the PGI implementation to those from the other vendors, but since we have not seen any obvious inaccuracies in the PGI pro ler's report on which lines of code are dependent on communications, we believe this indeed provides a realistic picture.
From these data, it would be possible to indirectly determine the performance of these codes compared to the (usually unobtainable) ideal situation of: { No di erence in execution times between the serial Fortran 90 and the HPF code on one processor.
{ The reciprocal of the execution times scale down linearly with the number of processors.Within the parallel computing community, the question often asked is how an HPF code compares with a functionally equivalent hand-coded message passing version.Writing |and presumably optimizing| message passing calls into the eighteen codes in this study would be the ideal means by which to answer this question.However, this was deemed infeasible under the available timeframe and instead the results presented here will compare the performance with the ideal situation listed above as the metric on `how good' were the compilers tested.
Table 1 shows the speed-up gures for the HPF implementations by PGI and IBM on an IBM SP-2.Speed-up here is de ned as the execution time taken by a particular con guration divided by the time taken by the one-processor HPF execution time.As a guide to the overhead of using HPF, this is also done for the Fortran 90 version of the code.
An identical exercise is done in table 2 for the HPF implementations by DEC on an eight workstation Alphafarm, and PGI on a Cray T3D.It should be noted that the Cray T3D Fortran 90 runs were performed with the Cray Fortran 90 compiler, rather than the PGI product; mainly because unlike the case with the IBM SP-2, it was not immediately obvious how to `switch o ' the HPF features of the PGI compiler on the T3D.
Finally, table 3 gives the wall clock execution times of the four con gurations examined on the sixteen HPFA codes, on a single processor running the HPF code.
Table 1.Speed-up results for the PGI PGHPF and IBM XLHPF/XLF90 compilers on the IBM SP-2.The numbers presented here are speed-ups with respect to the one-processor HPF codes, for the Fortran 90 serial run, and the 2,4 & 8 processors HPF runs; by de nition the speed-up for the one processor HPF runs is `1.0'.The con guration(s) which gave the best speed-up has been highlighted.The dash represents where the HPF compiler was unable to compile the code, whether due to documented limitations or due to unknown compilation errors.
These provide an indication of the spread in execution times amongst the di erent products.
The following subsections x4.1{4.4 will describe the behaviour of the HPFA codes on the compiler and hardware con gurations listed in x3.1, as well as elaborating on the results given in tables 1{3.

PGI PGHPF Timings on the IBM SP-2
Of the sixteen codes examined, eight displayed a speed-up of 7.0 or higher at eight processors, with six also having a low Fortran 90 to HPF overhead.Moreover, two of the codes had a speed-up higher than 8.0, due to better cache hits with the smaller problem size given to each processor.From these results, one may infer that the following characteristics are implemented well by the PGHPF compiler on the IBM SP-2: INDEPENDENT do loops, WHERE mask operations with no communications, simple near-neighbour CSHIFT() operations, and the SUM(), TRANSPOSE() & SPREAD() functions.On the otherhand, three codes performed noticeably badly.The features which appear to have caused problems are: masked CSHIFT() operations with communications, and MATMUL() with communications.
On the whole, this con guration performs well on codes with little or interprocessor communications.Although there is the exception of the ADI code, which being mostly embarrassingly parallel, should also have scaled well.
as used by the Hough transformation code.In addition, it was found that two other codes caused unknown compile-time errors.
Of the remaining thirteen codes, ve codes had a speed-up of 7.0 or better at eight processors, with two codes also having a low Fortran 90 to HPF overheads.The features which the XLHPF compiler implemented well appear to be: WHERE mask operations with no communications, simple near-neighbour CSHIFT(), and the SUM(), TRANSPOSE() and SPREAD() intrinsics.The features which performed badly are: INDEPENDENT do loops which called pure subroutines with array sections, and MATMUL() with interprocessor communications.
As with the PGI PGHPF on the IBM SP-2, the IBM XLHPF compiler appear to perform best on embarrassingly parallel problems |it notably scaled better on the ADI code than the PGHPF compiler.However, it still su ers from being subset HPF, and the implementation of INDEPENDENT do loops is still lacking.

DEC HPF & Fortran 90 Timings on the DEC Alphafarm
DEC was the rst vendor to o er a syntactically complete HPF compiler, but on the system benchmarked it had the most disappointing performance.Perhaps the most telling statistics is that of the sixteen codes, nine had their single processor Fortran 90 timings comparable or better than the eight processor HPF runs.Of the other seven codes, three had a speed-up gure above 2.0 with the rest delivering performances comparable to that from a single processor.In mitigation with two of the codes, it should be mentioned that the INDEPENDENT directive does not function with the compiler release which was used.

PGI PGHPF Timings on the Cray T3D
The Cray T3D is generally acknowledged as having a superior communications network to that of the IBM SP-2, in particular with a better latency.However, this does not appear to be re ected in its performance with respect to the SP-2 port: ve codes obtained a speed-up of 7.0 or better at eight processors and these codes are essentially embarrassingly parallel, with little interprocessor communications |although the aforementioned case with the ADI code in x4.1 again shows disappointing speed-up.The codes where the IBM SP-2 port bettered the Cray T3D version are actually those with substantial near-neighbour communications, namely from CSHIFT() operations.On the otherhand, this port contains the same features as the SP-2 version, in particular o ering the users the option of expressing their code's parallelism with the INDEPENDENT do-loop directive.

Discussion
This exercise has demonstrated that today one can write HPF codes which scales well and have acceptable Fortran 90 to HPF latencies.In this context, it would be di cult to envisage a message-passing program outperforming such codes.However, currently it would appear that such codes should preferably either have few interprocessor communications, or have them as simple operations such as CSHIFT() and SPREAD() which the compiler can easily optimize.
Of the HPFA codes examined which did not scale well, these were generally due either to obvious gaps in the implementations (e.g., full INDEPENDENT do loops in the DEC and IBM compilers), or to comparatively complicated communication patterns (e.g., MATMUL() on non-local data, masked CSHIFT() operations).That pro lers are now available to determine these problematic parts of the code was of major assistance in this report.However, more information could still be given to the user to optimise their HPF codes: such as for when arrays have been remapped, or if temporary arrays have been created, or if computations have been unnecessarily duplicated.
In conclusion, we have demonstrated that present day compilers of the HPF language are capable of good scalability and low latencies.Nonetheless it is very easy to construct codes which do not scale well, and for these cases the user must be provided with the information needed to identify and perhaps bypass these bottlenecks.This is more pertinent with HPF programming than with message-passing, where the ease of coding and the freedom to re-express a given computation is much greater.

Table 3 .
Wall-clock execution times in seconds for the PGI PGHPF compiler on a IBM SP-2, the IBM XLHPF/XLF90 compiler on a IBM SP-2, the DEC HPF compiler on a DEC Alphafarm, and the PGI PGHPF compiler on a Cray T3D, for a single processor HPF run.