Interpreting the performance of HPF/Fortran 90D

We present a novel interpretive approach for accurate and cost effective performance prediction in a high performance computing environment, and describe the design of a source driven HPF/Fortran 90D performance prediction framework based on this approach. The performance prediction framework has been implemented as part of a HPF/Fortran 90D application development environment. A set of benchmarking kernels and application codes are used to validate the accuracy, utility, usability, and cost effectiveness of the performance prediction framework. The use of the framework for selecting appropriate compiler directives and for application performance debugging is demonstrated.<<ETX>>


Introduction
Although currently available High Performance Computing (HPC) systems possess large computing capabilities, few existing applications are able to fully exploit this potential.The fact remains that the development of efficient application software capable of exploiting available computing potentials is non-trivial and is largely governed by the availability of suffi- ciently high-level languages, tools, and application development environments.
A key factor contributing to the complexity of parallel/distributed software development is the increased degrees of freedom that have to be resolved and tuned in such an environment.Typically, during the course of parallel/distributed software development, the developer is required to select between available algorithms for the particular application; between possible hardware configuration and amongst possible decompositions of the problem onto the selected hardware 1063-9535194 $4.00 0 1994 IEEE configuration; between different communication and synchronization strategies; and so on.The set of reasonable alternatives that have to be evaluated is very large and selecting the best alternative among these is a formidable task.Consequently, evaluation tools form a critical part of any software development environment.
In this paper we present a novel interpretive approach for accurate and cost-effective performance prediction in a high performance computing environment, and describe the design of a source-driven HPF'/Fortran 90D performance prediction framework based on this approach.The interpretive approach defines a comprehensive characterization methodology which abstracts system and application components of the HPC environment.Interpretation techniques are then used to interpret performance of the abstracted application in terms of parameters exported by the abstracted system.System abstraction is performed off- line through a hierarchical decomposition of the computing system.Application abstraction is achieved automatically at compile time.The performance prediction framework has been implemented as a part of the HPF/Fortran 90D application development environment [l] developed at the Northeast Parallel Architectures Center (NPAC), Syracuse University.The environment integrates a HPF/Fortran 90D compiler, a functional interpreter and the source based performance prediction tool; and is supported by a graphical user interface.The current implementation of the environment framework is targeted to the iPSC/SSO hypercube multicomputer system.
A set of benchmarking kernels and application codes are used to validate the accuracy, utility, and usability of the performance prediction framework.The use of this framework for selecting appropriate compiler directives and for application performance de-

~~ ~
'High Performance Fortran bugging is demonstrated.The rest of the paper is organized as follows: Section 2 gives an overview of HPF/Fortran 90D.Section 3 introduces the interpretive performance prediction approach, and the system and application characterization methodologies.Section 4 describes the design of the HPF/Fortran 90D performance prediction framework.Section 5 presents numerical results to validate the approach and the framework.Section 6 discusses some related research.Finally, Section 7 presents some concluding remarks and discusses future extensions to the project.
2 An Overview of HPF/Fortran 90D High Performance Fortran (HPF) [a] is based on the research language Fortran 90D [3] and provides a minimal set of extensions to Fortran 90 to support the data parallel programming model'.Extensions incorporated into HPF/Fortran 90D provide a means for explicit expression of parallelism and data mapping.These extensions include compiler directives which are used to advice the compiler how data objects should be assigned to processor memories, and new language features such as the forall statement and construct.
HPF adopts a two level mapping using the PRO-CESSORS, ALIGN, DISTRIBUTE, and TEMPLATE compiler directives to map data objects to abstract processors.The data objects (typically array elements) are first aligned with an abstract index space called a template.The template is then distributed onto a rectilinear arrangement of abstract processors.The mapping of abstract processors to physical processors is implementation dependent.Data objects not explicitly distributed are mapped according to an implementation dependent default distribution (e.g.replication).Supported distributions include BLOCK and CYCLIC.
Our current implementation of the HPF compiler and performance prediction framework supports a formally defined subset of HPF.The term HPF/Fortran 90D is used to refer to this subset.

An Interpretive Approach t o Performance Prediction
The essence of the interpretive approach is the application of interpretation techniques to performance prediction through an appropriate characterization of the HPC system and the application.It consists of four modules as follows (see Figure 1):

.
The Interpretaiion Engine (or module) which predicts the performance of the application on the HPC system by interpreting the execution costs of the abstracted application in terms of the parameters exported by the abstracted system.

4.
The Output Module which communicates the predicted performance metrics, and provides the application developer with the required information at the required granularity.
The four modules are briefly described below.A detailed discussion of the performance interpretation approach can be found in [4].

Systems Module
The systems module abstracts a HPC system by hierarchically decomposing it to form a rooted tree structure called the System Abstraction Graph (SAG).Each node of the SAG is a System Abstraction Unit (SAU) which abstracts a part of the HPC system into a set of parameters representing its performance.A SAU is composed of4 components: (1) Processing Component (P), (2) Memory Component (M), (3) Communication/Synchronization Component (C/S), and (4) Input/Output Component (I/O); each component parameterizing relevant characteristics of the associated system unit.

Application Module
Application abstraction is performed in two step: Machine independent application abstraction is performed by recursively characterizing the application description into Application Abstraction Units (AAU's).Each AAU represents a standard program- ming construct (such as iterative, conditional, sequential) or a communication/synchronization operation, and parameterizes its behavior.AAU's are combined to abstract the control structure of the application, forming the Application Abstraction Graph (AAG).The communication/synchronization structure of the application is superimposed onto the AAG by augmenting the graph with a set of edges corresponding to the communications or synchronization between AAU's.The resulting structure is the Synchronized Application Abstraction Graph (SAAG).The second step consists of machine specific augmentation and is performed by the machine specific filter.This step incorporates machine specific information (such as introduced compiler transformations/optimizations) into the SAAG based on a mapping defined by the user.

Interpretation Engine
The interpretation engine consists of two components; an interpretation function that interprets the performance of an individual AAU, and an interpretation algorithm that recursively applies the interpretation function to the SAAG to predict the performance of the corresponding application.An interpretation function is defined for each AAU type to compute its performance in terms of parameters exported by the associated SAU.Models and heuristics are defined to handle accesses t o the memory hierarchy, overlap between computation and communication, and user experimentation with system and run-time parameters.Details of these models and the complete set of interpretation functions can be found in [4].

Output Module
The output module provides an interactive interface through which the user can access estimated performance statistics.The user has the option of selecting the type of information, and the level at which the information is to be displayed.Available information includes cumulative execution times, the communication time/computation time breakup and existing overheads and wait times.This information can be obtained for an individual AAU, cumulatively for a branch of the AAG (i.e.sub-AAG), or for the entire AAG .

Design of the HPF/Fortran 90D Performance Prediction Framework
The HPF/Fortran 90D performance prediction framework is based on the HPF source-to-source compiler technology [5] which translates HPF into loosely synchronous, SPMD3 Fortran 77 + Message-Passing codes.It uses this technology in conjunction with the performance interpretation model to provide performance estimates for HPF/Fortran 90D applications on a distributed memory MIMD multicomputer.Performance prediction is performed in two phases as described below:

Phase 1 -Compilation
The compilation phase is based on the HPF/-Fortran 90D compiler.Given a syntactically correct HPF/Fortran 90D program, this phase performs the following steps: 1.The first step parses the program to generate a parse tree.Array assignment statement and where statement are transformed into equivalent forall statements with no loss of information.

2.
The partitioning step processes the compiler directives and using these directives, it partitions the data and computation among the processors.

Phase 2 -Interpretation
Phase 2 is implemented as a sequence of parses: (1) The abstraction parse generates the application abstraction graph (AAG) and synchronized application abstraction graph (SAAG). (2)The interpretation parse performs the actual interpretation using the interpretation algorithm.(3) The output parse generates the required performance metrics.

Abstraction Parse:
The abstraction parse intercepts the SPMD program structure produced in phase 1 and abstracts its execution and communication structures to generate the corresponding AAG and SAAG (as defined in Section 3).A communication table is generated to store the specifications and status of each communication/synchronization.
The abstraction parse also identifies all critical variables in the application description; a critical variable being defined as a variable whose value effects the flow of execution, e.g. a loop limit.The critical variables are then resolved either by tracing their definition paths or by allowing the user to explicitly specify their values.

Interpretation Parse:
The interpretation parse performs the actual performance interpretation using the interpretation algorithm.For each AAU in the SAAG, the corresponding interpretation function is used to generate the performance measure associated with it.Performance metrics maintained at each AAU are its computation, communication and overheads times, and the value of the global clock.In addition, cumulative metrics are also maintained for the entire SAAG .The interpretation parse has provisions to take into consideration a set of compiler optimizations (for the generated Fortran 77 + MP code) such as loop re-ordering, etc.These can be turned on/off by the user.

Output Parse
The final parse communicates estimated performance metrics to the user.The output interface provides three types of outputs.The first type is a generic performance profile of the entire application broken up into its communication, computation and overhead components.Similar measures for each individual AAU and for sub-graphs of the AAG are also available.The second form of output allows the user to query the system for the metrics associated with a particular line (or a set of lines) of the application description.Finally, the system can generate an interpretation trace which can be used as input to the ParaGraph [6] visualization package.The user can then use the capabilities provided by the package to analyze the performance of the application.

Abstraction & Interpretation HPF/-Fortran 90D Parallel Constructs
The abstraction/interpretation of the HPF/Fortran 90D parallel constructs i.e. forall, array assignments, and where is described below: forall Statement: The forall statement generalizes array assignments to handle new shapes of arrays by specifying them in terms of array elements or sections.The element array may be masked with a scalar logical expression.Its semantics are an assignment to each element or section (for which the mask expression evaluates true) with all the right-hand sides being evaluated before any left-hand sides are assigned.The order of iteration over the elements is not fixed.Examples of its use are: Phase 1 translates the forall statement into a three level structure consisting of a collective communication level, a local computation level and another collective communication level, to be executed by each processor.The processor that is assigned an iteration of the forall loop is responsible for computing the right-hand-side expression of the assignment statement, while the processors that owns an array element used in the left-hand side or right-hand side of the assignment statement must communicate that element to the processor performing the computation.Consequently, the first communication level fetches off-processor data required by the computation level.Once this data has been gathered, computations are local.The final communication level writes calculated values to off-processors.
Phase 2 then generates a corresponding sub-AAG using the application abstraction model.The communication level translates into a sequential (Seq( AAU corresponding to index translations and message packing performed, and a communication (Comm) AAU.The computation level generates an iterative (IterD) AAU which may contain a conditional (CondtD) AAU (depending on whether a mask is specified).The abstraction of the forall statement is shown in Figure 2. In this example, the final communication phase is not required as no off-processor data needs to be written.where Statement: Like the array assignment statement, the HPF/Fortran 90D where statement is also a special case of the forall statement and is handled in a similar way.

Abstraction of the iPSC/SSO System 5 Validation/Evaluation of the Interpretation Framework
In this section we present numerical results obtained using the current implementation of the HPF/Fortran 90D performance prediction framework.
In addition to validating the viability of the interpretive approach, this section has the following objectives: 1. To validate the accuracy of the performance prediction framework for applications on a high performance computing system.The aim is to show that the predicted performance metrics are accurate enough to provide realistic information about the application performance and to be used as a basis for design tuning.Abstraction of the iPSC/860 hypercube system to generate the corresponding SAG was performed off-line using a combination of assembly instruction counts, measured timings, and system specifications.The processing and memory components were generated using system specification provided by the vendor, while iterative and conditional overheads were 2. To demonstrate the utility of the framework and the metrics generated for efficient HPC application development.The results presented illustrate the framework's utility for: (1) Application design and directive selection; and (2) Application performance debugging.computed using instruction counts.The communication component was parameterized using benchmarking runs.These parameters abstracted both low-level primitives as well as the high-level collective communication library used by the compiler.Benchmarking runs were also used to parameterize the HPF parallel intrinsic library.The intrinsics included circular shift ( cshaft), shift to temporary (tshaft), global sum operation (sum), global product operation (product), and the maxloc operation which returns the location of the maximum in a distribute array.Characterization of the SRM (host) and the communication channel connecting the SRM to i860 cube was performed in a similar manner.

3.
To demonstrate the usability (ease of use) of the performance interpretation framework and its cost-effectiveness.
The high performance computing system used is an iPSC/860 hypercube connected t o a 80386 based host processor.The particular configuration of the iPSC/860 consists of 8 i860 nodes.Each node has a 4 KByte instruction cache, 8 KByte data cache and 8 MBytes of main memory.The node operates at a clock speed of 40 MHz and has a theoretical peak performance of 80 MFlop/s for single precision and 40 MFlop/s for double precision.The validation application set was selected from the NPAC HPF/Fortran

Approximation of A by calculating the area
Table 1: Validation Application Set 90D Benchmark Suite [7].The suite consists of a set of benchmarking kernels and "real-life" applications and is designed to evaluate the efficiency of the HPF/Fortran 90D compiler and specifically, automatic partitioning schemes.The selected application set includes kernels from standard benchmark sets like the Livermore Fortran Kernels and the Purdue Benchmark Set, as well as real computational problems.The applications are listed in Table 1.

Validating Accuracy of the Frame-
Accuracy of the performance prediction framework is validated by comparing estimated execution times with actual measured times.For each application, the experiment consisted of varying the problem size and number of processing elements used.Measured timings represent an average of 1000 runs.The results are summarized in Table 2. Error values listed are percentages of the measured time and represent maximum/minimum absolute errors over all problem sizes and system sizes.For example, the N-Body computation was performed for 16 to 4094 bodies on 1, 2, 4, and 8 nodes of the iPSC/860.The minimum absolute error between estimated and measured times was 0.09% of the measured time while the maximum absolute error was 5.9%.
The obtained results show that in the worst case, the interpreted performance is within 20% of the measured value, the best case error being less than 0.001%.

work
The larger errors are produced by the benchmark kernels which have been specifically coded to task the compiler.Further, it was found that the interpreted performance typically lies within the variance of the measured times over the 1000 iterations.This indicates that the main contributors to the error are the tolerance of the timing routines and fluctuations in the system load.Predicted metrics typically serve either as the first-cut performance estimate of an application or as a relative performance measure to be used as a basis for design tuning.In either case, the interpreted performance is accurate enough to provide the required information.

Validating Utility of the Framework
The utility of the performance prediction framework is validated through the following experiments; (1) selecting the appropriate HPF/Fortran 90D directives based on the predicted performance, and (2) using the tool to analyze different components of the execution time and their distributions with respect to the application.These experiments are described below:

Appropriate Directive Selection
To demonstrate the utility of the interpretive framework in selecting HPF compiler directives we compare the performance of the Laplace solver for 3 different distributions (DISTRIBUTE directive) of the template, namely (BLOCK,BLOCK), (BLOCK,X) and (XIBLOCK)   the performance of each of the three cases for different system sizes us- ing both, measured times and estimated times.These graphs can be used to select the best directives for a particular problem size and system configuration.For the Laplace solver, the (Block,X) distribution is the appropriate choice.Further, since the maximum absolute error between estimated and measured times is less than 1%, directive selection can be accurately performed using the interpretive framework.Using the interpretive framework is also significantly more cost-effective as will be demonstrated in Section 5.3.
In the above experiment, performance interpretation was source driven and can be automated.This exposes the utility of the framework as a basis for an intelligent compiler capable of selecting appropriate directives and data decompositions.Similarly, it can also enable such a compiler to select code optimizations such as the granularity of the computation phase

Application Performance Debugging
The performance metrics generated by the framework can be used to analyze the performance contribution of different parts of the application description and to identify bottlenecks.A performance profile for the phases (Figure 6) of the parallel stock option pricing application is shown in Figure 7. Phase 1 creates the (distributed) option price lattice while Phase 2, which requires no communication, computes the call prices of stock options.
Application performance debugging using conven-tional means involves instrumentation, execution and data collection, and post-processing this data.Further, this process requires a running application and has to be repeated to evaluate each design modification.Using the interpretive framework, this information (at all levels required) is available during application development (without requiring a running application).

Validating Usability of the F'ramework
The interpreted performance estimates for the experiments described above were obtained using the interpretive framework running on a Sparcstation l+.The framework provides a friendly menu-driven, graphical user interface to work with and requires no special hardware other than a conventional workstation and a windowing environment.Application characterization is performed automatically (unlike most approaches) while system abstraction is performed offline and only once.Application parameters and directives were varied from within the interface itself.Typical experimentation on the iPSC/860 (to obtained measured execution times) consisted of editing code, compiling and linking using a cross compiler (compiling on the front end is not allowed to reduce its load), transferring the executable to the iPSC/860 front end, loading it onto the i860 node and then finally running it.The process had to be repeated for each instance of each experiment.Relative experimentation times for different implementation of the Laplace Solver (Section 5.2.1) using measurements and the performance interpreter are shown in Figure 8. Experimentation using the interpretive approach required Laplace Solver (4 Procs) -Esti-matedlMeasured Times approximately 10 minutes for each of the three implementation.Experimentation using measurements however took a minimum 27 minutes (for the (Blk,*) implementation) and required almost 1 hour for the (*,Blk) case.Clearly, the measurements approach is not feasible, specially when a large number of options have to be evaluated.Further, the iPSC/860, being an expensive resource, is shared by various development groups in the organization.Consequently, its usage can be restrictive and the required configuration may not be immediately available.The comparison above validates the convenience and cost-effectiveness of the framework for experimentation during application development.

Related Work
Existing performance prediction approaches and models for multicomputer systems can be broadly classified as analytic, simulation, monitoring or hybrid (which make use of a combination of the above techniques along with possible heuristics and approximations) Analytic techniques use mathematical models to abstract the system and application, and solve these models to obtain performance metrics.A general approach for analytic performance prediction for shared memory systems has been proposed by Siewiorek et al. in [8] while probabilistic models for parallel programs based on queueing theory have been presented in [9].The above approaches require users to explicitly model the application along with the HPC system.A source based analytic performance prediction model for Dataparallel C has been developed by Clement et a1 [lo].The approach uses the a set of assumptions and specific characteristics of the language to develop a speedup equation for applications in terms of system costs.
Simulation techniques simulate the hardware and the actual execution of a program on that hardware.These techniques are typically expensive in terms of the time and computing resource required.A simulation based approach is used in the SiGLe system (Simulator at Global Level) [ll]  A hybrid approach is presented in [15] where the runtime of each node of a stochastic graph representing the application is modeled as a random variable.The distributions of these random variables are then obtained using hardware monitoring.
The layered approach presented in [16] uses a methodology based on application and system characterization.The developer is required to characterize the application as an execution graph and define its resource requirements in this system.In this paper, we described a novel interpretive approach for accurate and cost-effective performance prediction on high performance computing systems.A comprehensive characterization methodology is used to abstract the system and application components of the HPC environment into a set of well defined parameters.An interpreter engine then interprets the performance of the abstracted application in terms of the parameters exported by the abstracted system.A source-driven HPF/Fortran 90D performance prediction framework based on the interpretive approach has been implemented as part of the HPF/Fortran 90D integrated application development environment.The current implementation of the environment framework is targeted to the iPSC/860 hypercube system.
Numerical results using benchmarking kernels and application codes from the NPAC HPF/Fortran 90D Benchmark Suite, were presented to validate the accuracy, utility, and usability of the performance prediction framework.The use of the framework for selecting appropriate compiler directives, and for application performance debugging was demonstrated.We are currently working on developing an intelligent HPF/Fortran 90D compiler based on the source based interpretation model.This tool will enable the compiler to automatically evaluate directives and transformation choices and optimize the application at compile time.Future development of the framework will involve moving it to high performance distributed computing systems and exploiting its potential as a

Figure 2 :
Figure 2: Abstraction of the forall Statement , and corresponding alignments (ALIGN directive) of the data elements to the template.These three distributions (on 4 processors) are shown in Figure 3. Figures 4 & 5 compare

r r t l o a Moduk Application Module lnterprelalion Engine systems Module L
A W which provides spe-Balasundaram et al. [14] use 'training routines" to benchmark the performance of the architecture and then use this information to evaluate different data decompositions.