Data access reorganizations in compiling out-of-core data parallel programs on distributed memory machines

This paper describes optimization techniques for translating out-of-core programs written in a data parallel language to message passing node programs with explicit parallel I/O. We demonstrate that straightforward extension of in-core compilation techniques does not work well for out-of-core programs. We then describe how the compiler can optimize the code by (1) determining appropriate file layouts for out-of-core arrays, (2) permuting the loops in the nest(s) to allow efficient file access, and (3) partitioning the available node memory among references based on I/O cost estimation. Our experimental results indicate that these optimizations can reduce the amount of time spent in I/O by as much as an order of magnitude.


Introduction
The use of massively parallel machines to solve large scale computational problems in physics, chemistry, and other sciences has increased considerably in recent times.Many of these problems have computational requirements which stretch the capabilities of even the fastest supercomputer available today.In addition to requiring a great deal of computational power, these problems usually deal with large quantities of data up to a few terabytes.Main memories are not large enough to hold this much amount of data; so data needs to be stored on disks and fetched during the execution of the program.Unfortunately, the performance of the U 0 subsystems of massively parallel computers has not kept pace with their processing and communication capabilities.Hence, the performance bottleneck is the time taken to perform disk UO.
In this paper we describe data access reorganization strategies for efficient compilation of out-of-core data parallel programs on distributed memory machines.In particu- lar, we address the following issues, 1) how to estimate the U 0 costs associated with different access patterns in out-ofcore computations, 2) how to reorganize data on disks to reduce U 0 costs, and 3) when multiple out-of-core arrays are involved in the computation, how to allocate memory to individual arrays to minirnize U 0 accesses.The rest of the paper is organized as follows.Section 2 introduces our model and section 3 explains out-of-core compilation strategy.Section 4 discusses how I/O optimizations can reduce the U 0 cost of loop nests.Section 5 presents experimental results.Section 6 discusses related work and concludes.

Model for Out-of-Core Compilation
In SPMD model, parallelism is achieved by partitioning data among processors.' To achieve load balance, express locality of accesses and reduce communication, several distribution and alignment strategies are often used.Many parallel languages or language extensions provide directives that enable the expressioin of mappings from the problem domain to the processing domain.The compiler uses the information provided by these directives to compile global name space programs for distributed memory computers.Examples of parallel languages which support data distributions include Vienna Fortran [9] and HPF [41.
Explicit or implicit distribution of data to each processor results in each processor having a local array associated with it.For large data sets, local arrays cannot entirely fit in local memory and parts of them have to be stored on disk.We refer such local arrays out-ofcore local arrays.The outof-core local arrays of each processor are stored in separate files called local arrayB1es.We assume that each processor has its own logical disk with the local array files stored on that disk.If a processor needs data from any of the local array files of another processor, the required data will be first read by the owner processor and then communicated to the requesting processor.

Compilation Strategy
In order to translate out-of-core programs, the compiler has to take into account the data distribution on disks, the number of disks used for storing data etc.The portions of local arrays currently required for computation are fetched from disk into memory.These portions are called (data) tiles.Each processor performs computation on its tiles.
Figure 1 shows various steps involved in translating an out-of-core program consisting a single nest.The compilation consists of two phases.In the first phase, called in-core phase, the arrays in the source program are partitioned according to the distribution information and bounds for local arrays are computed.The second phase, called out-ofcore phase, involves adding appropriate statements to perform I/O and communication.The local arrays are first tiled according to the node memory available in each processor.The resulting tiles are analyzed for communication.The loops are then modified to insert necessary I/O calls.
Consider the loop nest shown in Figure 2:(A) where l b k , ubk and sk are the lower bound, upper bound and step size respectively for loop k.This nest will be translated by compiler into the node program shown in Figure 2:(B).In this translated code loops IT and JT are called tizing loops, and loops I E and J E are called element loops.Note that communication is allowed only at tile boundaries (outside the element loops).For sake of clarity, we will write this translated version as shown in Figure 2:(C).All communication statements and element loops will be omitted, and in computation part each reference will be replaced by its sub-matrix version.

YO Optimizations
We first consider the example shown in Figure 3:(A), assuming that A, B and C are column-major out-of-core arrays.l The compilation is performed in two phases as described before.In the in-core phase, using the array distribution information, the compiler computes the local array bounds and partitions the computation.In the second phase, tiling of the data is carried out using the information about available node memory size.The YO calls to fetch necessary data tiles for A, B and C are inserted add finally the node program is generated.Figure 3:(B) shows the struightfor-'In this example, using HPF-like directives, the array A is distributed in row-block, the array B is distributed in column-block across the processors, and the array C is replicated.Notice that in out-of-core computations, compiler directives apply to data on disks.provided that 3nS 5 M. Notice that this cost is much better than that of the original.Also note that in order to keep calculations simple we have assumed that at most n elements can be requested in a single U 0 call.The rest of the paper explains how to obtain U 0 optimized node programs.Our approach consists of three steps: (1) Determination of the most appropriate file layouts for all arrays referenced in the nest, (2) Permutation of the loops in the nest in order to maximize locality, and (3) Partitioning the available memory across references based on U 0 cost.
We assume that the file layout for any out-of-core array may be either row-major or column-major and there is only one distinct reference per array.1.

Definition:
The Order of a term is the greatest symbolic value it contains.For example the order of (S + n) is n whereas the order of S is S. A term that contains neither n nor S is called constant-order term.
After listing all possible layout combinations term by term, our layout determination algorithm chooses the combination with greatest number of constant-order andor Sorder terms.For our example, combination 6 is an optimum combination, since it contains 2 S-order terms (S and 2s).
Notice that there may be more than one optimum combination.

Deciding Loop Order
Our technique next determines an optimal loop order for efficient file access.interchange(s) to improve the temporal locality for the tile being updated.Returning to our example, for combination 6, TCost(IT) = 2n/p, TCost(JT) = S , TCost(KT) = 2S, and TCost(LT) = n.The desired loop permutation from outermost to innermost is LT,IT,KT,JT, assuming p 2 2.2 Considering the temporal locality for the array being written to, the compiler interchanges LT and I T , and obtains the order IT,LT,KT,JT.

Memory Allocation Scheme
Since each node has a limited memory capacity and in general a loop nest may contain a number of arrays, the memory should be partitioned optimally.
DeJinition: The column-conformant (row-conformant) position of an array reference is theJirst (lust) index position of it.
Our scheme starts with tiles of size S for each dimension in each reference.For example, if a loop nest contains a one-dimensional array, and a three-dimensional array, it first allocates a tile of size S for the one-dimensional array, and a tile of size S x S x S for the three-dimensional array.This allotment scheme implies the memory constraint S3 + S 5 M where M is the size of the node memory.It then divides array references in the nest into two disjoint groups depending on the file layouts.For rowmajor (column-major) group, the compiler considers all loop indices in turn.For each loop whose index appears in at least one row-conformant (column-conformant) position and does not appear in any other position of any reference in this group, it increases the tile size, in the rowconformant (column-conformant) position(s) to full array size.Of course, the memory constraint should be adjusted accordingly.
For our running example, the compiler first allocates a tile of size S x S for each reference.It then divides the array 2Notice that, in some cases, the actual value of p can change the preferred loop order.31t should be noted that, after these adjustments, any inconsistency between those two groups (due to a common loop index) should be resolved by not changing the original tile sizes in the dimensions in question.der the assumption of fixed disk layouts are better than that of the unoptimized version, they are much worse than the one obtained by our approach.

references into two groups: A[IT, JT] and C[LT, K T ] in the first group, and B [ K T , I T ] in
The complexity of our heuristics is 8(lmd+2"+ZZog(Z)) where Z is the number of loops, m is the number of distinct array references, and d is the maximum number of array dimensions considering all references.The log term comes from sorting once the TCost value for each loop index has been computed.Since in practice 1, m and d are very small (e.g.2,3,etc.),all the steps are inexpensive and the approach is efficient.It should also be noted that if the desired loop permutation is not legal (semantic-preserving), then the compiler keeps the original loop order and applies only the memory allocation algorithm4.

Experimental Results
The technique introduced in this paper was applied on IBM SP-2 by hand using PASSION [ 7 ] , a run-time library for parallel I/O.PASSION routines can be called from C and Fortran, and an out-of-core array can be associated with different layouts.All the reported times are in seconds.The experiments were performed for different values of slab ru- tio (SR), the ratio of available node memory to the size of 4Another option is to try the next most desirable loop permutation.Our choice is simpler and guarantees that the optimized program will be at least as good as the original one.out-of-core local arrays combined.Figure 4 presents the normalized WO times of four different versions of our first example (Figure 3) with 4K x 4K (128 MByte) double arrays : unoptimized version (Original), optimized version using column-major layout for all arrays (Col-Opt), optimized version using row-major layout for all arrays (Row-Opt), and the version that is optimized by our approach (Opt).Figure 5 illustrates the speedups for Original and Opt versions.We define two kinds of speedups: speedup that is obtained for each version by increasing the number of processors, which we call s,, and speedup that is obtained by using Opt version instead of the Original when the number of processors is fixed.We call this second speedup local speedup (Sl), and product S, x Si is termed as combined speedup (see Figure 6:(A)).We conclude the following:

Slab
(1) The Opt version performs much better than all other versions.
(2) When the slab ratio is decreased, the effectiveness of our approach increases (see Figure 4).
(3) As shown in Figure 5, the Opt version also scales better than the Original for all slab ratios.
(4) It is clear to see from Figure 6:(A) that combined speedup is much higher for small slab ratios.Note that the combined speedups are super-linear as the algorithm (loop order) is changed in the Opt version.
(5) When the slab ratio is very small, the optimized versions with fixed layouts for all files also perform much better than the Original.

Processors Coefficient and Memory
The WO optimizations introduced in this paper can be evaluated in two different ways: (1) First, a problem that is solved by the Original version using a fixed slab ratio on p processors can, in principle, be solved in the same or less time on p' processors with the same slab ratio using the Opt version.The ratio p/p' is termed as processor coejTcient (PC).
(2) Second, a problem that is solved on a fixed number processors with a slab ratio sr by the Original version can, in principle, be solved in the same or less time on the same number of processors with a smaller slab ratio (less memory) sr' by the Opt version.We call the ratio sr/sr' memory coeflcient (MC).
The larger these coefficients are, the better as they indicate reduction in processoir and memory requirements of the application program respectively.
Figures 6:(B) and (C) show the PC and MC curves respectively for our example with 4K x 4K (128 MByte) double arrays.It can be observed that there is a slab ratio, called critical slab ratio, beyond which the shape of the PC curve does not change.In Figure 6:(B) the critical slab ratio is 1/64.Below this ratio, independent of the node memory capacities, for a given p it is possible to find the corresponding p' where p and p' are as defined above.Similarly it can be observed that there is a number of processors beyond which the shape of' the MC curve does not change.
In Figure B:(C) that number is 8.This result means that beyond that number of processors, given an sr it is possible to find the corresponding ST' where ST and ST' are as defined above.
We .believethat the final PC and MC curves give enough information about the perfolrmance of I/O optimizations.

Related Work and Conclusions
Previous works on compiler optimizations to improve locality have concentrated on iteration space tiling.In [8] and [ 5 ] , iteration space tiling is used for optimizing cache performance.
There is some work on compilation of out-of-core programs.In [2], the functionality of ViC*, a compiler-like preprocessor for out-of-core C* is described.In [6], the compiler support for handling out-of-core arrays on parallel architectures is discussed.In [ 13, a strategy to compile outof-core programs on distributed-memory message-passing systems is offered.It should be noted that our optimization technique is general in the sense that it can be incorporated to any out-of-core compilation framework for parallel and sequential machines.
In this paper we presented how basic in-core compilation method can be extended to compile out-of-core programs.However, the code generated using such a straightforward extension may not give good performance.We proposed a three-step I/O optimization process by which the compiler can improve the code generated by the above method.
Our work is unique in the sense that it combines data transformations (layout determination) and control transformations (loop permutation) in a unified framework for optimizing out-of-core programs on distributed-memory message-passing machines.

Definition:
Assume a loop index IT, an array reference R with associated file layout and an array index position r .Also assume a data tile with size S in each dimension except rth dimension where its size is n provided that n = O ( N ) >> S where N is size of the array in rth dimension.Then Index YO Cost of IT with respect to R, layout and r is the number of U 0 calls required to read such a tile from the associated file into memory, if IT appears in the rth position of R; else Index I/O Cost is zero.Index U 0 Cost is denoted by ICost(IT, R, r, layout) [31.Definition: The Basic I .0 Cost of a loop index IT with respect to a reference R is the sum of index U 0 costs of IT with respect to all index positions of reference R. Mathematically speaking, BCost(IT, R, layout) = ICost(IT, R, T , layout) T Definition: The Array Cost of an array reference R is the sum of BCost values for all loop indices with respect to reference R. In other words, ACost(R, layout) = BCost(IT, R, layout) I T 4.1.Determining File Layouts Our heuristic for determining file layouts for out-of-core local arrays first computes the ACost values for all arrays under possible layouts.It then chooses the combination that will allow the compiler to perform efficient file access.Consider the assignment statement in Figure 3:(B).The term by term additions of ACost values for different combinations are shown in Table Total 110 Cost of a loop index IT is the sum of the Basic U 0 costs (BCost) of IT with respect to each distinct array reference it surrounds.Mathematically, where R is the array reference and layoutR is the layout of the associated file as determined in the previous step.Our algorithm for desired loop permutation (1) calculates TCost(IT) for each tiling loop IT, (2) permutes the tiling loops from outermost to innermost according to nonincreasing values of TCost, and (3) applies necessary loop the second group.Since JT and KT appear in the row-conformant positions of the first group and do not appear elsewhere in this group, our algorithm allocates data tiles of size Sn for A[IT, JT] and C[LT, K T ] .Similarly, since KT appears in the columnconformant position of the second group and does not appear elsewhere in this group, the algorithm allocates a data tile of size nS for B [ K T , I T ] .After these tile allocations tiling loops KT and JT disappear and the node program shown in Figure 3:(C) is obtained.If we assume fixed column-major file layout for all arrays, then TCost(1T) = S + n / p , TCost(JT) = n, TCost(KT) = S + n, and TCost(LT) = S (from the first row of

Figure 5 .
Figure 5. Speedups for unoptimized and optimized versions of our example with 4K x 4K double arrays.

Table 1 )
. So, from outermost to innermost position KT,JT,IT,LT is the desirable loop per- mutation.Considering the temporal locality for the array being written to, the compiler interchanges KT and I T , and the order IT,JT,KT,LT is obtained.If, on the other hand, we assume a fixed row-major layout for all arrays, then TCost(IT) = n / p + S , TCost(JT) = S , TCost(KT) = n + S , and TCost(LT) = n.From outermost to innermost position KT,LT,IT,JT is the desirable loop permutation.Considering the temporal locality, our compiler takes IT to the outermost position.So, the final loop order is IT,KT,LT,.JT.It should be emphasized that although for reasonable values of M the costs obtained un-