A Framework for Integrated Communication and I/O Placement

. This paper describes a framework for analyzing data(cid:13)ow within an out-of-core parallel program. Data(cid:13)ow properties of FORALL statement are analyzed and a uni(cid:12)ed I/O and communication placement framework is presented. This placement framework can be applied to many problems, which include eliminating redudant I/O incurred in communication. The framework is validated by applying it for optimizing I/O and communication in out-of-core stencil problems. Experimental performance results on an Intel Paragon show signi(cid:12)cant reduction in I/O and communication overhead. optimization within a FORALL construct. We showed that existing frameworks do not extend directly to out-of-core problems and can not exploit the FORALL semantics. We presented a uni(cid:12)ed framework for the placement of I/O and communication calls and applied it for optimizing communication for stencil applications. Using the experimental results, we demonstrated that correct placement of I/O and communication calls can completely eliminate extra (cid:12)le I/O from communication and as a result, signi(cid:12)cant performance improvement can be obtained.


Introduction
It is widely acknowledged in the high-performance computing circles that parallel input/output requires substantial improvement in order to make scalable computers truly usable. There are several reasons for a parallel application for performing input/output. These include real-time I/O, initial/ nal read-write, checkpointing and out-of-core computations Bor96].
We focus on the problem of supporting out-of-core computations. Out-ofcore computations are those computations whose primary data sets are stored on les in the secondary memory. Speci cally, we concentrate on compiling outof-core programs developed using High Performance Fortran (HPF) Hig93]. 3 HPF is a data parallel language which provides explicit language directives to partition data over processors in certain pre-de ned decomposition patterns like BLOCK and CYCLIC. This data distribution results in each processor storing a local array associated with each array distributed in the HPF program. HPF also provides data-parallel program construsts like FORALL Hig93].
In this paper, we describe a data ow framework for optimizing communication in out-of-core problems. We focus on communication optimization within a single out-of-core FORALL construct. Unlike the available data ow frameworks for optimizing inter-processor communication KN94,KS95,GSS95], our framework takes an uni ed approach for placing I/O and communication calls while preserving characteristics of these calls. All the current frameworks focus on improving communication performance by vectorizing messages, eliminating redundant communication and overlapping communication with computation. How-ever, these frameworks do not directly extend to out-of-core problems. Another limitation of these frameworks is that they do not make e cient use of the copy-in-copy-out semantics of the HPF FORALL construct. We illustrate these points by applying two communication placement frameworks KN94,KS95] to an out-of-core problem performing stencil computations (also called an regular problem). We then compare the results with an integrated I/O and communication placement framework which achieves substantial performance improvement by simultaneously reordering I/O and communication calls.
The paper is organized as follows: Section 2 introduces various data ow definitions that will be used throughout the paper. In Section 3, we present an out-of-core regular problem and analyzes it's communication and I/O pattern. This problem is used as a running example throughout the paper. Section 4 presents an integrated I/O and communication framework and describes its application in eliminating extra le I/O from communication. Section 5 presents experimental performace results of optimizing out-of-core communication from stencil problems using our framework. Finally, we conclude in Section 6.

Background
Our program representation is based on KS95]. Let G=(N; E) be the interval ow graph representing an HPF program, with N nodes and E edges. Let s and e be the unique start and end nodes of G. Every edge in E can be classi ed as an entry, forward or backward edge. Let a Tarjan interval T(h) represent a set of program ow nodes that correspond to a loop in the program text. T(h) has a unique header h, where h 6 2 T(h). For every node n of the interval ow graph, G, we de ne Succ(n) and Pred(n) as a set of successor and predecessor nodes of n. The edges induce the following traversal order over G. Given a forward edge (m; n), a Forward order visits m before n and a Backward order visits m after n. Let Header denote the header node of the interval T(n). Bor96] describes the properties of the interval ow graph.
To anlyze data ow properties of the FORALL statement, we use the classical data ow de nitions, i.e., USE, DEF, KILL. A variable is said to be USEd if it is referred in an expression. A variable is said to be DEFed if it is initialized in an expression. The variable is said to be LIVE until it is de ned again (in other words, KILLed). We can extend these de nitions for objects such as arrays. An array is said be INJURED, if some elements of the array are overwritten, otherwise the array can be considered LIVE. An array is said to be ACTIVE if some of its elements are either USEd or DEFed and these elements constitute the Active set of the array.
Recall that the FORALL statement has copy-in-copy-out semantics Hig93]. Consequently, during the execution of a FORALL statement, old as well as new values of an array can be LIVE. In other words, the FORALL statement satis es the DELAYED KILL property Bor96]. We use variable DKILL to represent an array which satis es the DELAYED KILL property.
We now de ne some data ow variables that will be used for analyzing communication and I/O access patterns in out-of-core programs. Let Active p n denote the set of elements that will be used in computation in processor p at a node n in the interval ow graph. Similarly Incore p n denote the set of elements read by a processor p at node n. De nitions Active p n and Incore p n are used to compute the send-recv sets for each processor, Send p n and Recv p n . Using Send p n and Recv p n , we can compute the set of elements communicated at a node n, Comm n as S i fSend i n +Recv i n g. Similarly, we compute the set of incore elements at node n, Incore n , as S i Incore i n . For every node n, for every processor p and Send p n , we de ne Eio p n as a set of elements which will be sent by p but are not members of Incore p n . Formally, Eio p n =Send p n -(Incore p n T Send p n ).
For any data set d 2 Incore or Send or Recv, the following predicates are  Figure 1:2 presents an HPF example in which an out-of-core array a is distributed over 4 processors in BLOCK fashion. This example will be used as a running example throughout the paper. Our running example performs onedimensional relaxation using 3-point stencil computations. The interior points of the array a are updated using a FORALL construct. To preserve the FORALL semantics, it is necessary to use temporaries to store initial and intermediate data. Since the primary data sets are stored in les, it is necessary to use two di erent les, the source local array le (LAF) for reading initial data and a temporary LAF to store the updated intermediate data. After the computation is over, the temporary LAF can be renamed as the source LAF. 4 Figure 1:3 shows the pseudo-code for the stripmined program (assuming per processor available memory as 10). There are two stripmined iterations, each iteration reads the initial data from the source le into an in-core local array (ICLA) temp and writes the intermediate results from an ICLA temp1 to the temporary le. Each iteration, after reading the ICLA, performs communication (if required). For example, in the rst iteration, processors 0,1 and 2 send elements a(16),a(32) and a(48) to processors 1,2, and 3 respectively. In the second iteration, processors 1,2, and 3 send elements a(17), a(33) and a(49) to processors 0, 1, and 2. Note that this is an example of the Receiver-driven In-core communication method Bor96].  space while the bounds for the in-core computation are given in the local stripmined space (i.e., lb=1 and ub=8). For example, Read 1:9 0 means that processor 0 is reading elements a(1) to a(9), Comm 0 16 ! 1 represents communication of element a(16) from processor 0 to processor 1 and Write 1:8 0 means that processor 0 writing elements a(1) to a(8).

I/O and Communication Optimization: An Example
From the computation pattern, it is easy to determine the communication pattern for each stripmined iteration Bor96]. For example, in the rst iteration, processor 0 needs to send element a(16) to processor 1. Since processor 0 does not have element a(16) in memory, however, it needs to read it from the LAF and send it to processor 1. Similarly, processors 2 and 3 need to read elements a(32) and a(48) from their LAFs and send them to their respective destinations. These le reads are termed as extra since the read elements are not required for computation by the owner processor. In the second iteration, processors 1, 2, and 3 perform also extra le accesses to read elements a(17), a(33) and a(49) respectively. To prevent violation of FORALL semantics, old values of elements, a(17), a(33), and a(49), are read from the source LAF and communicated to appropriate processors. It should be observed that elements a(17), a(33) and a(49) are brought into memory in the rst iteration and could be communicated before or after they are overwritten; thus minimizing extra le accesses. The example also performs redundant reads of some elements. For example, in the rst iteration, processor 0 reads elements a(1) to a(9), but writes modi ed values of elements a(1) to a(8) while retaining the old set of elements, a(1) to a(9) in form of the temporaries. 5 In the second iteration, processor 0 again reads the old values of elements a(8) and a(9). Therefore, these two reads are partially redundant. These partially redundant reads can be eliminated if it is possible to determine which elements can be reused across iterations.
As observed before, for our running example, communication requires both inter-processor communication (i.e., communication of in-core data) and le I/O. To improve the communication cost, it is very important to minimize the le I/O cost (or the number of le accesses). The le accesses generated by the program can be classi ed into: (1) Compulsory: These accesses are required to read and write in-core data and (2) Extra: These accesses are required for communicating o -processor out-of-core elements. The le I/O cost can be reduced by (1) eliminating partially redundant compulsory le accesses and (2) minimizing extra le accesses by communicating in-core data whenever possible. The second optimization requires reordering computation and placing the communication calls so that only in-core data is communicated Bor96]. In an out-of-core application, the computation order is decided by the data access pattern, that is, by placement of the read/write calls. Therefore, to minimize overhead due to le I/O in communication, it is important that both communication and I/O calls are placed at appropriate positions.

A Framework for Integrated I/O and Communication Placement
In Section 3, we describe the compilation of an out-of-core FORALL statement. We observe that the implementation of out-of-core FORALL requires extra le accesses during communication and a naive implementation results in reading redundant data. In this section, we propose an integrated I/O and communication placement framework that exploits the DELAYED KILL property of the FORALL construct and applies the array access information for improving the overall performance. Note that the indeterminacy in FORALL execution order, allows our framework to freely reorder in-core computations. Speci cally, our framework reorders in-core computation such that communication would involve only inter-processor communication. Consequently, all extra le accesses will be eliminated.

The Correctness Criteria
Our integrated framework imposes the following correctness requirements: { Safety: All data either communicated or read is used immediately. { Su ciency: Every in-core computation is preceded by an appropriate Read call and each non-local reference is preceded by appropriate communication.
{ Balance: For every Send, there is exactly one matching Recv. Note that this condition does not apply for Read. 6 In the presence of the DELAYED KILL type of computation, the de nition of Safety is considerably weakened. Hence, it is more appropriate to term it as Weak Safety. Note that Weak Safety and Su ciency are applicable for both le access and communication calls, while Balance is applicable only to communication calls. Therefore, our framework is able to take an uni ed approach for placing le access and communication calls while honoring their individual characteristics.

Eliminating Extra File Accesses in Communication
It should be observed that extra le accesses are generated because an array section 7 is used several times in the stripmined FORALL iterations; once by the processor that owns the section and in remaining cases, by other processors. If it is possible for the processors to perform computation on the common array section in the same iteration, the communication will involve only inter-processor data transfer and extra le accesses could be eliminated. To satisfy this condition, we add the following constraint in the correctness criteria.

Strict Safety Constraint
{ Strict Safety: Everything that is read or communicated (i.e., sent and received) will be used only once.
Criteria Safety and Strict Safety require that the data read by processor i at node n, Incore i n , should be used immediately and should not be used anywhere else in the computation. Computation in any processor, j, at node n 0 , which requires elements of Incore i n (in other words, Recv j n 0 Incore i n ), should, therefore, be placed at node n. Then, processor i needs to send only the incore data (Send i n Incore i n ). Applying this condition to every processor, we can observe that if node n satis es Strict Safety, Comm n is subsumed by Incore n and therefore, set Eio is empty and all extra I/O is eliminated. where n and n 0 are nodes of the interval ow graph denoting the initial placement of the computation (in other words, placement of Read calls). To 6 We currently use synchronous I/O calls. 7 An element can be considered as a special case of section. nd i; j and n; n 0 , it is necessary to perform both Forward and Backward ow analysis.
Let us now de ne a predicate Incl i j (n; n 0 ) as follows: { Incl i j (n; n 0 ) df = True if Recv j n 0 Incore i n or Recv i n Incore j n 0 For a processor i, the solution of the Incl i j (n; n 0 ), for any processor j (j 6 = i), gives the node pair (n; n 0 ) satisfying the inclusion properties. The inclusion property is then veri ed for every Incore and Recv set in the program. If all the Incore and Recv sets satisfy the inclusion property, then the computation is said to be balanced. For balanced computation, one can eliminate extra I/O by reordering computations. We illustrate this optimization by using our running example (Figures 1). Table 1 illustrates the values of various data ow variables corresponding to the stripmined iterations (Figure 1). There are two stripmined iterations; for each iteration, Incore gives the set of elements that are brought in memory by each processor (ICLA). Corresponding Active, Send and Recv sets are also shown.   Incore 1 6 , i.e., the data required by the ICLA of processor 2 at node 2 ( rst stripmined iteration) is part of the ICLA of processor 1 at node 6 (second stripmined iteration). The entries in the positions 0,0] and 3,3] denote that processors 0 and 3 do not perform communication at nodes 2 and 6 respectively (in other words, in the rst and second stripmined iteration). Such entries are called non-solution entries. The number of solution entries in i th row or j th column denotes the number of times a processor i or j performs communication. Table 2. Inclusion matrix for the running example.
1. In the rst step, choose a random processor i. For our problem, let us choose processor 2. For this processor, select a solution entry from the second row, e.g., entry 2,1] which corresponds to the solution tuple (2,6). It states that Recv 2 2 Incore 1 6 . Therefore, sections of local arrays of processors 2 and 1, corresponding to the nodes 2 and 6 (in the interval ow graph) should be brought in memory. 2. In the second step, using the inclusion matrix, determine if the ICLA of processor 1 requires any o -processor data. It can be easily found out by checking the rst row of the inclusion matrix for solution entries containing node 6. The entry 1,2] corresponds to the solution tuple (6,2), which indicates that Recv 1 6 Incore 2 2 . Note that the array section of processor 2, corresponding to node 2, is already in memory. Therefore, the communication between processors 1 and 2 will involve only inter-processor communication.
3. The rst two steps have scheduled ICLAs of processors 1 and 2. The third step tries to schedule ICLAs of the remaining processors so that there are no extra I/O accesses. Consider processor 0. In the 0 th row, the only solution entry involves processor 1 at node 2. Since ICLA of processor 1 at node (1) (2) Tables 3 present performance results for column, and square tiles. In each experiment, the amount of time required to read and write local data, LIO, and the time required for performing communication, COMM, were measured for unordered and ordered (after placing the I/O and communication calls) access patterns and the communication gain was computed. Each table presents LIO and COMM for 5-and 9-point stencils with di erent processor grids and di erent array sizes. Since the local computation time is negligible compared to LIO, we have not reported the computation cost. Each experiment was performed for the memory ratio of 1 4 (i.e., the ratio of size of available memory to that of outof-core array). Note that for the unordered cases, COMM includes the cost of inter-processor communication and extra le I/O.
From Table 3, we can observe that by reordering communication and I/O calls, the communication cost COMM is signi cantly reduced. For example, for a 9-point stencil problem running on 64 processors using 8K*8K array and column tiles, COMM without ordering is 2.06 seconds, and with ordering is 0.05 seconds (therefore, the communication gain is 39). For the same problem, if square tiles are used, the communication gain is 35992. This increase in the gain is due to the additional I/O cost incurred during accessing square tiles.