A global communication optimization technique based on data-flow analysis and linear algebra

Reducing communication overhead is extremely important in distributed-memory message-passing architectures. In this article, we present a technique to improve communication that considers data access patterns of the entire program. Our approach is based on a combination of traditional data-flow analysis and a linear algebra framework, and it works on structured programs with conditional statements and nested loops but without arbitrary goto statements.The distinctive features of the solution are the accuracy in keeping communication set information, support for general alignments and distributions including block-cyclic distribu-tions, and the ability to simulate some of the previous approaches with suitable modifications. We also show how optimizations such as message vectorization, message coalescing, and redundancy elimination are supported by our framework. Experimental results on several benchmarks show that our technique is effective in reducing the number of messages (anaverage of 32% reduction), the volume of the data communicated (an average of 37%reduction), and the execution time (an average of 26% reduction).


Introduction
Distributed memory multiprocessors such as the IBM SP-2 and the Intel Paragon are attractive for high performance computing in that they offer potentially high levels of flexibility, scalability and performance.
But the need for explicit message passing resulting from the lack of a globally shared address space renders programming these machines a difficult task. The main objective behind the efforts such as High Performance Fortran (HPF) [39] and Fortran D [30] is to raise the level of programming by allowing the user to write programs with a shared address space view augmented with directives that specify data mapping.
The compilers for such languages are responsible for partitioning the computation, inserting the necessary commands that implement the required message passing for access to non-local data.
On such machines, the time (cost) to access non-local data is usually orders of magnitude higher than accessing local data. For example, on the Intel Paragon the processor cycle time is 20 nanoseconds whereas the remote memory access time is between 10;000 and 30;000 nanoseconds depending on the distance between communicating processors [29]. Therefore, it is imperative that the frequency and volume of non-local accesses are reduced as much as possible. In particular, in message-passing programs, the startup cost for the messages can easily dominate the execution time. For example, on the Intel Paragon the message startup time is approximately 1;720 times the transfer time per word; in the IBM SP-2 this figure is around 360 [16]. These figures indicate that optimizing communication is very important. Several software efforts have been aimed at reducing the communication overhead. The main goal of these optimizations is to increase the performance of programs by combining messages in various ways to reduce the overall communication overhead. The most common optimization technique used by previous researchers is message vectorization [51,8,30,11,10]. In message vectorization, instead of naively inserting send and recv operations just before references to non-local data, communication is hoisted to outer loops. Essentially this optimization replaces many small messages with one large message, thereby reducing the number of messages. For example, consider the program fragment shown in Figure 1(a) and assume that all arrays are distributed across processors block-wise in the second dimension. Figures 1(b) and (c) show naively inserted messages and message vectorization respectively, for a processor p before loop bounds reduction (a technique to allow processors to execute only those iterations which have assignments that write to local memory [30]) and guard insertion (a technique that guarantees correct execution of statements within loop nests). The notation sendfB,q,ng means that n elements of array B should be sent to processor q; recvfB,q,ng is defined similarly. For this discussion, we are not concerned with exactly which elements are sent and received. Notice that the version in Figure 1(c) reduces the message startup cost as well as the latency. Some of the researchers [30,2] also considered message coalescing which is a technique that combines messages due to different references to the same array, and message aggregation which combines messages due to references to different arrays to the same destination processor into a single message. In general, due to private physical memory spaces, generating communication code for message-passing architectures might be very difficult, because it requires the correct non-local elements to get transferred to the memories of the processors that will use them. Optimizing compilers for data-parallel languages automate this time consuming task of deriving node programs based on the data distribution specified by the programmer.    This problem can be illustrated using the program fragment given in Figure 2(a) assuming that arrays X and Y are distributed block-wise across two processors, 0 and 1. The RSDs corresponding to these two communications are also shown next to the loop statements. Notice that all communication is from processor 0 to processor 1. The problem here is that a data-flow approach based on RSDs to combine these communi- nication cannot be taken out of t loop because of a data dependence [52,48], the redundant communication will occur T times.
On the other hand, we represent these sets in our framework as S i := f d] : 9( : d = 1 + 4 and 1 d 197)g and S j := f d] : 9( : 1 + d = 3 and 50 d 299)g. Then by using the Omega library [31], we derive the code shown in Figure 2(b) which can enumerate all the elements in S i + S j . As a result, each element will be communicated once and only once. It should be stressed that the same problem with RSDs can occur with set difference (?) operations. For instance, the RSD difference between (1:1000: 3) and (1:1000:7) cannot be represented as a single RSD. Unfortunately, the inaccuracies originating from the union (and difference) operations on the RSDs accumulate as the data-flow process proceeds, making the final communication sets imprecise.
In this paper, we make the following contributions: (2) We present two different approaches, primarily for hoisting communication and minimizing the number of messages, respectively, that are aimed at reducing communication overhead and show the tradeoff between these two. Both these approaches are accurate; using the linear algebra framework proposed by Ancourt et al. [7], they are able to handle the optimization problem at the granularity of individual array elements.
(3) We show that the global communication sets resulting from our analysis can be enumerated by our use of the Omega library [42,31] from the University of Maryland. Although the Omega library works on the Presburger formulas and the best known asymptotic upper bound of any algorithm for verifying the Presburger formulas is O(2 2 2 n ), the library is much more efficient for the practical cases that arise in compilation.
(4) We compare our approach both qualitively and quantitively to the previous work which focused on a single loop nest at a time.
The remainder of this paper is organized as follows. Section 2 briefly describes some important concepts such as control flow graphs, interval analysis, dependence analysis and the linear algebra framework used throughout the paper. We present our approach in detail in Section 3 and show how it uses both the linear algebra framework and data-flow analysis. Section 4 discusses the effect of hoisting communication vis-a-vis reducing the number of messages. In Section 5, we present details of communication generation. Section 6 reports experimental results on a 16 node IBM SP-2 distributed-memory message-passing machine and shows that our technique is effective in reducing number of communication messages, volume of communication and execution time. Section 7 discusses related work and Section 8 concludes the paper.

Preliminaries
The main idea of this work is to show that a global communication optimization problem can be put into a linear algebra framework and that doing so might be useful in practice. Our approach gives the compiler the ability to represent communication sets globally as equalities and inequalities as well as to use poly-hedron scanning techniques to perform optimizations such as redundant communication elimination and   global message coalescing which were not possible under the loop-nest based communication optimization schemes. The following subsections give information about the basic concepts used throughout the paper.

Control Flow Graph (CFG)
We concentrate on structured programs with conditional statements and nested loops but without arbitrary goto statements. Our technique, however, can be extended to deal with jumps out of loops as well. We assume that array subscript functions, loop bounds and conditional expressions are affine functions of enclosing loop indices and symbolic constants. We also assume that the number of processors is known beforehand.
A basic block is a sequence of consecutive statements in which the flow of control enters at the beginning and leaves at the end without the possibility of branching except may be at the end [4]. A control flow graph (CFG) is a directed graph constructed by basic blocks and represents the flow-of-control information of the program.
For our purposes, the CFG can be thought of as a directed graph G = (V;E) where each v 2 V represents either a basic block or a (reduced) interval that represents a loop, and each e 2 E represents an edge between blocks. In this paper, depending on the context, we use the term node interchangeably for a statement, a block or an interval. Two unique nodes s and t denote the start and terminal nodes, respectively, of a CFG. One might think of these nodes as dummy statements. It is assumed that every node n 2 V lies on a path from s to t. We define the sets of all successors and predecessors of a node n as succ(n) = fm j (n;m) 2 Eg and pred(n) = fm j (m;n) 2 Eg, respectively. We say node i dominates node j in the CFG, if every path from s to j goes through i. We write this relation as j 2 dom(i).
The CFGs we consider have the following properties in addition: (a) empty else branches are added to if/endif constructs; (b) all the non-local references in the loop bounds and if-conditions are taken just above the respective constructs; and (c) like [22], any edge that goes directly from a block with more than one successor, to a block with more than one predecessor is split. This last transformation, shown in Figure 3, : An example application of the edge-split transformation to eliminate critical edges-the edges going from a node with more than one successor to a node with more than one predecessor.

Interval Analysis
We assume that prior to our analysis, the compiler has performed all loop level transformations [9,48,52] to enhance parallelism (e.g., loop permutation, loop distribution) and optimize communication. Our technique is based on interval analysis performed on the CFG. As explained in [5], the interval analysis consists of a contraction phase and an expansion phase. For programs written in a structured language, an interval corresponds to a loop, and there is a well defined algorithm to partition a CFG into disjoint intervals [4]. We use a version of the interval detection algorithm that identifies Tarjan's intervals [45].
The contraction phase collects information about what is generated and what is killed inside each interval. Then the interval is reduced to a single node and annotated with the information collected. This is a recursive procedure and stops when the reduced CFG contains no more cycles. In other words, the main purpose of this phase is to percolate the influence of each node to the outside into an increasingly more global context.
After the contraction phase, the expansion phase is run. In each step of this phase, a node (reduced interval) is expanded, and the information regarding the nodes in that interval is computed. In our case, at each step of the expansion phase, communication required for the intervals (loops) is determined. Figure 4 shows the two phases of the interval analysis for an example CFG. In this figure, as shown by the dashed arrows, the contraction phase proceeds from left to right, whereas the expansion phase proceeds in the reverse direction. As an example, the block marked with 3;4 represents an interval (a loop) containing Contraction Phase Expansion Phase Figure 4: An example application of interval analysis based on Tarjan intervals. First, the contraction phase is run and then the expansion phase is executed. blocks 3 and 4. It is also possible to adapt our approach to work with interval-flow graph, which is basically a CFG with an interval structure imposed on it [27,28,37].
It should be noted that since we assume that our input programs are structured, irreducible (intermediate) CFGs [4] can not occur during our analysis.

Data Dependence
Let S x and S y be two statements (not necessarily distinct) enclosed by nested loops. A data dependence determines which iterations of the loops can be executed in parallel. A flow dependence exists from statement S x to statement S y if S x writes a value that is subsequently (in sequential execution) read by S y . Such a dependence implies that instances of S x and S y must execute as if some of the nest levels must be executed sequentially. An anti-dependence exists between S x and S y if S x reads a value that is subsequently modified by S y . An output dependence exists between S x and S y if S x writes a value that is subsequently written by S y as well. Data dependences are loop-independent if the accesses to the same memory location occur in the same loop iteration; if the accesses occur in different loop iterations they are said to be loop-carried.
Note that in that case not all loop nest levels need to contribute to the dependence. The outermost loop level that contributes the dependence is said to carry that dependence. In-depth discussion of data dependence analysis techniques is beyond the scope of this paper and can be found elsewhere [48,52].

Linear Algebra Framework
HPF-like languages provide compiler directives that allow the user to perform data allocation onto local memories. The compiler then uses these distribution directives to partition computation across processors.
It has been shown in [7] that linear algebra provides a powerful framework to generate code for distributedmemory message-passing machines, taking into account compiler directives.
Most of the compilers for distributed-memory message-passing machines use the owner-computes rule, which simply assigns each computation to the processor that owns the data being computed [30,51,8]. In this paper, we also assume the owner-computes rule; our framework, however, can be modified to handle the cases where this rule is relaxed. In such cases, the LHS references can also introduce communication.
For clarity of the presentation, we do not consider relaxing the owner-computes rule in this paper.
Our approach uses the affine framework introduced by Ancourt et al. [7]. In this framework, data arrays, templates and processors are all declared as Cartesian grids as in HPF [39]. The data arrays are first aligned to templates and then these templates are distributed across the memories of the processors.
Consider the following program fragment under a compilation scheme based on HPF-like directives and the owner-computes rule. A cyclic(C) attribute indicates that the template (or array) dimension in question will be partitioned into blocks of size C and these are assigned to processors in a round-robin fashion. The block and cyclic(1) are just two common cases for the general cyclic(C) distribution.

END DO
Let R L = X( L *i+ L ) and R R = X( R *i+ R ). In the rest of the paper, for the sake of simplicity, we will sometimes refer to the subscript expressions as data (array) elements when the intention is clear. Assuming p and q denote two processors, we define the following sets.
Own(X; q) = fd j d 2 X and is owned by qg Compute(X; R L ; q) = fi j L i + L 2 Own(X; q) and i l i i u g View(X; R R ; q) = fd j 9{ st. { 2 Compute(X; R L ; q) and d = X( R { + R ) and i l { i u g CommSet(X; R R ; p; q) = Own(X; q) \ View(X; R R ; p): Intuitively, the set Own(X,q) refers to the elements mapped onto processor q through compiler directives.
The similar Own sets are defined for other arrays as well. The set of iterations to be executed by q due to a LHS reference R L is given by Compute(X,R L ,q). Of course, during the execution of this local iteration set, some elements (local or non-local) denoted by the RHS reference R R will be required; the set View(X,R R ,q) defines these elements. Finally, CommSet(X,R R ,p,q) defines the elements that should be communicated from processor q to processor p due to reference R R .
It should be noted that in general there may be more than one RHS reference, and the computation may involve multi-dimensional arrays and a multi-level nest in which case d and i denote data and iteration vectors respectively. Also in the most general case, , L and R are matrices, and , L and R are vectors.
The definition of the Own set above is rather informal. For a more precise definition, we take into account the block-cyclic distribution and define the Own set as Own(X; q) = fd j 9t; c; l such that t = d + and t = C P c + C q + l and a l d a u and p l q p u and t l t t u and 0 l C ? 1g; where P = p u ? p l + 1. In this formulation, t = d + represents alignment information and t = C P c + C q + l denotes the distribution information. In other words, each array element d is mapped onto a point in a local two-dimensional array. This point can be represented by a pair (c,l) and gives the local address of the data item in a processor. Simple block and cyclic(1) distributions can easily be handled within this framework by setting c = 0 and l = 0, respectively. As an example, Figure 5(a) shows the global and local addresses of a one-dimensional array distributed in block-cyclic manner across three processors with C = 4. Figures 5(b) and (c), on the other hand, illustrate two-dimensional views of the global and local addresses, respectively. For each processor, the horizontal dimension corresponds to c coordinate whereas the vertical dimension denotes l. For example, the 55 th element of the (global) array is mapped onto Processor 1 with c = 4 and l = 3 as local coordinates. The relation t = d + can be generalized by adding a replication matrix V which eliminates the replicated dimension from the equations: V t = d + . In the case where no replication is specified, V is the identity matrix. Also, in order to take the collapsed dimensions (the dimensions that are not distributed across processors) into account, another projection matrix Y can be used: Y t = C P c + C q + l.
All the elements on a collapsed dimension are stored on the same processor. Notice that these projection matrices are only useful if we adhere to a matrix form for describing the relations. We do not need them if the relations are described on a per dimension basis. In the rest of the paper we assume that identity alignment is used and arrays are directly distributed across processors. For an in-depth discussion of the linear algebra framework for compiling distributed-memory programs, we refer the reader to Ancourt et al. [7].

Parafrase-2 and Omega Library
Parafrase-2 [41] is used as the front end in our compilation framework. It is a parallelizing compiler implemented as a source to source code restructurer that consists of several passes for analysis, transformation, parallelism detection and code generation. In order to obtain the loops that enumerate the elements in the ownership and communication sets, we use the Omega library [31]. This library is essentially a set of C++ classes for manipulating integer tuple relations and sets defined using Presburger formulas. We implemented a framework that obtains data access information from Parafrase-2 internal structures and feeds them into the Omega library; when all the required sets have been obtained the framework converts these sets back to    internal Parafrase-2 structures.

Data-flow Analysis using a Linear Algebra Framework
In this section, we define our data-flow framework in detail. First, we introduce some important sets and operations on them.

Definition of Sets and Operations Communication Descriptors and Communication Sets
A communication descriptor can be defined as a pair hR; Si, where R is an array identifier (name) and S is the communication set associated with R. The exact definition of a communication set depends on the context in which it is used. Throughout our analysis, a communication set is defined as fd jd is owned by q and is required by (or should be transferred to or has already been transferred to) pg except for the KILL set, which defines the set of elements written (killed) by q. In these set definitionsd refers to a multi-dimensional array element.

Operations on Communication Sets
Since we define a communication set as a list of equalities and inequalities (this is how the Omega library represents a set), it can be represented as S = fd j P(d)g where P(:) is a predicate. Let fd j P(d)g and fd j Q(d)g be two communication sets. We define the operations + c , ? c , and \ c on communication sets as follows: Note that the operations 'or', 'and' and 'not' can be performed by using the corresponding Omega operations on sets which contain equalities and inequalities. When there is no ambiguity, we also use c and d instead of + c and + d , respectively. It should be noted that although these operations are similar to those given by Gong et al. [18], there is an important difference.

Operations on Sets of Communication Descriptors
Since we keep the communication sets accurately in terms of equalities and inequalities, we can optimize It should be noted that our analysis works with sets of equalities and inequalities. As compared with the previous approaches based on RSDs, our technique may be slower. In order to alleviate this problem, we do not operate on the contents of the sets in every data-flow equation to be evaluated; instead we represent the sets with symbolic names and postpone the real computation on them until the end of the analysis where the communication code should be generated. For example, suppose that a data-flow equation requires combining two sets S x = f x] : Q 1 (x)g and S y = f y] : Q 2 (y)g where Q 1 and Q 2 are predicates consisting of equalities and inequalities. Instead of forming the set f z] : Q 1 (z) _ Q 2 (z)g immediately and using it in subsequent computations, our approach represents the resulting set abstractly as S x + S y . When the whole process is finished, the resulting sets are re-written in terms of equalities and inequalities and the simplify utility of the Omega library is used to simplify them. Our experience shows that this approach requires a manageably small symbolic expression manipulation support and is fast in practice (see Section 6 for a cost analysis of the compilation time). Next we present our data flow framework.

Local (Intra-Interval) Analysis
In order to make the data-flow analysis task easier, the CFG of the program is traversed prior to the local analysis phase, and for each LHS reference a pointer is stored in the header of all enclosing loop nests. This allows the compiler to reach a LHS reference inside a loop quickly during the data-flow analysis. The local analysis part of our framework computes KILL, GEN and POST GEN sets for each interval. Then the interval is reduced to a single node and annotated with this information.
Let R L (ĩ) and R R (ĩ) be the data elements obtained from references R L and R R , respectively, with a specific iteration vectorĩ. The computation of the KILL set proceeds in the forward direction; that is, the nodes within the interval are traversed in topological sort order. Let KILL(i,q) be the set of elements written (killed) by processor q in node i, and Modified(i,q) be the set of elements that may be killed along any path from the beginning of the interval to node i (including node i). Then, This last equation is used to reduce an interval into a node. Notice that i is used to denote a node in the CFG whereasĩ is used for an iteration vector. In order to see how the computation of the KILL set proceeds, END DO 13 END DO Although for the sake of presentation we show the analysis here in terms of communication sets, the data-flow analysis is actually performed on sets of communication descriptors, since in general there may be accesses to several arrays. That is, the KILL set for a program that refers to arrays N i is as follows KILL(i; q) = fhN 1 ; KILL N1 (i; q)i; hN 2 ; KILL N2 (i; q)i; g : Since we concentrate on computation of the KILL set for a single array, we use KILL(i,q). Similar simplification will be used for presentation of the computation of the GEN(i,q,p) and POST GEN(i,q,p) sets as well.
GEN(i,p,q) is the set of elements required by processor p from processor q at node i with no preceding write (assignment) to them. The computation of the GEN proceeds in the backward direction, i.e., the nodes within each interval are traversed in reverse topological sort order. The elements that can be communicated at the beginning of a node are the elements required by any RHS reference within the node except the ones that are written by the owner before being referenced. Notice that this process involves considering all the LHS references within an interval for a given RHS reference; this leads to an exponential cost. However, there are two factors that make the analysis affordable. First, the scope of the analysis is a single interval (loop nest). In practice the number of distinct references in a loop nest is a small value. Second, since, as mentioned earlier, prior to analysis we keep pointers to all LHS references within a loop nest, we do not have to traverse the parse tree once more to search for the LHS references. Assuming{ = ({ 1 ; :::; { n ) and{ 0 = ({ 0 1 ; :::; { 0 n ), let{ 0 { mean that{ 0 is lexicographically less than or equal to{; and{ 0 k{ mean that { 0 j = { j for all j < k, and ({ 0 k ; :::; { 0 n ) ({ k ; :::; { n ). Since a node can refer to multiple RHS references, we first define gen(i; R R ; p; q) as the set of elements to be sent by processor q to processor p at node i due to reference R R . In that case we can compute GEN(i; p; q) = RR gen(i; R R ; p; q): For the sake of explanation, we assume one RHS reference per node, and use only GEN(i,p,q) in the following. The extension to the multiple RHS reference per node is straightforward. Let Comm(i,p,q) be the set of elements that may be communicated at the beginning of interval i to satisfy communication requirements from the beginning of i to the last node in the interval that contains i. Then, for an array X, we have GEN(i; p; q) = nd j 9{; Y st.ĩ l { ĩ u andd 2 Own(X; q) andd = R R ({) and R L ({) 2 Own(Y; p) and not 9|; In addition, we use the following equation to reduce an interval into a single node: In the definition of GEN, R R denotes the RHS reference, and R L denotes the LHS reference of the same statement. R L 0 , on the other hand, refers to any LHS reference within the same interval. Notice that while R L 0 is a reference to the same array as R R , R L can be a reference to any array (e.g., array Y in the formulation above). level(i) gives the nesting level of the interval (loop), with the value 1 corresponding to the outermost loop in the nest. If the dependence is loop independent the textual positions of the references in the nest may also need to be taken into account when computing the GEN set. In that case the formulation of the GEN set should contain terms showing the precedence relations between references. For the sake of simplicity, we assume that all the dependences that we are dealing with are loop carried.
After the interval is reduced, the GEN set for it is recorded, and an operator F is applied to the last part of this GEN set to propagate it to the outer interval: As an example consider Figure 6 on page 17 once more, this time concentrating on the computation of GEN sets due to array Y. Notice that array Y is written only in statement (line) 7 To keep the presentation simpler, we do not show the remaining GEN sets in this interval. The analysis proceeds as follows.
We should emphasize that computing the GEN sets gives us all the communication that can be vectorized or coalesced above a loop nest; i.e., our analysis easily handles message vectorization and message coalescing [30]. Finally, POST GEN(i,p,q) is the set of elements required by processor p from processor q at node i with no subsequent write to them. For an array X: The computation of POST GEN(i,p,q) proceeds in the forward direction. Its computation is similar to those of the KILL(i,q) and GEN(i,p,q) sets, so we do not discuss it in detail.

Data-flow Equations
In our framework, any communication incurred is placed at the beginning of the nodes. Here, we concentrate on the computation of a communication set called RECV. The actual send and recv sets used by the code generator are produced in a later pass of the compiler from the RECV sets discussed here using two projection functions as explained in Section 5. Our data-flow analysis framework consists of a backward and a forward pass. In the backward pass, the compiler determines sets of data elements that can safely be communicated at specific points. The forward pass eliminates redundant communication and determines the final set of elements (if any) that should be communicated at the beginning of each node i. The data-flow equations that we present here are aggressive in the sense that a communication incurred by a non-local reference is hoisted to the highest point possible in the CFG. Later in Section 4 we discuss how to refine this approach to control communication hoisting. The input for the equations consists of the GEN(i,p,q), KILL(i,q) and POST GEN(i,p,q) sets for each i as computed during the local analysis.
The data-flow equations for the backward analysis are given by Equations (1) and (2)  (1) a node should not fetch data needed by a successor unless it dominates that successor; and (2) a successor should ignore what a predecessor has received so far unless that predecessor dominates it.
RECV IN(i,p,q) and RECV OUT(i,p,q) denote the set of communication descriptors containing the elements that have been communicated so far (at the beginning and end of the node i, respectively) from q to p.
On the other hand, RECV(i,p,q) denotes the set of communication descriptors containing the elements that should be communicated from q to p at the beginning of node i and is finally used by the communication generation portion of the compiler to generate the actual send and recv commands as explained in Section 5.
Equation (3) simply says that the communication set arriving in a join node can be found by intersecting the sets for all the joining paths. Equation (4) is used to compute the RECV set which corresponds to the elements that can be communicated at the beginning of the node except the ones that have already been communicated (RECV IN). The elements that have been communicated at the end of node i (that is, RECV OUT set) are simply the union of the elements communicated up to the beginning of i; the elements communicated at the beginning of i provided that the condition in equation (5)

Global Data-flow Analysis
Our approach starts by computing the GEN, KILL and POST GEN sets for each node. Then the contraction phase of the analysis reduces the intervals from innermost to outermost and annotates them with GEN, KILL and POST GEN sets. When a reduced CFG with no cycles is reached, the expansion phase starts and RECV sets for each interval is computed, this time from outermost to innermost. There is one important point to note: before starting to process the next inner graph, the RECV IN set of the first node in this graph is set to the RECV set of the interval that contains it. More formally, in the expansion phase, we set RECV IN(i; p; q) k th pass = RECV(i; p; q) (k?1) th pass: This assignment then triggers the next pass in the expansion phase. Before the expansion phase starts RECV IN(i; p; q) 1 st pass is set to ;. Figure 8 shows the overall algorithm COMM-OPT followed by compiler to generate the send and recv sets. Notice that due to Equations (1) and (2) in Figure 7 a datum can only be communicated when it is safe to do so (i.e., the semantics of the program is preserved). In the forward analysis, the RECV sets contain only the elements needed to be communicated; therefore no stale data is used INPUT: A connected CFG. OUTPUT: A processed CFG with optimized communication calls.
Step (a) Pre-processing phase: (a.1) The CFG is traversed and in each loop a pointer for each LHS it encloses is stored; (a.2) The CFG is traversed to add empty else branches to "if" constructs and to eliminate the critical edges; (a.3) The "dominance" relation for each node in the CFG is computed.
Step (b) Initialization phase: For each node in the initial CFG, KILL, GEN and POST GEN sets are computed in terms of symbolic set names; Step (c) Contraction phase: Until a CFG with no cycles is reached, recursively each CFG is handled by reducing its intervals and annotating each interval by its KILL, GEN and POST GEN sets; Step (d) Expansion phase: For each intermediate CFG, the following is repeated: (d.1) Using data-flow Equations (1) and (2) in Figure 7, the SAFE IN sets are computed in backward direction; (d.2) Using data-flow Equations (3), (4) and (5) in Figure 7, the RECV sets are computed in forward direction; (d. 3) The CFG is expanded; the equation (6) is used to trigger the data-flow activity in the new CFG; Step (e) Substitution phase: The symbolic set names in the resultant RECV sets are replaced with actual sets consisting of equalities and inequalities; Step (f) Set generation phase: The Omega library is called to generate send and recv sets used by the code generator from the RECV sets.

Example
We use the synthetic benchmark program shown in Figure 9(a) on page 29 to illustrate our framework. We concentrate on the communication placement at the higher level CFG that is acyclic. Figure 9(b) shows the message vectorized program with communication calls before the loop bounds reduction and guard insertion. The notation sendfB,qg means that some elements of array B should be sent to q; recvfB,qg is defined similarly. We omitted from the figure the number of elements communicated to make the code look clear. In this example communication arises only due to references to array B. A loop-based communication analysis places eight send and eight recv calls (in fact these are themselves loop nests) for eight RHS references marked as bold in Figure 9(b). The communication points for these references are just above the corresponding loop nests. For example, communication required due to reference B(i-1,j-1) in line 33 in Table 1: Data-flow sets for the example shown in Figure 9. ; Figure 9(b) would be performed in line 29. Notice that in this example array B is written only once (in line 38).
Without loss of generality, assume that after local analysis, the GEN and KILL sets are obtained as shown in the second and third column of Table 1  For this example, the data-flow analysis framework achieves the following: The communication sets due to references B(i-1,j) and B(i-1,j+1) in line 45 of Figure 9(b) are combined; that is our approach handles message coalescing easily. Similarly, the communication sets due to references in lines 33, 11 and 4 can be combined and performed above line 2 in Figure 9 The resulting optimized program is shown in Figure 9(c). It should be emphasized that the final communication sets are precise, i.e., there is no overestimation. Moreover, these communication sets can be enumerated using the Omega library [31]. Notice that since all communication sets are enumerated in terms of abstract processors p and q; in general, if desired, message aggregation can also be performed easily.

Extension for Inter-procedural Analysis
It is relatively straightforward to extend our analysis to work inter-procedurally. In a simple inter-procedural setting, our approach can be used as follows. We first build a call graph [4] where each node corresponds to a procedure and there is a directed edge between two nodes P 1 and P 2 if and only if P 1 calls P 2 . We assume that there is no recursive procedure call. We then traverse the call graph in two steps corresponding to backward and forward analyses. In the backward analysis, we traverse the graph in such a way that a node is visited only after all of the nodes it calls have been visited. When a node P k is visited, the compiler runs our algorithm for the backward analysis. After the algorithm terminates, we summarize this node's communication by using three sets: GEN, KILL, and POST GEN. Notice that these three sets completely define the communication behavior of P k . Subsequently, P k is transformed to a new single node, and annotated by these sets (of course, all formal parameters are replaced with actual parameters). When the whole program is reduced to a single node, the forward analysis starts. This time we traverse the call graph in such a way that a node is visited only after all the nodes that call it have been visited. During the visit of a node, we compute the RECV sets for each node of it.
It should be noted that there are several inter-procedural communication optimization algorithms (e.g., [25], [26], [15]) with different degrees of sophistication, and the detailed analysis of communication optimization across procedure boundaries is beyond the scope of this paper. However, we believe that for most  and globally optimized (c) versions. The message vectorized program is obtained using the popular vectorization approach based on dependence-analysis. After determining the outermost loop at which the vectorization can be applied, the item wise messages are combined and are lifted out of the enclosing loops. The globally optimized version is generated using the approach discussed in this paper.

Hoisting Communication vs. Minimizing the Number of Messages
The approach explained so far is focused on hoisting communication as far as possible, and in general, results in reduction in communication volume as well as number of messages. However, as also pointed out by others, hoisting the communication too eagerly can, under some circumstances, lead to excessive buffer requirement [35] and an increase in the number of communication calls inserted [13]. In particular, failing to take resource constraints into account may affect the correctness of the communication placement. For example, if the buffer requirements exceed the maximum available buffer, the program may stall [37]. One way to prevent these problems is to avoid hoisting communication aggressively and to reduce breaking of messages into smaller ones. Since the optimal placement of communication is NP-hard [17], we present a simple heuristic that stops accumulating communication sets as soon as it encounters a node that satisfies a predicate P(i). The content of this predicate depends on a specific implementation. A few alternatives are presented in Table 2. For example, in [35], the third alternative has been used. An implementation can also employ a combination of these alternatives. As an example, consider the predicate obtained by the conjunction of the first and second alternatives; i.e., P(i) = fKILL(i,q) 6 = ; and GEN(i,p,q) = ;g.
The data-flow equations given in Figure 11  each requiring a message of its own. This also eliminates some of the complexity of the resultant code. A possible impact of the new approach is shown in Figure 10. In this figure S G and S K denote the GEN and KILL sets respectively for the node shown. The two approaches described in this paper behave similarly for the cases shown in Figures 10(a) and (b). But when a node performs only writes and no reads, the approach in To compare our new approach with the previous one (Figure 7), consider the example program fragment given in Figure 12 on page 33, a modified version of the last part of the program shown in Figure 9. Columns two, three and four of the top part of Table 3 shows the GEN, KILL and POST GEN sets respectively corresponding to the line numbers given in the first column. The fifth and sixth columns of the top part of Table 3 show the SAFE IN and RECV sets respectively of the previous approach. Although we obtain some reduction in communication volume, the number of messages is three which is larger than that of the loop based approach (column 7) that uses message vectorization alone. The bottom part of Table 3, on the other hand, presents SAFE IN and RECV sets obtained by our new approach. In that case the number of messages is 1 and we have reduction in communication volume as well.
The main advantages of the new approach are less computation time during the compilation, less complex send/recv loops and reduced number of communication messages. However, in real programs when a communicated array is written by the owner processor, it is usually written entirely; therefore, two approaches discussed behave similarly in practice.

Communication Generation
Our communication code generator uses the Omega library from University of Maryland [42,31]. After the RECV(i,p,q) sets are obtained in terms of symbolic expressions, they are rewritten in terms of equalities and inequalities. Then the Omega library is called to generate the send and recv loops.
Let us now consider the example given in Figure 9 (and Table 1) once more to show how the communication sets are generated. We first concentrate on the computation of S 4 + c S 11 + c S 33 . The compiler keeps this set as a symbolic expression until the code generation phase where it inserts equalities and inequalities corresponding to S 4 , S 11 , and S 33 , and then calls the Omega library to enumerate the elements. Figure 13 shows the communication sets for S 4 , S 11 , S 33 and S 0 = S 4 + c S 11 + c S 33 as represented in Omega.
A set element in this figure is represented as a quadruple q,p,d 1 ,d 2 ] meaning that the array element indexed by d 1 ; d 2 ] should be transferred from q to p. Later in code generation, the projection function proj R := f q; P; d 1 ; d 2 ] ! q; d 1 ; d 2 ]g is applied to this set to generate the recv set, and similarly the projection function proj S := f P; p; d 1 ; d 2 ] ! p; d 1 ; d 2 ]g is applied to generate the send set, for a particular processor P. Notice that deriving send and recv sets from a common set ensures correctness. In Figure 13, l 1 and c 1 denote the coordinates of an element to be communicated in the source (sending) processor whereas l 2 and c 2 denote its coordinates in the target (receiving) processor. l 3 and c 3 , on the other hand, refer to coordinates of the LHS reference in the same statement. Notice that the bounds on l 2 are adjusted in the appropriate directions to accommodate the received (non-local) elements; and the entire procedure works on the local address space similar to the one shown in Figure 5(c) on page 13.
generate the loops to enumerate q; d 1 ; d 2 ] and p; d 1 ; d 2 ] triples. Finally, the loops are converted to Fortran and the internal data structures of the compiler are updated. As an example, the code enumerating the triples for (S 25 + c S 33 ) ? c (S 4 + c S 11 + c S 33 ) is shown in Figure 14(a) on page 35 as C code for the send set and in Figure 14(b) for the recv set. In these codes, process(.) is an implementation-specific function that handles the resulting elements. These codes enumerate the elements and only the elements that should be communicated between q and p. The remaining sets are computed and enumerated similarly. Notice that redundant equalities and inequalities can be eliminated before the code generation phase by using the 'simplify' utility provided by the Omega library.
As a final note, although our use of Omega library increases the compilation time as compared to the previous approaches based on RSDs, this increase was not an issue for the programs we experimented with and was more than compensated by the run-time gains due to optimized communication as explained in the next section.

Experiments
In this section we report experimental results for eight programs that exhibit regular communication behavior. The salient characteristics of these programs are given in Table 4 Figure 14: Codes for enumerating (S 25 + c S 33 ) ? c (S 4 + c S 11 + c S 33 ) for the example shown in Figure 9 for a specific processor P. process element() is an implementation specific function that handles the set of elements to be communicated. block (BLK), cyclic (CYC), cyclic(4) (CYC(4)), and cyclic(7) (CYC (7)). The last two distributions are taken into account to demonstrate the effectiveness of our approach with block-cyclic distributions where most Table 4: Programs in our experiment set and their characteristics. The REFS column shows the number of references in the program whereas the C REFS column gives the number of references that require communication. The ITER column shows how many times the outermost timing loop has been iterated for each program. Each dimension of an array used in the experiments is set to the value shown in the SIZE column. The DISTR column shows how the highest dimensional arrays in the program are distributed. A 'D' in a dimension means that the dimension is distributed across processors while a '*' denotes a non-distributed dimension. power of two whereas the other one is prime. Gupta and Banerjee [21] note that for tred2 the block-cyclic distribution is the best choice. We also found that in addx block-cyclic distribution performs best (depending on the number of processors used).
We have found that except hydro m for all of these programs our two global optimization approaches given in Figures 7 and 11 [40]. Instead we considered the message-vectorized version with loop bounds reduction as the base version.
Since most of the compilers for message-passing architectures apply some kind of message-vectorization, we felt that it would be unfair to compare our method against run-time resolution without loop bounds reduction. Notice however, even in a single loop nest our global optimization approach subsumes most local optimizations including message-vectorization, message-coalescing, and message-aggregation. For all the programs except hydro m we refer to the globally optimized version as opt. In the hydro m code, opt refers to the approach given in Figure 7 whereas opt* denotes the approach given in Figure 11. For all the programs and the versions, we also applied an optimization that we call communication pattern reuse.
For example, assuming a (*,D) distribution for all arrays, in a statement such as X(i,j) = Y(i,j ? 1) + Z(i,j ? 1), arrays Y and Z have the same communication structure; therefore, we can generate communication loops only once and reuse it with a different name for each array. This optimization has not been fully implemented yet.
We now briefly discuss the implementation status of our framework. We have finished the implementa-  Table 13 on page 43 summarizes the improvement in number of messages for our programs. For hydro m there are two rows corresponding to our two methods (opt and opt* from top). Overall there is a 32% reduction in the number of messages. Improvement with 16 processors is slightly higher than that with 8 processors. This is because with the 16 processors in general there are more communication messages to optimize. It is also interesting to note that our optimization technique achieves 33% improvement with block-cyclic distribution (CYC(4) and CYC (7)) where most of the previous techniques fail. As expected, for hydro m our second approach which controls communication hoisting performs better than aggressive hoisting. Table 14 shows the percentage improvement in communication volume across all processors. We note that in both 8 and 16 processor cases we have on average 37% improvement over the base version. Considering block-cyclic distributions alone, we have a 40% improvement. As mentioned earlier these counts are collected dynamically at run-time using the performance analysis tools available on the SP-2. Also it should be emphasized that most of the improvements on adi and tomcatv result from a single nest, meaning that an aggressive loop level optimizer that applies a combination of vectorization, coalescing, and aggregation could also obtain similar improvements.
Finally, Table 15 gives the improvement in execution times. We note that the performance improvement for some programs such as hydro, adi, tomcatv, and swim is very good whereas for eflux and tred2 the improvement is only modest. This is due to the fact that the communication for this second group of codes is either small compared to the total execution time or difficult to optimize. Therefore, there is not much opportunity for improvement. Overall we have 26% improvement. Our approach improves performance in all cases, and more importantly we see  Table 16 shows the compilation times in milliseconds for our programs under different distributions.
We can conclude that a hypothetical global optimization approach using RSDs to represent communication sets may be able to eliminate at most 64% of the compilation time. This is a theoretical bound as we do not know of any RSD based framework with zero cost that can handle block-cyclic distributions globally.
Given the gains in execution time, we believe that the extra overhead that our approach incurs at compiletime is tolerable. In general, over several runs, the extra compilation time will be amortized. Moreover, we can expect the Omega-like tools to be much faster in the future.
The RUN column shows the percentages of execution times spent on executing the communication loops (without communication statements). On the average, only 7% of the execution time is spent on communication loops; therefore, the overhead incurred by our Omega-based approach at run-time is reasonable.
We also compared the compilation time taken by our Omega-based global approach with that of an approach based on processor-tagged descriptors (PTDs) [44], an enhanced form of RSDs built on top of Parafrase-2. PTDs provide an efficient way of describing distributed sets of iterations and regions of data, and are based on a single set representation parameterized by the processor location for each dimension of a virtual mesh. Table 18 shows the overall compilation times of the Omega-based approach (OME), the PTD-based approach (PTD), and the percentage increase (INC) when going from PTD to OME for pure block (BLK) and pure cyclic (CYC) distributions, as the PTDs cannot compile for general block-cyclic distributions.
The results show that using Omega instead of an RSD-like approach increases the compilation time 7% to 27%, averaging on 19% for both block and cyclic distributions.

Related Work
Several papers have address the problem of generating local address and communication sets for HPF programs where arrays are distributed using the general block-cyclic distributions [7,14,24,33,34,46,47]. Of these, Ancourt et al. [7] use a linear algebra framework; this renders their approach general. The rest of the approaches are very efficient for a restricted class of mappings. Considering the lack of generality of these approaches, their use in the communication optimizations of the kind discussed in this paper appears to be limited.
Most of the previous efforts considered communication optimization at loop level. Although each approach has its own unique features, the general idea has been the use of an appropriate combination of message vectorization, message coalescing and message aggregation [10,11,30,51,8,52].
More recently some researchers have proposed techniques based on data-flow analysis in order to op-  Table 18: Total compilation times (in milliseconds) of the Omega-based approach and the PTD-based approach. The OME column and the PTD column gives the compilation times obtained using the Omega-based and the PTD-based approaches, respectively. The INC column shows the percentage increase when going from PTD to OME. Granston and Veidenbaum [19] propose an algorithm that applies combined flow and dependence analysis to programs with parallel constructs. Their algorithm detects partial redundancies across loop nests and in the presence of conditionals. However their approach is not directly applicable to programs with general data distributions.
Gong et al. [18]  Gupta et al. [22] present a framework to optimize communication based on data-flow analysis and available section descriptors. Their approach is aggressive in exploiting the locally available data but fails to support general block-cyclic distributions, and the representation that they use makes it difficult to embed alignment and distribution information. Moreover, the communication set information they compute may not be precise.
Hanxleden and Kennedy [27,28] present a code placement framework for optimizing communication caused by irregular array references. Although the framework provides global data-flow analysis, it treats arrays as indivisible entities; thus, it is limited in exploiting the information available in compile-time.
In contrast, Kennedy and Nedeljkovic [32] offer a global data-flow analysis technique using bit vectors.
Although this approach is efficient, it is not as precise as the approach presented in this paper. They do not give any clue how their method can be extended to handle general type block-cyclic distributions.
Kennedy and Sethi [35,36,37] show the necessity of incorporating resource constraints into a global communication optimization framework. They take into account limited buffer size constraint and illustrate how strip-mining improves the efficacy of the communication placement. Their approach works with multiple nests but not for general block-cyclic distributions. Since they do not give any experimental results, a direct quantitive comparison of this work with ours is not possible. Their work defines a data-flow variable called SAFE which can be used in a similar manner as our predicate P(i). Kennedy and Sethi [35,36,37] do not use a linear algebra framework; later work from the dHPF project at Rice [2,3] includes the use of the Omega library for message optimizations.
The IBM pHPF compiler [13,23] achieves both redundancy elimination and message combining globally. But message combining is feasible only if the messages have identical patterns, or one pattern is a subset of another. The general block-cyclic distributions, however, can lead to complicated data access patterns and communication sets which, we believe, more precisely can be represented within a linear algebra framework.
Yuan et al. [49,50] present a communication optimization approach based on array data-flow analysis.
The cost of the analysis is managed by partitioning the optimization problem into subproblems, and solving the subproblems one at a time. Since that approach is also based on RSDs, it has difficulty in handling block-cyclic distributions.
Adve et al. [2,3] describe an integer set based approach for analysis and code generation for data parallel programs that uses the Omega library [31]. They consider performing message vectorization and message coalescing for general access patterns. Their method can also work with computation decomposition schemes that are not based on the owner-computes rule. These papers do not show how their techniques handle global communication optimization for multiple loop nests in the case of block-cyclic distributions.
Interval analysis used in this paper was first introduced by Allen and Cocke [5]. They used it to solve several data-flow problems; the analysis was then extended by Gross and Steenkiste [20] to array sections.
The approach proposed by Gupta et al. [22] mentioned above refines the technique by Gross and Steenkiste using loop-carried dependences.
In this paper we used ideas from the linear algebra framework [7] and data-flow analysis [5,4] devel-oped for performing optimizations on the CFG representation of the programs. We have shown that these two techniques blend together in a nice manner, which makes dealing with the global communication optimization problem feasible even in the presence of general block-cyclic distributions. We should emphasize that the data-flow equations given by Figures 7 and 11 are only two representative solutions to show how the global communication problem can be put into the linear algebra framework. We believe most of the previous approaches can also be put into this framework by re-defining the communication and ownership sets in terms of equalities and inequalities. This would not only give those approaches the capability to handle arbitrary alignments and distributions, but also provides high accuracy in manipulating the communication sets.

Summary
Management of accesses to non-local data to minimize communication costs is critical for scaling performance on distributed-memory message-passing machines. In this paper, we presented a global communication optimization scheme based on two complementary techniques: data-flow analysis and linear algebra framework. The combination of these techniques allows us to optimize communication globally and use polyhedron scanning techniques to enumerate global communication sets effectively for HPF-like alignments and distributions including block-cyclic distributions. Our framework takes into account control flow and achieves message vectorization, message coalescing, message aggregation and redundant communication elimination all in a unified framework. The cost of the analysis is managed by keeping the communication sets symbolically until the end of the data-flow analysis where the Omega library is called to generate actual sets in terms of equalities and inequalities. The experimental results demonstrate the effectiveness of our approach in reducing the number of messages and the volume of the data to be communicated. Future work will address the development of performance models to provide the compiler with the ability to estimate the profitability of message aggregation and coalescing globally.