Runtime Support for In-Core and Out-of-Core Data-Parallel Programs

Rajeev Thakur
Graduate School of Syracuse University, Computer Engineering

Follow this and additional works at: https://surface.syr.edu/eecs

Part of the Computer Engineering Commons

Recommended Citation
Thakur, Rajeev, "Runtime Support for In-Core and Out-of-Core Data-Parallel Programs" (1995). Electrical Engineering and Computer Science. 118.
https://surface.syr.edu/eecs/118

This Dissertation is brought to you for free and open access by the College of Engineering and Computer Science at SURFACE. It has been accepted for inclusion in Electrical Engineering and Computer Science by an authorized administrator of SURFACE. For more information, please contact surface@syr.edu.
RUNTIME SUPPORT FOR IN-CORE AND OUT-OF-CORE DATA-PARALLEL PROGRAMS

by

RAJEEV THAKUR

B.E., University of Bombay, India, 1990
M.S., Syracuse University, 1992

DISSERTATION

Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Engineering in the Graduate School of Syracuse University

May 1995

Approved ____________________________

Professor Alok Choudhary

Date ____________________________
Abstract

Distributed memory parallel computers or distributed computer systems are widely recognized as the only cost-effective means of achieving teraflops performance in the near future. However, the fact remains that they are difficult to program and advances in software for these machines have not kept pace with advances in hardware. This thesis addresses several issues in providing runtime support for in-core as well as out-of-core programs on distributed memory parallel computers. This runtime support can be directly used in application programs for greater efficiency, portability and ease of programming. It can also be used together with a compiler to translate programs written in a high-level data-parallel language like High Performance Fortran (HPF) to node programs for distributed memory machines.

In distributed memory programs, it is often necessary to change the distribution of arrays during program execution. This thesis presents efficient and portable algorithms for runtime array redistribution. The algorithms have been implemented on the Intel Touchstone Delta and are found to scale well with the number of processors and array size. This thesis also presents algorithms for all-to-all collective communication on fat-tree and two-dimensional mesh interconnection topologies. The performance of these algorithms on the CM-5 and Touchstone Delta is studied extensively. A model for estimating the time taken by these algorithms on the basis of system parameters is developed and validated by comparing with experimental results.

A number of applications deal with very large data sets which cannot fit in main memory, and hence have to be stored in files on disks, resulting in out-of-core programs. This thesis also describes the design and implementation of efficient runtime support for out-of-core computations. Several optimizations for accessing out-of-core data are presented. An Extended Two-Phase Method is proposed for accessing sections of out-of-core arrays efficiently. This method uses collective I/O and the I/O workload is divided among processors dynamically, depending on the access requests. Performance results obtained using this runtime support for out-of-core programs on the Touchstone Delta are presented.
# Contents

List of Tables viii

List of Figures x

Acknowledgments xiii

1 Introduction 1
   1.1 Distributed Memory Parallel Computers 2
   1.2 Software for Distributed Memory Computers 2
   1.3 Data-Parallel Languages 4
   1.4 Need for High Performance I/O 5
   1.5 Contributions of this Thesis 6
   1.6 Related Work 8
   1.7 Organization of this Thesis 11

2 Issues in Runtime Support 12
   2.1 Runtime Support for Regular Problems 12
   2.2 Runtime Support for Irregular Problems 13
   2.3 Runtime Support for Compilers 15
      2.3.1 Overview of HPF 15
      2.3.2 Compiler and Runtime Support for HPF 17
   2.4 Runtime Support for Out-of-Core Programs 21
      2.4.1 Out-of-Core Applications 22
3 Runtime Support for Array Redistribution

3.1 Introduction ................................................................. 25
  3.1.1 Need for Redistribution ........................................... 26
3.2 Notations and Definitions .............................................. 28
3.3 Block\((m)\) to Cyclic Redistribution .................................. 31
3.4 Cyclic to Block\((m)\) Redistribution .................................... 35
3.5 Cyclic\((x)\) to Cyclic\((y)\) Redistribution ............................... 38
  3.5.1 Special case \(x = k \cdot y\) ........................................... 38
  3.5.2 Special case \(y = k \cdot x\) ........................................... 42
  3.5.3 General Case ........................................................ 46
3.6 Redistribution of Multidimensional Arrays ......................... 50
  3.6.1 Shape Retaining Redistribution .................................. 51
  3.6.2 Shape Changing Redistribution .................................. 53
3.7 Circular Redistribution .................................................. 55
  3.7.1 Circular Block\((m)\) ↔ Cyclic Redistribution .................. 55
  3.7.2 Circular Cyclic ↔ Block\((m)\) Redistribution ................... 56
  3.7.3 Circular Cyclic\((x)\) ↔ Cyclic\((y)\) Redistribution ............... 56

4 Runtime Support for All-to-All Collective Communication .... 58

4.1 Architecture and Communication Issues ............................ 59
  4.1.1 CM-5 ................................................................. 59
  4.1.2 Touchstone Delta .................................................. 61
  4.1.3 Performance Models ............................................... 62
4.2 All-to-All Communication on a Fat Tree ............................ 63
  4.2.1 Linear Exchange (LEX) ........................................... 63
  4.2.2 Pairwise Exchange (PEX) ......................................... 64
  4.2.3 Recursive Exchange (REX) ........................................ 65
  4.2.4 Balanced Exchange (BEX) ........................................ 67
  4.2.5 Performance of Algorithms on the CM-5 ......................... 68
4.3 All-to-All Communication on a 2D Mesh ............................. 71
  4.3.1 Pairwise Exchange for Power-Of-Two Mesh (PEX) ............... 72
<table>
<thead>
<tr>
<th>Section</th>
<th>Page</th>
</tr>
</thead>
<tbody>
<tr>
<td>4.3.2 Pairwise Exchange for General Mesh (PEX-GEN)</td>
<td>74</td>
</tr>
<tr>
<td>4.3.3 PEX-GEN with Shift (PEX-GEN-SHIFT)</td>
<td>74</td>
</tr>
<tr>
<td>4.3.4 General Algorithm for any Mesh (GEN)</td>
<td>77</td>
</tr>
<tr>
<td>4.3.5 Indirect Pairwise Exchange (IPEX)</td>
<td>78</td>
</tr>
<tr>
<td>4.3.6 Recursive Exchange (REX)</td>
<td>78</td>
</tr>
<tr>
<td>4.3.7 Performance of Algorithms on the Delta</td>
<td>81</td>
</tr>
<tr>
<td>4.3.8 Model Validation</td>
<td>86</td>
</tr>
</tbody>
</table>

5 Runtime Support for Out-of-Core Programs: (I) Models and Local Optimizations 90

<table>
<thead>
<tr>
<th>Section</th>
<th>Page</th>
</tr>
</thead>
<tbody>
<tr>
<td>5.1 Introduction</td>
<td>90</td>
</tr>
<tr>
<td>5.2 PASSION Runtime Library</td>
<td>91</td>
</tr>
<tr>
<td>5.3 Models</td>
<td>93</td>
</tr>
<tr>
<td>5.3.1 Architectural Model</td>
<td>93</td>
</tr>
<tr>
<td>5.3.2 Data Storage and Access Models</td>
<td>94</td>
</tr>
<tr>
<td>5.4 Runtime Support for the Local Placement Model</td>
<td>97</td>
</tr>
<tr>
<td>5.4.1 Out-of-Core Array Descriptor (OCAD)</td>
<td>100</td>
</tr>
<tr>
<td>5.5 Optimizations</td>
<td>100</td>
</tr>
<tr>
<td>5.5.1 Data Sieving</td>
<td>101</td>
</tr>
<tr>
<td>5.5.2 Data Prefetching</td>
<td>108</td>
</tr>
<tr>
<td>5.5.3 Data Reuse</td>
<td>112</td>
</tr>
</tbody>
</table>

6 Runtime Support for Out-of-Core Programs: (II) Collective I/O 114

<table>
<thead>
<tr>
<th>Section</th>
<th>Page</th>
</tr>
</thead>
<tbody>
<tr>
<td>6.1 Introduction</td>
<td>114</td>
</tr>
<tr>
<td>6.2 Need for Collective I/O</td>
<td>115</td>
</tr>
<tr>
<td>6.3 Extended Two-Phase Method for Collective I/O</td>
<td>117</td>
</tr>
<tr>
<td>6.3.1 Reading Sections of Out-of-Core Arrays</td>
<td>117</td>
</tr>
<tr>
<td>6.3.2 Writing Sections of Out-of-Core Arrays</td>
<td>120</td>
</tr>
<tr>
<td>6.4 Partitioning I/O Among Processors</td>
<td>121</td>
</tr>
<tr>
<td>6.4.1 Static Partitioning</td>
<td>122</td>
</tr>
<tr>
<td>6.4.2 Dynamic Partitioning</td>
<td>122</td>
</tr>
</tbody>
</table>
6.5 Performance ................................................. 124
  6.5.1 Reading Common Sections ......................... 125
  6.5.2 Reading Overlapping Sections ..................... 127
  6.5.3 Reading Distinct Sections .......................... 129
  6.5.4 Writing Distinct Sections .......................... 131
  6.5.5 Accessing Sections with Non-Unit Strides .......... 131
  6.5.6 Scalability ............................................ 132
6.6 Advantages of Extended Two-Phase Method ............ 135

7 Conclusions ................................................. 140

Bibliography .................................................. 143

Vita .............................................................. 159
List of Tables

3.1 Data Distribution and Index Conversion .......................... 31
4.1 Communication Schedule for PEX on 8 Processors .............. 64
4.2 Communication Schedule for REX on 8 processors .............. 66
4.3 Communication Schedule for BEX on 8 processors .............. 69
4.4 Performance of PEX (time in sec.) .............................. 81
4.5 Performance of PEX-GEN (time in sec.) ......................... 82
4.6 Performance of PEX-GEN-SHIFT (time in sec.) ................. 83
4.7 Performance of GEN on power-of-two mesh (time in sec.) .... 84
4.8 Performance of GEN on non power-of-two mesh (time in sec.) .. 84
4.9 Performance of REX on Delta (time in sec.) ................... 85
4.10 Performance of IPEX (time in sec.) ............................. 85
5.1 Some of the PASSION Routines ................................. 99
5.2 Performance of Direct Read/Write versus Data Sieving (time in sec.) 107
5.3 I/O requirements of Direct Read and Data Sieving Methods .... 107
5.4 Performance of Median Filtering using $3 \times 3$ window (time in sec.) 110
5.5 Performance of Median Filtering using $5 \times 5$ window (time in sec.) 110
5.6 Performance of Data Reuse ........................................ 113
6.1 Time (sec.) for Reading Common Sections ....................... 126
6.2 Time (sec.) for Reading Overlapping Sections .................. 128
6.3 Time (sec.) for Reading Distinct Sections ...................... 130
6.4 Time (sec.) for Writing Distinct Sections ...................... 131
6.5 Time (sec.) for Reading Sections with Non-unit Strides . . . . . . . 132
6.6 Time (sec.) for Writing Sections with Non-unit Strides . . . . . . . 133
6.7 Performance for different number of processors . . . . . . . . . . . . 134
6.8 Performance for large sections of large arrays . . . . . . . . . . . . . 136
List of Figures

2.1 Phases of Compilation ................................. 18
3.1 Need for Redistribution ............................. 27
3.2 2D FFT Program using Redistribution ............... 28
3.3 Notations ........................................... 29
3.4 Block\((m)\) to Cyclic Redistribution .................. 32
3.5 Algorithm for Block\((m)\) to Cyclic Redistribution ....... 34
3.6 Algorithm for Cyclic to Block\((m)\) Redistribution .......... 36
3.7 Performance of Cyclic to Block Redistribution, array size 1M integers 37
3.8 Cyclic\((4)\) to Cyclic\((2)\) Redistribution ................ 39
3.9 KY_TO_Y Algorithm for Cyclic\((x)\) to Cyclic\((y)\) Redist., where \(x = ky\) 41
3.10 KY_TO_Y Algorithm, cyclic\((4)\) to cyclic\((2)\), array size 1M integers ... 42
3.11 KY_TO_Y Algorithm, cyclic\((4)\) to cyclic\((2)\), 64 processors ........ 43
3.12 X_TO_KX Algorithm for Cyclic\((x)\) to Cyclic\((y)\) Redist., where \(y = kx\) 45
3.13 X_TO_KX Algorithm, cyclic\((2)\) to cyclic\((4)\), array size 1M integers ... 46
3.14 X_TO_KX Algorithm, cyclic\((2)\) to cyclic\((4)\), 64 processors ........ 47
3.15 General Cyclic\((x)\) to Cyclic\((y)\) Redistribution ................ 47
3.16 GCD and LCM Methods ................................ 48
3.17 GCD, LCM and General Methods; cyclic\((15)\) to cyclic\((10)\); 1M array .... 49
3.18 GCD, LCM and General Methods; cyclic\((11)\) to cyclic\((3)\); 1M array ... 50
3.19 LCM v/s General Methods, cyclic\((11)\) to cyclic\((3)\), 64 processors ... 51
3.20 (block,block) to (cyclic,cyclic) Redistribution, 1K × 1K array ... 52
3.21 \((cyclic(11),cyclic(11))\) to \((cyclic(3),cyclic(3))\) Redist., \(1K \times 1K\) array  
3.22 Circular \(cyclic(11)\) to \(cyclic(3)\) Redistribution, array size \(1M\) integers  
4.1 CM-5 fat tree  
4.2 CM-5 Data Network with 64 Processing Nodes  
4.3 Pairwise Exchange (PEX)  
4.4 Recursive Exchange (REX) on CM-5  
4.5 Balanced Exchange (BEX)  
4.6 Performance on 32 node CM-5 for different message sizes  
4.7 Performance on CM-5 for message size 512 bytes  
4.8 Performance on CM-5 for message size 2 Kbytes  
4.9 PEX on \(2 \times 4\) mesh  
4.10 Pairwise Exchange for General Mesh (PEX-GEN)  
4.11 Processor Shift  
4.12 Pairwise Exchange for General Mesh with Shift (PEX-GEN-SHIFT)  
4.13 General Algorithm for any Mesh (GEN)  
4.14 Indirect Pairwise Exchange (IPEX)  
4.15 Recursive Exchange (REX) on Delta  
4.16 Performance of algorithms on a \(16 \times 32\) mesh  
4.17 Performance of algorithms on a \(16 \times 9\) mesh  
4.18 Performance of power-of-two algorithms for message size 16 Kbytes  
4.19 Performance of power-of-two algorithms for message size 1 Kbytes  
4.20 Observed and Predicted times (PEX, GEN)  
4.21 Observed and Predicted times (REX, IPEX)  
5.1 Software Architecture  
5.2 Data Storage and Access Models  
5.3 HPF Program Fragment: Solving Laplace’s Equation by Jacobi Iteration Method  
5.4 Example of OCLA, ICLA and LAF  
5.5 Out-of-Core Array Descriptor (OCAD)
Acknowledgments

I have been fortunate to have Alok Choudhary as my advisor throughout the course of my graduate studies. He has helped me in too many ways to mention. I greatly appreciate his guidance, support, encouragement and enthusiasm. Working with him has been a pleasure and a great learning experience.

I wish to thank Thong Dang, Geoffrey Fox, Salim Hariri, David Kotz and Sanjay Ranka for serving on my thesis committee and for their feedback on drafts of this thesis. I thank Geoffrey Fox and everyone else at NPAC for providing an excellent environment for my work. I thank Thong Dang for getting me interested in Computational Fluid Dynamics and inspiring me in many other ways. I am grateful to all my friends particularly Rajesh Bordawekar, Sachin Damle, Kanad Chakraborty, Ravi Ponnumasamy, Sachin More, Kevin Roe and Mario Campuzano for their help at all times and making my stay at Syracuse pleasant and enjoyable.

I cannot express in words my gratitude towards my parents, grandparents and sister. All my achievements have been possible only because of their constant encouragement, love and support. I am fortunate to have such a wonderful family. I dedicate this thesis to them.
Chapter 1

Introduction

Massively Parallel Processors (MPPs) with peak performance greater than 100 Gflops have already made their advent into the supercomputing arena. As a result, parallel computers are increasingly being used in many applications which require a great deal of computational power. Examples of such applications include many large scale computations in physics, chemistry, biology, engineering, medicine and other sciences, which have been identified as Grand Challenge Applications [56, 131]. Many applications dealing with information technology, such as multimedia systems, video on demand, video compression and decompression, also require a large amount of compute power. It is estimated that a computer capable of 1 Tflops ($10^{12}$ flops) or more would be required to solve these applications in a reasonable amount of time. It is widely recognized that rather than conventional vector supercomputers, parallel computers or distributed systems provide the only cost-effective means of achieving teraflop performance.

However, software support for parallel computers has lagged far behind advances in hardware. Programming a parallel machine can prove to be quite tedious. In order to get the best performance from a parallel computer, the programmer has to pay attention to many low-level implementation details. Also, the programs are very often not portable. One way to tackle this problem is by using advanced compilers and runtime support systems. The research presented in this thesis addresses several issues in providing runtime support for in-core as well as out-of-core programs on
distributed memory parallel computers. This runtime support can be directly used in application programs for greater efficiency, portability and ease of programming. It can also be used together with a compiler to translate programs written in a high-level data-parallel language like High Performance Fortran (HPF) to node programs for distributed memory parallel computers.

1.1 Distributed Memory Parallel Computers

In a distributed memory computer, the memory is physically distributed among processors. Each processor has its own local memory and all processors are connected together by an interconnection network such as a hypercube, mesh, torus, fat tree or some other topology. A processor can directly access only its own local memory. If data from some other processor is needed, it can be obtained by explicit message passing. Examples of such systems are the Intel iPSC/860, Touchstone Delta and Paragon; Thinking Machines CM-5; IBM SP-1, SP-2; Meiko CS-2; Cray T3D; nCUBE-2 etc.

The main advantage of distributed memory computers is that they are scalable. Advances in interconnection network technology have made it possible to connect hundreds or thousands of processors into one large high-performance parallel computer. Hence, such machines are also called Massively Parallel Processors (MPPs). Each processor of an MPP is typically a powerful microprocessor like a DEC Alpha, IBM PowerPC, Sun Sparc or Intel i860. Since these microprocessors are also used in workstations or personal computers, they are available at competitive prices. Hence, MPPs also prove to be very cost-effective.

1.2 Software for Distributed Memory Computers

Distributed memory machines are relatively hard to program. The most common programming model for distributed memory computers is the Single Program Multiple Data (SPMD) model in which each processor runs the same program, but on different data sets. The program running on each node is essentially a sequential program
CHAPTER 1. INTRODUCTION

(usually in C or Fortran) interspersed with calls to message passing routines. The node program is written in the local address space and the absence of a single global address space is what makes message passing programming difficult. The programmer has to decide how data should be partitioned among the processors and has to manage communication between processors explicitly. Interprocessor communication is at least an order of magnitude more expensive than accessing data in local memory. This forces the programmer to pay considerable attention to saving communication costs. Since each machine has its own architectural features and idiosyncrasies, the programmer tries to hardwire the program to exploit these peculiarities and get the best performance. Hence a great deal of effort is required to port such programs to other machines. These details also make it difficult to convert existing sequential programs to parallel programs. Another problem is that each parallel machine has its own message passing library which is quite different from that of other machines. Some of the common communication libraries are NX for Intel machines, CMMD for the CM-5, MPI for the IBM SP-1 and SP-2 etc. A program written using the NX library for the Intel Paragon cannot be directly run on the CM-5 and vice-versa because the message passing routines are totally incompatible.

There have been several attempts to provide some measure of portability in parallel programs. There are number of portable communication libraries like EXPRESS [90], PVM [114], PICL [47], P4 [19] etc. which provide a communication layer above the native message passing library of the system. These libraries provide a well-defined set of communication routines which remain the same for any system. For example, a program written using EXPRESS or PVM can be run on almost any parallel computer and even on a network of workstations. The EXPRESS or PVM routines make calls to the message passing library provided on the system. However, this portability is at the cost of slightly lower performance because of this additional layer of communication software.

There has also been an effort to develop a standard Message Passing Interface (MPI) [83]. The Message Passing Interface Forum, a group of researchers from industry, academia and research laboratories, has defined a set of library interface standards
for message passing called MPI. MPI has tried to make use of the most attractive features of a number of existing message passing systems. The goal is to establish a practical, portable, efficient and flexible standard for message passing. The definition of a message passing standard such as MPI provides vendors with a clearly defined set of routines that they can implement efficiently, or in some cases provide hardware support for, thereby providing better performance. But the difficulty of explicitly doing message passing still remains.

1.3 Data-Parallel Languages

Since message passing programming is difficult, the user would prefer to write a program in global name space without any message passing, and have a compiler translate it into a message passing program. This requires advanced compiler technology, but it would result in parallel programming being easy and portable. Researchers have found it very hard to build compilers which can parallelize sequential programs written in standard C or Fortran 77. Hence, standard sequential languages have been augmented with directives and constructs to aid the compiler in generating message passing code. Fortran D [43, 58] is one such language. Fortran D consists of a set of extensions to Fortran 77 which specify how data is to be distributed among the processors of a distributed memory machine. The Fortran D compiler developed at Rice University [122, 59] can translate a Fortran D program into a Fortran 77 plus message passing node program. Another such language is Fortran 90D [128]. Fortran 90D consists of a set of extensions to Fortran 90, similar to those used in Fortran D. The Fortran 90D compiler developed at Syracuse University [13] translates Fortran 90D programs to Fortran 77 plus message passing node programs. Vienna Fortran [21, 130] and CM Fortran [121] also allow the user to write programs in global address space.

Recently, the High Performance Fortran Forum, comprising a group of researchers from industry, academia and research laboratories, developed a standard language called High Performance Fortran (HPF) [57, 67]. HPF was designed to provide a
CHAPTER 1. INTRODUCTION

portable extension to Fortran 90 for writing data-parallel applications. It includes features for mapping data to processors, specifying data-parallel operations, and methods for interfacing HPF programs to other programming paradigms. It is expected that HPF will be a standard programming language for computationally intensive applications on many types of machines, such as massively parallel MIMD and SIMD multiprocessors as well as traditional vector processors. In order for HPF to be successful, it needs a powerful compiler. Compiler technology for HPF is still in its infancy and current versions of HPF compilers are not robust and mature enough to compile large production codes efficiently. However once good compilers are available, the modern features and powerful capabilities of HPF are expected to make it the new popular version of Fortran for scientists and engineers solving complex large-scale problems [67].

1.4 Need for High Performance I/O

Another important issue in high performance computing is providing support for high performance parallel input-output (I/O). I/O for parallel systems has drawn increasing attention in the last few years as it has become apparent that I/O performance rather than CPU or communication performance may be the limiting factor in future computing systems. Large scale scientific computations, in addition to requiring a great deal of computational power, also deal with large quantities of data. At present, a typical Grand Challenge Application could require 1Gbyte to 4Tbytes of data per run [38]. These figures are expected to increase by orders of magnitude as teraflop machines make their appearance. Although supercomputers have very large main memories, the memory is not large enough to hold this much amount of data. Hence, data needs to be stored on disk and the performance of the program depends on how fast the processors can access data from disks. In order to have a balanced system [75], it is essential that the I/O bandwidth is comparable to the CPU and communication bandwidth. Unfortunately, the performance of the I/O subsystems of MPPs has not kept pace with their CPU and communication capabilities. It is still
orders of magnitude more expensive to do I/O than to do computation or communication. Improvements are needed in both hardware and software in order to improve I/O performance.

1.5 Contributions of this Thesis

This thesis addresses several issues in providing runtime support for in-core as well as out-of-core programs on distributed memory parallel computers. This runtime support can be used for performing many of the commonly required operations in application programs written using a distributed memory programming model. The use of runtime support makes it easier to write application programs and provides greater efficiency and portability. We mainly focus on runtime support for regular computations in which the communication and I/O patterns are known statically.

This runtime support can also be used together with a compiler to translate programs written in a high-level data-parallel language like HPF to node programs for distributed memory parallel computers. In fact it forms an essential part of the Fortran 90D/HPF compiler developed at Syracuse University [13]. Runtime support helps to separate the machine dependent aspects of compilation from the machine independent aspects. The compiler can do all the machine independent transformations and the runtime system can be optimized for each different machine. Thus, a portable and efficient compiler can be obtained by porting the runtime system to different machines. The runtime support discussed in this thesis is general and can be used in any other HPF compiler or a compiler for any other data-parallel language.

In distributed memory programs, arrays are distributed across processors in some fashion. For a number of reasons, it is often necessary to change the distribution, or redistribute the arrays during the course of program execution. This thesis presents efficient algorithms for runtime array redistribution. Since array distribution and redistribution is rigorously defined in HPF, the algorithms are developed for redistributing arrays as defined in HPF. The algorithms are independent of the communication mechanism used and hence are portable. A novel approach is proposed for
performing the general cyclic($x$) to cyclic($y$) redistribution using two methods, called the GCD Method and the LCM Method, which have low runtime overhead. We have implemented all the algorithms on the Intel Touchstone Delta and they are found to perform very well for different number of processors and array sizes.

A collective communication pattern which arises very often in many applications such as two-dimensional Fast Fourier Transform, parallel quicksort as well as in array redistribution, is the all-to-all communication pattern, also called complete exchange. This thesis presents several algorithms for all-to-all collective communication on fat-tree and two-dimensional mesh interconnection topologies. Previous work in this area tried to minimize link contention by increasing the number of communication steps. The algorithms proposed in this thesis take advantage of the fact that in many of the present generation machines like the Touchstone Delta and Paragon, the communication links have excess bandwidth and can tolerate a certain amount of link contention. Hence communication can be performed in fewer steps by allowing a small amount of link contention to exist. The performance of these algorithms on the CM-5 and Touchstone Delta is studied extensively. A model for estimating the time taken by these algorithms on the basis of system parameters has been developed and validated by comparing with experimental results.

A large number of applications deal with very large data sets which cannot fit in main memory, and hence have to be stored in files on disks. This thesis also describes the design and implementation of efficient runtime support for out-of-core computations. The runtime system supports three different models of data storage and access. A number of optimizations have been incorporated for improved performance. A new method, called the Extended Two-Phase Method, is proposed for accessing sections of out-of-core arrays efficiently. This method uses collective I/O in which processors cooperate to perform I/O in an efficient manner by combining several I/O requests into fewer larger requests, eliminating multiple file accesses for the same data and reducing contention for the I/O subsystem. A dynamic scheme is used for partitioning the I/O workload among processors, depending on the access requests. Performance results obtained using this runtime support for out-of-core
programs on the Touchstone Delta are presented.

A Parallel Compiler Runtime Consortium (PCRC) [45] has recently been formed with the goal of developing a common runtime support for high level parallel languages. We believe that the research presented in this thesis would be very useful to this consortium. It is also useful to vendors developing commercial HPF compilers, such as The Portland Group Inc. (PGI), Applied Parallel Research (APR), Digital Equipment Corporation (DEC) and others. Application programmers writing distributed memory programs would also benefit a great deal by using the ideas proposed in this thesis.

1.6 Related Work

This section briefly describes some of the related research in parallel languages, parallel compilers, runtime support systems and support for high-performance parallel I/O.

The concept of defining processor arrays and distributing data to them was first introduced in the programming language BLAZE [82] in the context of shared memory systems with non-uniform access times. This research was continued in the Kali [65] programming language for distributed memory machines. The Kali compiler uses the inspector/executor strategy to parallelize irregular computations. The compilation system SUPERB [129] parallelizes sequential programs semi-automatically for distributed memory machines. The SUPERB tool transforms sequential Fortran programs with data distribution annotations into parallel programs. Compilers for functional languages like Id Nouveau and Crystal have been developed for distributed memory machines. The Crystal compiler generates communication statements by studying the access patterns of the arrays in a statement. Split-C [32] is a parallel extension of C intended for high performance programming on distributed memory multiprocessors. It provides a global address space and allows a mixture of shared memory, message passing and data-parallel programming styles for both regular and
irregular problems. A compiler for Split-C is being developed at the University of California, Berkeley, with a runtime support system which uses Active Messages [124]. pC++, a data-parallel extension to C++, has been developed at Indiana University [5]. A Fortran D compiler is being developed at Rice University [122].

Some research has been done in developing compilers which can automatically determine a good distribution and alignment of arrays instead of having the user specify them. Gupta [50] has proposed a constraint based approach to automatically determine a good data distribution. This method has been incorporated in the PARADIGM compiler [49, 112]. The PARADIGM project at the University of Illinois aims at developing a fully automated compiler to translate sequential programs to parallel programs for distributed memory computers. The problem of automatic alignment of arrays has been studied by Chatterjee et al. [22] and Li et al. [79].

There has been some research in runtime support for applications with irregular data access patterns. The PARTI/CHAOS toolkit is a collection of runtime library routines to handle irregular computations [35, 102]. These primitives have been integrated with the Fortran D compiler at Rice University, the Fortran 90D/HPF compiler at Syracuse University and the Vienna Fortran compiler at the University of Vienna. Compilation methods for irregular problems have been investigated by Ponnusamy [93], Das [34] and Hanxleden [125]. Agrawal et al. [1] describe how runtime support can be integrated with a compiler to solve irregular block-structured problems.

Research has also been done in the area of array redistribution. Gupta et al. [53] and Koelbel [66] provide closed form expressions for determining the send and receive processor sets and data sets for redistributing arrays between block and cyclic distributions. An analytical model for evaluating the communication cost of data redistribution is presented in [63]. A multiphase approach to redistribution is discussed in [64]. Wakatani and Wolfe [126] describe a method of array redistribution called Strip Mining Redistribution in which parts of an array are redistributed in sequence, instead of redistributing the entire array at one time as a whole. The reason for doing
this is to try to overlap the communication involved in redistribution with some of the computation in the program. Kalns and Ni [62] present a technique for mapping data to processors so as to minimize the total amount of data that must be communicated during redistribution. Ramaswamy and Banerjee [99] discuss algorithms for redistribution based on a mathematical representation for regular distributions called PITFALLS. There has also been some research on the closely related problem of determining the local addresses and communication sets for array assignment statements like $A(l_1 : h_1 : s_1) = B(l_2 : h_2 : s_2)$ where $A$ and $B$ have different cyclic($m$) distributions [23, 110, 111, 52].

Algorithms for all-to-all communication (complete exchange) on a hypercube have been proposed by Bokhari [6], and also by Johnsson and Ho [61]. Complete exchange algorithms for a two-dimensional mesh are discussed in [103, 7, 113, 51, 54]. Optimal communication algorithms on the hypercube have been developed by Fox and Furmanski [42]. Algorithms for scheduling irregular communication patterns have been proposed by Wang and Ranka [100, 127] and also by Shankar and Ranka [106, 107, 108].

Compiling out-of-core data-parallel programs is a fairly new topic and there has been very little research in that area. A model and compilation strategy for out-of-core data-parallel programs is discussed in [10]. Bordawekar [11, 8] is developing a compiler for out-of-core HPF programs which uses the runtime library [116, 25, 115, 118] discussed in Chapters 5 and 6 of this thesis. Cormen and Colvin [31] are developing a compiler-like preprocessor for out-of-core C*, called ViC*, which translates out-of-core C* programs to standard C* programs with calls to a runtime library for I/O. Paleczny et al. [89] are also developing techniques for compiling out-of-core data-parallel programs. Brezany et al. [18] are working on compilation techniques for out-of-core programs in the context of Vienna Fortran. Language extensions for out-of-core data-parallel programs are proposed in [8, 17, 109, 18].

There has been a lot of effort to improve parallel I/O performance both by hardware and software means. Various RAID schemes (Redundant Arrays of Inexpensive
Disks) are proposed in [91] for better performance, reliability, power consumption and scalability. An excellent overview of RAID concepts is given in [24]. Disk striping where data is distributed across disks at a small granularity so that each block is distributed across all disks is proposed in [101]. File declustering, where different blocks of a file are stored on distinct disks is suggested in [81]. This is used in the Bridge File System [39], in Intel's Concurrent File System (CFS) [92] and in various RAID schemes [91]. Vesta is a parallel file system which supports logical partitioning of files [29, 27]. The RAMA [84] file system distributes file blocks across disks randomly using a hash function, instead of the usual striped layout. Runtime libraries for parallel I/O, such as the Panda Library [104, 105] and the Jovian Library [4], are being developed. Portable parallel file systems such as VIP-FS [55], PIOUS [85] and PPFS [60] have been developed recently. Techniques for improving I/O performance using collective I/O have also been proposed. Two-phase I/O is a technique for performing collective I/O using a runtime library [37, 12]. Disk-directed I/O is a technique for performing collective I/O at the file system level [69, 70, 71].

1.7 Organization of this Thesis

The rest of this thesis is organized as follows. Chapter 2 gives an overview of some of the issues in providing runtime support for in-core and out-of-core data-parallel programs. Runtime support for array redistribution is discussed in Chapter 3. Chapter 4 describes runtime support for all-to-all collective communication on fat-tree and two-dimensional mesh interconnection topologies. Runtime support for out-of-core programs is discussed in Chapters 5 and 6. Chapter 5 describes three models of data storage and access for out-of-core programs, and a number of local optimizations for accessing out-of-core data efficiently. Chapter 6 describes the Extended Two-Phase Method for accessing sections of out-of-core arrays using collective I/O. Finally, conclusions are presented in Chapter 7, along with some ideas for future work.
Chapter 2

Issues in Runtime Support

This chapter gives a brief overview of the various issues involved in providing runtime support for programs on distributed memory parallel computers. Runtime support can either be used directly in message passing application programs, or used together with a compiler to translate programs written in a high-level data-parallel language such as HPF.

2.1 Runtime Support for Regular Problems

There are many scientific applications which have very regular array access patterns. These access patterns may arise either from the underlying physical domain being studied, or the type of algorithm used. Examples of such applications include

- Matrix Multiplication, LU Decomposition and other operations in dense linear algebra
- Solving Partial Differential Equations (PDEs) on regular meshes
- Fast Fourier Transform

The main characteristic of such applications is that the communication pattern is known statically before program execution. Thus all the necessary optimizations can be performed beforehand at compile time. In such applications, data is usually
distributed using either a block, cyclic, or block-cyclic distribution [119]. The array subscripts used in the program are usually linear functions of the loop indices.

Many different communication patterns are possible depending on the array access pattern in the program. Li and Chen [78] characterize many of the commonly occurring communication patterns. A library of collective communication routines is needed for carrying out communication efficiently. A particular type of communication pattern which occurs frequently is the all-to-all communication pattern. Efficient algorithms for all-to-all communication are presented in Chapter 4 of this thesis. Runtime support is also needed for array redistribution. Although arrays are distributed among processors at the start of the program, it is very often necessary to change the distribution during program execution, which is called array redistribution. This involves calculation of source and destination processor and index sets as well as interprocessor communication. Runtime support for array redistribution is discussed in detail in Chapter 3 of this thesis.

2.2 Runtime Support for Irregular Problems

There is another set of scientific applications in which the array access pattern is irregular. This is usually due to the fact that the underlying physical domain is irregularly connected. Examples of such applications include

- Computational Fluid Dynamics (CFD) codes using unstructured meshes
- Molecular dynamics codes
- Sparse iterative linear systems solvers

In irregular problems, arrays are accessed using one or more level of indirection. An example of this is the following loop

\[
\text{do } i=1, n \\
\quad A(x(i)) = B(y(i)) + C(z(i)) \\
\text{end do}
\]

A regular distribution of data, such as block, cyclic or block-cyclic, may not be the best
distribution for these problems because it may result in higher communication cost. A graph partitioning algorithm is normally used to determine the best distribution in the case of irregular applications. Such a distribution is called an irregular distribution. Due to the indirection in the array access, the communication pattern is not known statically at compile time. It depends on the values of the indirection arrays, which may not be known until runtime. Thus a runtime resolution scheme is needed to determine the communication required.

Runtime support for irregular problems has been studied by a number of researchers [93, 102, 34, 35, 68]. One way of detecting and performing the communication is by using an inspector-executor [68, 102] approach. A parallel loop is transformed into two constructs called an inspector and an executor. During program execution, the inspector examines the data references made by a processor, and calculates what off-processor data needs to be fetched and where that data will be stored once it is received. The executor loop then uses the information from the inspector to perform the actual communication and computation.

PARTI [35] is a library of runtime routines for solving irregular problems on distributed memory computers. PARTI primitives can be directly used to generate the inspector/executor pairs. Each inspector produces a communication schedule, which is essentially a pattern of communication for gathering or scattering data. In order to avoid duplicate accesses, a list of off-processor data references is stored locally for each processor in a hash table. For each new off-processor data reference required, a search through the hash table is performed in order to determine if this reference has already been accessed. If the reference has not previously been accessed, it is stored in the hash table, otherwise it is discarded. The primitives thus only fetch a single copy of each unique off-processor distributed array reference. The executor contains embedded PARTI primitives to communicate data. The primitives issue instructions to gather, scatter or accumulate (i.e. scatter followed by add) data according to the schedule created by the inspector. Latency or startup cost is reduced by packing small messages intended for the same destination into one large message.
2.3 Runtime Support for Compilers

Data-parallel languages like HPF [57] and pC++ [5] have recently been developed to provide support for high performance programming on parallel machines. These languages provide a framework for writing portable parallel programs independent of the underlying architecture and other idiosyncrasies of the machine. The same program can be run on different machines by simply using a different compiler for each machine. Compilers for such languages usually rely on runtime support to carry out many of the commonly required operations usually involving communication. Runtime support helps to separate the machine dependent aspects of compilation from the machine independent aspects. The compiler can do all the machine independent transformations and the runtime system can be optimized for each different machine. Thus, a portable and efficient compiler can be obtained by simply porting the runtime system to different machines.

We briefly describe some of the issues related to providing runtime support for an HPF compiler. We do not discuss runtime support for compiling irregular problems; that is explained in detail in [93]. We first outline the salient features of HPF in order to explain the runtime support needed for an HPF compiler.

2.3.1 Overview of HPF

HPF was designed to be a standard portable programming language for writing efficient computationally intensive parallel programs. HPF uses Fortran 90 as its base language and provides several extensions to Fortran 90. The new HPF language features fall into four categories with respect to Fortran 90: new directives, new language syntax, new library routines, and some language restrictions. The new directives are structured comments that suggest implementation strategies or assert facts about a program to the compiler. They may affect the efficiency of the computation performed, but do not change the value computed by the program. Compiler directives form the heart of the HPF language. They start with the prefix !HPF$, CHPF$ or *HPF$ which would actually be treated as comments in Fortran 90. Some of the
new language features are \texttt{FORALL} statement and \texttt{FORALL} construct. HPF extends
the Fortran 90 library of intrinsic functions to include additional reduction functions,
combining scatter functions, prefix and suffix functions, and sorting functions.

The HPF approach is based on two key observations. First, the overall efficiency
of the program can be increased if many operations are performed concurrently by
different processors, and secondly, the efficiency of a single processor is likely to be
the highest if the processor performs computations on data elements stored in its
local memory. Therefore, HPF provides means for explicit expression of parallelism
and data mapping. The data alignment and distribution directives in HPF allow the
programmer to inform the compiler how to partition arrays. Arrays are first aligned
to a \textit{template} or index space using the \texttt{ALIGN} directive. The \texttt{DISTRIBUTE} directive
specifies how the template is to be distributed among a set of abstract processors.
The mapping of abstract processors to physical processors is not specified by HPF
and is language-processor dependent. The combination of alignment (from arrays
to templates) and distribution (from templates to processors) thus determines the
mapping of an array to the processors. The data mapping is declared using the
directives: \texttt{PROCESSORS, ALIGN, DISTRIBUT}E and, optionally, \texttt{TEMPLATE}. In addition,
arrays may be remapped at runtime. For this, the array must be declared using
the \texttt{DYNAMIC} directive and the actual remapping is specified using the executable
directives \texttt{REALIGN} and \texttt{RE\textit{DISTRIBUTE}}.

In HPF, an array may be aligned with another in many ways including shifts,
strides, or any other linear combination of a subscript (ie., $n \times i + m$), transposition
of indices, and collapse or replication of array dimensions. Irregular alignments are not
allowed. The template can be distributed as \texttt{BLOCK, CYCLIC} or \texttt{CYCLIC}(\textit{m}). In a block
distribution, contiguous blocks of the array are distributed among the processors. In
a cyclic distribution, array elements are distributed among processors in a round-
robin fashion. In a cyclic(\textit{m}) distribution, blocks of size \textit{m} are distributed cyclically.
Irregular distributions are not allowed in version 1.0 of HPF.
2.3.2 Compiler and Runtime Support for HPF

An HPF compiler which translates in-core HPF programs to message passing node programs for distributed memory parallel computers has been developed at Syracuse University \[13, 15\]. It translates HPF programs to Fortran 77 programs with calls to a runtime library \[2, 14, 119\]. The compiler only exploits the parallelism expressed in data parallel constructs such as \texttt{FORALL}, \texttt{WHERE} and array assignment statements. It does not attempt to parallelize other constructs such as \texttt{DO} loops and \texttt{WHILE} loops, since they are used only as naturally sequential control constructs in HPF. The compiler mainly recognizes commonly occurring computation and communication patterns. These patterns are then replaced by calls to the optimized runtime support system routines. The runtime support system includes parallel intrinsic functions, data redistribution routines, communication primitives and several other miscellaneous routines.

The basic structure of the HPF compiler is organized around four major modules — parsing, data partitioning, communication detection and insertion, and code generation as shown in Figure 2.1. The compiler first creates a parse tree from the input HPF program. Also, each array assignment statement and \texttt{WHERE} statement is internally transformed into its equivalent \texttt{FORALL} statement, so that the subsequent steps only need to deal with \texttt{FORALL} statements. The partitioning module processes the data distribution directives, namely \texttt{TEMPLATE}, \texttt{DISTRIBUTE} and \texttt{ALIGN}, and partitions data and computation among processors. After partitioning, the parallel constructs in the node program are sequentialized since the code will be executed on a single processor. Array operations and \texttt{FORALL} statements in the original program are transformed into \texttt{DO} loops. The communication module detects communication requirements and inserts appropriate communication primitives.

Finally, the code generator produces \textit{loosely synchronous} \[44, 46\] SPMD code. The generated code is structured as alternating phases of local computation and communication. Local computations consist of operations by each processor on the data in its own memory. The communication phase includes any transfer of data among processors, possibly with arithmetic or logical computation on the data as it
is transferred (e.g. reduction functions). In such a model, processors do not need to synchronize during local computation. But, if two or more processors communicate with each other, they are implicitly synchronized during the communication.

**Communication Library**

The HPF compiler described above relies on a very powerful runtime support system which includes a library of collective communication routines [13], a library of intrinsic functions [2, 14] and other runtime routines such as for array redistribution [119]. The HPF compiler produces calls to collective communication routines instead of generating individual send and receive calls inside the compiled code. This is done
CHAPTER 2. ISSUES IN RUNTIME SUPPORT

for the following reasons:

1. *Improved performance*: To achieve good performance, interprocessor communication must be minimized. By developing a separate library of interprocessor communication routines, each routine can be optimized.

2. *Increased portability of the compiler*: By separating the communication library from the basic compiler design, portability is enhanced because to port the compiler, only the machine specific low-level communication calls in the library need to be changed.

3. *Better estimation of communication costs*: The cost of collective communication routines can be determined more precisely, which enables one to make a better estimate of the time a program will take.

The compiler must recognize the presence of collective communication patterns in the program in order to generate the appropriate communication calls. This involves a number of tests on the relationship among subscripts of various arrays in a FORALL statement. These tests also include information about array alignments and distributions. The compiler uses pattern matching techniques to detect communication patterns [13, 15].

**Intrinsic Library**

Intrinsic functions form an important feature of Fortran 90 and HPF. They directly support many of the basic data-parallel operations on arrays and provide a means for expressing operations on arrays concisely. HPF includes all Fortran 90 intrinsic functions and also adds a number of new intrinsic procedures in three categories: system inquiry intrinsics, mapping inquiry intrinsics, and computational intrinsics. The computational intrinsic functions fall into the following main categories:-

- *Simple Reduction Functions*: ALL, ANY, COUNT, MAXVAL, MINVAL, PRODUCT, SUM.
CHAPTER 2. ISSUES IN RUNTIME SUPPORT

- **Combining Scatter Functions**: ALL_SCATTER, ANY_SCATTER, COUNT_SCATTER, MAXVAL_SCATTER, MINVAL_SCATTER, PRODUCT_SCATTER, SUM_SCATTER

- **Prefix and Suffix Functions**: ALL_PREFIX, ANY_PREFIX, COUNT_PREFIX, MAXVAL_PREFIX, MINVAL_PREFIX, PRODUCT_PREFIX, SUM_PREFIX, ALL_SUFFIX, ANY_SUFFIX, COUNT_SUFFIX, MAXVAL_SUFFIX, MINVAL_SUFFIX, PRODUCT_SUFFIX, SUM_SUFFIX

- **Array Sorting Functions**: GRADE_UP, GRADE_DOWN

- **Array Manipulation Functions**: CSHIFT, EOSHIFT, TRANSPOSE.

- **Array Location Functions**: MAXLOC, MINLOC

- **Array Construction Functions**: SPREAD, MERGE, PACK, UNPACK

- **Vector and Matrix Multiplication Functions**: DOT_PRODUCT, MATMUL.

- **Bit Manipulation functions**: IAND, IOR, POPCNT, POPPAR, LEADZ

- **Elemental Intrinsics functions**: SIN, COS, TAN

It is necessary to have a library of these intrinsic functions which can be called from the node programs of a distributed memory machine [14]. The HPF compiler detects calls to intrinsic functions in the HPF program and replaces them with calls to these routines. Since arrays in HPF can have up to seven dimensions and can be distributed in many different ways, it is not feasible to write intrinsic libraries for all possible distributions. A more practical approach is to write optimized routines for a few distributions. If the distribution of an array is different from what is expected by the intrinsic library, the array must first be redistributed to the desired distribution, and after returning from the intrinsic, it must be redistributed back to the original distribution. It is essential to use efficient algorithms for redistribution, so as to minimize the redistribution overhead. Efficient algorithms for redistributing arrays are discussed in Chapter 3 of this thesis.
CHAPTER 2. ISSUES IN RUNTIME SUPPORT

2.4 Runtime Support for Out-of-Core Programs

An out-of-core program is one in which all the data required by the program cannot fit in main memory at the same time. Hence data needs to be stored in files on secondary storage such as disks. During program execution, only a portion of the data can be present in memory at any time. This makes it necessary to move data back and forth between main memory and disks. Since the cost of performing I/O is very high compared to computation and communication, this can prove to be very expensive. Hence it is necessary to do the I/O in out-of-core programs efficiently to get good performance.

The I/O performance is best when a processor makes large contiguous requests instead of many small requests. This is because of the high latency time associated with any I/O operation. In a parallel computer, when many processors need to do I/O simultaneously, there is the additional problem of contention for the I/O system. Hence it is necessary to schedule I/O requests of different processors as well as reorder and combine I/O requests within and across processors into large contiguous requests. We believe that this can best be done by having a library of optimized routines which can be directly called from an out-of-core program.

Another important factor is the portability of out-of-core programs. There is no standard parallel file system interface or Application Program Interface (API) at present. Each parallel machine has its own parallel file system with a completely different interface from that of any other parallel file system. Hence programs written directly using calls to the parallel file system are not portable to other systems. This problem can be overcome by using runtime support. The application programs can call a runtime library which provides a common interface for all machines. The routines in the runtime library can be implemented using the native parallel file system on each different machine.

Thus runtime support is needed in the case of out-of-core programs for efficiency and portability. As in the in-core case, this runtime support can also be used together with a compiler to translate out-of-core programs written in a high-level language like
HPF to node programs with calls to the runtime library for I/O and communication. Chapters 5 and 6 of this thesis describe in detail the design and implementation of runtime support for out-of-core programs, including a number of optimizations.

2.4.1 Out-of-Core Applications

There are a number of applications which deal with very large data sets, resulting in out-of-core programs. These applications exist in diverse areas such as large scale scientific computations, database and information processing, hypertext and multimedia systems, information retrieval and many others.

del Rosario and Choudhary [38] have done a survey of the I/O requirements of several Grand Challenge applications. They find that the data requirements of these applications range from 1 Gbyte to 4 Tbytes per run. Some of the applications they surveyed are:

• Imaging of Planetary Data: The spacecraft Magellan has gathered more than 3 Tbytes of data from the surface of the planet Venus. To produce a three-dimensional rendering of the surface of Venus at 200 Mbytes of data per frame would require over 13 Gbytes/sec. of I/O throughput at 50 frames per second [48].

• Climate Prediction: A climate prediction code using the General Circulation Model (GCM) has the following requirements on the Intel Touchstone Delta. For a 100-year atmosphere run with 300 km$^2$ resolution and 0.2 simulated years/machine hour, the simulation takes 3 weeks run time and generates 1144 Gbytes of data at 38 Mbytes per simulation minute. For a 1000-year coupled atmosphere-ocean run with a 150 km$^2$ resolution, the atmospheric simulation takes about 30 weeks, while ocean simulation takes 27 weeks; the process produces 40 Mbytes of data per simulation minute, or a total of 20 Tbytes of data for the entire simulation [40].
- **Four-Dimensional Data Assimilation**: This application from NASA incorporates actual space-time observations into mathematical and computational models in order to create a unified, complete description of the atmosphere. The algorithm for this currently operates on a 3 Tbyte database with single runs producing 100Mbytes to several Gbytes of output. These figures are expected to increase by orders of magnitude in the near future.

Cormen [30] has also compiled a list of several applications which deal with huge data sets. These include applications in computational biology, computational fluid dynamics, combinatorial search, conjugate gradient solvers, genetic algorithms, geophysics, region growing, logic simulation, computational physics, meteorology, molecular dynamics, ocean modeling, partial differential equations solvers, synthetic aperture radar image processing, visualization and graphics. The Applications Working Group of the Scalable I/O Initiative has provided a description of the I/O requirements of several applications in biology, chemistry, physics, earth sciences, engineering and graphics [97]. Fox [41] presents a performance analysis of applications on a hyper-cube machine with a hierarchical memory at each node, consisting of a fast cache and slower main memory. The applications considered are the N-body problem, Poisson’s equation solver, matrix multiplication, LU Decomposition, Fast Fourier Transform and neural networks. This performance analysis can be extended to out-of-core applications in which the memory hierarchy includes a much slower secondary memory.

There have also been studies of the file-access characteristics of application programs on multiprocessor systems. Kotz and Nieuwejaar [74] present the results of a three week tracing study in which all file-related activity on an Intel iPSC/860 running production, parallel scientific applications at NASA Ames Research Center was recorded. An interesting result of this study was that there were a large number of small strided requests to the file system. They found that 96.1% of all reads were for fewer than 4000 bytes, but those reads transferred only 2% of all data read. For writing, 89.4% of all requests were for fewer than 4000 bytes, but those writes transferred only 3% of all data written. Another study of the file-access characteristics
of production applications on the CM-5 at the National Center for Supercomputing Applications, University of Illinois, also found similar results of a large number of small requests [98]. This shows that although it is well-known that I/O should be done in large chunks to minimize the effect of high I/O latency, many parallel out-of-core applications actually access small strided data sets. Hence, it is necessary to provide runtime support for accessing small strided data efficiently. This issue is addressed in this thesis and optimizations, such as data sieving and collective I/O using an Extended Two-Phase Method, are proposed to access strided data in an efficient manner.
Chapter 3

Runtime Support for Array Redistribution

3.1 Introduction

In distributed memory parallel computers, arrays have to be distributed among processors in some fashion. The distribution can either be regular i.e. block, cyclic or block-cyclic as in Fortran D [43] and High Performance Fortran (HPF) [57, 67]; or irregular in which there is no function specifying the mapping of arrays to processors. The distribution of an array does not need to remain fixed throughout the program. In fact, it is very often necessary to change the distribution of the array at runtime, which is called array redistribution. This involves calculating source and destination processor and index sets as well as data communication. It is essential to use efficient algorithms for redistribution, otherwise the performance of the program may degrade considerably.

This chapter describes efficient algorithms for runtime array redistribution. Since array distribution and redistribution is rigorously defined in HPF, the algorithms are developed for redistributing arrays as defined in HPF. We consider block($m$) to cyclic, cyclic to block($m$) and the general cyclic($x$) to cyclic($y$) type redistributions. For the general cyclic($x$) to cyclic($y$) redistribution, there is no direct algebraic formula to calculate the source and destination processor and index sets [53]. We use a novel approach for doing the general cyclic($x$) to cyclic($y$) redistribution, where we first
develop optimized algorithms for two special cases — when \( x \) is a multiple of \( y \), or \( y \) is a multiple of \( x \). For the general case, we propose two algorithms called the \textit{GCD Method} and the \textit{LCM Method} which make use of the optimized algorithms developed for the above special cases. We initially describe algorithms for one-dimensional arrays and then extend the methodology to multidimensional arrays. The algorithms proposed have low runtime overhead. We study the performance of these algorithms on the Intel Touchstone Delta.

### 3.1.1 Need for Redistribution

HPF provides directives (\texttt{ALIGN} and \texttt{DISTRIBUTE}) which specify how arrays should be partitioned among the processors of a distributed memory computer. Arrays are first aligned to a \textit{template} or index space. The \texttt{DISTRIBUTE} directive specifies how the template is to be distributed among the processors. In HPF, an array (or template) can be distributed as \texttt{BLOCK}(m) or \texttt{CYCLIC}(m) \cite{57}. Though the distribution of an array is specified at compile time, there are a number of reasons for which it may be necessary to redistribute an array during the execution of the program.

- HPF has a directive called \texttt{REDISTRIBUTE} by which an array can be redistributed anywhere in the program provided the array was declared as \texttt{DYNAMIC}. \texttt{REDISTRIBUTE} can be considered as an executable directive.

- HPF provides intrinsic functions which operate on arrays. It is not practical to write intrinsic and runtime libraries for all possible distributions, especially since arrays can have up to seven dimensions. Libraries can be written for a few common distributions and for any other distribution it is necessary to redistribute before calling the subroutine. After returning from the subroutine it is necessary to redistribute back to the original distribution. We call this type of redistribution as a \textit{circular redistribution}. This is illustrated in Figure 3.1 which shows an HPF program calling the \texttt{MATMUL} intrinsic with all arrays distributed as \texttt{(BLOCK,*)}. The \texttt{MATMUL} library routines have been written for \texttt{(BLOCK,BLOCK)} and \texttt{(CYCLIC,CYCLIC)} distributions. So it is necessary to redistribute to one
CHAPTER 3. RUNTIME SUPPORT FOR ARRAY REDISTRIBUTION

Figure 3.1: Need for Redistribution

of these distributions before calling the intrinsic and then redistribute back to (BLOCK,*) after the intrinsic.

- Arrays and array sections can be passed as arguments to subroutines. The array (or array section) can be specified to have any distribution in the subroutine. If this distribution is different from that in the calling program, the array (or array section) needs to be redistributed before the subroutine is called. At the end of the subroutine, the array (or array section) needs to be redistributed to the original distribution.

- In many applications such as 2D FFT and the ADI method for solving multidimensional PDEs, dynamic redistribution can result in significant performance improvement [20].

An example of an HPF program using redistribution is shown in Figure 3.2. This is a two-dimensional FFT program in which the array is first block distributed along
REAL A(1024,1024)
!HPF$ PROCESSORS P(4)
!HPF$ DISTRIBUTE A(BLOCK,*) ONTO P
!HPF$ DYNAMIC A

............

CALL ROW FFT(A)
!HPF$ REDISTRIBUTE A(*,BLOCK) ONTO P
CALL COL FFT(A)

............

END

Figure 3.2: 2D FFT Program using Redistribution

rows. A one-dimensional FFT is first performed along each row of the array, which can be done without any communication. The array is then redistributed so that it is block distributed along columns. A one-dimensional FFT is performed along each column of the redistributed array, which again does not require any communication.

3.2 Notations and Definitions

The notations used in this chapter are given in Figure 3.3. We assume that all arrays are indexed starting from 1, while processors are numbered starting from 0 and that arrays are identically aligned to the template. The algorithms can be easily extended for the general case. We also assume that the number of processors on which the array is distributed remains the same before and after the redistribution.

The block($m$) and cyclic($m$) distributions in HPF are defined as follows. Consider an array of size $N$ distributed over $P$ processors. Let us define the ceiling division
function $CD(j, k) = (j + k - 1)/k$ and the ceiling remainder function $CR(j, k) = j - k \times CD(j, k)$. Then block($m$) distribution means that index $j$ of the array is mapped to logical processor number $CD(j, m) - 1$. Note that for a valid block($m$) distribution, $m \times P \geq N$ must be true. Block by definition means the same as block($CD(N, P)$). In a cyclic($m$) distribution, index $j$ of the global array is mapped to logical processor number $MOD(CD(j, m) - 1, P)^1$. Cyclic by definition means the same as cyclic(1).

Suppose we have 4 processors and an array of length 26. Distributing the array as BLOCK results in this mapping of array elements to processors:

<table>
<thead>
<tr>
<th>Proc. 0</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6</th>
<th>7</th>
</tr>
</thead>
<tbody>
<tr>
<td>Proc. 1</td>
<td>8</td>
<td>9</td>
<td>10</td>
<td>11</td>
<td>12</td>
<td>13</td>
<td>14</td>
</tr>
<tr>
<td>Proc. 2</td>
<td>15</td>
<td>16</td>
<td>17</td>
<td>18</td>
<td>19</td>
<td>20</td>
<td>21</td>
</tr>
<tr>
<td>Proc. 3</td>
<td>22</td>
<td>23</td>
<td>24</td>
<td>25</td>
<td>26</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

---

$^1 MOD(a, b) = a \text{ modulo } b$
Distributing the array as CYCLIC results in this mapping of array elements to processors:

<table>
<thead>
<tr>
<th>Proc. 0</th>
<th>1</th>
<th>5</th>
<th>9</th>
<th>13</th>
<th>17</th>
<th>21</th>
<th>25</th>
</tr>
</thead>
<tbody>
<tr>
<td>Proc. 1</td>
<td>2</td>
<td>6</td>
<td>10</td>
<td>14</td>
<td>18</td>
<td>22</td>
<td>26</td>
</tr>
<tr>
<td>Proc. 2</td>
<td>3</td>
<td>7</td>
<td>11</td>
<td>15</td>
<td>19</td>
<td>23</td>
<td></td>
</tr>
<tr>
<td>Proc. 3</td>
<td>4</td>
<td>8</td>
<td>12</td>
<td>16</td>
<td>20</td>
<td>24</td>
<td></td>
</tr>
</tbody>
</table>

Distributing the array as CYCLIC(3) results in this mapping of array elements to processors:

<table>
<thead>
<tr>
<th>Proc. 0</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>13</th>
<th>14</th>
<th>15</th>
<th>25</th>
<th>26</th>
</tr>
</thead>
<tbody>
<tr>
<td>Proc. 1</td>
<td>4</td>
<td>5</td>
<td>6</td>
<td>16</td>
<td>17</td>
<td>18</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Proc. 2</td>
<td>7</td>
<td>8</td>
<td>9</td>
<td>19</td>
<td>20</td>
<td>21</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Proc. 3</td>
<td>10</td>
<td>11</td>
<td>12</td>
<td>22</td>
<td>23</td>
<td>24</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

In other words, in a block distribution, contiguous blocks of the array are distributed among the processors. In a cyclic distribution, array elements are distributed among processors in a round-robin fashion. In a cyclic(m) distribution, blocks of size m are distributed cyclically. The cyclic(m) distribution with $1 < m < \lceil N/P \rceil$ is commonly referred to as a block-cyclic distribution with block size m [43]. Block and cyclic distributions are special cases of the general cyclic(m) distribution. A cyclic(m) distribution with $m = \lceil N/P \rceil$ is a block distribution and a cyclic(m) distribution with $m = 1$ is a cyclic distribution. The formulae for conversion between local and global indices for the different distributions are given in Table 3.1.

The redistribution algorithms proposed in this thesis are intended to be portable. Hence, we do not specify how data communication should be performed because the best communication algorithms are often machine dependent. Redistribution requires all-to-many personalized communication in general and in many cases it requires all-to-all personalized communication. These communication algorithms are described in detail in Chapter 4 of this thesis and in [117, 95, 120]. The performance results presented in this chapter have been obtained using these algorithms. We do assume that all the data to be sent from any processor $i$ to processor $j$ has to be
Table 3.1: Data Distribution and Index Conversion

Note: This assumes that arrays are indexed starting from 1 and processors are numbered starting from 0.

\[ CD(j, k) = (j + k - 1)/k \quad \text{and} \quad CR(j, k) = j - k \times CD(j, k) \]

<table>
<thead>
<tr>
<th>global index ((g)) to processor number ((p))</th>
<th>(\text{BLOCK}(m))</th>
<th>(\text{CYCLIC})</th>
<th>(\text{CYCLIC}(m))</th>
</tr>
</thead>
<tbody>
<tr>
<td>(l = m + CR(g, m))</td>
<td>(p = CD(g, m) - 1)</td>
<td>(p = \text{MOD}(g - 1, P))</td>
<td>(p = \text{MOD}(CD(g, m) - 1, P))</td>
</tr>
<tr>
<td>(g = l + m)</td>
<td>(l = (g - 1)/P + 1)</td>
<td>(l = \text{MOD}(g - 1, m) + 1 + (g/(mP))m)</td>
<td></td>
</tr>
</tbody>
</table>

collected together in a contiguous packet in processor \(i\) and sent in one communication operation, so as to minimize the communication startup cost. In the rest of this chapter, any division involving integers should be considered as integer division unless specified otherwise.

### 3.3 Block\((m)\) to Cyclic Redistribution

We first consider the case of block\((m)\) to cyclic redistribution, a few examples of which are shown in Figure 3.4.

**Theorem 3.1** Let \(l_{11}\) and \(l_{12}\) be the local array sizes in processor \(p_i\) corresponding to block\((m)\) and cyclic distributions respectively. In a block\((m)\) to cyclic redistribution, the number of processors to which \(p_i\) has to send data is

\[ P - 1 \quad \text{if} \quad l_{11} \geq P \]
\[ l_{11} \text{ or } l_{11} - 1 \quad \text{if} \quad l_{11} < P \]

The number of processors from which \(p_i\) has to receive data is at most

\[ \begin{align*}
  CD(N, m) & \quad \text{if} \quad l_{12} \geq CD(N, m) \\
  l_{12} & \quad \text{if} \quad l_{12} < CD(N, m)
\end{align*} \]
Figure 3.4: Block(m) to Cyclic Redistribution
Proof: Note that if $N$ is not a multiple of $P$, $l_1$ may not be equal to $l_2$. If $l_1 < P$, each of the $l_1$ elements of $p_i$ corresponding to a block$(m)$ distribution will lie in a different processor when the array is distributed cyclically. At most one of the $l_1$ elements will be remapped to processor $p_i$ itself. Therefore, $p_i$ will have to send data to either $l_1$ or $l_1 - 1$ processors. If $l_1 \geq P$, then clearly $p_i$ has at least one element to send to every other processor.

In a block$(m)$ distribution, the number of processors with data is given by $CD(N, m)$. Therefore, in the receive phase, if $l_2 < CD(N, m)$, each processor will receive data from at most $l_2$ processors. If $l_2 \geq CD(N, m)$, each processor will receive data from at most $CD(N, m)$ processors. \hfill \square

The algorithm for block$(m)$ to cyclic redistribution is given in Figure 3.5. In the send phase, each processor only needs to calculate the destination processor of the first element of the local array. The other elements have to be sent to the other processors in a round-robin fashion. Thus the array is scanned only once, which makes good use of the cache. In the receive phase, the data received from other processors has to be stored in contiguous memory locations in order of logical processor number. In every processor, the data received from processor 0 is stored first; from processor 1 second and so on. Hence addresses do not need to be calculated. If the amount of data to be received from all processors is known, the data can be directly received into appropriate locations in the array.

In a block$(m)$ distribution, the last element $N$ of the global array is mapped to processor $p_N = (N - 1)/m$. If $p_N < P - 1$, no elements of the array are mapped to processors $p_N + 1, p_N + 2, ..., P - 1$. The index of the last element of the local array in processors $p < p_N$ is $last\_index = m$. The index of the last element in $p_N$ is $last\_index = MOD(N - 1, m) + 1$. The index of the first element of $p_i \leq p_N$ that is mapped to $p$ in a cyclic distribution is given by

$$first\_index = MOD[p - MOD\{p, MOD(m, P), P\} + P, P] + 1$$
CHAPTER 3. RUNTIME SUPPORT FOR ARRAY REDISTRIBUTION

send phase

1. The destination processor \( p_d \) of the first element of the local array is \( p_d = \text{MOD}(p, m, P) \).
2. Destination processor of element \( i, i = 2 : N \), is \( \text{MOD}(p + i, P) \).

receive phase

1. Last processor with data is \( p_N = (N - 1)/m \)
2. For \( p_s = 0 \) to \( p_N \) do
3. If \( (p_s = p_N) \) then
4. \( \text{last\_index} = \text{MOD}(N - 1, m) + 1 \)
5. Else
6. \( \text{last\_index} = m \)
7. The index of the first element of \( p_s \) mapped to \( p \) is
   \[ \text{first\_index} = \text{MOD}[p - \text{MOD}\{p, \text{MOD}(m, P), P\} + P, P] + 1 \]
8. Number of elements to be received from \( p_s \) is 0, if \( (\text{last\_index} < \text{first\_index}) \),
   and \( (\text{last\_index} - \text{first\_index})/P + 1 \), otherwise
9. No data is to be received from processors \( p_N + 1, p_N + 2, ..., p - 1 \).
10 The received data is to be stored in order of processor number.

Figure 3.5: Algorithm for Block(m) to Cyclic Redistribution

If \( m \) is divisible by \( P \), the first element of \( p_s \) that is mapped to \( p \) is \( p + 1 \). However, if \( m \) is not divisible by \( P \), a shift is introduced in this simple mapping which is taken into account by the \( \text{MOD} \) expression in the above equation. Hence, the number of elements to be sent from any processor \( p_s \) to \( p \) is

\[
0, \quad \text{if } (\text{last\_index} < \text{first\_index}) \text{ or } (p_s > p_N) \\
(\text{last\_index} - \text{first\_index})/P + 1, \quad \text{otherwise}
\]

where \( \text{first\_index}, \text{last\_index} \) and \( p_N \) are calculated as above.
3.4 Cyclic to Block\((m)\) Redistribution

A cyclic to block\((m)\) redistribution is essentially the reverse of a block\((m)\) to cyclic redistribution. The algorithm for cyclic to block\((m)\) redistribution is given in Figure 3.6. In a cyclic distribution, the size of the local array in processor \(p\) is

\[
L = \begin{cases} 
\left\lceil \frac{N}{P} \right\rceil & \text{if } \text{MOD}(N, P) = 0, \text{ or } p < \text{MOD}(N, P) \\
\left\lceil \frac{N}{P} \right\rceil - 1 & \text{otherwise}
\end{cases}
\]

In the send phase, each processor \(p\) calculates the destination processor \(p_d\) and the destination local address \(l_d\) of the first element of its local array as \(p_d = CD(p + 1, m) - 1\) and \(l_d = m + CR(p + 1, m)\). The first \((m - l_d)/P + 1\) elements from \(p\) have to be sent to \(p_d\). The next set of elements starting from \(i = (m - l_d)/P + 2\) have to be sent to processor \(p_{d1} = CD((i - 1)P + p + 1, m) - 1\). The destination local address of element \(i\) is given by \(l_{d1} = m + CR((i - 1)P + p + 1, m)\) and so \((m - l_{d1})/P + 1\) elements starting from \(i\) have to be sent to processor \(p_{d1}\). This process is continued for the remaining elements of the array. Thus blocks of elements have to be sent to different processors.

In the receive phase, the data received cannot be directly stored in the array as the data has to be stored with a stride equal to the number of processors. Hence the data received has to be stored in a temporary buffer in memory. This gives us two choices in implementing the receive phase:

1. **Synchronous Method:** Each processor waits till it receives data from all other processors, before placing any data in the local array. This increases the memory requirements of the algorithm and also increases processor idle time. These problems worsen as the number of processors is increased, so this method is not scalable.

2. **Asynchronous Method:** The processors do not wait for data from all processors to arrive. Instead, each processor takes any packet which has arrived and places the data from that packet into appropriate locations in the local array. This
Send Phase

1. \( i = 1 \)
2. While \((i \leq L)\) do
3. The destination processor \((p_d)\) and destination local address \((l_d)\) of element \(i\) of the local array is \(p_d = CD((i - 1)P + p + 1, m) - 1\) and \(l_d = m + CR((i - 1)P + p + 1, m)\)
4. The destination processor of elements \(i\) to \(i + (m - l_d)/P\) is \(p_d\).
5. \(i = i + (m - l_d)/P + 1\)

Receive Phase

**Synchronous Method:**
1. Receive packets from all processors.
2. The source processor of the first element of the local array \(p_s = MOD(m, p, P)\).
3. The source processor of element \(i, i = 2 : N,\) is \(p_s = MOD(p_s + 1, P)\).

**Asynchronous Method:**
1. The source processor of the first element of the local array is \(p_s = MOD(m, p, P)\).
2. The source processor of the \(k^{th}\) element, \(2 \leq k \leq P,\) is \(MOD(p_s + k - 1, P)\).
3. For \(i = 0\) to \(P - 1\) do
4. Receive a packet from any processor \(p_i\)
5. Starting from the location calculated above, place elements from the packet into the array with stride \(P\).

Figure 3.6: Algorithm for Cyclic to Block (\(m\)) Redistribution

method overlaps computation and communication. Each processor posts non-blocking receive calls and waits for any packet to arrive. As soon as a packet is received, it places the data in appropriate locations in the local array. During this time, other packets may reach the processor. When the processor has placed all the data from the earlier packet into the local array, it takes up the next packet and so on. This reduces processor idle time. Since all packets do not have to be in memory at the same time, it also reduces memory requirements. This method is scalable as neither processor idle time nor memory requirements increase as the number of processors is increased.
The array locations where incoming data has to be stored can be calculated as follows. The source processor \(p_s\) of the first element of the local array is given by
\[ p_s = MOD(mp, P). \]
The next \((P - 1)\) elements will be received from the remaining processors in order of processor number. This cycle is repeated for all elements of the local array. If all packets are present in memory (Synchronous Method), the local array needs to be scanned only once to be filled. If the packets are processed one at a time (Asynchronous Method), the array elements are filled with stride \(P\) and the array has to be scanned \(P\) times. So, clearly the Synchronous Method makes better use of the cache than the Asynchronous Method. Figure 3.7 compares the performance of the Synchronous and Asynchronous Methods on the Intel Touchstone Delta for a global array with 1M \(2^{20}\) integers and the number of processors varied between 8 and 128. We observe that the Asynchronous Method performs much better than the Synchronous Method as it overlaps computation and communication and thus reduces processor idle time. This difference is larger for a small number of processors because the amount of data communicated per processor is larger.
3.5 Cyclic($x$) to Cyclic($y$) Redistribution

For a general cyclic($x$) to cyclic($y$) redistribution, there is no direct algebraic formula to calculate the set of elements to send to a destination processor and the local addresses of these elements at the destination [53]. Several complex schemes have been proposed for performing this redistribution [110, 111, 23, 99]. Since redistribution has to be done at runtime, a simple and efficient algorithm with minimum runtime overhead is necessary. We propose a new method for performing the general cyclic($x$) to cyclic($y$) redistribution, which has low runtime overhead. We first consider two special cases where $x$ is a multiple of $y$, or $y$ is a multiple of $x$, and develop optimized algorithms for these two special cases. For the general case where there is no particular relation between $x$ and $y$, we have developed two algorithms called the GCD Method and the LCM Method, which make use of the optimized algorithms developed for the above two special cases.

3.5.1 Special case $x = k \cdot y$

We first consider the special case where $x$ is a multiple of $y$. Let $x = k \cdot y$. An example of this is given in Figure 3.8.

Theorem 3.2 In a cyclic($x$) to cyclic($y$) redistribution where $x = k \cdot y$, if $k < P$, each processor communicates with $k$ or $k - 1$ processors. If $k \geq P$, each processor communicates with all other processors.

Proof: Assume $k < P$. Since $x = k \cdot y$, each block of size $x$ is divided into $k$ sub-blocks of size $y$ and distributed cyclically. Consider any processor $p_i$. Assume that it has to send its first sub-block of size $y$ to processor $p_j$. Then the remaining $k - 1$ sub-blocks of the first block are sent to the next $k - 1$ processors in order. The next $k(P - 1)$ sub-blocks of the global array are located in the other $P - 1$ processors. This results in a total of $k \cdot P$ sub-blocks. Hence the $(k + 1)^{th}$ sub-block of size $y$ in $p_i$ is also sent to $p_j$. Thus all sub-blocks from $p_i$ are sent to $k$ processors starting from
p_i. One of these processors may be p_i itself, in which case p_i has to send data to k - 1 processors. For the receive phase, consider the first kP sub-blocks of size y in the global array corresponding to the first P blocks of size x. Let us number these kP sub-blocks from 0 to kP - 1. Out of these, the sub-blocks that are mapped to processor p_i in the new distribution are numbered p_i to P(k - 1) + p_i with stride P. These sub-blocks come from \( \frac{P(k-1)+p_i - p_i}{P} + 1 = k \) processors. One of these processors might be p_i itself, in which case p_i receives data from k - 1 processors.

If \( k \geq P \), each block of size x has to be divided into k sub-blocks and distributed cyclically, where the number of sub-blocks is greater than or equal to the number of processors. So clearly each processor has to send to and receive from all other processors (all-to-all communication).

The algorithm for cyclic(x) to cyclic(y) redistribution, where \( x = ky \) is given in Figure 3.9. We call this the KY TO Y algorithm. In the send phase, each processor p calculates the destination processor p_d of the first element of its local array as p_d = MOD(ki, P). The first y elements have to be sent to p_d, the next y to \( MOD(p_d + 1, P) \), the next to \( MOD(p_d + 2, P) \) and so on till the end of the first block of size x. The next k sub-blocks of size y have to be sent to the same set of k processors starting from p_d. The sequence of destination processors can be stored and need not be calculated for each block of size x. In the receive phase there are two cases depending on the value of k :-

![Image of Cyclic(4) and Cyclic(2) Redistribution]
1. \((k \leq P)\) and \((\text{MOD}(P,k) = 0)\): In this case, each processor \(p\) calculates the source processor of the first block of size \(y\) of its local array as \(p_s = p/k\). The next block of size \(y\) will come from processor \(\text{MOD}(p_s + P/k, P)\), the next from \(\text{MOD}(p_s + 2(P/k), P)\) and so on till the first \(k\) blocks. The above sequence of processors is repeated for the remaining sets of \(k\) blocks of size \(x\) and hence can be stored and used. If the Synchronous Method is used for receiving data, the local array needs to be scanned only once and the \(i^{th}\) block, \(0 \leq i \leq \lceil L/y \rceil - 1\), of size \(y\) of the local array will be read from the packet received from processor \(\text{MOD}(p_s + i(P/k), P)\). If the Asynchronous Method is used, the first block from the packet received from some processor \(p_s\) will be stored starting at the location calculated above. The remaining blocks will be stored with stride \(x\).

2. If \(k\) does not satisfy the above condition, it is necessary to calculate the source processor of the first element \((j = iy + 1)\) of each block of size \(y\), \(0 \leq i \leq \lceil L/y \rceil - 1\), of the local array as \(p_s = \text{MOD}[(iP + p)/k, P]\). The block is read from the packet received from \(p_s\). The sequence of processors does not repeat itself and hence cannot be stored. In this case, the Synchronous Method is used.

We have tested the performance of the KY\_TO\_Y Algorithm using both Synchronous and Asynchronous Methods on the Intel Touchstone Delta. Figure 3.10 compares the performance of the Synchronous and Asynchronous Methods for a cyclic(4) to cyclic(2) redistribution of a global array of 1M integers for different number of processors. We observe that the Asynchronous Method performs better than the Synchronous Method, even though in this case each processor communicates with at most two other processors. This is because the Asynchronous Method overlaps computation and communication and thus reduces processor idle time. The better cache utilization of the Synchronous Method does not improve its overall performance. Figure 3.11 shows the performance of the KY\_TO\_Y Algorithm for a cyclic(4) to cyclic(2) redistribution on 64 processors for different array sizes. For small arrays, the difference in performance between the Synchronous and Asynchronous Methods is small, because of the small data sets. For large arrays, the difference is significant because
Send Phase

1. The destination processor \((p_d)\) of the first element of the local array is
\[ p_d = \text{MOD}(kp, P). \]
2. For each block of size \(x\) in the local array
3. For \(i = 0\) to \(k - 1\)
4. The destination processor of elements \((iy + 1)\) to \((i + 1)y\) of this block of size \(x\) is \(\text{MOD}(p_d + i, P)\).

Receive Phase

1. If \((k \leq P)\) and \(\text{MOD}(P, k) = 0\) then
2. The source processor \((p_s)\) of the first element of the local array is \(p_s = p/k\).

Synchronous Method:
3. Receive data from all processors.
4. For \(j = 1\) to \([L/x]\) do
5. For \(i = 0\) to \(k - 1\) do
6. Read the next block of size \(y\) from the data received from processor \(\text{MOD}(p_s + i(P/k), P)\).

Asynchronous Method:
3. The \(i^{th}\) block of size \(y\), \(0 \leq i \leq k - 1\), is to be received from processor \(\text{MOD}(p_s + i(P/k), P)\).
4. For \(i = 0\) to \(k - 1\) do
5. Receive data from any processor \(p_s\).
6. Place the first block of size \(y\) in the local array starting from the location calculated above, and the other blocks with stride \(x\).
7. Else
8. Receive data from all processors.
9. For \(i = 0\) to \([L/y] - 1\) do
10. The source processor \((p_s)\) of the first element \((j = iy + 1)\) of this block of size \(y\) is \(p_s = \text{MOD}[(iP + p)/k, P]\)
11. Read this block of size \(y\) from the data received from \(p_s\).

Figure 3.9: KY_TO_Y Algorithm for Cyclic\((x)\) to Cyclic\((y)\) Redist., where \(x = k\ y\)
CHAPTER 3.  RUNTIME SUPPORT FOR ARRAY REDISTRIBUTION

Figure 3.10: KY_TO_Y Algorithm, cyclic(4) to cyclic(2), array size 1M integers

of the higher processor idle time in the Synchronous Method.

3.5.2 Special case $y = kx$

We now consider the special case where $y$ is a multiple of $x$. Let $y = kx$. This is essentially the reverse of the case where $x = ky$.

**Theorem 3.3** In a cyclic($x$) to cyclic($y$) redistribution where $y = kx$, if $k < P$, each processor sends data to $k$ or $k−1$ processors and receives data from $k$ or $k−1$ processors. If $k \geq P$, each processor has to communicate with all other processors (all-to-all communication).

**Proof:** Assume $k < P$. Consider the first $kP$ sub-blocks of size $x$ in the global array corresponding to the first $P$ sub-blocks of size $y$. Let us number these $kP$ sub-blocks from 0 to $kP−1$. Out of these, the sub-blocks that are located in processor $p_i$ are numbered from $p_i$ to $P(k−1)+p_i$ with stride $P$. In the new distribution, these sub-blocks will be mapped to $\frac{(P(k−1)+p_i)−p_i}{P}+1 = k$ processors. One of these processors
might be $p_i$ itself, in which case $p_i$ sends data to $k-1$ processors. In the receive phase, since $y = k \times x$, each block of size $y$ in the new distribution consists of $k$ sub-blocks of size $x$ which will come from $k$ processors. Consider any processor $p_i$. Assume that it receives its first sub-block of size $x$ from processor $p_j$. Then the remaining $k-1$ sub-blocks of the first block are received from the next $k-1$ processors in order. The other $P-1$ processors receive the next $k(P-1)$ sub-blocks of the global array. This results in a total of $kP$ sub-blocks. Hence the next sub-block in $p_i$, which is the first sub-block of the next block of size $y$, is also received from $p_j$. Thus all sub-blocks from $p_i$ are received from $k$ processors starting from $p_j$. One of these processors may be $p_i$ itself in which case $p_i$ receives data from $k-1$ processors.

If $k \geq P$, each block of size $y$ will consist of $k$ sub-blocks of size $x$, where the number of sub-blocks is greater than or equal to the number of processors. So clearly each processor has to send to and receive from all other processors (all-to-all communication).
The algorithm for cyclic($x$) to cyclic($y$) redistribution, where $y = k \times x$, is given in Figure 3.12. We call this the X_TO_KX Algorithm. In the send phase, there are two cases depending on the value of $k$:

1. $(k \leq P)$ and $(MOD(P, k) = 0)$: In this case, each processor $p$ calculates the destination processor of the first block of size $x$ of its local array as $p_d = p/k$. The next block of size $x$ has to be sent to processor $MOD(p_d + P/k, P)$, the next to $MOD(p_d + 2(P/k), P)$ and so on till the first $k$ blocks. The above sequence of processors is repeated for the remaining sets of $k$ blocks of size $x$, and hence need not be calculated again.

2. If $k$ does not satisfy the above condition, it is necessary to individually calculate the destination processor of each block $i$ of size $x$, $0 \leq i \leq \lceil L/x \rceil - 1$, as $p_d = MOD([i \times P + p]/k, P]$.

In the receive phase, each processor $p$ calculates the source processor of the first element of its local array as $p_s = MOD[k \times p, P]$. As in the KY_TO_Y algorithm, the receive phase can be implemented using either the Synchronous Method or the Asynchronous Method. If the Synchronous Method is used to receive data, for each block of size $y$ of the local array, the $k$ sub-blocks of size $x$ are read from the packets received from the $k$ processors starting from $p_s$ in order of processor number. If the Asynchronous Method is used, we know that the $i^{th}$ block of size $x$ of the local array, $0 \leq i \leq k - 1$, will be received from processor $MOD(p_s + i, P)$. Thus the local index of the first block received from any source processor can be calculated. The remaining blocks have to be stored with stride $y$.

We have tested the performance of the X_TO_KX Algorithm on the Intel Touchstone Delta for different array sizes and number of processors. Figure 3.13 compares the performance of the Synchronous and Asynchronous Methods for a cyclic($2$) to cyclic($4$) redistribution of an array of 1M integers for different number of processors. Figure 3.14 compares the performance of the two methods for different array sizes on 64 processors. The results are similar to those obtained for the KY_TO_Y Algorithm.
CHAPTER 3. RUNTIME SUPPORT FOR ARRAY REDISTRIBUTION

Send Phase

1. If \((k \leq P)\) and \((MOD(P, k) = 0)\) then
2. The destination processor \((p_d)\) of the first element of the local array is \(p_d = p/k\).
3. For \(j = 0\) to \([L/y] - 1\)
4. \hspace{1em} For \(i = 0\) to \(k - 1\)
5. \hspace{2em} The destination processor of the next block of size \(x\) of the local array is \(MOD(p_d + i(P/k), P)\).
6. Else
7. \hspace{1em} For \(i = 0\) to \([L/x] - 1\)
8. \hspace{2em} The destination processor \((p_d)\) of the first element \((j = i x + 1)\) of this block of size \(x\) is \(p_d = MOD[(i P + p)/k, P]\).

Receive Phase

1. The source processor \((p_s)\) of the first element of the local array is element of the local array is \(p_s = MOD[k p, P]\).

Synchronuous Method:
2. Receive data from all processors.
3. For each block of size \(y\) in the local array do
4. \hspace{1em} For \(i = 0\) to \(k - 1\) do
5. \hspace{2em} Read elements \((i x + 1)\) to \((i + 1)x\) of the current block of size \(y\) from the packet received from processor \(MOD(p_s + i, P)\).

Asynchronous Method:
2. The \(i^{th}\) block of size \(x\), \(0 \leq i \leq k - 1\), is to be received from processor \(MOD(p_s + i, P)\).
3. For \(i = 0\) to \(k - 1\) do
4. \hspace{1em} Receive data from any processor \(p_i\).
5. \hspace{2em} Place the first block of size \(x\) in the local array starting from the location calculated above, and the other blocks with stride \(y\).

Figure 3.12: X TO KX Algorithm for Cyclic(x) to Cyclic(y) Redist., where \(y = k x\)
Figure 3.13: X_TO_KX Algorithm, cyclic(2) to cyclic(4), array size 1M integers

The Asynchronous Method is found to perform better in all cases.

3.5.3 General Case

Let us consider the general case of a cyclic($x$) to cyclic($y$) redistribution in which there is no particular relation between $x$ and $y$ (Figure 3.15). One algorithm for doing this is to explicitly calculate the destination and source processor of each element of the local array, using the formulae given in Table 3.1. We call this General Algorithm and is described below.

General Algorithm

In the send phase, the destination processor of each element of the local array can be determined by first calculating its global index based on the current distribution and then the destination processor based on the new distribution. These two calculations can be combined into a single expression to give the destination processor of element $i$ of the local array as $p_d = MOD[\{MOD(i-1, x) + (P((i-1)/x) + p)x + y\}/y-1, P]$. Similarly in the receive phase, the source processor of each element of the local array
can be determined by first calculating its global index based on the new distribution and then the source processor based on the old distribution. These two calculations can be combined into a single expression to give the source processor of element \(i\) of the local array as

\[ p_s = \text{MOD}([\text{MOD}(i-1, y) + (P((i-1)/y) + p)y + x]/x - 1, P). \]

The drawback of this algorithm is that calculations are needed individually for all elements of the array. We propose two algorithms called the GCD Method and the LCM Method, which make use of the optimized KY_TO_Y and X_TO_KX algorithms and hence require a lot less calculations.
CHAPTER 3. RUNTIME SUPPORT FOR ARRAY REDISTRIBUTION

<table>
<thead>
<tr>
<th>GCD Method</th>
<th>LCM Method</th>
</tr>
</thead>
<tbody>
<tr>
<td>1. Let $m = GCD(x, y)$.</td>
<td>1. Let $m = LCM(x, y)$.</td>
</tr>
<tr>
<td>2. Redistribute from cyclic($x$) to cyclic($m$) using the KY_TO_Y Algorithm.</td>
<td>2. Redistribute from cyclic($x$) to cyclic($m$) using the X_TO_KX Algorithm.</td>
</tr>
<tr>
<td>3. Redistribute from cyclic($m$) to cyclic($y$) using the X_TO_KX Algorithm.</td>
<td>3. Redistribute from cyclic($m$) to cyclic($y$) using the KY_TO_Y Algorithm.</td>
</tr>
</tbody>
</table>

Figure 3.16: GCD and LCM Methods

GCD Method

This method takes advantage of the fact that we have developed special optimized algorithms for a cyclic($x$) to cyclic($y$) redistribution when $x = ky$ (the KY_TO_Y Algorithm) and $y = kx$ (the X_TO_KX Algorithm). In the GCD Method, the redistribution is done as a sequence of two phases — cyclic($x$) to cyclic($m$) followed by cyclic($m$) to cyclic($y$), where $m = GCD(x, y)$. Since both $x$ and $y$ are multiples of $m$, the KY_TO_Y Algorithm can be used for the cyclic($x$) to cyclic($m$) phase and the X_TO_KX Algorithm can be used for the cyclic($m$) to cyclic($y$) phase. This is described in Figure 3.16. The GCD Method involves the cost of having to do two separate redistributions. For small arrays, the increased communication may outweigh the savings in computation, but for large arrays in some cases, the performance is better than that of the General Method. This can be observed from Figure 3.17 which shows the performance of a cyclic(15) to cyclic(10) redistribution, for an array of size 1M integers on the Delta. We see that for small number of processors, the GCD Method performs considerably better than the General Method because of the saving in the amount of computation per processor. Since the size of the global array is kept constant, as the number of processors is increased, the size of the local array in each processor becomes smaller and each processor spends less time on address calculation. Hence the improvement of the GCD Method over the General Method is also small.
CHAPTER 3.  RUNTIME SUPPORT FOR ARRAY REDISTRIBUTION

One disadvantage of the GCD Method is that in the intermediate cyclic($m$) distribution, the block size $m$ is smaller than both $x$ and $y$. In the KY_TO_Y and X_TO_KX algorithms, all the address and processor calculations are done for blocks of size $x$ or $y$. Since $m$ is the GCD of $x$ and $y$, $m$ can even be equal to 1 in some cases. When $m = 1$, calculations have to be done for each element, which is no better than in the General Method. In this case the General Method performs better than the GCD Method. Figure 3.18 shows an example of cyclic(11) to cyclic(3) redistribution on the Delta for an array of size 1M integers. Since the GCD of 11 and 3 is 1, we find that the General Method always performs better than the GCD Method.

**LCM Method**

The key to getting good performance in this two-phase approach for redistribution is to have a large value for $m$. One way of ensuring that $m$ is always large is by choosing $m$ as the LCM of $x$ and $y$. Since $m$ is a multiple of both $x$ and $y$, the X_TO_KX Algorithm can be used for the cyclic($x$) to cyclic($m$) phase and the KY_TO_Y algorithm can be used for the cyclic($m$) to cyclic($y$) phase. This is described in Figure 3.16.
Also, since $m$ is larger than $x$ and $y$, all calculations are done for this larger block size. This results in fewer calculations than in the GCD and General Methods. Figures 3.17 and 3.18 compare the performance of the LCM, GCD and General Methods for an array of 1M integers on different number of processors. We observe that the LCM Method performs better in all cases. Figure 3.19 compares the performance of the LCM and General Methods for a cyclic(11) to cyclic(3) redistribution keeping the number of processors fixed at 64 and varying the array size. We observe that for small arrays, the General Method performs better because it has less communication, but for large arrays the LCM method performs better because the saving in computation is higher than the increase in communication.

### 3.6 Redistribution of Multidimensional Arrays

The redistribution of two and higher dimensional arrays can be classified into two types :-

![Figure 3.18: GCD, LCM and General Methods; cyclic(11) to cyclic(3); 1M array](image-url)
Figure 3.19: LCM v/s General Methods, cyclic(11) to cyclic(3), 64 processors

1. **Shape Retaining**: The shape of the local array remains unchanged after the redistribution, e.g. (block,block) to (cyclic,cyclic).

2. **Shape Changing**: The shape of the local array changes after the redistribution, e.g. (block,*) to (*,block) where '*' indicates that the corresponding dimension is collapsed. This type of redistribution is used for example in 2D FFT and the ADI method for solving multidimensional partial differential equations.

The shape retaining and shape changing redistributions are quite different from each other and require different algorithms.

### 3.6.1 Shape Retaining Redistribution

A shape retaining redistribution may involve redistribution in only one dimension {e.g. (block,block) to (cyclic,block)} or more than one dimension {e.g. (block,block) to (cyclic,cyclic)}. If the redistribution is only along one dimension, it is similar to the redistribution of one-dimensional arrays and the same algorithms described earlier can be used. If both dimensions have to be redistributed, it can either be done directly.
or indirectly as a series of one-dimensional redistributions. In the Indirect Method, the array is redistributed separately along each dimension. For example, if an array has to be redistributed from (block, block) to (cyclic, cyclic), it is first redistributed to (block, cyclic) and then to (cyclic, cyclic). This method has the advantage that all the optimizations developed for one-dimensional arrays in the previous sections can be easily extended to multidimensional arrays. The order in which the dimensions are redistributed does not affect the performance. This is because the order of dimensions chosen only results in a different set of data values being communicated, and does not affect the amount or type of communication. So, one could also do the above redistribution as (block, block) to (cyclic, block) to (cyclic, cyclic).

In the Direct Method, data is sent directly to the destination processor corresponding to the new distribution. Hence the optimized algorithms developed for the one-dimensional case cannot be used. This method requires different algorithms for different number of dimensions and different types of redistributions, and these algorithms cannot be optimized much. However, data needs to be communicated
only once in the Direct Method. Figure 3.20 compares the performance of the Direct and Indirect Methods for redistributing an array of size 1K × 1K integers from (block,block) to (cyclic,cyclic) on the Touchstone Delta. The Indirect Method is found to perform much better even though data is communicated twice. This is because the algorithms for one-dimensional redistribution are highly optimized and a lot of the communication during each one-dimensional redistribution actually takes place in parallel.

Another interesting observation comes from Figure 3.21 which compares the performance of the Direct and Indirect Methods for a (cyclic(11),cyclic(11)) to (cyclic(3),cyclic(3)) redistribution of a 1K × 1K array. The indirect redistribution is done in two ways — using the General Method and the LCM Method for each cyclic(11) to cyclic(3) redistribution. We find that the General Method performs better than the LCM Method. This is because in the General Method, for each one-dimensional redistribution, the destination and source processors need to be calculated for each row (or column) of the array and the entire row (or column) has to be communicated. In the LCM Method, this communication needs to be done twice. Since the amount of communication is much higher than the amount of computation, the General Method performs better. In the Direct Method, destination and source processor calculations have to be done for each element of the local array. If the local array size is $L \times L$, $L^2$ calculations have to be done. In the Indirect Method, calculations are done for each column during the column redistribution and for each row during the row redistribution. Hence the number of calculations is only $L + L = 2L$. Therefore, the Direct Method performs considerably worse than any of the Indirect Methods in this case.

### 3.6.2 Shape Changing Redistribution

This type of redistribution occurs when at least one dimension of the array is collapsed before or after the redistribution. Consider the redistribution from (X,* ) to (*,Y) or vice-versa, where X and Y can be either block, cyclic or cyclic(m). This is basically a
CHAPTER 3. RUNTIME SUPPORT FOR ARRAY REDISTRIBUTION

Figure 3.21: (cyclic(11), cyclic(11)) to (cyclic(3), cyclic(3)) Redist., 1K × 1K array collapsed to distributed type of redistribution in one of the dimensions, which is done as follows. Each processor divides the local array into packets along the collapsed dimension, depending on the type of the new distribution Y. The processors then exchange packets with other processors. At the receiving end, packets are placed in the local array in order of source processor number. Data from the received packets may have to be placed in the local array either contiguously or with a stride, depending on the type of distributions X and Y.

The other type of redistribution involving change of shape of the local array is of the type (X,* or (*,X) to (Y,Z), or vice versa. That is, in either the source or target distributions, one of the dimensions is collapsed and in the corresponding target or source distributions, both dimensions are distributed. Each case of this type requires a different algorithm and so has to be considered separately. Note that we do not use the Indirect Method for shape changing redistributions.
3.7 Circular Redistribution

Very often, it is necessary to redistribute from a distribution $X$ to a distribution $Y$ and then sometime later in the program it is required to redistribute back from $Y$ to $X$. We call this $X \rightarrow Y \rightarrow X$ type redistribution as a circular redistribution and is denoted as $X \leftrightarrow Y$. The redistribution from $X$ to $Y$ is called a forward redistribution and from $Y$ to $X$ is called a backward redistribution. For example when the main program calls a subroutine with a different distribution, it is necessary to redistribute from say $X$ to $Y$. After returning from the subroutine, it is necessary to redistribute back to $X$ to restore the original distribution.

In the case of a circular redistribution, the backward redistribution is essentially the reverse of the forward redistribution. The calculation of the destination processors and addresses done during the forward redistribution can be saved and reused in the backward redistribution. Thus, no calculations need be done during the backward redistribution. This decrease in computation is at the cost of increased memory for saving the information calculated in the forward redistribution.

3.7.1 Circular Block($m$) $\leftrightarrow$ Cyclic Redistribution

We first consider a circular block($m$)$\leftrightarrow$cyclic redistribution. In the forward block($m$)$\rightarrow$cyclic redistribution, each processor calculates the destination processor to which the first element of the local array is to be sent. This information can be saved (A). In the receive phase, each processor calculates how much data is to be received from other processors. This information can also be saved (B). In the backward cyclic$\rightarrow$block($m$) redistribution, Information (B) is sufficient for each processor to know how many elements to send to other processors. Information (A) is sufficient for the receiving processor to know where to store incoming data. Thus, no calculations need to be done in the backward redistribution phase. However, this saving in computation is at the cost of increased memory requirements. Information (A) can be stored in a single variable. Information (B) requires an array of length equal to the number of processors.
3.7.2 Circular Cyclic $\leftrightarrow$ Block($m$) Redistribution

This is essentially the reverse of a circular block($m$) $\leftrightarrow$ cyclic redistribution. In the send phase of the forward cyclic$\rightarrow$block($m$) redistribution, each processor stores the set of destination processors and the number of elements sent to them (A). In the receive phase, each processor stores the source processor of the first element of the local array (B). In the backward block($m$)$\rightarrow$block(cyclic) redistribution, Information (B) is sufficient to carry out the send phase and Information (A) can be used by the receiving processor to know how many elements are to be received from which processor. Information (B) can be stored in a single variable. Information (A) requires an array of length equal to the number of processors.

3.7.3 Circular Cyclic($x$) $\leftrightarrow$ Cyclic($y$) Redistribution

Let us first consider the special case of a circular cyclic($x$) $\leftrightarrow$ cyclic($y$) redistribution where $x = ky$. In the forward cyclic($x$)$\rightarrow$cyclic($y$) redistribution, each processor calculates the sequence of processors to which sub-blocks of size $y$ have to be sent. This information is stored (A). In the receive phase, each processor calculates the source processors for blocks of size $y$ of the local array. This information is also stored (B). In the backward cyclic($y$)$\rightarrow$cyclic($x$) redistribution where $x = ky$, information (B) can be used in the send phase to determine the sequence of processors to send blocks of size $y$. In the receive phase, information (A) can be used to determine the source processors for blocks of size $y$ of the local array. Thus, no calculations need to be done in the backward redistribution phase. In the forward redistribution phase, since the sequence of processors to send sub-blocks of size $y$ remains the same for every block of size $x$, information (A) can be stored in an array of size $x/y = k$. Similarly, information (B) also requires an array of size $k$.

The other special case $y = kx$ is similar.

For the general case where there is no particular relation between $x$ and $y$, if the LCM or GCD Methods are used, information can be stored and reused in the same
Figure 3.22: Circular cyclic(11) to cyclic(3) Redistribution, array size 1M integers

manner as described above for the $x = k y$ case.

If the General Method is used, then in the send phase of the forward redistribution, each processor calculates the destination processor of each element of the local array. This information can be stored (A). In the receive phase, each processor calculates the source processor of each element of the local array. This can also be stored (B). For the backward $cyclic(y) \rightarrow cyclic(x)$ redistribution, Information (B) gives the set of destination processors and Information (A) gives the set of source processors. This saves a lot of computation in the backward redistribution, since it would otherwise be required to compute the source and destination processors for each individual element. However the amount of storage required is twice the size of the local array. Figure 3.22 compares the performance of a circular $cyclic(11) \leftrightarrow cyclic(3)$ redistribution with and without saving any information in the forward redistribution, for an array of size 1M integers. We observe that there is considerable improvement in performance by reusing the information.
Chapter 4

Runtime Support for All-to-All Collective Communication

Programs on distributed memory parallel computers typically require interprocessor communication. In loosely synchronous parallel programs, all processors perform similar operations but on different data sets. Hence it is very likely that a group of processors or even all processors may need to perform communication at the same time. This makes it possible for processors to cooperate with each other to perform communication efficiently, which is known as collective communication. Depending on the type of communication required, different communication patterns are possible. These can be generally classified into the following types:

- **All-to-all**: All processors need to send data to all other processors.
- **All-to-many**: All processors need to send data to some other processors.
- **Many-to-all**: Some processors need to send data to all other processors.
- **Many-to-many**: Some processors need to send data to some other processors.

Efficient algorithms are needed to implement these communication patterns on different network topologies. In this chapter, we consider the all-to-all communication pattern in detail. The other communication patterns, namely all-to-many, many-to-all and many-to-many, have been studied in [100, 127, 106, 107, 108].
The all-to-all communication pattern (also called complete exchange\textsuperscript{1}) occurs frequently in many important parallel computing applications such as array redistribution, parallel quicksort, some implementations of two-dimensional Fast Fourier Transform, matrix transpose etc. It is the densest form of communication because all processors need to communicate with all other processors. This can result in severe link contention and degrade performance considerably. Hence, it is necessary to use efficient algorithms in order to get good performance over a wide range of message sizes and number of processors.

We describe several algorithms for all-to-all communication on fat-tree and two-dimensional mesh interconnection topologies. We use the Thinking Machines CM-5 as an example machine with a fat tree topology and the Intel Touchstone Delta as an example machine with a two-dimensional mesh topology. Since these machines have different architectures and communication capabilities, different algorithms are needed to get the best performance on each of them. We present four algorithms for the CM-5 and six algorithms for the Delta. Extensive performance results on the CM-5 and Delta are presented and analyzed. We have also developed analytical models to estimate the performance of the algorithms on the basis of system parameters. The analytical models are validated by comparing with experimental results.

\section*{4.1 Architecture and Communication Issues}

\subsection*{4.1.1 CM-5}

The CM-5 is a scalable distributed memory multiprocessor system which can be scaled up to 16K processors. It supports both SIMD and MIMD programming models. Each node on the CM-5 is a SPARC processor and has four optional vector processors. The CM-5 has two internal networks that support interprocessor communication — the \textit{Control Network} and the \textit{Data Network}. The control network is responsible for communication patterns in which many processors may be involved in the processing.

\textsuperscript{1}We use the words “all-to-all communication” and “complete exchange” interchangeably in this chapter.
of each datum, such as global reduction operations, parallel prefix operations and processor synchronization. The data network supports point-to-point communication and has a fat tree topology [76, 77] as shown in Figures 4.1 and 4.2. It is actually a 4-ary fat tree or quad tree, where each node has four children. Each internal node of the fat tree is implemented as a set of switches. The number of switches per node depends on where it is in the tree. Nodes at level 1 have two switches. The number of switches per node doubles for each higher level till level 3, from where on it quadruples. Each level 1 or level 2 switch has two parents and four children; switches at higher levels have four parents and four children. The data network has a guaranteed system-wide bandwidth of 5 Mbytes/sec. Messages are divided into packets of size 20 bytes of which 4 bytes is the header. The routing algorithm for the Data Network compares the destination address with the source address to determine how far up the tree the message must travel. The message can then take any path up the tree. This allows the switches to perform load balancing on the fly. Once the message has reached the necessary height in the tree, it must follow a particular path down. A detailed discussion of the interprocessor communication overhead on the CM-5 is given in [96, 16, 94].
4.1.2 Touchstone Delta

The Intel Touchstone Delta is a 16 × 32 mesh of computational nodes, each of which is an Intel i860/XR microprocessor. The two-dimensional mesh interconnection network has bidirectional links and uses wormhole routing. The links have a maximum bandwidth of 10 Mbytes/sec in each direction. Messages are divided into packets of size 512 bytes of which 32 bytes is the header. It uses deterministic XY routing in which packets are first sent along the X dimension and then along the Y dimension. In other words, at most one turn is allowed, and that turn must be from the X dimension to the Y dimension. For a 2D mesh, the XY routing algorithm is guaranteed to be deadlock free [33].

In wormhole routing, a packet is divided into a number of flits (flow control digits) for transmission. The size of a flit is typically the same as the channel width. The header flit of a packet determines the route. As the header advances along the route, the remaining flits follow in a pipeline fashion. If the header flit encounters a channel already in use, it is blocked until the channel becomes available. The flow control within the network blocks the trailing flits and they remain in flit buffers along the
CHAPTER 4. RUNTIME SUPPORT FOR ALL-TO-ALL COMMUNICATION

established route. Once a channel has been acquired by a packet, it is reserved for the packet. The channel is released when the last flit has been transmitted on the channel. The pipelined nature of wormhole routing makes the communication latency almost insensitive to path length in the absence of network contention. The network latency for wormhole routing is \((L_f/B)D + L/B\), where \(L_f\) is the length of each flit, \(B\) is the channel bandwidth, \(D\) is the path length, and \(L\) is the length of the message \([87]\). Thus, if \(L_f << L\), the path length \(D\) will not significantly affect the network latency unless it is very large. Further details of wormhole routing can be found in \([87]\).

4.1.3 Performance Models

Barnett et. al. \([3]\) have proposed algorithms and performance models for global combine operations on a wormhole routed mesh. We use similar models for our all-to-all communication algorithms, which take into account link conflicts and other characteristics of the underlying communication system. The following notations are used in our models:

<table>
<thead>
<tr>
<th>Symbol</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>(\alpha)</td>
<td>startup time per message</td>
</tr>
<tr>
<td>(\beta_{tx})</td>
<td>transfer time per byte for an exchange with no link conflicts</td>
</tr>
<tr>
<td>(\beta_{sr})</td>
<td>transfer time per byte to send to and receive from different processors with no link conflicts</td>
</tr>
<tr>
<td>(\beta_{sat})</td>
<td>transfer time per byte on a saturated link</td>
</tr>
<tr>
<td>(\beta_s)</td>
<td>transfer time per byte for a single send-recv with no link conflicts</td>
</tr>
<tr>
<td>(L)</td>
<td>number of bytes to be exchanged per processor pair</td>
</tr>
<tr>
<td>(f(i))</td>
<td>maximum number of messages contending for a saturated link at step (i)</td>
</tr>
<tr>
<td>(P)</td>
<td>total number of processors</td>
</tr>
</tbody>
</table>

The time taken for an exchange operation may be different from the time to send to and receive from different processors, because in the latter case the incoming and outgoing messages may traverse links with different amount of contention. Hence, we use \(\beta_{tx}\) or \(\beta_{sr}\), depending on the algorithm. We assume that the time taken is independent of distance, a property of both CM-5 and Delta \([95, 3]\). Thus, the time
required for an exchange step $i$ is given by

$$T = \alpha + L \max(\beta_{ex}, f(i)\beta_{sat})$$

We assume that conflicting messages share the bandwidth of a network link. The network may have excess bandwidth, enabling multiple messages to traverse a link in the same direction without conflict. In other words, $\beta_{sat} < \beta_{ex}$, $\beta_{sat} < \beta_{sr}$, $\beta_{sat} < \beta_{s}$.

Also, in any expression in this chapter, the division of two integer variables should be considered as *integer division*, i.e. $5/2 = 2$.

### 4.2 All-to-All Communication on a Fat Tree

In this section, we describe four algorithms for all-to-all communication on the fat tree topology of the CM-5.

#### 4.2.1 Linear Exchange (LEX)

The Linear Exchange algorithm is the simplest of the four algorithms. In step $i$, $0 \leq i < P$, processor $i$ receives messages from every processor except itself. The algorithm clearly requires $P$ steps. Since at every step one processor receives from all other processors, there is a lot of link contention. At step $i$, every processor sends data to processor $i$. Processor $i$ has two links to its parent node and $P - 1$ processors simultaneously need to use these links. Hence, the maximum number of messages contending for a link at any step is $\lceil \frac{P-1}{2} \rceil$. The time taken for any step $i$ is

$$T(i) = \alpha + L \max(\beta_{s}, \lceil \frac{P-1}{2} \rceil \beta_{sat})$$

The cost of LEX is obtained by summing over all steps of the algorithm:

$$T_{LEX} = \sum_{i=1}^{P-1} [\alpha + L \max(\beta_{s}, \lceil \frac{P-1}{2} \rceil \beta_{sat})] = \alpha (P-1) + L (P-1) \max(\beta_{s}, \lceil \frac{P-1}{2} \rceil \beta_{sat})$$
do $i=1, P-1$
    destination = xor(mynumber, $i$)
    Exchange with destination
end do

Figure 4.3: Pairwise Exchange (PEX)

Table 4.1: Communication Schedule for PEX on 8 Processors

<table>
<thead>
<tr>
<th>Step 1</th>
<th>Step 2</th>
<th>Step 3</th>
<th>Step 4</th>
<th>Step 5</th>
<th>Step 6</th>
<th>Step 7</th>
</tr>
</thead>
<tbody>
<tr>
<td>0 ↔ 1</td>
<td>0 ↔ 2</td>
<td>0 ↔ 3</td>
<td>0 ↔ 4</td>
<td>0 ↔ 5</td>
<td>0 ↔ 6</td>
<td>0 ↔ 7</td>
</tr>
<tr>
<td>2 ↔ 3</td>
<td>1 ↔ 3</td>
<td>1 ↔ 2</td>
<td>1 ↔ 5</td>
<td>1 ↔ 4</td>
<td>1 ↔ 7</td>
<td>1 ↔ 6</td>
</tr>
<tr>
<td>4 ↔ 5</td>
<td>4 ↔ 6</td>
<td>4 ↔ 7</td>
<td>2 ↔ 6</td>
<td>2 ↔ 7</td>
<td>2 ↔ 4</td>
<td>2 ↔ 5</td>
</tr>
<tr>
<td>6 ↔ 7</td>
<td>5 ↔ 7</td>
<td>5 ↔ 6</td>
<td>3 ↔ 7</td>
<td>3 ↔ 6</td>
<td>3 ↔ 5</td>
<td>3 ↔ 4</td>
</tr>
</tbody>
</table>

4.2.2 Pairwise Exchange (PEX)

We consider the Pairwise Exchange (PEX) algorithm which has been shown to be the best algorithm for a hypercube network, on which it guarantees no link contention at any step [6, 103]. The algorithm is described in Figure 4.3. It requires $P-1$ steps and the communication schedule is as follows. At step $i$, $1 \leq i \leq P-1$, each processor exchanges a message with the processor determined by taking the exclusive-or of its processor number with $i$. Therefore, this algorithm has the property that the entire communication pattern is decomposed into a sequence of pairwise exchanges. The communication schedule of the pairwise exchange algorithm for 8 processors is given in Table 4.1. The entry $i \leftrightarrow j$ in the table indicates that processors $i$ and $j$ exchange messages.

The time taken by this algorithm on the CM-5 can be estimated as follows. When a processor has to communicate with another processor in its cluster of $4^k$ processors, $k \geq 1$, the message has to travel a maximum of $k$ levels up the tree. When a cluster
of 16 processors exchange among themselves, the messages have to travel either 1 or 2 levels up the tree depending on the source and destination. There are 32 links from level 0 to level 1 and 16 links from level 1 to 2 for this cluster, enough for the 16 processors to exchange among themselves without contention. However, when processors in a cluster of 16 need to exchange with processors in another cluster of 16, there are only 8 links from a level 2 node to a level 3 node, which results in 16 processors contending for 8 links. A similar bottleneck exists at higher levels. For example, a level 3 node has 32 links upwards and downwards, and 64 processors in its subtree.

The communication schedule of PEX is such that in the first 15 steps, processors exchange completely within a cluster of 16 processors and after that they exchange across clusters. Hence, in the first 15 steps there is no contention. From step 16 onwards, there are a maximum of 2 messages contending for a link. The time taken for step \( i \) is given by

\[
T(i) = \begin{cases} 
\alpha + L \beta_{ex} & \text{for } 1 \leq i \leq 15 \\
\alpha + L \max(\beta_{ex}, 2\beta_{sat}) & \text{for } i > 15 
\end{cases}
\]

The time for the entire PEX algorithm can be obtained by summing over all steps

\[
T_{PEX} = \sum_{i=1}^{15}[\alpha + L \beta_{ex}] + \sum_{i=16}^{P-1}[\alpha + L \max(\beta_{ex}, 2\beta_{sat})]
\]

which can be simplified to

\[
T_{PEX} = (P - 1)\alpha + L [15\beta_{ex} + (P - 16) \max(\beta_{ex}, 2\beta_{sat})]
\]

For a complete exchange on 16 processors, this algorithm has no contention.

### 4.2.3 Recursive Exchange (REX)

The Recursive Exchange algorithm is described in Figure 4.4. The number of processors is halved in each step and each processor exchanges data with the corresponding
Table 4.2: Communication Schedule for REX on 8 processors

<table>
<thead>
<tr>
<th>Step 1</th>
<th>Step 2</th>
<th>Step 3</th>
</tr>
</thead>
<tbody>
<tr>
<td>0 ↔ 4</td>
<td>0 ↔ 2</td>
<td>0 ↔ 1</td>
</tr>
<tr>
<td>1 ↔ 5</td>
<td>1 ↔ 3</td>
<td>2 ↔ 3</td>
</tr>
<tr>
<td>2 ↔ 6</td>
<td>4 ↔ 6</td>
<td>4 ↔ 5</td>
</tr>
<tr>
<td>3 ↔ 7</td>
<td>5 ↔ 7</td>
<td>6 ↔ 7</td>
</tr>
</tbody>
</table>

A processor sends all the data intended for all processors in the other half to only one processor in that half, which forwards that data to the remaining processors in a later step. The number of steps required is $\lg P$ and each message is of size $L \times P/2$. The communication schedule for REX on 8 processors is given in Table 4.2. Although this algorithm takes less number of steps than LEX and PEX, the amount of data transmitted in each step is much higher. Since it is a store-and-forward type algorithm, each step incurs the additional overhead of reshuffling data.

In step $i$, $1 \leq i \leq \lg P$ each processor $j$ exchanges with processor $j \pm \frac{P}{2}$. Communication always takes place either entirely within a cluster of 16 processors or entirely across clusters. In steps $1$ to $\lg P - 4$, communication takes place across clusters, so the maximum number of messages contending for a link is 2. In steps $\lg P - 3$ to $\lg P$, communication takes place within a cluster of 16 processors, so there is no contention. Hence, the time taken by REX is

$$T_{REX} = \sum_{i=1}^{\lg P - 4} [\alpha + L \frac{P}{2} \max (\beta_{ex}, 2\beta_{sat})] + \sum_{i=\lg P - 3}^{\lg P} [\alpha + L \frac{P}{2} \beta_{ex}]$$

which can be simplified to

$$T_{REX} = \alpha \lg P + L \frac{P}{2} [4\beta_{ex} + (\lg P - 4) \max (\beta_{ex}, 2\beta_{sat})]$$
Figure 4.4: Recursive Exchange (REX) on CM-5

4.2.4 Balanced Exchange (BEX)

In the PEX algorithm, the communication schedule is such that all processors in a cluster first exchange completely with each other and then exchange with processors in other clusters. In other words, all the communication is either entirely within the cluster or entirely across clusters. As explained above, this gives rise to contention from step 16 onwards. An improvement in performance can be expected if there is a balance of local and long distance communication at every step, which will reduce contention in step 16 onwards. The Balanced Exchange (BEX) algorithm provides such a schedule. BEX is a simple modification of PEX and is described in Figure 4.5. For the purpose of determining the communicating pairs of processors, we define a mapping between the physical number of a processor and its virtual number as

\[
\text{virtual no} = \text{MOD}(\text{physical no} + 1, P)
\]

Balanced exchange consists of using the pairwise exchange algorithm with this mapping and the virtual processor numbers. The communication schedule for BEX is given in Table 4.3.
The BEX algorithm has the property that in steps 0 to \(P/2 - 1\), two processors in each cluster of size \(P/2\) communicate across clusters while the rest communicate within the cluster. In steps \(P/2\) to \(P - 1\), two processors in each cluster of size \(P/2\) communicate within the cluster, while the other processors communicate across clusters. In steps 0 to 15, there is no contention as in the PEX algorithm. In step \(i > 15\), instead of \(2^{|\log i|}\) processors contending for \(2^{|\log i| - 1}\) links, there are \((2^{|\log i|} - 2)\) processors contending for \(2^{|\log i| - 1}\) links. Hence the maximum contention for any link at step \(i > 15\) is \(\frac{2^{|\log i|} - 2}{2(|\log i| - 1)}\) on an average. The total time taken by BEX is

\[ T_{BEX} = \sum_{i=1}^{15} [\alpha + L \beta_{ex}] + \sum_{i=16}^{P-1} [\alpha + L \max(\beta_{ex}, \frac{2^{|\log i|} - 2}{2(|\log i| - 1)} \beta_{sat})] \]

which can be simplified to

\[ T_{BEX} = (P - 1)\alpha + L [15\beta_{ex} + (P - 16) \max(\beta_{ex}, \frac{2^{|\log i|} - 2}{2(|\log i| - 1)} \beta_{sat})] \]

4.2.5 Performance of Algorithms on the CM-5

We have implemented the above algorithms on the CM-5 using the message passing library CMMD Version 3.0 Beta. Figure 4.6 compares the communication time of
the four algorithms on a 32 node CM-5 with message size varied between 0 and 2048 bytes. As expected, LEX performs much worse than the other algorithms, so we do not consider it any further. For small message sizes, the performance of PEX, REX and BEX is virtually indistinguishable. However, for large message sizes, PEX performs much better than REX and BEX performs better than PEX. This is because of the following reasons. First, even though the number of steps in REX is only \( \log_2 P \), as compared to \( P - 1 \) steps in PEX, the message size in REX remains constant at \( L / P \), whereas the size of each message in PEX is \( L \). Also, REX uses a store-and-forward approach in which a message is sent from source to destination processor through one or more intermediate processors. Sending a message from source to destination through \( k \) intermediate processors costs \( k \) times more than sending it directly. In addition, each node needs to buffer and reshuffle data in REX so that appropriate data can be sent to the appropriate node. These two overheads outweigh the savings in the number of communication steps. BEX performs the best because it maintains a balance of local and remote communication at each step.

Figures 4.7 and 4.8 show the performance of the algorithms on up to 256 processors for message size 512 bytes and 2 Kbytes respectively. PEX and BEX perform better than REX for small number of processors because the overhead of message size and number of steps dominate for REX. As the number of processors increases, the overhead of the larger number of messages in PEX and BEX is higher than the overhead of larger message size and reshuffling in REX, and therefore, REX performs better.
CHAPTER 4. RUNTIME SUPPORT FOR ALL-TO-ALL COMMUNICATION

Figure 4.6: Performance on 32 node CM-5 for different message sizes

Figure 4.7: Performance on CM-5 for message size 512 bytes
This section describes algorithms for all-to-all communication on the two-dimensional mesh topology of the Intel Touchstone Delta. A mesh is a low-dimension high-bandwidth network which performs well when there is no link contention. However, a dense communication pattern like complete exchange results in a lot of link contention which can degrade performance considerably. Also, the algorithms discussed above for the CM-5 and the existing algorithms for a hypercube assume that the number of processors is a power-of-two. This is a valid assumption for hypercube and fat-tree architectures because the number of processors is always a power-of-two. However, on the Delta, the user can allocate a mesh which may not be a power-of-two and may even be an odd number (eg. 5x4 or 3x3). So, it is necessary to develop algorithms which work even on non power-of-two meshes.

Bokhari and Berryman describe two algorithms for a circuit-switched mesh, which assume that the number of processors is a power-of-two [7]. Scott has shown that $a^3/4$
is the lower bound on the number of phases required to perform complete exchange on an $a \times a$ mesh such that there is no link contention in any phase [103]. However, if we allow link contention to exist, the operation can be performed in fewer steps. We have adopted this approach of allowing a small amount of link contention to exist, thereby reducing the number of steps and keeping all processors active at every step. This approach also takes advantage of the fact that the communication links in the Delta have excess bandwidth [3], so that a small number of contending messages will not affect the communication time.

We consider six algorithms on the Delta, for power-of-two and non-power-of-two meshes. For the analysis of the algorithms, we assume that the mesh has $r$ rows and $c$ columns. Hence, $P = r \times c$.

### 4.3.1 Pairwise Exchange for Power-Of-Two Mesh (PEX)

The Pairwise Exchange algorithm described earlier for the CM-5 can also be used on the Delta without any modification, as long as the number of processors is a power of two. However, since the mesh architecture is different from a fat tree architecture, the link contention caused by this algorithm on the Delta is different from that on a fat tree. Figure 4.9 shows the communication pattern of PEX on a $2 \times 4$ mesh. The complete exchange requires seven steps. In steps 1, 4 and 5 there is no contention. In steps 2, 3, 6 and 7, the maximum number of messages contending for a link is 2. Messages traveling in opposite directions on a link do not contend, because the links on the Delta are bidirectional.

Since each step of the algorithm involves an exchange between pairs of processors, the maximum number of messages contending for a link at any step is limited by $max(r,c)/2$. An exact expression for the maximum number of messages contending for a link at step $i$ is

$$f(i) = 2^{\lfloor \log_{max(MOD(i,c),i/c)} \rfloor}$$

This expression was obtained empirically. We studied the communication pattern of PEX for a large number of processors and mesh configurations. We found that there
**Figure 4.9**: PEX on $2 \times 4$ mesh

is a relation between the step number $i$, the shape of the mesh and the maximum link contention in that step. That relation is given by the above expression for $f(i)$.

The time taken for step $i$ is

$$T(i) = \alpha + L \max(\beta_{\text{ex}}, f(i)\beta_{\text{sat}})$$

The cost of PEX can be determined by summing over all steps of the algorithm:

$$T_{\text{PEX}} = \sum_{i=1}^{P-1} [\alpha + L \max(\beta_{\text{ex}}, f(i)\beta_{\text{sat}})] = (P - 1)\alpha + L \sum_{i=1}^{P-1} \max(\beta_{\text{ex}}, f(i)\beta_{\text{sat}})$$

If the number of processors is not a power-of-two, the exclusive-or function does not create all the processor pairs in $P - 1$ steps, so the algorithm is not directly applicable.
We have extended the basic pairwise exchange algorithm, so that it works even when the number of processors is not a power-of-two. We call this algorithm Pairwise Exchange for General Mesh (PEX-GEN) and is described in Figure 4.10. The algorithm first finds the smallest power-of-two (say $q$) greater than the number of processors and uses this number to schedule $q - 1$ steps of the pairwise exchange. In each step, every processor checks to see if the calculated destination processor number is less than the actual number of processors. If so, it exchanges data with the processor, else it goes ahead to the next step. Thus, the algorithm requires $q - 1$ steps where $q$ is the nearest power-of-two larger than the number of processors. Clearly, the algorithm takes more steps than necessary and many processors remain idle in several of the steps. However, this reduces the link contention in each step. The maximum contention in each step is upper bounded by that in the PEX algorithm.

### 4.3.3 PEX-GEN with Shift (PEX-GEN-SHIFT)

The motivation for the Pairwise Exchange for General Mesh with Shift (PEX-GEN-SHIFT) algorithm can be explained with the help of Figure 4.11(a). Assume that the user has allocated a mesh of 20 processors which may be organized in any way (ie $4 \times 5$, $5 \times 4$, $6 \times 3$, or $3 \times 6$). The algorithm involves replacing the XOR operation with a shift operation. The number of steps required to complete the algorithm is $q - 1$ where $q$ is the nearest power-of-two larger than the number of processors. The maximum contention in each step is upper bounded by that in the PEX algorithm.
or $2 \times 10$ etc.). The processor numbers will be 0 to 19. The nearest power of two larger than 20 is 32. The PEX-GEN algorithm will require 31 steps. The processor pairs created by the exclusive-or function are such that in steps 1 through 15, processors 0 to 15 exchange completely among themselves and do not communicate with any processor in the range 16 to 31. Similarly, processors 16 to 19 exchange completely among themselves and do not communicate with any processor in the range 0 to 15. In steps 16 through 31, the processors in one half (0 — 15) exchange with processors in the second half (16 — 31). Since there are only 4 processors in the second half, several of the processors in the first half do not do any communication. Thus the communication pattern is such that in the first 15 steps, all the processors in the first half are active and in the next 16 steps, several of them are inactive. Since each step involves communication with link contention, there is high link contention in steps 1 — 15 and very little or no link contention in steps 16 — 31. In general, if there are $P$ processors where $P$ is not a power of two, PEX-GEN will require $q - 1$ steps where $q = 2^{\lceil \log_2 P \rceil}$. In the first $\lfloor (q - 1)/2 \rfloor$ steps, the first $q/2$ processors are active and in the remaining steps, several of them are inactive.

A better algorithm is one which balances the contention such that all steps have more or less equal contention and equal number of inactive processors. This can be
Achieved by defining virtual processor numbers such that the real processors 0 — 19 are numbered 6 — 25 as shown in Figure 4.11(b). The processor numbers are shifted by an amount equal to half the absolute difference between the number of processors and the nearest power of two. In other words, in the range 0 — 31, the actual processors are numbered 6 — 25, and there are no processors numbered 0 — 5 and 26 — 31. Thus the empty space which earlier existed only in the half 16 — 31 is now equally divided among the two halves. So, even in the first 15 steps of the algorithm, there are equal number of idle processors in both halves, which balances the contention among all the steps of the algorithm. We call this algorithm Pairwise Exchange for General Mesh with Shift (PEX-GEN-SHIFT) and is described in Figure 4.12. This algorithm also takes $q - 1$ steps where $q$ is the smallest power-of-two larger than the number of processors. The maximum contention at each step is upper bounded by that for the PEX algorithm.
4.3.4 General Algorithm for any Mesh (GEN)

The above algorithms require one less than a power of two number of steps, because they use the exclusive-or function to obtain processor pairs which exchange with each other. For non power-of-two meshes, it would be advantageous to have an algorithm which requires only \( P - 1 \) steps. Figure 4.13 describes such an algorithm, which we call the General Algorithm for any Mesh (GEN), because it works for any number of processors. In the GEN algorithm, processor pairs do not exchange with each other. Instead, at step \( i \), a processor \( j \) sends data to processor \( MOD(j + i, P) \) and receives data from processor \( MOD(j - i + P, P) \). This algorithm requires only \( P - 1 \) steps, for any value of \( P \).

The maximum contention at step \( i \) is obtained empirically as

\[
f(i) = \min[MOD(i, c), c - MOD(i, c)] + \min[i/c, (P - i)/c]
\]

The total time for all steps is given by:

\[
T_{GEN} = \sum_{i=1}^{P-1} \left[ \alpha + L \max(\beta_{sr}, f(i)\beta_{sat}) \right] = (P - 1)\alpha + L \sum_{i=1}^{P-1} \max(\beta_{sr}, f(i)\beta_{sat})
\]
4.3.5 Indirect Pairwise Exchange (IPEX)

The Indirect Pairwise Exchange (IPEX) algorithm aims at reducing link contention in the direct Pairwise Exchange (PEX) algorithm. In IPEX, each processor communicates only with the processors in its row and column. The algorithm is described in Figure 4.14. Each exchange along a row is followed by a complete exchange along a column. During the row exchange, each processor sends $L_r$ bytes of data to the destination processor, out of which $L(r-1)$ bytes are intended for other processors in the same column as the destination processor. This is followed by a complete exchange along the columns (involving messages of $L$ bytes), in which the data received during the row exchange is sent to the appropriate processors in the same column. This entire operation requires $r(c-1)$ communication steps. Finally, an additional complete exchange is required along the columns for processors to exchange their own data directly with processors in the same column. In this phase, data is sent directly from source to destination, requiring $r-1$ exchange steps. Hence, the total number of steps required is $r(c-1) + (r-1) = rc - 1 = P - 1$.

The maximum link contention at any step is the same as for pairwise exchange along a row or column which is $2^{\lfloor \log i \rfloor}$, where $i$ is the step number along the row or column. Hence, the total time required for IPEX is given by:

$$T_{IPEX} = \sum_{i=1}^{r-1} [\alpha + L \max(\beta_{ex}, 2^{\lfloor \log i \rfloor} \beta_{sat})] + \sum_{j=1}^{r-1} \{\alpha + L \max(\beta_{ex}, 2^{\lfloor \log j \rfloor} \beta_{sat})\} + \sum_{i=1}^{r-1} [\alpha + L \max(\beta_{ex}, 2^{\lfloor \log i \rfloor} \beta_{sat})]$$

4.3.6 Recursive Exchange (REX)

The Recursive Exchange algorithm is described in Figure 4.15. It is similar to that for the CM-5, except it is recursively applied to both dimensions of the mesh. The mesh is first recursively halved in the $x$ direction and messages are exchanged over...
each cut. This takes \( \lg c \) steps. The mesh is then recursively halved in the \( y \) direction and messages are exchanged over each cut, which takes \( \lg r \) steps. Thus, the total number of steps is \( \lg c + \lg r = \lg P \) and the message size in each step is \( L \times P/2 \). This algorithm also works only for power-of-two meshes, since the mesh is divided by two in each step. REX has an indirect form of communication, in the sense that data is sent from source processor to destination processor through one or more intermediate processors.

Since the mesh is recursively divided by two and each processor in one partition communicates with its mirror image in the other partition, the maximum number of messages contending for a link at step \( i \) is

\[
f(i) = \begin{cases} 
  \frac{c}{2^i} & \text{for } 1 \leq i \leq \lg c \\
  \frac{r}{2^{\lg c}} & \text{for } \lg c < i \leq \lg P 
\end{cases}
\]
size = c
pos = 0
x = my x-coordinate
y = my y-coordinate
bytes = $L \times P/2$
do i=1, \lg c
    size = size/2
    if ($x < (size + pos)$) then
        destx = $x + size$
    else
        destx = $x - size$
        pos = pos + size
    end if
    dest = y \times c + destx
    exchange message of size “bytes” with dest
end do
size = r
pos = 0
do i=1, \lg r
    size = size/2
    if ($y < (size + pos)$) then
        desty = $y + size$
    else
        desty = $y - size$
        pos = pos + size
    end if
    dest = desty \times c + x
    exchange message of size “bytes” with dest
end do

Figure 4.15: Recursive Exchange (REX) on Delta
Table 4.4: Performance of PEX (time in sec.)

<table>
<thead>
<tr>
<th>Message Size (bytes)</th>
<th>Mesh Configuration</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>4 × 4</td>
</tr>
<tr>
<td>256</td>
<td>0.004</td>
</tr>
<tr>
<td>1K</td>
<td>0.008</td>
</tr>
<tr>
<td>4K</td>
<td>0.023</td>
</tr>
<tr>
<td>8K</td>
<td>0.034</td>
</tr>
<tr>
<td>16K</td>
<td>0.064</td>
</tr>
</tbody>
</table>

The cost of REX can be obtained by summing over all steps of the algorithm:

\[ T_{REX} = \sum_{i=1}^{\lg P} [\alpha + \frac{L}{2} P \max(\beta_{cx}, f(i)\beta_{sat})] \]

which can be expanded to

\[ T_{REX} = \alpha \lg P + \frac{L}{2} \sum_{i=1}^{\lg c} \max(\beta_{cx}, c/2^i \beta_{sat}) + \frac{L}{2} P \sum_{i=\lg c+1}^{\lg P} \max(\beta_{cx}, \frac{r}{2^{i-\lg c}} \beta_{sat}) \]

4.3.7 Performance of Algorithms on the Delta

We implemented all the algorithms on the Intel Touchstone Delta and studied their performance for different mesh configurations and message sizes. As suggested in [80], we use forced messages (which provide higher bandwidth but also higher startup cost) if the message size is greater than or equal to 1.5 Kbytes and unforced messages if the message size is less than 1.5 Kbytes.

The performance of PEX is shown in Table 4.4. The number of processors is varied from 16 to 512 with message size varied from 256 bytes to 16 Kbytes. Message size refers to the amount of data communicated in each send and receive operation, so the total amount of data communicated increases as the number of processors is increased. Hence, the time taken increases almost linearly with the number of
processors. In a mesh, the time taken depends not only on the number of processors, but also on the mesh configuration. The maximum contention in PEX is \( \max(r, c)/2 \). Thus, for a fixed number of processors, the time taken will be minimum for a square mesh and maximum for a mesh which is a linear array.

The performance of PEX-GEN is given in Table 4.5. We have chosen some mesh sizes which are non power-of-two. We observe that for mesh sizes which are only slightly less than the nearest higher power-of-two, the performance is close to that of PEX for that power-of-two. But, if the mesh size is only slightly higher than the nearest smaller power-of-two, the time taken is almost twice the time taken by PEX for that power-of-two. For example, the time taken by PEX-GEN on a 16 × 9 mesh is much higher than the time taken by PEX on a 16 × 8 mesh, but the time taken by PEX-GEN on a 16 × 14 mesh is very close to the time taken by PEX on a 16 × 16 mesh. This is because of the difference in the number of steps required. Another interesting observation is that the time taken by PEX-GEN on a 16 × 30 mesh is in fact higher than the time taken by PEX on a 16 × 32 mesh. This is because since the processors are numbered in row major order, a change in the number of columns from a power-of-two to a non power-of-two, changes the communication pattern in the mesh completely for an algorithm which uses the exclusive-or function to determine processor pairs. Hence, there is more contention in the 16 × 30 case than in the 16 × 32 case.

Table 4.6 shows the performance of PEX-GEN-SHIFT. In all cases it performs...
Table 4.6: Performance of PEX-GEN-SHIFT (time in sec.)

<table>
<thead>
<tr>
<th>Message Size (bytes)</th>
<th>Mesh Configuration</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>4 × 5</td>
</tr>
<tr>
<td>256</td>
<td>0.008</td>
</tr>
<tr>
<td>1K</td>
<td>0.017</td>
</tr>
<tr>
<td>4K</td>
<td>0.036</td>
</tr>
<tr>
<td>8K</td>
<td>0.071</td>
</tr>
<tr>
<td>16K</td>
<td>0.129</td>
</tr>
</tbody>
</table>

no worse than PEX-GEN. In most cases, the improvement in performance is not significant.

Table 4.7 gives the performance of GEN on a power-of-two mesh. GEN performs better than PEX for small message sizes and small number of processors. However, for large number of processors (≥ 64) and large message sizes (> 1 Kbytes) PEX performs better. The GEN algorithm has a certain amount of asymmetry in the communication in the sense that each communication operation consists of a send to one processor and a receive from some other processor. Thus, the incoming and outgoing messages may traverse a different number of links with different amounts of contention, and the path which has the highest amount of contention adversely affects the communication time. On the other hand, in the PEX algorithm, processor pairs exchange with each other at every step, so the incoming and outgoing messages travel the same number of links with the same amount of contention.

The performance of GEN on non power-of-two meshes is given in Table 4.8. GEN reduces the number of steps from \( q - 1 \) in PEX-GEN and PEX-GEN-SHIFT, where \( q = 2^{\lfloor \log P \rfloor} \), to \( P - 1 \). For small number of processors, PEX-GEN performs the best and the improvement in performance is higher when \( q - P \) is large. However, if \( q - P \) is small and the number of processors is large, the performance of PEX-GEN-SHIFT tends to that of PEX and and the performance of GEN tends to that for a power-of-two mesh. So in this case, PEX-GEN-SHIFT performs better than GEN.
Table 4.7: Performance of GEN on power-of-two mesh (time in sec.)

<table>
<thead>
<tr>
<th>Message Size (bytes)</th>
<th>Mesh Configuration</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>4 × 4</td>
</tr>
<tr>
<td>256</td>
<td>0.004</td>
</tr>
<tr>
<td>1K</td>
<td>0.008</td>
</tr>
<tr>
<td>4K</td>
<td>0.018</td>
</tr>
<tr>
<td>8K</td>
<td>0.037</td>
</tr>
<tr>
<td>16K</td>
<td>0.069</td>
</tr>
</tbody>
</table>

Table 4.8: Performance of GEN on non power-of-two mesh (time in sec.)

<table>
<thead>
<tr>
<th>Message Size (bytes)</th>
<th>Mesh Configuration</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>4 × 5</td>
</tr>
<tr>
<td>256</td>
<td>0.004</td>
</tr>
<tr>
<td>1K</td>
<td>0.009</td>
</tr>
<tr>
<td>4K</td>
<td>0.025</td>
</tr>
<tr>
<td>8K</td>
<td>0.052</td>
</tr>
<tr>
<td>16K</td>
<td>0.098</td>
</tr>
</tbody>
</table>

We observe from Table 4.9 that REX does not perform well for any mesh configuration and message size even though it requires only \( \log P \) steps. There are several reasons for this. First, there is a lot of link contention in each step. Second, the message size per step is increased to \( L P/2 \) instead of \( L \) in the other algorithms. Third, the indirect form of communication requires a lot of data buffering and shuffling in order to send the appropriate data to the appropriate node. Also, this algorithm has high memory requirements because the large message size requires more memory per node. For example, with the other algorithms we could run tests with message sizes up to 2M bytes, but with REX we could only go up to 16 Kbytes on 512 processors.

Table 4.10 gives the performance of IPEX. For large meshes and large message sizes, IPEX performs better than any of the direct algorithms. This is because in
large meshes, direct algorithms result in a lot of link contention. The reduction in contention by IPEX is larger than the cost of sending messages indirectly, hence IPEX performs better. The message size and contention in each step in IPEX is much smaller than in REX. The additional memory required per node in IPEX is $L r$, compared to $L \times P/2$ in REX. In small meshes, the cost of communicating indirectly is higher than the saving in contention, so direct algorithms perform better.

A comparison of the power-of-two algorithms on a $16 \times 32$ mesh for different message sizes is shown in Figure 4.16. We can see that for this mesh size IPEX performs the best, for reasons explained above. This is followed by PEX and GEN. REX always performs the worst in spite of its $\lg P$ steps. The relative performance of non power-of-two algorithms on a $16 \times 9$ mesh is shown in Figure 4.17. For this
CHAPTER 4. RUNTIME SUPPORT FOR ALL-TO-ALL COMMUNICATION

Figure 4.16: Performance of algorithms on a $16 \times 32$ mesh

mesh size, GEN clearly performs the best. For messages up to 2 Kbytes, the performance of PEX-GEN and PEX-GEN-SHIFT is almost indistinguishable. But for larger messages, PEX-GEN-SHIFT performs better than PEX-GEN.

The performance of the power-of-two algorithms for different mesh sizes keeping the message size constant at 16 Kbytes is shown in Figure 4.18. The corresponding graph for a message size of 1 Kbytes is shown in Figure 4.19. We observe that for a large message size of 16 Kbytes, PEX performs the best for meshes with up to 200 processors. For larger meshes, IPEX performs the best. For a message size of 1 Kbytes, we see that GEN performs the best on meshes with up to 125 processors. When the number of processors is between 125 and 420, PEX performs the best and for larger systems, IPEX performs the best.

4.3.8 Model Validation

We have validated the models developed to predict the performance of the algorithms by comparing the predicted times with the actual times observed on the Delta. For
CHAPTER 4. RUNTIME SUPPORT FOR ALL-TO-ALL COMMUNICATION

Figure 4.17: Performance of algorithms on a $16 \times 9$ mesh

Figure 4.18: Performance of power-of-two algorithms for message size 16 Kbytes
this purpose, we use typical values for the communication costs on the Delta [80, 3],
namely for unforced messages $\alpha = 75 \mu s$, $\beta_{cx} = 0.35 \mu s$ and for forced messages
$\alpha = 150 \mu s$, $\beta_{cx} = 0.2 \mu s$. We assume that $\beta_{sr} \approx \beta_{cx}$ and $2\beta_{sat} \approx \beta_{cx}$, as done
in [3], i.e. two messages can travel on a link in the same direction without conflict.
Figures 4.20 and 4.21 show that the observed and predicted times agree very closely.

We were not able to validate the models on the CM-5 because accurate values for
$\beta_{sat}$ relative to $\beta_{cx}$ or $\beta_{sr}$ for the CM-5 are not available in the literature.
Figure 4.20: Observed and Predicted times (PEX, GEN)

Figure 4.21: Observed and Predicted times (REX, IPEX)
Chapter 5

Runtime Support for Out-of-Core Programs: (I) Models and Local Optimizations

5.1 Introduction

There are a number of applications which deal with very large quantities of data. These applications exist in diverse areas such as large scale scientific computations, database applications, hypertext and multimedia systems, information retrieval, visualization etc. The number of such applications and their data requirements keep increasing day by day. Although supercomputers have very large main memories, the memory is not large enough to hold all the data required by these applications. For example, a typical Grand Challenge Application at present could require 1Gbyte to 4Tbytes of data per run [38]. These figures are expected to increase by orders of magnitude as teraflop machines make their appearance. Hence, data needs to be stored on disk and the performance of the program depends on how fast processors can access data from disks. A poor I/O capability can severely degrade the performance of the entire program.

Almost all present generation parallel computers provide some kind of hardware and system software support for parallel I/O [29, 92, 9, 36]. But, the I/O performance observed at the application level is usually much lower than what the hardware can
support. There are several reasons for this. First, the data access patterns of many parallel programs are such that they result in a large number of small requests to the file system [74]. Since the I/O latency is very high, this results in poor performance. Second, the interface to any parallel file system does not currently allow a programmer to specify strided accesses using a single read or write call; though there are some recent proposals to rectify this [28, 88]. Third, the interface does not provide support for processors to make a collective I/O request. So file systems cannot perform any optimizations based on the knowledge of the access requests of all processors. Finally, the programmer cannot specify access requests using a high level description, but instead has to explicitly manipulate file pointers. This makes it difficult for the programmer to perform optimizations at the application level, for example prefetching to overlap I/O with computation, because of the complexities involved in managing buffers and file pointers.

We have developed a runtime system called PASSION (Parallel and Scalable Software for Input-Output) [116, 25] which aims to alleviate many of these problems and provide better software support for out-of-core programs. We believe that high-level interfaces that facilitate the use of semantic knowledge about the accesses from parallel programs are necessary for simple and portable application programming. A high-level interface can at the same time provide enough information so that I/O can be done in an efficient manner. The PASSION Runtime Library accepts high-level requests from the application program, translates them to the low-level interface supported by the parallel file system, and performs optimizations for efficient I/O. This chapter discusses the basic design of PASSION and the various models, techniques and optimizations used in it.

5.2 PASSION Runtime Library

The PASSION Runtime Library provides routines to efficiently perform the I/O required in programs involving out-of-core multidimensional arrays. It provides support for loosely synchronous [46] out-of-core computations which use a Single Program
Multiple Data (SPMD) Model. The library can either be directly used by application programmers, or a compiler can translate out-of-core programs written in a high-level data-parallel language like HPF to node programs with calls to the library for I/O. PASSION provides the user with a simple high-level interface, which is a level higher than any of the existing parallel file system interfaces, as shown in Figure 5.1. For example, the user only needs to specify what section of the array needs to be read/written in terms of its lower-bound, upper-bound and stride in each dimension, and the PASSION Runtime Library will fetch it in an efficient manner. A number of optimizations such as Data Sieving, Data Prefetching, Data Reuse, and the Extended Two-Phase Method have been incorporated in the library [116, 115, 118].
5.3 Models

This section discusses the architectural model and the data storage and access models used by the PASSION Runtime Library.

5.3.1 Architectural Model

An important goal in the design of PASSION has been to make it architecture independent as far as possible. The architectural model assumed by PASSION is that of any general distributed memory computer in which the processors are connected together in some fashion. The system is assumed to be provided with a set of disks and I/O nodes. The I/O nodes can either be dedicated processors or some of the compute nodes may also serve as I/O nodes [72]. Each processor may either have its own local disk or all processors may share the set of disks. We prefer if the file system allows us to control the way files are stored on disks, particularly the number of I/O nodes (or disks) across which the file is striped and the stripe size. The I/O subsystem may have a separate interconnection network or it can share the same network which connects the processors together.

PASSION has been implemented on the Intel Touchstone Delta using the native Concurrent File System (CFS) and the NX message passing library. Hence it can run without modification on other Intel machines such as the Paragon and iPSC/860. All the performance results in this chapter as well as in Chapter 6 have been taken on the Touchstone Delta. The computation and communication hardware on the Delta is described in Section 4.1.2. The I/O system on the Delta consists of 32 dedicated I/O nodes, each an Intel 80386 microprocessor. Each I/O node is connected to two disks, resulting in a total of 64 disks. A Concurrent File System (CFS) [92] is provided for parallel access to files. By default, a file is striped across all 64 disks in a round-robin fashion in blocks of size 4 Kbytes. It is possible for the user to specify the disks on which a file is to be stored, but the stripe size is fixed. The CFS provides the user with a Unix-like interface and the parallel reads and writes are handled transparently. The performance of the CFS on the Touchstone Delta has been studied in detail in [9]. The
CFS supports four modes of file access. In Mode 0, each processor has an independent file pointer. In Modes 1 – 3, processors have a common file pointer. All PASSION routines have been implemented using Mode 0.

5.3.2 Data Storage and Access Models

Since PASSION is used in programs having large arrays which do not fit in main memory, the arrays have to be stored on disks in some fashion. PASSION supports three basic models of storing and accessing arrays, called the Local Placement Model (LPM), the Global Placement Model (GPM) and the Partitioned In-Core Model (PIM).

Local Placement Model (LPM)

In this model, the global array is divided into local arrays belonging to each processor. Since the local arrays are out-of-core, they have to be stored in files on disks. The local array of each processor is stored in a separate file called the Local Array File (LAF) of that processor as shown in Figure 5.2(I). The node program explicitly reads from and writes into the file when required. The simplest way to view this model is to think of each processor as having another level of memory which is much slower than main memory. If the I/O architecture of the system is such that each processor has its own disk, the LAF of each processor will be stored on the disk attached to that processor. If there is a common set of disks for all processors, the LAF will be distributed across one or more of these disks. In other words, we assume that each processor has its own logical disk with the LAF stored on that disk. The mapping of logical disks to physical disks depends on how much control the parallel file system provides the user. At any time, only a portion of the local array is fetched and stored in main memory. The size of this portion depends on the amount of memory available. The portion of the local array which is in main memory is called the In-Core Local Array (ICLA). All computations are performed on the data in the ICLA. Thus, during the course of the program, parts of the LAF are fetched into the ICLA, the new values are computed and the ICLA is stored back into appropriate locations in
Figure 5.2: Data Storage and Access Models
the LAF.

Global Placement Model (GPM)

In this model, the global array is stored in a single file called the Global Array File (GAF) as shown in Figure 5.2(II) and no local array files are created. The global array is only logically divided into local arrays in keeping with the SPMD programming model. But, there is a single global array on disk. The PASSION runtime system fetches the appropriate portion of each processor’s local array from the global array file, as requested by the user, in an efficient manner. The advantage of the Global Placement Model is that it does not require the initial local array file creation phase in the Local Placement Model. The disadvantage is that each processor’s data may not be stored contiguously in the GAF, resulting in higher I/O latency time. Also, explicit synchronization is required when a processor needs to access data belonging to another processor. Hence an optimized method, such as the Extended Two-Phase Method proposed in Chapter 6, is needed to perform I/O efficiently.

Partitioned In-Core Model (PIM)

The Partitioned In-Core Model, illustrated in Figure 5.2(III), is a variation of the Global Placement Model. The array is stored in a single global array file as in the Global Placement Model, but there is a difference in the way data is accessed. In the Partitioned In-Core Model, the global array is logically divided into a number of partitions, each of which can fit in the main memory of all processors combined. Thus the computation on each partition is essentially an in-core problem and no I/O is required during the computation on the partition. Hence the name Partitioned In-Core Model. This model is useful when the data access pattern in the program has good locality. Otherwise, creating in-core partitions itself is difficult. The Extended Two-Phase Method is used for I/O in this model.
5.4 Runtime Support for the Local Placement Model

Let us first consider runtime support for the Local Placement Model. Consider the HPF program fragment shown in Figure 5.3, which solves Laplace’s equation by Jacobi iteration method. Let us assume that arrays A and B are very large and hence out-of-core. We note that this may not be the best algorithm for solving Laplace’s equation with an out-of-core data set as discussed in [41], but we only use it for the purpose of explanation.

The arrays A and B are distributed as (block,block) on a $4 \times 4$ grid of processors as shown in Figure 5.4. In the Local Placement Model, the out-of-core local array of each processor is stored in a separate local array file. Consider the out-of-core local array (OCLA) on processor P5, shown in Figure 5.4(B). This is stored in the local array file (LAF) shown in Figure 5.4(D). Depending on the amount of memory available, the OCLA is divided into slabs each of which can fit in the in-core local array (ICLA). Program execution proceeds by fetching a slab from the LAF to the ICLA, doing the computation on that slab, storing the results back to the file, and so on for the remaining slabs.

The computation in this example is a stencil computation in which the value of each element $(i, j)$ is calculated using the values of its corresponding four neighbors, namely $(i-1, j), (i+1, j), (i, j-1)$ and $(i, j+1)$. Also, the computation in the current iteration uses values computed in the previous iteration. Thus to calculate the values at the four boundaries of the local array, P5 needs the last row of the local array of P1, the last column of the local array of P4, the first row of the local array of P9 and the first column of of the local array of P6. Before each iteration of the program, P5 needs to get these rows and columns from its neighboring processors. If the local array was in-core, these rows and columns would have been placed in the overlap areas shown in Figure 5.4(B). This is done so as to obtain better performance by retaining the DO loop even at the boundary. Since the local array is out-of-core, these overlap areas are provided in the local array file. The local array file basically consists of
parameter (n=1024)
real A(n,n), B(n,n)
...........
!HPF$ PROCESSORS P(4,4)
!HPF$ TEMPLATE T(n,n)
!HPF$ DISTRIBUT T(BLOCK,BLOCK) ONTO P
!HPF$ ALIGN with T :: A, B
...........
FORALL (i=2:n-1, j=2:n-1)
   A(i,j) = (B(i,j-1) + B(i,j+1) + B(i+1,j)
            + B(i-1,j))/4
...........
B = A

Figure 5.3: HPF Program Fragment: Solving Laplace’s Equation by Jacobi Iteration Method

Array distributed on 16 processors
(A)

Overlap Area

Out-of-core Local Array on P5
(B)

In core Local Array on P5
(C)

Local Array File on P5
(D)

Figure 5.4: Example of OCLA, ICLA and LAF
CHAPTER 5. RUNTIME SUPPORT FOR OUT-OF-CORE PROGRAMS (I) 99

Table 5.1: Some of the PASSION Routines

<table>
<thead>
<tr>
<th>PASSION Routine</th>
<th>Function</th>
</tr>
</thead>
<tbody>
<tr>
<td>1  PASSION_read</td>
<td>Read entire LAF into ICLA</td>
</tr>
<tr>
<td>2  PASSION_write</td>
<td>Write entire ICLA to LAF</td>
</tr>
<tr>
<td>3  PASSION_read_section</td>
<td>Read a regular section (with stride) from LAF to ICLA</td>
</tr>
<tr>
<td>4  PASSION_write_section</td>
<td>Write a regular section (with stride) from ICLA to LAF</td>
</tr>
<tr>
<td>5  PASSION_read_prefetch</td>
<td>Prefetch a regular section</td>
</tr>
<tr>
<td>6  PASSION_prefetch_wait</td>
<td>Wait for a prefetch to complete</td>
</tr>
<tr>
<td>7  PASSION_read_reuse</td>
<td>read_section with data reuse</td>
</tr>
<tr>
<td>8  PASSION_global_read</td>
<td>Read a regular section (with stride) from global array file</td>
</tr>
<tr>
<td>9  PASSION_global_write</td>
<td>Write a regular section (with stride) to global array file</td>
</tr>
<tr>
<td>10 PASSION_oc_shift</td>
<td>Shift type collective communication on out-of-core data</td>
</tr>
<tr>
<td>11 PASSION_oc_multicast</td>
<td>Multicast communication on out-of-core data</td>
</tr>
</tbody>
</table>

the local array stored in either row-major or column-major order. In either case, the local array file will consist of the local array elements interspersed with overlap area as shown in Figure 5.4(D). The in-core local array also needs overlap area for the same reason as for the out-of-core local array.

At the end of each iteration, processors need to exchange boundary data with neighboring processors. In the in-core case, this would be done using a shift type collective communication routine to directly communicate data from the local memory of the processors. In the out-of-core case, this communication also requires I/O. In an out-of-core shift type collective communication, each processor reads the boundary data from its local array file and communicates it to the neighboring processor. The processor also receives the data sent by neighboring processors and stores it in appropriate locations in the local array file.

We have developed a library of routines to do the I/O as well as the out-of-core communication required in programs such as the example described above. The routines use a number of optimizations which are described in Section 5.5. Table 5.1 lists some of the PASSION routines and their function.
5.4.1 Out-of-Core Array Descriptor (OCAD)

The runtime routines require information about the array such as its size, distribution among the nodes of the distributed memory machine, storage pattern etc. All this information is stored in a data structure called the Out-of-Core Array Descriptor (OCAD) and passed as a parameter to the runtime routines. Before any of the runtime routines are called, the compiler or user must fill the necessary information in the OCAD. The structure of the OCAD is given in Figure 5.5. Rows 1 and 2 contain the lower and upper bounds of the in-core local array (excluding overlap area) in each dimension. The lower and upper bounds of the in-core local array in each dimension including overlap area are stored in rows 3 and 4. The size of the global array in each dimension is given in row 5. Row 6 contains the size of the out-of-core local array. Row 7 specifies the number of processors assigned to each dimension of the global array. The format in which the out-of-core local array is stored in the local array file is given in Row 8. The array is stored in the order in which array elements are accessed in the program, so as to reduce the I/O cost. The entry for the dimension which is stored first is set to 1, the entry for the dimension which is stored second is set to 2 and so on. For example, for a two-dimensional array, x,y = 1,2 means that the array is stored on disk in column major order and x,y = 2,1 means that the array is stored in row major order. This information enables the runtime system to determine the location of any array element (i,j) on the disk. Row 9 contains information about the distribution of the global array. Since the array can be distributed as BLOCK(m) or CYCLIC(m), where m is the block-size, the value of m is stored in Row 10 of the OCAD.

5.5 Optimizations

A number of optimizations, such as data sieving, data prefetching and data reuse, have been incorporated in the PASSION Runtime Library.
5.5.1 Data Sieving

Studies of file access characteristics by Kotz and Nieuwejaar [74] have shown that many scientific applications actually make strided accesses to the file. Hence, all the PASSION runtime routines for reading or writing data from/to files support the reading/writing of regular sections of arrays with strides. We define a regular section of an array as any portion of an array which can be specified in terms of its lower bound, upper bound and stride in each dimension. The need for reading array sections from disks may arise due to a number of reasons, for example FORALL or array assignment statements involving sections of out-of-core arrays.

Consider the array of size (11,11) shown in Figure 5.6, which is stored in a file. Suppose it is required to read the section (2:16;2:3;9:2) of this array. The elements to be read are circled in the figure. The interface to any parallel file system does not currently allow a programmer to read or write strided data using a single call. Hence the only direct way of reading the required data is to explicitly move the file pointer to each element and read it individually, which requires as many reads as the number of elements. We call this the Directed Read Method. A major disadvantage of this
method is the large number of I/O calls and low granularity of data transfer. Since
the I/O latency is very high, this method proves to be very expensive. For example,
on the Intel Touchstone Delta using one processor and one disk, it takes 16.06 ms. to
read 1024 integers from a file as one block, whereas it takes 1948 ms. to read all of
them individually.

Suppose it is required to read a section of a two-dimensional array specified by
$$(l_1 : u_1 : s_1, l_2 : u_2 : s_2).$$  The number of array elements in this section is
$$\lfloor (u_1 - l_1)/s_1 \rfloor + 1 \times \lfloor (u_2 - l_2)/s_2 \rfloor + 1.$$ Therefore, in the direct read method,

\begin{align*}
\text{No. of I/O requests} &= \lfloor (u_1 - l_1)/s_1 \rfloor + 1 \times \lfloor (u_2 - l_2)/s_2 \rfloor + 1 \\
\text{No. of array elements read per access} &= 1
\end{align*}

Thus in this method, the number of I/O requests is very high and the number of
elements accessed per request is very low, which is undesirable.

We propose a much more efficient method called \textit{Data Sieving} to read or write
out-of-core array sections having strides in one or more dimensions. Data sieving can
be explained with the help of Figure 5.7. As explained earlier, each processor has an
out-of-core local array (OCLA) associated with it. The OCLA is (logically) divided
into slabs, each of which can fit in main memory (ICLA). The OCLA shown in the
figure has four slabs. Let us assume that it is necessary to read the array section

\begin{figure}[h]
\centering
\begin{tabular}{|c|c|c|c|c|c|c|c|c|c|}
\hline
(1,1) & . & . & . & . & . & . & . & . & (1,11) \\
\hline
. & (2,3) & . & . & . & . & (2,9) & . & . & . \\
\hline
. & . & . & . & . & . & . & . & . & . \\
\hline
. & . & . & . & . & . & . & . & . & . \\
\hline
. & . & . & . & . & . & . & . & . & . \\
\hline
. & . & . & . & . & . & . & . & . & . \\
\hline
. & . & . & . & . & . & . & . & . & . \\
\hline
(10,3) & . & . & . & . & . & (10,9) & . & . & . \\
\hline
(11,1) & . & . & . & . & . & . & . & . & (11,11) \\
\hline
\end{tabular}
\caption{Accessing Sections of the OCLA}
\end{figure}
shown in Figure 5.7, specified by \((l_1 : u_1 : s_1, l_2 : u_2 : s_2)\), into the ICLA. Although this section spans three slabs of the OCLA, because of the stride all the data elements can fit in the ICLA.

In data sieving, the entire block of data from column \(l_2\) to \(u_2\) if the storage is column major, or the entire block from row \(l_1\) to \(u_1\) if the storage is row major, is read into a temporary buffer in main memory using one read call. The required data is then extracted from this buffer and placed in the ICLA. Hence the name data sieving. A major advantage of this method is that it requires only one I/O call and the rest is data transfer within main memory. The main disadvantage is the high memory requirement. Another disadvantage is the extra amount of data that is read from disk. However, we have found that the savings in the number of I/O calls increases performance considerably. For this method, assuming column major storage,

\[
\text{No. of I/O requests} = 1
\]

\[
\text{No. of array elements read per access} = (u_2 - l_2 + 1) \times \text{rows}
\]

Data sieving is a way of combining multiple I/O requests into one request so as to
reduce the effect of high I/O latency time. A similar method called message coalescing is used in interprocessor communication, where small messages are combined into a single large message in order to reduce the effect of communication latency. However, data sieving is different because instead of coalescing the required data elements together, it actually reads even unwanted data elements so that large contiguous blocks are read. The useful data is then filtered out by the runtime system in an intermediate step and passed on to the program. The unwanted data read into main memory is dynamically discarded.

Data sieving can also be considered from the following perspective. In the direct method, even though a separate read call is required for each individual element, it may not result in a disk access each time. There is usually some form of caching done at the I/O nodes and if the requested data lies in the cache, it can be read from the cache itself. In spite of this, we find that reading individual elements is a lot more expensive than reading one large chunk. This is probably because of the overhead of making several requests to the I/O nodes and looking up the software cache at the I/O node each time. Data sieving can be considered as a way of doing software caching in user space at the compute node itself. An entire chunk of data starting from the first element in the section is effectively cached at the compute node and all the required elements are supplied from this cache. The file system sees only a single request or at most a few requests, depending on the amount of memory available for this cache.

In the future when file systems support a strided interface, data sieving can also be implemented at the file system level. The I/O processors can read large chunks of data from the file, perform sieving, and send only the required data to the compute processors.

**Reducing the Memory Requirement**

If the stride in the array section is large or the number of rows in the section is small compared to the total number of rows in the out-of-core array, the amount of memory required to read the entire block from column \( l_2 \) to \( u_2 \) may be quite large.
There may not be enough main memory available to store this entire block. Hence instead of reading the entire section in a single call, a smaller subsection is fetched from the file, depending on the amount of memory available. Sieving is performed on this subsection, then the next subsection is read, sieving is performed on it, and so on. This reduces the memory requirements of the program considerably and increases the number of I/O requests only slightly. Let us assume that the array is stored in column major order and \( n \) columns of the OCLA can fit in main memory. Then for this case

\[
\text{No. of I/O requests} = \lceil (u_2 - l_2 + 1)/n \rceil
\]

\[
\text{No. of array elements read per access} = n \times n rows
\]

**Writing Array Sections**

Suppose it is required to write an array section \((l_1 : u_1 : s_1, l_2 : u_2 : s_2)\) from the ICLA to the LAF. The issues involved here are similar to those described above for reading array sections. A *Direct Write Method* can be used to write each element individually, but it suffers from the same problems of large number of I/O requests and low granularity of data transfer. To reduce the number of I/O requests, a method similar to the data sieving method described above needs to be used. If we directly use data sieving in the reverse direction, i.e. elements from the ICLA are placed at appropriate locations in a temporary buffer with stride, and the buffer is written to disk, the data in the buffer between the strided elements will overwrite the corresponding data elements on disk. In order to maintain data consistency, it is necessary to first read the entire subsection from the LAF into the temporary buffer. Then, data elements from the ICLA can be stored at appropriate locations in the buffer and the entire buffer can be written back to disk. This is similar to what happens in cache memories when there is a write miss. In that case, a whole line or block of data is fetched from main memory into the cache and then the processor writes data into the cache. Thus, writing sections using sieving requires twice the amount of I/O compared to reading sections, because for each write to disk the corresponding block has to first be fetched into memory. Therefore, for writing array sections
No. of I/O requests = $2\left(\frac{u_2 - l_2 + 1}{n}\right)$

No. of array elements transferred per access = $n \times nrows$

**Performance**

Table 5.2 gives the performance of data sieving versus the direct method for reading and writing array sections. An array of size $2K \times 2K$ is distributed among 64 processors in one dimension along columns. So each processor's local array file consists of an array of size $2K \times 32$ stored in column major order. Each processor needs to read/write a section from/to its local array file. We measured the time taken by the `PASSION_read_section()` and `PASSION_write_section()` routines for reading and writing sections of the out-of-core local array on each processor. We observe that data sieving provides tremendous improvement over the direct method in all cases. The reason for this is large number of I/O requests in the direct method, even though the total amount of data accessed is higher in data sieving.

Table 5.3 gives the number of I/O requests and the total amount of data transferred for each of the array sections considered in Table 5.2. We observe that in the data sieving method, the number of data elements transferred is more or less the same for all cases. This is because the total amount of data transferred depends only on the lower and upper bounds of the section and is independent of the stride. Hence the time taken using data sieving does not vary much for all the sections we have considered. However, there is a wide variation in time for the direct method, because only those elements belonging to the section are read. The time is lower for small sections and higher for large sections.

We observe that even for writing array sections, data sieving performs better than direct write even though it requires reading the section before writing. As expected, `PASSION_write_section()` takes about twice the time as `PASSION_read_section()` when using data sieving. All PASSION routines involving array sections use data sieving for greater efficiency.
CHAPTER 5. RUNTIME SUPPORT FOR OUT-OF-CORE PROGRAMS (I) 107

Table 5.2: Performance of Direct Read/Write versus Data Sieving (time in sec.)

Global array size 2K × 2K real nos. (single precision), 64 processors,
OCLA size 2K × 32, slab size = 16 columns

<table>
<thead>
<tr>
<th>Array Section</th>
<th>PASSION_read_section</th>
<th>PASSION_write_section</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Direct Read</td>
<td>Sieving</td>
</tr>
<tr>
<td>(1:2048:2, 1:32:2)</td>
<td>52.95</td>
<td>1.970</td>
</tr>
<tr>
<td>(1:2048:4, 1:32:4)</td>
<td>14.03</td>
<td>1.925</td>
</tr>
<tr>
<td>(100:2048:6, 5:32:4)</td>
<td>7.881</td>
<td>1.606</td>
</tr>
<tr>
<td>(1024:2048:2, 1:32:3)</td>
<td>18.43</td>
<td>1.745</td>
</tr>
</tbody>
</table>

Table 5.3: I/O requirements of Direct Read and Data Sieving Methods

Global array size 2K × 2K real nos. (single precision), 64 processors,
OCLA size 2K × 32, slab size = 16 columns

<table>
<thead>
<tr>
<th>Array Section</th>
<th>No. of I/O requests</th>
<th>No. of array elements read</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Direct Read</td>
<td>Sieving</td>
</tr>
<tr>
<td>(1:2048:2, 1:32:2)</td>
<td>16384</td>
<td>2</td>
</tr>
<tr>
<td>(1:2048:4, 1:32:4)</td>
<td>4096</td>
<td>2</td>
</tr>
<tr>
<td>(10:1024:3, 3:22:3)</td>
<td>2373</td>
<td>2</td>
</tr>
<tr>
<td>(100:2048:6, 5:32:4)</td>
<td>2275</td>
<td>2</td>
</tr>
<tr>
<td>(1024:2048:2, 1:32:3)</td>
<td>5643</td>
<td>2</td>
</tr>
</tbody>
</table>
5.5.2 Data Prefetching

In the model of computation and I/O described earlier, the OCLA is divided into a number of *slabs*, each of which can fit in the ICLA. Program execution proceeds as follows: a slab of data is fetched from the LAF to the ICLA; the computation is performed on this slab and the slab is written back to the LAF. This is repeated on other slabs till the end of the program. Thus I/O and computation form distinct phases in the program. A processor has to wait while each slab is being read or written as there is no overlap between computation and I/O. This is illustrated in Figure 5.8(A) which shows the time taken for computation and I/O on 3 slabs. For simplicity, reading, writing and computation are shown to take the same amount of time, which may not be true in certain cases.

The time taken by the program can be reduced if it is possible to overlap computation with I/O in some fashion. A simple way of achieving this is to issue a non-blocking I/O read request for the next slab immediately after the current slab has been read. This is called *Data Prefetching*. Kotz and Ellis [73] discuss in detail the effects of prefetching in parallel file systems. Since the read request is non-blocking, the reading of the next slab can be overlapped with the computation being performed on the current slab. If the computation time is comparable to the I/O time, this can
result in significant performance improvement. Figure 5.8(B) shows how prefetching can reduce the time taken for the example in Figure 5.8(A). Since the computation time is assumed to be the same as the read time, all reads other than the first one get overlapped with computation. The total reduction in program time is equal to the time for reading two slabs, as only two of the three reads can be overlapped in this example.

Note that the parallel file system may do some prefetching on its own. For example, for each read request on the Delta, the CFS automatically prefetches the next seven blocks of the file into the I/O node, resulting in a total of at least eight blocks being read. For many applications, the prefetching done at the file system level may not be sufficient and could even degrade performance, unless the user can control the prefetching policy. This is because different applications have different access patterns which may not match well with the prefetching policy of the file system. Most of the parallel file systems at present do not allow the user to control prefetching. On the other hand, the prefetching done by PASSION is controlled by the user. Since the user knows what data is needed next, it can be prefetched correctly. Prefetching can be done in PASSION using the routine \texttt{PASSION\_read\_prefetch()} and the routine \texttt{PASSION\_prefetch\_wait()} can be used to wait for the prefetch to complete.

Performance

We use an out-of-core median filtering program to illustrate the performance of data prefetching. Median filtering is frequently used in computer vision and image processing applications to smooth the input image. Each pixel is assigned the median of the values of its neighbors within a window of a particular size, say $3 \times 3$ or $5 \times 5$ or larger. We have implemented a parallel out-of-core median filtering program using PASSION runtime routines for I/O and communication. The image is distributed among processors in one dimension along columns and stored in local array files. Depending on the window size, each processor needs a few columns from its right and left neighbors. This requires a shift type communication which is implemented using the routine \texttt{PASSION\_oc\_shift()}. 

Table 5.4: Performance of Median Filtering using $3 \times 3$ window (time in sec.)

<table>
<thead>
<tr>
<th>Procs</th>
<th>4 slabs</th>
<th>8 slabs</th>
<th>16 slabs</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Prefetch</td>
<td>No Prefetch</td>
<td>Prefetch</td>
</tr>
<tr>
<td>4</td>
<td>36.37</td>
<td>46.56</td>
<td>33.63</td>
</tr>
<tr>
<td>8</td>
<td>18.32</td>
<td>23.37</td>
<td>16.72</td>
</tr>
<tr>
<td>32</td>
<td>5.340</td>
<td>6.830</td>
<td>5.260</td>
</tr>
<tr>
<td>64</td>
<td>5.650</td>
<td>5.850</td>
<td>4.970</td>
</tr>
</tbody>
</table>

Table 5.5: Performance of Median Filtering using $5 \times 5$ window (time in sec.)

<table>
<thead>
<tr>
<th>Procs</th>
<th>4 slabs</th>
<th>8 slabs</th>
<th>16 slabs</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Prefetch</td>
<td>No Prefetch</td>
<td>Prefetch</td>
</tr>
<tr>
<td>4</td>
<td>81.47</td>
<td>94.09</td>
<td>79.25</td>
</tr>
<tr>
<td>8</td>
<td>41.81</td>
<td>47.76</td>
<td>41.35</td>
</tr>
<tr>
<td>16</td>
<td>21.57</td>
<td>25.41</td>
<td>21.40</td>
</tr>
<tr>
<td>32</td>
<td>11.36</td>
<td>12.83</td>
<td>11.40</td>
</tr>
<tr>
<td>64</td>
<td>7.110</td>
<td>9.010</td>
<td>6.810</td>
</tr>
</tbody>
</table>

Tables 5.4 and 5.5 show the performance of median filtering on the Intel Touchstone Delta for windows of size $3 \times 3$ and $5 \times 5$ respectively. The image is of size $2K \times 2K$ pixels. We assume this to be out-of-core for the purpose of experimentation. The number of processors is varied from 4 to 64 and the size of the ICLA is varied in each case in such a way that the number of slabs varies from 4 to 16. Since the Touchstone Delta has 64 disks, each processor’s LAF can be stored on a separate disk.

The following observations can be made from these tables:-

1. In all cases, prefetching improves performance considerably. In some cases, the improvement is close to 40%. Figures 5.9 and 5.10 show the relative performance with and without prefetching when the number of slabs is 8.
Figure 5.9: Median Filtering using $3 \times 3$ window

Figure 5.10: Median Filtering using $5 \times 5$ window
2. Without prefetching, as the number of slabs is increased, the time taken increases. This is because a larger number of slabs means a smaller slab size which results in a larger number of I/O requests.

3. With prefetching, as the number of slabs is increased, the time taken decreases in most cases. Since the first slab can never be prefetched, all processors have to wait for the first slab to be read. As the slab size is reduced, the wait time for the first slab is also reduced and there is more overlap of computation and I/O. However, the number of I/O requests increases. When the slab size is large, a reduction in the slab size by half improves performance because the saving in the wait time for the first slab is higher than the increase in time due to the larger number of I/O requests. But when the slab size is small (64 processor case with 8 or 16 slabs), the higher number of I/O requests costs more than the decrease in wait time for the first slab. Hence the performance actually degrades in this case.

5.5.3 Data Reuse

In out-of-core programs, data is fetched from files many times. If a portion of the data currently fetched into memory is also needed for the computation on the next data set, then that portion already in memory can be reused instead of reading it again with the next data set. For example, consider the Laplace equation solver discussed earlier (Figure 5.3). Suppose the array is distributed along columns. Then the computation of each column requires one column from the left and one column from the right. The computation of the last column requires one column from the overlap area and the computation of the column in the overlap area cannot be performed without reading the next column from the disk. Hence, instead of reading the column in the overlap area again with the next set of columns, it can be reused by moving it to the first column of the array and the last column can be moved to the overlap area before the first column (Figure 5.11). If this move is not done, it would be necessary to read the two columns again from the disk along with data for the next slab. The reuse thus
eliminates the reading of two columns in this example. In general, the amount of data reuse would depend on the intersection of the sets of data needed for computations involving two consecutive slabs. The PASSION routine `PASSION_read_reuse()` can be used to perform data reuse.

![Diagram showing data reuse and overlap areas](image)

**Figure 5.11: Data Reuse**

<table>
<thead>
<tr>
<th>Array size</th>
<th>Time in sec.</th>
<th>Without Reuse</th>
<th>With Reuse</th>
</tr>
</thead>
<tbody>
<tr>
<td>2K × 2K</td>
<td>75.12</td>
<td>71.71</td>
<td></td>
</tr>
<tr>
<td>4K × 4K</td>
<td>274.7</td>
<td>269.1</td>
<td></td>
</tr>
</tbody>
</table>

Table 5.6: Performance of Data Reuse

Laplace Equation solver, 64 processors

Table 5.6 shows the performance improvement obtained by using data reuse for the Laplace Equation Solver program on the Intel Touchstone Delta using 64 processors. Two different global array sizes are used — 2K × 2K and 4K × 4K. Reuse provides good performance improvement even in this case where only two columns can be reused.
Chapter 6

Runtime Support for Out-of-Core Programs: (II) Collective I/O

6.1 Introduction

Chapter 5 discusses runtime support for the Local Placement Model where there is a separate local array file for each processor. Each processor can directly access only its own local array file. However, the situation is quite different in the Global Placement and Partitioned In-Core Models. In these two models, there is a single global array file containing the entire out-of-core array. Each processor accesses the data it needs from this common file. A processor may, in general, need to access any arbitrary section of the global array, with or without stride. The global array may be stored in the global array file in either row-major or column-major order. As a result, the data required by each processor may not be located contiguously in the file. Also, the requests of some processors may overlap. In the extreme case, all processors may want to access the same section of the array. If each processor directly tries to read the data it needs, it may result in a large number of low granularity requests and multiple requests for the same data.

In loosely synchronous SPMD programs, all processors perform similar operations but on different data sets [46]. Hence, if one processor needs to read data from disks, it is very likely that a group of processors or maybe all processors need to read data from disks at about the same time. This makes it possible for the requesting
processors to cooperate in reading or writing out-of-core data in an efficient manner, which is known as collective I/O. This chapter proposes a method, called the Extended Two-Phase Method, which uses collective I/O for accessing sections of out-of-core arrays efficiently. PASSION provides two routines, \texttt{PASSION\_global\_read()} and \texttt{PASSION\_global\_write()}, to read/write arbitrary array sections with strides from/to a global array file, using the Extended Two-Phase Method.

### 6.2 Need for Collective I/O

The need for collective I/O can be explained using the following example. Consider the large out-of-core array shown in Figure 6.1 and assume that it is stored in a file in column-major order. Four processors (0 – 3) need to access a block of rows each as shown. Since the access requests are orthogonal to the file storage order, the data requested by each processor is located non-contiguously in the file. Also, the requests of different processors lie interleaved in the file. A portion of 0’s request lies at the start of the file, followed by some unwanted data (gap), then a portion of 1’s request followed by a gap, then a portion of 2’s request followed by a gap, then a portion of 3’s request followed by a gap, then again a portion of 0’s request, and so on. To read the data using the interface provided by most of the existing parallel file systems, each processor has to explicitly seek to the appropriate location in the file, read the small chunk of data, then seek to the next location, and so on. We call this the \textit{Direct Method}. The Vesta file system [29] and the nCUBE file system [36] do provide support for the user to specify a logical view of the data to be read, and use a single call to read data. But each processor’s request is serviced independently and there is no collective optimization based on the requests of all processors.

The drawback of the Direct Method is a large number of low granularity requests which may arrive from different processors in any order. Since I/O latency is very high, this usually results in poor performance. The basic premise behind collective I/O is that if a group of processors need to read/write data at the same time, they...
can cooperate among themselves to perform I/O efficiently in large chunks and in the right order. This is usually possible in loosely synchronous SPMD programs in which all processors perform similar operations on different data sets. Hence, a group of processors may need to perform I/O at the same time. An appropriate interface needs to be provided for the user to specify a collective I/O request. The PASSION Runtime Library provides such an interface [26].

Given an appropriate collective I/O interface, the collective I/O operation can be implemented either as a library on top of the file system, or at the file system level itself. A technique called Two-Phase I/O [37, 12] has been proposed for doing collective I/O at the library level. In this method, I/O is done in two phases. In the first phase, processors cooperate to read data in large chunks, and in the second phase they do an in-core redistribution of the data. Disk-directed I/O [69] is a technique which proposes to do collective I/O at the file system level. In disk-directed I/O, a collective I/O request is sent to all I/O nodes which determine the order and timing of the flow of data.
6.3 Extended Two-Phase Method for Collective I/O

The Two-Phase Method [37, 12] is a collective I/O technique for reading an entire in-core array from a file into a distributed array in main memory, and conversely to write a distributed in-core array to a file. I/O is done in two phases. In the first phase, processors always read data assuming a conforming distribution. A conforming distribution is defined as a distribution of an array among processors such that each processor's local array lies contiguously in the file. This results in each processor reading a single large chunk of data. For an array stored in a file in column-major order, a column-block distribution is the conforming distribution. In the second phase, data is redistributed among processors to whatever is the desired distribution. Since I/O cost is orders of magnitude more than communication cost, the cost incurred by the second phase is negligible. This Two-Phase approach is found to give consistently good performance for all distributions [37, 12].

We have extended the basic Two-Phase Method to access arbitrary sections of out-of-core arrays. This Extended Two-Phase Method performs I/O for out-of-core arrays efficiently by

- partitioning the I/O workload among processors dynamically, depending on the access requests,
- combining several I/O requests into fewer larger granularity requests,
- reordering requests so that data is accessed in proper sequence,
- eliminating multiple disk accesses for the same data, and
- reducing contention for disks.

6.3.1 Reading Sections of Out-of-Core Arrays

We first describe the Extended Two-Phase Method for reading regular sections of out-of-core arrays. We define a regular section of an array as a section which can
be specified by a lower-bound, upper-bound and stride in each dimension. For the purpose of explanation, we consider the case where each processor needs to read some regular section of a two-dimensional array stored in the file in column-major order. The Extended Two-Phase Method can actually be used for arrays with any number of dimensions, stored in any order in the file and accessed by any number of processors. Section 6.6 discusses how the general case can be implemented.

In the Extended Two-Phase Method, the I/O workload is divided among processors. For this, we assign ownership to portions of the file such that a processor can directly access only the portion of the file it owns. The file is effectively logically divided into domains. The portion of the file which a processor can directly access is called its File Domain (FD). For a file stored in column-major order, the file domain of each processor is some set of columns of the array. The issue of how to select file domains is important because it determines how the I/O workload gets divided among processors. This is discussed in detail in Section 6.4.

Let us assume that each processor needs to read some regular section \((l_1 : u_1 : s_1, l_2 : u_2 : s_2)\) of the array in global coordinates. In the first step of the Extended Two-Phase Method, processors exchange their own access information (the indices \(l_1, u_1, s_1, l_2, u_2, s_2\)) with other processors, so that each processor knows the access requests of other processors. This information is stored in a data structure called the File Access Descriptor (FAD). The FAD contains exactly the same information on all processors. This exchange phase is not required if the collective I/O interface itself provides information about other processors' access requests.

Since each processor knows its own file domain and the access requests of other processors, it can determine what portion of the data in its file domain is needed by other processors. This is done by computing the intersection of the requests of other processors from the FAD and its own file domain. This information is stored in a data structure called the File Domain Access Table (FDAT). Thus the FDAT of a processor contains information about which portions of its file domain have been requested by other processors.

Each processor now has to read data from its file domain, as determined by the
FDAT. For example, Figure 6.2 shows the file domain of processor 0 and, for some access pattern, the portions of this file domain which have been requested by other processors. A simple way of reading would be to read all the data needed by processor 0, followed by that needed by processor 1 and so on in order of processor number. But, as in the case of Figure 6.2, this may result in too many small accesses which are not in sequence. For the read to be done efficiently, it is important that the FDAT be analyzed so that the file is accessed in sequence and contiguously, as far as possible.

We have devised a very general method for analyzing the information in the FDAT, which ensures that the file is read contiguously and in sequence. Each processor calculates the minimum of the lower-bounds and the maximum of the upper-bounds of all sections in its FDAT. This effectively determines the smallest section which contains all the data that needs to be read from the file domain (for example, section ABCD in Figure 6.2). This section may also contain some data which is not required by any processor. If the processor tries to read only the useful data, it may result in a number of small strided accesses. To avoid this, it uses the optimization data sieving described in Chapter 5. The processor reads a column of the section at a time in a single operation into a temporary buffer. This may include some unwanted data. The useful data is extracted from the temporary buffer and placed in communication.
buffers depending on which processors need the data. The entire section is read from the file domain in this fashion. It is possible to read more than one column at a time if there is enough temporary memory available to do sieving on the set of columns. This forms the first phase of the Extended Two-Phase Method.

The second phase of the Extended Two-Phase Method consists of communicating the data read in the first phase to the respective processors. The information in the FDAT is sufficient for each processor to determine what data has to be sent to which processor. Since each processor knows the file domains of other processors and its own access request, it can calculate how much data it needs to receive from other processors, as well as the locations in main memory where the received data needs to be placed.

The two phases of the Extended Two-Phase Method can either be done distinctly by performing all the I/O first and then the communication, or they can be overlapped (pipelined) by reading smaller portions of data and communicating it.

### 6.3.2 Writing Sections of Out-of-Core Arrays

The algorithm for writing sections is essentially the reverse of the algorithm for reading sections. From the information in the File Access Descriptor (FAD), each processor can determine what portion of the section it needs to write lies in the file domains of other processors. Each processor also computes its own File Domain Access Table (FDAT), which indicates how much data it needs to receive from other processors, to be written to its file domain. The first phase of the Extended Two-Phase Method for writing is to perform communication to get this data from other processors.

The second phase is to write the data to the file in sequence and contiguously. The FDAT is analyzed in the same way as in the read algorithm. Each processor determines the minimum and maximum of all indices in its FDAT. This effectively determines the smallest section which includes all the data that needs to be written to the file domain. It may also include some data which is not being written by any processor. The processor writes the useful data in this section one column at a time using data sieving. However, for writing using data sieving, we cannot directly use
the reverse of the method used for reading. If the useful data is placed at appropriate locations (possibly with stride) in a temporary buffer and the buffer is written to the file, the contents of the buffer between the useful data elements will overwrite the data in the file. To maintain data consistency, it is necessary to first read the entire column from the file into the temporary buffer. Then, the data elements to be written in that column can be stored at appropriate locations in the buffer and the entire column can be written back to disk. Thus, writing sections requires twice the amount of I/O compared to reading sections, because to write each column, the corresponding column has to first be read into memory. It is possible to avoid this extra reading in cases where the entire column contains useful data to be written. This requires each processor to do a more extensive analysis of the FDAT, to make sure that there are no “holes” between the data sets being written by different processors in the same column, and also no strides in the section being written by each processor. As in the case of reading sections, it is possible to do sieving with more than one column at a time if there is enough temporary memory available.

If the sections requested to be written by different processors have some elements in common, there is a data consistency problem. The result depends on how the Extended Two-Phase Method has been implemented. In our implementation, if there are write requests from multiple processors to the same location, the data from the highest numbered processor is written to the file.

6.4 Partitioning I/O Among Processors

In the Extended Two-Phase Method, processors cooperate to perform I/O. The exact partitioning of the I/O workload among processors depends on how file domains are defined. In general, the partitioning of I/O can be done either statically or dynamically.
6.4.1 Static Partitioning

One way of partitioning I/O is to assign a block of columns of the entire out-of-core array to each processor, as if the array were distributed among processors in a column block fashion. Thus the file domain of each processor is a block of columns of the array, which is stored contiguously in the file. The file domains are predefined and fixed. The size of each domain can be determined from the size of the array and the number of processors, and is independent of the access requests. This is called a static partitioning of I/O among processors. Note that this is just a logical partitioning of the file among processors, so that each processor directly accesses only a particular portion of the file (its file domain). Figure 6.3(A) shows the file domains with static partitioning when there are four processors.

6.4.2 Dynamic Partitioning

The main drawback of static partitioning is that the partitioning is independent of the type of access requests. For many types of access patterns, this may result in an imbalance of I/O among processors. Some processors may perform more I/O than others and some may not perform any I/O at all. For example, consider the access pattern in Figure 6.3. If the partitioning is done statically, the access requests span the file domains of only two processors. Thus only two processors (1 and 2) perform all the I/O; the remaining two processors (0 and 3) only wait to receive data sent by 1 and 2. Another drawback of static partitioning is that as the size of the out-of-core array is increased keeping the number of processors fixed, the size of each file domain also increases. Hence, for the same access pattern, as the size of the out-of-core array is increased, the access requests span the file domains of fewer processors resulting in greater imbalance of I/O and lower performance.

The I/O throughput can potentially be improved by partitioning the I/O workload among processors dynamically, depending on the array sections requested. This is illustrated in Figure 6.3(B). For a file stored in column-major order, each processor calculates the lowest and highest among the columns of the sections requested by
all processors. The section formed by these columns and all the rows of the out-of-core array is called the *bounding section*. The bounding section includes the sections requested by all processors and it lies contiguously in the file. Figure 6.3(B) shows the bounding section for the given access requests. File domains are determined by dividing the bounding section among the processors in a column-block fashion. Thus the file domain of each processor is a contiguous chunk of the bounding section.

If the requested sections span all the columns of the out-of-core array, the dynamically selected file domains are exactly the same as those determined statically. But if the sections span only a few columns, as in Figure 6.3, dynamic partitioning provides a much better balance of I/O among processors. The memory requirements of the Extended Two-Phase Method are also reduced, because the file domain of each processor is smaller. In the static case, if all requested sections lie in a single processor’s file domain, all the requested data may not fit in the memory of that processor. Hence I/O and communication may need to be done in stages several times. This is less likely to occur in the case of dynamic partitioning, because the requested data is more evenly divided among processors.
1. Exchange access information with other processors and fill in the File Access Descriptor (FAD).

2. Determine the smallest section which includes the sections requested by all processors, called the bounding section.

3. The file domain of each processor is determined by dividing this bounding section among the processors in a column-block manner for arrays stored in column-major order.

4. Compute the intersection of the FAD and this processor's file domain, and fill in the File Domain Access Table (FDAT).

5. Calculate the minimum of the lower-bounds and the maximum of the upper-bounds of all sections in the FDAT to determine the smallest section containing all the data needed from the file domain.

6. Read this section using data sieving and communicate the data to the requesting processors.

Figure 6.4: Extended Two-Phase Method for Reading Sections of Out-of-Core Arrays

The algorithm for reading sections of out-of-core arrays using the Extended Two-Phase Method with dynamic partitioning is given in Figure 6.4. If the array is stored in the file in row-major order instead of column-major order, the only difference would be that the file domains would be defined in terms of row-blocks and data sieving would be done along rows.

6.5 Performance

We extensively tested the performance of the Extended Two-Phase Method versus the Direct Method on the Intel Touchstone Delta, for many different access patterns,
using both static and dynamic partitioning. The access patterns can be classified into three basic types:

1. *Common Sections:* All processors need to access exactly the same section of the array.

2. *Overlapping Sections:* Parts of the section requested by a processor may overlap with parts of the sections requested by other processors.

3. *Distinct Sections:* The section requested by each processor does not have any data in common with the section requested by any other processor.

### 6.5.1 Reading Common Sections

Table 6.1 compares the performance of the Extended Two-Phase Method and the Direct Method for reading common sections. The array size is $4K \times 4K$ and the number of processors is 16. Figure 6.5 shows approximately where each of these sections is located in the array. The performance of the Extended Two-Phase Method was measured using both static and dynamic partitioning. We observe that in all cases, the Extended Two-Phase Method performs considerably better than the Direct Method. This is primarily because, in the Extended Two-Phase Method, the common section is read only once and then broadcast to other processors, whereas in the Direct Method, all processors simultaneously try to read the same section from the file, resulting in extra I/O overhead.

In all cases, the Extended Two-Phase Method takes a lot less time with dynamic partitioning than with static partitioning. In the case of static partitioning, for a $4K \times 4K$ array with 16 processors, each processor's file domain is of size $4K \times 256$. Thus all sections except V lie in the file domains of only a few processors. But with dynamic partitioning, each section is evenly divided among all available processors, resulting in higher I/O throughput. Since Section V spans all 4096 columns, the statically and dynamically selected file domains are identical, resulting in identical performance.
Table 6.1: Time (sec.) for Reading Common Sections

Array size $4K \times 4K$ real nos. (single precision), 16 processors

<table>
<thead>
<tr>
<th>No.</th>
<th>Array Section</th>
<th>Direct Read</th>
<th>Extended Two-Phase Static</th>
<th>Dynamic</th>
</tr>
</thead>
<tbody>
<tr>
<td>I</td>
<td>(1:100:1, 1:100:1)</td>
<td>1.632</td>
<td>1.027</td>
<td>0.431</td>
</tr>
<tr>
<td>II</td>
<td>(200:300:1, 200:300:1)</td>
<td>1.867</td>
<td>0.883</td>
<td>0.363</td>
</tr>
<tr>
<td>III</td>
<td>(400:800:1, 400:800:1)</td>
<td>6.265</td>
<td>3.692</td>
<td>1.056</td>
</tr>
<tr>
<td>IV</td>
<td>(32:64:1, 128:1024:1)</td>
<td>9.995</td>
<td>2.780</td>
<td>1.318</td>
</tr>
<tr>
<td>V</td>
<td>(1:16:1, 1:4096:1)</td>
<td>52.06</td>
<td>3.241</td>
<td>3.241</td>
</tr>
<tr>
<td>VI</td>
<td>(1:4096:1, 1:16:1)</td>
<td>1.216</td>
<td>2.024</td>
<td>0.420</td>
</tr>
</tbody>
</table>

Figure 6.5: The common sections listed in Table 6.1 (not to scale)
In cases where the sections span a large number of columns (e.g. Section V), the Extended Two-Phase Method provides a significant improvement over the Direct Method. This is because for such cases, the Direct Method results in a large number of small requests spread across the entire file. In the Extended Two-Phase Method, the I/O gets evenly divided among all processors, the requests are reordered and data is read in large chunks.

### 6.5.2 Reading Overlapping Sections

Table 6.2 compares the performance of the Extended Two-Phase Method and the Direct Method for reading overlapping sections. Figure 6.6 shows approximately where these sections are located in the array. To represent these overlapping sections for all processors concisely, we use the following notation. Each processor’s request is denoted by \((l_1 + ov_1 \times p : u_1 + ov_1 \times p : s_1, l_2 + ov_2 \times p : u_2 + ov_2 \times p : s_2)\), where \(p\) is the processor number and \(ov_1, ov_2\) are some constants. The amount of overlap can be changed by varying \(ov_1\) and \(ov_2\). For example, the notation \((1:100:1, 1+10p:100+10p:1)\) in row I of Table 6.2 represents a group of overlapping sections with processor 0 requesting section \((1:100:1, 1:100:1)\), processor 1 requesting section \((1:100:1, 11:110:1)\), processor 2 requesting section \((1:100:1, 21:120:1)\) and so on. The sections in cases I — IV overlap along columns, the sections in cases V — VII overlap along rows and the sections in case VIII overlap in both dimensions.

We observe that the Extended Two-Phase Method with dynamic partitioning performs the best in all cases. The sections in cases I and II are of the same size, but they differ in the amount of overlap. The sections in case I have more overlap than those in case II. Since the total number of columns of the out-of-core array spanned by the sections in case I is less than that by the sections in case II, it takes less time to read the sections in case I. Sections in cases IV, V and VI span only a few columns. For these cases, the Direct Method performs better than the Extended Two-Phase Method with static partitioning. This is because static partitioning results in only a few processors performing I/O. But the Extended Two-Phase Method with
Table 6.2: Time (sec.) for Reading Overlapping Sections

Array size 4K × 4K real nos. (single precision), 16 processors

<table>
<thead>
<tr>
<th>No.</th>
<th>Array Section</th>
<th>Direct Read</th>
<th>Extended Two-Phase</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>(p = processor number)</td>
<td></td>
<td>Static</td>
</tr>
<tr>
<td>I</td>
<td>(1:100;1, 1+10p;100+10p;1)</td>
<td>2.000</td>
<td>1.830</td>
</tr>
<tr>
<td>II</td>
<td>(1:100;1, 1+50p;100+50p;1)</td>
<td>4.627</td>
<td>1.859</td>
</tr>
<tr>
<td>III</td>
<td>(400:800;1, 400+100p;800+100p;1)</td>
<td>8.097</td>
<td>3.348</td>
</tr>
<tr>
<td>IV</td>
<td>(1:4096;1, 1+8p;16+8p;1)</td>
<td>1.152</td>
<td>3.374</td>
</tr>
<tr>
<td>V</td>
<td>(1+50p;100+50p;1, 1:100;1)</td>
<td>1.579</td>
<td>1.994</td>
</tr>
<tr>
<td>VI</td>
<td>(400+100p;800+100p;1, 400:800;1)</td>
<td>7.442</td>
<td>11.84</td>
</tr>
<tr>
<td>VII</td>
<td>(1+8p;16+8p;1, 1:4096;1)</td>
<td>50.32</td>
<td>2.992</td>
</tr>
<tr>
<td>VIII</td>
<td>(200+100p;400+100p;1, 200+100p;400+100p;1)</td>
<td>3.104</td>
<td>2.986</td>
</tr>
</tbody>
</table>

Figure 6.6: The overlapping sections listed in Table 6.2 (not to scale)
dynamic partitioning performs better than the Direct Method, since there is a better
distribution of I/O among processors. The sections in case VII span all columns of
the array, which is the worst case for the Direct Method. Finally, in case VIII, the
sections have overlap in both dimensions and again the Extended Two-Phase Method
with dynamic partitioning takes the least time.

6.5.3 Reading Distinct Sections

Table 6.3 compares the performance of the Extended Two-Phase Method and the
Direct Method for reading distinct sections. Figure 6.7 shows approximately where
these sections are located in the array. We use the same notation as above, \((l_1 + ov\times
p : u_1 + ov \times p : s_1, l_2 + ov\times p : u_2 + ov \times p : s_2)\), for representing distinct sections.
The overlap factors \(ov1\) and \(ov2\) need to be large enough to ensure that the sections
are distinct.

In all cases, the Extended Two-Phase Method with dynamic partitioning per-
forms the best. The relative performance of the three methods is similar to that for
overlapping sections in Table 6.2. The sections in case I are located along rows, so
the requests of different processors lie in separate locations in the file. Hence the
Extended Two-Phase Method performs only slightly better. The sections in cases II — IV are located along columns and so the requests of different processors lie in-
terleaved in the file. Hence the Extended Two-Phase Method performs considerably
better. Static partitioning does not give good performance for the sections in case II
because they span only a few columns. The best case for the Extended Two-Phase
Method is case IV since the sections span all the columns. The sections in cases V and
VI lie partly interleaved in the file. Even for these cases, the Extended Two-Phase
Method performs the best.
Table 6.3: Time (sec.) for Reading Distinct Sections

Array size 4K × 4K real nos. (single precision), 16 processors

<table>
<thead>
<tr>
<th>No.</th>
<th>Array Section</th>
<th>Direct Read</th>
<th>Extended Two-Phase</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td>Static</td>
<td>Dynamic</td>
</tr>
<tr>
<td>I</td>
<td>(1:100:1, 1+100p:100+100p:1)</td>
<td>1.976</td>
<td>1.954</td>
</tr>
<tr>
<td>II</td>
<td>(1+100p:100+100p:1, 1:100:1)</td>
<td>1.633</td>
<td>2.182</td>
</tr>
<tr>
<td>III</td>
<td>(200+200p:400+200p:1, 512:1024:1)</td>
<td>8.016</td>
<td>5.680</td>
</tr>
<tr>
<td>IV</td>
<td>(1+32p:16+32p:1, 1:4096:1)</td>
<td>51.63</td>
<td>4.823</td>
</tr>
<tr>
<td>VI</td>
<td>(1+32p:32+32p:1, 1+100p:1024+100p:1)</td>
<td>12.02</td>
<td>2.991</td>
</tr>
</tbody>
</table>

Figure 6.7: The distinct sections listed in Table 6.3 (not to scale)
Table 6.4: Time (sec.) for Writing Distinct Sections

Array size $4K \times 4K$ real nos. (single precision), 16 processors

<table>
<thead>
<tr>
<th>No.</th>
<th>Array Section</th>
<th>Direct Write</th>
<th>Extended Two-Phase Static</th>
<th>Extended Two-Phase Dynamic</th>
</tr>
</thead>
<tbody>
<tr>
<td>I</td>
<td>$p = (1:100:1, 1+100p:100+100p:1)$</td>
<td>1.944</td>
<td>3.250</td>
<td>2.801</td>
</tr>
<tr>
<td>II</td>
<td>$p = (1+100p:100+100p:1, 1:100:1)$</td>
<td>1.182</td>
<td>2.901</td>
<td>0.854</td>
</tr>
<tr>
<td>IV</td>
<td>$p = (1+32p:16+32p:1, 1:4096:1)$</td>
<td>24.85</td>
<td>10.25</td>
<td>10.25</td>
</tr>
<tr>
<td>V</td>
<td>$p = (200+200p:400+200p:1, 1+200p:512+200p:1)$</td>
<td>5.155</td>
<td>5.461</td>
<td>4.401</td>
</tr>
<tr>
<td>VI</td>
<td>$p = (1+32p:32+32p:1, 1+100p:1024+100p:1)$</td>
<td>8.233</td>
<td>4.994</td>
<td>4.274</td>
</tr>
</tbody>
</table>

6.5.4 Writing Distinct Sections

We only consider the case where each processor writes a distinct section to the file, because it is unlikely that processors will want to write overlapping or common sections. Table 6.4 compares the performance of the Extended Two-Phase Method and the Direct Method for writing distinct sections. The sections chosen are the same as those for reading in Table 6.3 (Figure 6.7).

We use the most general algorithm for writing in the Extended Two-Phase Method, which requires an extra read for each write. Hence for the sections in case I, the Direct Method performs better because it does not require the extra read and also these sections are small with few columns. The sections in cases II – IV lie interleaved in the file, so the Extended Two-Phase Method with dynamic partitioning performs much better than the Direct Method. The sections in cases V and VI lie partly interleaved in the file and even for these cases, the Extended Two-Phase Method performs considerably better.

6.5.5 Accessing Sections with Non-Unit Strides

We have also tested the performance for accessing sections with non-unit strides. When an array section has a non-unit stride, each element of the array lies strided
Table 6.5: Time (sec.) for Reading Sections with Non-unit Strides

Array size $4K \times 4K$ real nos. (single precision), 16 processors

<table>
<thead>
<tr>
<th>No.</th>
<th>Array Section</th>
<th>Direct Read</th>
<th>Extended Two-Phase Static</th>
<th>Dynamic</th>
</tr>
</thead>
<tbody>
<tr>
<td>II</td>
<td>$(1+250p:250+250p:2, 1+250p:250+250p:2)$</td>
<td>53.13</td>
<td>3.610</td>
<td>2.842</td>
</tr>
<tr>
<td>IV</td>
<td>$(1+64p:64+64p:2, 500:2500:3)$</td>
<td>96.20</td>
<td>4.759</td>
<td>3.848</td>
</tr>
<tr>
<td>V</td>
<td>$(500:2500:3, 1+64p:64+64p:2)$</td>
<td>130.7</td>
<td>4.574</td>
<td>2.340</td>
</tr>
</tbody>
</table>

in the file. The only way of reading such array sections using a direct method is to explicitly seek to each individual element and read only that element. This results in very low granularity of data transfer, which is very expensive. The Extended Two-Phase Method overcomes this drawback of the Direct Method by reordering requests and using data sieving for larger granularity reads.

Table 6.5 shows the performance of the Extended Two-Phase Method for reading sections with non-unit strides. The sections in case I span almost the entire array, with stride equal to the number of processors. Hence static and dynamic partitioning take exactly the same time. The sections in cases II and III are located diagonally across the out-of-core array. The sections in case IV are located along columns and the sections in case V are located along rows. In all cases, the Extended Two-Phase Method is more than 20 times faster than the Direct Method. Table 6.6 shows the performance of the Extended Two-Phase Method for writing sections with non-unit strides. The sections chosen are the same as in Table 6.5. Even for writing sections, the Extended Two-Phase Method provides considerable performance improvement.

### 6.5.6 Scalability

We have also studied the scalability of the Extended Two-Phase Method for large number of processors, large array sections and large out-of-core arrays. Since dynamic
Table 6.6: Time (sec.) for Writing Sections with Non-unit Strides

Array size 4K × 4K real nos. (single precision), 16 processors

<table>
<thead>
<tr>
<th>No.</th>
<th>Array Section</th>
<th>Direct Write</th>
<th>Extended Two-Phase Static</th>
<th>Dynamic</th>
</tr>
</thead>
<tbody>
<tr>
<td>I</td>
<td>((p+1:4096:nprocs, p+1:4096:nprocs))</td>
<td>53.28</td>
<td>22.77</td>
<td>22.77</td>
</tr>
<tr>
<td>III</td>
<td>((1+200p:500+200p:3, 1+200p:500+200p:3))</td>
<td>44.64</td>
<td>8.696</td>
<td>7.516</td>
</tr>
<tr>
<td>IV</td>
<td>((1+64p64+64p2, 500:2500:3))</td>
<td>71.35</td>
<td>8.858</td>
<td>7.279</td>
</tr>
<tr>
<td>V</td>
<td>((500:2500:3, 1+64p64+64p2))</td>
<td>79.24</td>
<td>7.724</td>
<td>4.405</td>
</tr>
</tbody>
</table>

partitioning always performs better or at least as good as static partitioning, we only consider dynamic partitioning for the scalability experiments. Table 6.7 shows the timings obtained by varying the number of processors requesting array sections from 4 to 128, for both reading and writing. We have selected a few sections in each category, i.e. common, overlapping, distinct, and also sections with non-unit strides. Note that each processor is requesting a section, so as the number of processors is increased, the amount of I/O required increases.

The main observation is that the Extended Two-Phase Method scales well with the number of processors. In many cases, the time taken increases only slightly as the number of processors is increased, which indicates that we are able to get higher I/O throughput by increasing the number of processors. For example, for the sections in case I, the time taken increases from 1.282 sec. to only 2.130 sec. when the number of processors is increased from 4 to 128. In some cases, such as case II, the time taken even decreases. The Direct Method performs quite poorly as the number of processors is increased, especially for sections in cases II, IV and VIII. The Extended Two-Phase Method also scales well for writing sections. For small number of processors, the Direct Method takes less time for writing than the Extended Two-Phase Method. This is because of the extra read required for every write in the Extended Two-Phase Method. However, for large number of processors (≥ 16), the Extended Two-Phase Method performs better. For sections with non-unit strides, the Extended Two-Phase
Table 6.7: Performance for different number of processors

\[ I = (400:800:1, 400:800:1), \text{Figure 6.5(III)} \]
\[ II = (1:16:1, 1:4096:1), \text{Figure 6.5(V)} \]
\[ III = (400:800:1, 400+25p800+25p1), \text{Figure 6.6(III)} \]
\[ IV = (1+8p16+8p1, 1:4096:1), \text{Figure 6.6(VII)} \]
\[ V = (1+25p16+25p1, 1:4096:1), \text{Figure 6.7(IV)} \]
\[ VI = (1+32p32+32p1, 1+24p1024+24p1), \text{Figure 6.7(VI)} \]
\[ VII = (p+1:4096:nprocs, p+1:4096:nprocs) \]
\[ VIII = (500:2500:3, 1+32p32+32p2) \]

Time in sec., \(4K \times 4K\) array, \(DR = \text{Direct Read}\), \(ETP = \text{Extended Two-Phase (dynamic partitioning)}\), \(DW = \text{Direct Write}\)

### Reading Common Sections

<table>
<thead>
<tr>
<th>Section</th>
<th>Procs=4</th>
<th>Procs=8</th>
<th>Procs=16</th>
<th>Procs=32</th>
<th>Procs=64</th>
<th>Procs=128</th>
</tr>
</thead>
<tbody>
<tr>
<td>I</td>
<td>2.620</td>
<td>1.282</td>
<td>3.184</td>
<td>1.040</td>
<td>4.421</td>
<td>1.056</td>
</tr>
<tr>
<td>II</td>
<td>12.16</td>
<td>4.315</td>
<td>13.95</td>
<td>3.099</td>
<td>19.65</td>
<td>3.241</td>
</tr>
<tr>
<td>III</td>
<td>3.079</td>
<td>1.748</td>
<td>5.208</td>
<td>1.659</td>
<td>6.850</td>
<td>1.991</td>
</tr>
<tr>
<td>IV</td>
<td>13.75</td>
<td>4.450</td>
<td>13.77</td>
<td>3.391</td>
<td>19.63</td>
<td>2.902</td>
</tr>
</tbody>
</table>

### Reading Overlapping Sections

<table>
<thead>
<tr>
<th>Section</th>
<th>Procs=4</th>
<th>Procs=8</th>
<th>Procs=16</th>
<th>Procs=32</th>
<th>Procs=64</th>
<th>Procs=128</th>
</tr>
</thead>
<tbody>
<tr>
<td>IV</td>
<td>3.704</td>
<td>1.803</td>
<td>2.306</td>
<td>1.585</td>
<td>4.125</td>
<td>1.638</td>
</tr>
</tbody>
</table>

### Reading Distinct Sections

<table>
<thead>
<tr>
<th>Section</th>
<th>Procs=4</th>
<th>Procs=8</th>
<th>Procs=16</th>
<th>Procs=32</th>
<th>Procs=64</th>
<th>Procs=128</th>
</tr>
</thead>
<tbody>
<tr>
<td>VI</td>
<td>0.982</td>
<td>1.037</td>
<td>1.803</td>
<td>2.218</td>
<td>3.994</td>
<td>3.058</td>
</tr>
</tbody>
</table>

### Writing Distinct Sections

<table>
<thead>
<tr>
<th>Section</th>
<th>Procs=4</th>
<th>Procs=8</th>
<th>Procs=16</th>
<th>Procs=32</th>
<th>Procs=64</th>
<th>Procs=128</th>
</tr>
</thead>
<tbody>
<tr>
<td>V</td>
<td>759.2</td>
<td>22.82</td>
<td>216.6</td>
<td>15.83</td>
<td>210.8</td>
<td>9.331</td>
</tr>
<tr>
<td>VI</td>
<td>56.44</td>
<td>1.342</td>
<td>77.78</td>
<td>1.440</td>
<td>83.87</td>
<td>1.870</td>
</tr>
</tbody>
</table>

### Reading Sections with Non-Unit Strides

<table>
<thead>
<tr>
<th>Section</th>
<th>Procs=4</th>
<th>Procs=8</th>
<th>Procs=16</th>
<th>Procs=32</th>
<th>Procs=64</th>
<th>Procs=128</th>
</tr>
</thead>
<tbody>
<tr>
<td>VII</td>
<td>147.5</td>
<td>30.11</td>
<td>81.45</td>
<td>31.40</td>
<td>64.53</td>
<td>26.42</td>
</tr>
<tr>
<td>VIII</td>
<td>9.041</td>
<td>1.612</td>
<td>18.83</td>
<td>1.603</td>
<td>35.17</td>
<td>2.972</td>
</tr>
</tbody>
</table>

### Writing Sections with Non-Unit Strides

<table>
<thead>
<tr>
<th>Section</th>
<th>Procs=4</th>
<th>Procs=8</th>
<th>Procs=16</th>
<th>Procs=32</th>
<th>Procs=64</th>
<th>Procs=128</th>
</tr>
</thead>
<tbody>
<tr>
<td>VII</td>
<td>668.7</td>
<td>42.75</td>
<td>147.3</td>
<td>39.11</td>
<td>84.54</td>
<td>31.40</td>
</tr>
<tr>
<td>VIII</td>
<td>9.041</td>
<td>1.612</td>
<td>18.83</td>
<td>1.603</td>
<td>35.17</td>
<td>2.972</td>
</tr>
</tbody>
</table>
Method performs considerably better than the Direct Method. For the sections in case VIII, the stride depends on the number of processors, and so the total amount of I/O decreases as the number of processors is increased.

Table 6.8 shows the performance for accessing large sections of a large out-of-core array of size $16K \times 16K$ single precision real numbers (file size 1Gbyte). Figure 6.8 shows approximately where these sections are located in the array. We consider common, overlapping and distinct sections for reading, and distinct sections for writing. The trend in the results is the same as in Table 6.7 for a $4K \times 4K$ array, which confirms the scalability of the Extended Two-Phase Method and its superiority over the Direct Method. We observe that the Direct Method performs a lot worse for accessing large sections than for small sections, whereas the Extended Two-Phase Method performs consistently well for sections of any size. The relative performance of the two methods for reading and writing the sections in case VI of Table 6.8 is illustrated in Figures 6.9 and 6.10 respectively.

### 6.6 Advantages of Extended Two-Phase Method

The Extended Two-Phase Method with dynamic partitioning provides a very general way of accessing arbitrary sections of out-of-core arrays in an efficient manner. The first phase performs I/O optimizations at the cost of interprocessor communication in the second phase. Since communication cost is orders of magnitude lower than I/O cost, the overhead of communication is negligible. In all our experiments, we found the time spent on communication to be less than 5% of the total time. The Extended Two-Phase Method combines many small requests of different processors into single larger requests, thus providing larger granularity of data transfer and lower latency time. Another advantage is that multiple accesses by different processors to the same data in the file are eliminated. For example, if all processors need to read exactly the same section of the array, it will be read only once from the file and then broadcast to other processors over the interconnection network. Similarly, if the requests of two or more processors are overlapping, the overlapping portion will only be read once.
Table 6.8: Performance for large sections of large arrays

\[
\begin{align*}
    & \text{I} = (5000:6000:1,5000:6000:1) \\
    & \text{II} = (1+100p:300+100p:1,4000:8000:1) \\
    & \text{III} = (1+100p:400+100p:1,2000+20p:2800+20p:1) \\
    & \text{IV} = (4000:8000:1,1+4p:8+4p:1) \\
    & \text{V} = (1+100p:100+100p:1,1+100p:1024+100p:1) \\
    & \text{VI} = (1+20p:16+20p:1,4000:12000:1)
\end{align*}
\]

Time in sec., 16K × 16K array, DR = Direct Read,
ETP = Extended Two-Phase (dynamic partitioning), DW = Direct Write

<table>
<thead>
<tr>
<th>Sec-</th>
<th>Proc=4</th>
<th>Proc=8</th>
<th>Proc=16</th>
<th>Proc=32</th>
<th>Proc=64</th>
<th>Proc=128</th>
</tr>
</thead>
<tbody>
<tr>
<td>tion</td>
<td>DR</td>
<td>ETP</td>
<td>DR</td>
<td>ETP</td>
<td>DR</td>
<td>ETP</td>
</tr>
<tr>
<td>I</td>
<td>23.65</td>
<td>7.89</td>
<td>43.45</td>
<td>7.39</td>
<td>58.99</td>
<td>7.39</td>
</tr>
<tr>
<td>II</td>
<td>53.30</td>
<td>26.51</td>
<td>103.3</td>
<td>28.10</td>
<td>132.3</td>
<td>28.50</td>
</tr>
<tr>
<td>III</td>
<td>13.31</td>
<td>5.06</td>
<td>24.11</td>
<td>6.49</td>
<td>31.49</td>
<td>7.40</td>
</tr>
<tr>
<td>IV</td>
<td>0.683</td>
<td>0.699</td>
<td>0.841</td>
<td>0.939</td>
<td>1.343</td>
<td>1.173</td>
</tr>
<tr>
<td>V</td>
<td>10.97</td>
<td>5.38</td>
<td>19.31</td>
<td>8.47</td>
<td>26.52</td>
<td>10.58</td>
</tr>
<tr>
<td>VI</td>
<td>57.29</td>
<td>21.94</td>
<td>74.05</td>
<td>23.05</td>
<td>127.3</td>
<td>32.88</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Sec-</th>
<th>Proc=4</th>
<th>Proc=8</th>
<th>Proc=16</th>
<th>Proc=32</th>
<th>Proc=64</th>
<th>Proc=128</th>
</tr>
</thead>
<tbody>
<tr>
<td>tion</td>
<td>DW</td>
<td>ETP</td>
<td>DW</td>
<td>ETP</td>
<td>DW</td>
<td>ETP</td>
</tr>
<tr>
<td>V</td>
<td>7.198</td>
<td>12.91</td>
<td>19.21</td>
<td>18.98</td>
<td>32.90</td>
<td>23.31</td>
</tr>
<tr>
<td>VI</td>
<td>48.35</td>
<td>44.18</td>
<td>71.85</td>
<td>52.07</td>
<td>151.4</td>
<td>73.34</td>
</tr>
</tbody>
</table>

Figure 6.8: The sections listed in Table 6.8 (not to scale)
Figure 6.9: Scalability Results, reading sections in case VI of Table 6.8

Figure 6.10: Scalability Results, writing sections in case VI of Table 6.8
from the file.

The Extended Two-Phase Method also provides a great deal of flexibility in dividing I/O among processors. This can be done by defining the file domains appropriately. We have described one way of doing this dynamically as a function of the access requests. This performs better than a static assignment, but it may be possible to do even better. For example, instead of dividing the bounding section in a column-block manner among the processors, it could be divided in a block-cyclic fashion, so that if the bounding section includes some unwanted columns, they get evenly distributed. Another approach is to divide I/O among processors so that the I/O requests of different processors go to different disks or I/O nodes. For this, we need to know on which disks the requested sections are located. This information is not provided by any of the parallel file systems at present, but we expect it to be available in future versions of the file systems. Furthermore, if the ratio of processors to disks on the machine is very high, it is possible to have only a few processors perform I/O, so as to reduce contention for disks.

The Extended Two-Phase Method can be used for accessing arrays with any number of dimensions and any storage order. For the dynamic partitioning scheme we have proposed, the file domains for an \( n \)-dimensional array can be obtained by first calculating the \( n \)-dimensional bounding section of all requests. This section is divided among processors such that the file domain of each processor is stored contiguously in the file.

Array sections other than those which can be represented by a lower-bound, upper bound and stride in each dimension, for example sections with non-uniform strides, can also be accessed using the Extended Two-Phase Method. This requires a more general notation for representing such sections. The data structures such as the File Access Descriptor and the File Domain Access Table need to be modified to handle this, but the basic idea remains the same.

It is not necessary that all processors must call the Extended Two-Phase read/write routine. It is possible to define a group of processors involving only those processors that need to access data from the file. This is similar to process groups in MPI [83].
Only the processors in the group need to call the Extended Two-Phase routine and participate in the two-phase process. The I/O workload can be divided among the processors in this group.

The Extended Two-Phase Method is not specific to any particular machine, file system or architecture. It can be easily implemented on top of any of the existing file system interfaces or portable interfaces such as the proposed MPI-IO interface [28], resulting in portable implementations. It can also be easily modified and tuned for any particular system. This only requires defining the file domains appropriately, and possibly using a different algorithm for interprocessor communication.
Chapter 7

Conclusions

This thesis addresses several issues in providing runtime support for in-core as well as out-of-core data-parallel programs. This runtime support can be used for performing many of the commonly required operations in application programs written using a distributed memory programming model. The use of runtime support makes it easier to write application programs and provides greater efficiency and portability. It can also be used together with a compiler to translate in-core as well as out-of-core programs written in a high-level data-parallel language like HPF to node programs for distributed memory parallel computers. Runtime support helps to separate the machine dependent and machine independent aspects of the translation. The compiler can do all the machine independent transformations and the runtime libraries can be optimized for each different machine. Thus, a portable and efficient compiler can be obtained by porting the runtime library to different machines.

This thesis proposes efficient algorithms for runtime array redistribution which is frequently required in parallel programs. We have considered block\((m)\) to cyclic, cyclic to block\((m)\) and the general cyclic\((x)\) to cyclic\((y)\) redistributions for one-dimensional and multidimensional arrays. In all cases, the Asynchronous Method was found to perform better than the Synchronous Method because it overlaps computation and communication, and hence reduces processor idle time. For the general cyclic\((x)\) to cyclic\((y)\) redistribution, the LCM Method performs the best. The thesis also shows how circular redistributions can be performed efficiently by saving address information.
in the forward redistribution and reusing it in the backward redistribution.

This thesis presents several algorithms for all-to-all collective communication for the fat tree and two-dimensional mesh network topologies. The performance of these algorithms was studied on the CM-5 and Touchstone Delta. An analytical model for estimating the time taken by these algorithms was developed and validated by comparing with experimental results. The conclusion is that the choice of algorithm depends on the message size and the number of processors. No one algorithm performs the best for all message sizes and number of processors.

Due to the large memory requirements of many applications, it is becoming increasingly important to provide support for out-of-core programs. This thesis also describes the design and implementation of a runtime library for out-of-core computations. This library can either be directly used in out-of-core applications, or used together with a compiler to translate out-of-core programs written in a language such as HPF. The runtime library supports three different models for data storage and access. Runtime routines have been developed for all these models. Several optimizations, such as data sieving, data prefetching and data reuse, have been incorporated in the runtime library. The optimizations have provided significant performance improvement. The thesis also proposes the Extended Two Phase Method for reading and writing sections of out-of-core arrays efficiently. This method uses collective I/O in which processors cooperate to perform I/O in an efficient manner by combining several I/O requests into fewer larger requests, eliminating multiple file accesses for the same data and reducing contention for the I/O subsystem. A dynamic scheme is used for dividing I/O among processors, depending on the access requests. The Extended Two Phase Method showed impressive performance benefits over a Direct Method for many different access patterns.

There are a number of areas in which the research presented in this thesis can be extended. The area of runtime support for parallel I/O is particularly promising. A useful feature of the Extended Two-Phase Method is the flexibility it provides in defining file domains. We have studied one way of selecting file domains. A good research problem is to determine the best way to define file domains depending on
the pattern of access requests as well as the distribution of the file on disks. Scientific data is very often stored in standard data formats such as HDF [86] or NetCDF [123]. It would be useful to design and implement efficient runtime support for data stored in files using these formats. The PASSION Runtime Library has currently been implemented on the Intel Touchstone Delta, Paragon and iPSC/860 systems. It would be an interesting project to port it to the IBM SP-2 using the PIOFS file system. A unique feature of PIOFS (and its predecessor Vesta) is that it supports logical partitioning of files [29]. It is not entirely clear how widely this logical partitioning can be used. Implementing a general runtime system like PASSION using PIOFS can provide some insight into the usefulness of logical partitioning.
Bibliography


BIBLIOGRAPHY


Vita

Name: Rajeev Thakur

Place of Birth: Bombay, India

Date of Birth: July 15, 1969

Previous Degrees: B.E., Computer Engineering, University of Bombay, 1990
M.S., Computer Engineering, Syracuse University, 1992

Experience:
Summer Intern,
Intel Supercomputer Systems Division,
Beaverton, OR, May 1993 – August 1993

Graduate Fellow
Syracuse University, August 1994 – May 1995

Graduate Research Assistant,
Northeast Parallel Architectures Center,
Syracuse University, August 1991 – August 1994

Graduate Research Assistant,
Department of Electrical and Computer Engineering,
Syracuse University, January 1991 – August 1991