Benchmarking the Computation and Communication Benchmarking the Computation and Communication Performance of the CM-5 Performance of the CM-5

,


Introduction
Overview The rest of this paper is organized as follows. Section 2 gives a brief description of the CM-5 architecture. Section 3 introduces the test con gurations and the message-passing library that were used to perform our experiments. Section 4 gives the computational performance of the SPARC processor and the vector units. Section 5 presents the benchmarks to measure communication performance from one node to another. Section 6 addresses the global operations provided by the CM-5. Sections 7 shows how meshes and hypercubes can be simulated on the fat-tree network topology. Section 8 presents the estimation of the performance for a Gaussian elimination kernel code on the CM-5.

CM-5 System Overview
The CM-5 is a scalable distributed-memory computer system which can e ciently support up to 16,384 computation nodes. Each node contains a SPARC microprocessor and a portion of the global memory connected to the rest of the system through a network interface. Every node in the CM-5 is connected to two inter-processor communication networks, the data network and the control network. This section gives a brief overview of the CM-5 processing nodes, data, and control networks, which have a remarkable importance in our study.

Processing Nodes
Each CM-5 computation node consists of a SPARC microprocessor, a custom network interface that connects the node to the rest of the system through data and control networks, a local memory up to 128 Mbytes, and an associated memory controller unit (Figure 1-a.) SPARC has a clock rate of 33 MHz. It has 64 KB cache that is used for both instructions and data. The SPARC is also responsible for managing the communication with other system components via the network interface.
Node memory is allocated as 8 MB chunks and controlled by a special memory controller. Optionally, this memory controller can be replaced by up to four vector units (Figure 1-b.) In this con guration, size of each memory unit may be either 8 or 32 MB. The scalar multiprocessor is able to issue vector instructions to any subset of vector units. Each vector unit has a vector instruction decoder, a pipelined ALU, and 64 64-bit registers like a conventional vector processor ( Figure 2). The 16 MHz vector unit allows one memory operation and one arithmetic operation per clock cycle which gives 16 M ops peak performance for single arithmetic operations like add or multiply. On the other hand, it can perform a multiply-and-add operation in only one cycle which increases the peak performance to 32 M ops for this operation. To summarize, a node with four vector units has 256 _ 64-bit data registers, 32 to 128 MB of DRAM memory, and 64 to 128 M ops peak performance for oating-point arithmetic operations.
All the components inside a node are connected via a 64-bit bus. The bandwidth of the local memory can go up to 512 MBytes per second when vector units are attached.

The Control Network
The CM-5 control network provides high bandwidth and low latency for global operations, such as broadcast, reduction, parallel pre x and barrier synchronizations, where all the nodes are involved.
CM-5 control network has three subnetworks responsible for handling the global operations; a broadcast subnetwork which is responsible for broadcast operations, a combining subnetwork which supports global operations like reduction or parallel pre x, and a global subnetwork which takes care of the synchronization.

The Data Network
The data network is a high bandwidth network optimized for bulk transfers where each message has one source and one destination. It is a message-passing-based point-to-point routing network that guarantees delivery. In addition, it is deadlock free and has fair con ict arbitration.
The network architecture is based on fat-tree (quad-tree) topology with a network interface at all the leaf nodes. Each internal node of the fat-tree is implemented by a set of switches. The number of switches per node doubles for each higher layer until level 3, and from there on it quadruples. Figure 3 illustrates a data network having 16 nodes. The communication switches are labeled as (i,j), where i shows the number of the child switch and j the number of the parent switch.
The CM-5 is designed to provide a point-to-point peak transfer bandwidth of 5 MBytes/sec between any two nodes in the system. However, if the destination node is within the same 4-node cluster or 16-node cluster, it can reach to a peak bandwidth of 20 MBytes/sec and 10 MBytes/sec, respectively.

Test System
Our experiments were performed on a 32-node CM-5 at the Northeast Parallel Architecture Center at Syracuse University and on a 864-node CM-5 (recently upgraded to 896 nodes) at the Army High Performance Research Center at the University of Minnesota. Both machines are timeshared and run under CMOST version 7.2. There were no one else using the systems while we were running our benchmarking programs.
The CM-5 processing nodes can be grouped into one or more logical partitions, each of which is controlled by a partition manager. Each partition uses separate processors and network resources and has equal access to the shared system resources. For example, Minnesota's 864-node CM-5 machine is divided into 32-, 64-, 256-and 512-node partitions.
Most of the values reported in this paper were measured by using a set of short benchmark codes written in C with calls to the CM message-passing library (CMMD Version 3.0 Final). The codes were compiled by using the Gnu C compiler with all the optimizations turned on in order to bene t the full potential of the hardware. The precision of the CM-5 clock is one microsecond. The timings were estimated by recording the CM node busy-time for an average of 100 repetitions of the experiment and dividing the total time by the number of repetitions. CM node busy-time is the duration in which the user code is executed on a certain node within its own operating system time-sharing slice. We used the CM Fortran language 5] (Version 2.1.1.2), which partitions and stores the vectors directly into the vector unit memories, to measure the vector unit performance.
As might be expected, testing the hardware system by using high-level software (e.g., CM Fortran or C compilers and CMMD message-passing software) in uences the performance negatively. Performance is bounded by the software's ability to exploit the capabilities of the hardware.
3.1 CM-5 Message-Passing Library: CMMD CMMD 6] provides facilities for cooperative message passing between processing nodes. We used the nodeless model of programming, where all the processing nodes execute the same SPMD (Single-Program Multiple-Data) program and the partition manager acts simply as an I/O server.
At the lowest layer, CMMD implements active messages 19], which provide fast packetbased communication and simple low-latency array transfer. When a message is to be sent across the data network, the data message is divided into a group of packets of size 20 bytes; 16 bytes of this packet is used for the user data, and the remaining 4 bytes contain control information such as the destination and the message size 7].  We run a set of benchmark programs to measure the computational speed of the SPARC microprocessor for basic integer and oating-point operations. Execution times for any of the basic arithmetic operations were the same when all the operands were stored in the registers. We obtained a peak performance of 22 Mips for integer add-multiply and 11 Mips for other integer operations. Floating-point performance was 22 M ops for add-multiply and 11 M ops for other operations. When the operands are not in registers but available in the on-board cache, computation performance drops sharply because of the overhead of accessing the cache. The execution times for various arithmetic operations when the operands are initially stored in the cache are given in Table 1. In the \operation" column an entity like x&y&z indicates any combination of these three operands in an arithmetic statement, e.g., x = y z, y = x z, and so on, where indicates an arithmetic operator.

Vector Performance
The performance of vector processing performance on the CM-5 can be characterized by three length-related parameters; R 1 , N 1=2 , and N v 9]. R 1 is the asymptotic performance obtained as the vector length tends to in nity, N 1=2 corresponds to the vector length needed to reach one-half of the R 1 , and N v is the vector length needed to make the vector mode faster than the scalar mode. The values of these three parameters will depend on the operations being performed.
To evaluate the performance of the CM-5 vector units, we rst measured the execution times of some vector operations which are frequently used in scienti c application codes. The execution rates for each operation is shown in Figure 4 for vector lengths of up to 32 KB. Then we derived the length-related performance parameters for each vector operation. The results for double-precision and single-precision data are illustrated in Tables 2 and 3 Table 3: Length-related measures of vector performance for single-precision data.
respectively. R 1 is important for estimating the peak performance. Double-precision operations are always faster than the single-precision ones, since vector unit registers are con gured as 64-bit registers, and all the internal buses are of 64-bit. Manipulating a scalar operand (operations 1 and 3) is faster compared to manipulating a vector operand (operations 2 and 4). This is because the scalar operand comes free, while the vector operands in operations 2 and 4 require a memory or cache access to load the corresponding vector into the vector registers.
Additions and multiplications give us about the same timings. Although addition is expected to be faster, cycle time is stretched to handle one addition, one multiplication, or one add-multiply operation in a clock cycle. Therefore, a multiply-add operation gives twice the M ops rate of a single add or multiply operation. N 1=2 is a good measure of the impact of overhead. For nite vector lengths, a start-up time is associated with each vector operation. N 1=2 parameterizes this start-up time. The use of vector units for processing of vectors shorter than the N 1=2 will result in signi cant loss in performance. We obtained large values for N 1=2 which indicate that e cient use of vector units begins at large vector lengths on the CM-5. N 1=2 is longer for single-precision data than for double-precision data. This is, in fact, related to the higher M ops rating of the double-precision data, as explained above.
N v measures both the overhead and the speed of scalars relative to vectors. The node processor can manipulate vectors of up to about 20 data items faster than the vector units can. Table 2 and 3 also show the achievable peak rate in Giga ops when the vectors are distributed across all the vector units. Peak performance gures indicate that, even for 512 nodes, the peak performance is close to the multiplication of the number of processors with the peak speed of a single node. This is a good indication of the scalability of vector processing capability. For these kinds of simple loops there is an insigni cant amount of overhead, but it should not be forgotten that the overhead penalties encountered in real case problems may be much larger.

Point-to-Point Communication Benchmarks
In distributed-memory machines like the CM-5, data items are physically distributed among the node memories. Thus the performance of the communication primitives used to access non-local data is crucial. Point-to-point communication benchmarks measure basic communication properties of the CM-5 data network by performing the ping-pong test between a pair of nodes. The transmission time is recorded as half of the time of a round-trip message in the ping-pong test.
We used blocking sends and receives that transfer varying sizes of data blocks between two nodes. Both the source and the destination nodes take active parts in this exchange process, and the receiving node waits until it receives the last data byte from the data network.
Regression analysis of the transmission time allows the calculation of the start-up time and the asymptotic bandwidth between a pair of nodes. The total transmission time T between two nodes can be formulated as T(l) = t start?up + l t send ; where l is the message length in bytes, t start?up is the time to set up the communication requirements, and t send is the transfer time for one unit (byte) of data.
The asymptotic data transfer rate can be found approximately by taking the reciprocal of the transmission time (i.e., 1=t send .)

Nearest-Neighbor Communication
In this experiment we studied the communication time for sending a single message to another node in the same cluster of four nodes for di erent message sizes. This represents the shortest possible distance a message can travel. Figure 5 shows the communication time for messages of size 0-10 KB between two neighboring nodes on a 32-node CM-5. The communication time increases linearly with the increasing message size. To establish a communication link between two nodes, a preliminary handshake is required. This start-up time is observed to be 84.65 microseconds. Using a linear chi-square t, we can model the communication time for aligned messages within a cluster of four processors as a function of message size: T(l) = 84:65 + 0:117 l microseconds: (1) The thick appearance of the curve in Figure 5 is because of the sawtooth e ect caused by data alignment patterns. Figure 6 shows a smaller section (for message sizes of 320{ 576 bytes) of the previous graph to magnify this sawtooth e ect. As indicated by dips in the curve, when the message length is a multiple of the byte size, the communication time goes down to a local minimum. On the CM-5, the unaligned message transfer is more costly than aligned message transfers, but the communication time di erences between bytealigned, word-aligned, and double-word-aligned data are negligible. As stated earlier, each data packet contains 16 bytes of user data. Misalignment causes hardware complications since the memory is typically aligned on a word boundary. A misaligned memory access will be performed, therefore, by using several aligned memory accesses. In addition, since the network interface accepts only word and double-word writes, odd-sized bu ers can not be e ciently moved into the data registers. We studied the maximum bandwidth that can be sustained for a single message traveling to the shortest possible distance for message sizes up to 32 Kbytes. Figure 7 illustrates that the transfer rate ( approximately 1=t send ) for an aligned bu er is around 8.5 MB/sec. This bandwidth is signi cantly lower than the theoretical peak bandwidth of 20 MB/sec. In the current CMMD implementation, a node's ability to inject data into the network is much less than the network's capacity to accept the data 14]. Assembler codes can achieve close to 18 MB/sec moving data from one node's registers to another's 18]. However C codes with calls to the CMMD library tend to run slower, partly because the C compiler's output is never as e cient as a hand-crafted assembler code.

E ect of Distance on Communication
In this section we examine how the communication between any two nodes compares with the communication between two nearest neighbors. We measured the communication time from node 0 to every other node using the same strategy as in the previous section.   The transmission time di erence between the nearest neighbor and the neighbor at the maximum distance is less than 5 microseconds on a 512-node CM-5. The results are consistent for both short (16 bytes) and long messages (1 Kilobyte.)

Global Communication Benchmarks
The CM-5 hardware supports a rich set of global (cooperative) operations. Global operations involve any data transfer among processors, possibly with an arithmetic or logical computation on the data while it is being transferred. Collective communication patterns, such as reduction, broadcast, concatenation or synchronization, are very important in the implementation of high-level language constructs for distributed-memory machines.
We measured the performance of the communication networks by using a set of benchmark programs employing the global operations provided by the CM-5 hardware.

Scans
A scan (parallel pre x) operation creates a running tally of results in each processor in the order of the processor identi er. Assuming that the A j] represents the element A in the jth processor and R j] represents the result R in the jth processor, an inclusive scan with a summation operator performs the following operation: A j]; 0 i < Number of Processors ? 1: Table 4 summarizes the performance of scan operations using di erent data types on a 32-node CM-5. Integer scan operations take about 6 microseconds. On the other hand, the double-precision minimum/maximum scans and add scans are about 3 to 5 times slower than the integer scans.
In a segmented scan, independent scans are computed simultaneously on di erent subgroups ( or segments) of the nodes. The beginning of segments are determined at run-time by an argument called the segment-bit. Table 4 shows the performance of the segmented scan operations on a 32-node CM-5, assuming the segment-bit of a processor is turned on with a probability of 10%. Computation of integer-segmented scans takes slightly longer than regular scans, primarily because of testing the extra condition at run-time. Timings for the double-precision maximum or minimum segmented scans are almost equal to those for regular scans, but the time for a double-precision segmented add scan operation is almost twice that of a corresponding regular scan operation.
The CM-5 control network has integer arithmetic hardware that can compute various forms of scan operations. Integer minimum, maximum, and logical segmented scans are also supported by the hardware. On the other hand, single-and double-precison oating-point scan operations are handled partially by software, which results in a much longer time. While the oating-point minimum and maximum scans take advantage of the hardware partially,  oating-point add scan is performed almost completely by the software. This is the reason add scans and segmented scans are so costly.

Reductions
A reduction operation takes an input value from each node, applies a global operation such as summation, minimum or bitwise xor on all the values, and returns the result to all other nodes.
We measured the speed of combining subnetworks for various types of reduction operations (Table 4). Double-precision reduction operations take 4 to 6 times longer than integer reductions. Again, this can be explained by the same reasons described above.

Concatenation
Some computations on distributed data structures require that each processor receive data from all the other processors. For example, in the classical N-body algorithm, every particle interacts with every other particle. Concatenation is a cumulative operation that appends a value from each processor to the values of all the preceding processors in processor identi er order.
Assume that there are P processors, and B = N=P data elements of a large vector are distributed among these processors so that processor p contains a vector V p p.B (p+1)B-1]. The global concatenate operation stores the resultant vector V 0 N-1] in every node. We tested the e ects of message size and number of processors on the concatenation operation execution time. Figure 10 shows the time required for the concatenation operation using 32-, 64-, 256-, and 512-node partitions. We can derive the following equation for the concatenation operation. T(l; P) = 23:44 + 0:975 (P l) microseconds; where p is the number of processors in that partition and l is the size of the local portion of the data to be concatenated. Note that time for concatenation depends only on P for its contribution to the message size, and the the operation is itself independent of P.
From Figure 10 it is clear that the time for concatenation on 512 nodes is about 16 times larger than the time on 32 nodes, which may be surprising when compared to scan operations. The amount of data sent by each node is about N data items which leads to N P data items in the network and may cause congestion in the network, especially for large messages. Therefore, as the message length and number of processes increase, the horizontal distance between the lines increases.

One-to-All Broadcast
When we use SPMD style programming, one of the basic types of communication is to broadcast a value from one node to all the other nodes. For example, spreading a row to all other rows is a common operation in LU Decomposition and many other linear algebra computations. On the CM-5 any node can broadcast a bu er of a speci ed length to all other nodes within the partition.
We measured the performance of the broadcast subnetwork using CMMD broadcast intrinsics. The results for 32-, 64-, 256-and 512-node partitions are shown in Figure 11.
We can derive Equations 3 and 4 for a 32-and a 512-node CM-5, respectively. T(l) = 6:96 + 1:15 l microseconds: (2)  T(l) = 7:40 + 1:24 l microseconds: (3) The broadcast time is almost the same for 32-and 64-node partitions, and for 256-and 512-node partitions. Since the broadcast is implemented in the network in a spanning tree fashion, the number of hops (or switches traversed) slightly a ects the timings. Since values can be reduced in 3 hops in 32-and 64-node partitions (which can communicate via the third level of the fat-tree), it is faster than using 256-and 512-node partitions, which require 4 and 5 hops, respectively. Moreover, the initial setup times for di erent sized partitions slightly di er, as seen in the above equations.

Synchronization
Synchronization is very important in MIMD machines since they are fundamentally asynchronous and must be synchronized prior to most communication steps. Many machines, also use the common communication network also for synchronization, causing signi cant performance degradation. The CM-5 uses a separate barrier synchronization network (the control network) to carry out synchronization e ciently. We measured the delay to do a global synchronization on CM-5and found that it takes 5 microseconds, independent of the number of nodes in the partition. 7 Embedding of other topologies into CM-5 fat-tree 7.1 Embedding of a mesh into fat-tree A wrap-around mesh (torus) can be embedded into the CM-5 fat-tree-based architecture by using the shu e row-major mapping 17]. The physical node number corresponding to a logical mesh point is found by shu ing the row and column binary numbers of that point in the mesh topology. If a processor's location is row=abcd and col=efgh, then bitwise shu ing of row and col gives the bit string aebfcgdh. This kind of mapping preserves the locality of 2 2, 4 4, etc. submeshes. A representative example for this is illustrated in Figure 12.
Logical ProcNum TO Coordinate() and Coordinate TO Physical ProcNum() are two basic routines used for mapping a point on an m n mesh to a node of the fat-tree. The former is used to calculate the coordinate location of a point on the mesh. It is also useful for determining the neighbors of a point on the mesh. The latter is used to transform a given location on the mesh to a physical node number on the fat-tree. getbit() returns the corresponding bit of the string at the speci ed position. These routines are listed in Figure 13 for reference. Table 5 displays the timings for shift operations in a given direction which are very common in mesh topologies. We simulated 16 32, 8 64, 4 128, and 2 256 meshes mapped to the fat-tree topology on a 512-node CM-5.  We can deduce from Table 5 that mesh bandwidths are at about 4 Mbytes per second, which is less than the expected 5 Mbytes/sec bandwidth between any arbitrary nodes. The main reason for that is the contention happening in the data network when all the nodes send long data messages at the same time.

Embedding of a hypercube into fat-tree
For many computations, the required communication pattern is similar to the connections of a hypercube architecture. These include bitonic sort, the Fast Fourier Transform, and many divide-and-conquer strategies 17]. This section discusses the time requirements for such types of communication patterns.
A d-dimensional hypercube network connects 2 d processing elements (PEs). Each PE has a unique index in the range of 0,2 d -1]. Let (b d?1 b d?2 : : : b 0 ) be the binary representation of the PE index p and b k be the complement of bit b k . A hypercube network directly connects pairs of processors whose indices di er in exactly one bit; i.e., processor (b d?1 b d?2 : : : b 0 ) is connected to processors (b d?1 : : : b k : : :b 0 ), 0 k d-1. We use the notation p (k) to represent the number that di ers from p in exactly bit k.
Node p of a logical hypercube is mapped onto node p of the CM-5 ( Figure 14). We consider communication patterns in which data may be transmitted from one processor to another if it is logically connected along one dimension. At a given time, data is transferred from PE p to PE p (k) and from PE p (k) to PE p.
The communication patterns performed for a logical hypercube on the CM-5 using this    mapping are shown in Figure 15. The rst two dimensions of the cube require the rst level of the fat-tree to be traced, and the 8th dimension needs ve levels to be traced on a 512-node CM-5. We observe that all six plots are almost horizontal, from which we conclude that the time required for swapping data along di erent dimensions is approximately the same for all dimensions and that it scales linearly with the size of the message. Having more switches at higher levels is one reason for being able to achieve this performance. More bandwidth can therefore be handled as we go up in the network connection tree. The rate of transfer is between 3.3 Mbytes/sec and 3.6 Mbytes/sec, respectively. This is close to the peak bandwidth for long-range communication on the CM-5.

Performance Estimation for Gaussian Elimination
Modeling of basic computation and communication primitives is often used in estimating the performance of a given program 20]. We illustrate how to estimate the performance of a program by using the results stated in the previous sections. A Gaussian elimination code that uses the row-oriented algorithm with partial pivoting algorithm 8] is given in Figure 16. Assuming that there are P nodes, the rows of the matrix A N] N] are distributed using a block-mapping strategy, such that the rst N/P rows are assigned to node 0, the second N/P rows are assigned to node 2, and so on. The code gives just the enough detail about the elimination phase, back-substitution phase is not shown here.
The elimination phase is performed column by column. The outer loop which iterates over pivots is executed in parallel by all processors. Within the loop body there are computational phases, separated by communication phases. Computational phases include nding the maximum value of the current column among the rows owned, computing the multipliers, updating the permutation vector in which the pivoting sequence is saved, and reducing the part of nonpivot rows. Communication phases include a reduction operation to determine the pivot value in a column, another reduction operation to nd the maximum row number (pivot) in the case of a tie among the processors, and a broadcast operation to announce the pivot row to all nodes. This code uses collective communication primitives but does not attempt to overlap computation and communication.
The costs of the communication operations (as modeled by our benchmarking programs) required for the Gaussian elimination are given in Tables 6 and 7. We counted the number of arithmetic operations performed in the inner loop bodies to determine the computational time in one iteration. The execution time of each iteration is multiplied by the number of iterations to obtain the estimated time. There are N iterations for a matrix of size N N.
We counted the conditional expressions as one arithmetic operation (according to the type of test) as in the GENESIS benchmark suite 10]. The percentage of the time the conditional test evaluates to true depends on the speci c values assigned to a speci c processor at a given time. We assumed the condition yields a true value 50% of the time which is a very close approximation in average. This code was executed on a 32-node CM-5. The measured results are compared to the estimated results in Table 7 and are found to be within 10% of the estimated results for    Table 7: Comparison of the estimated and measured times for Gaussian elimination code on a 32-node CM-5.
matrices of size smaller than 512 512. For a 512 512 coe cient matrix, there is a bigger discrepancy since the matrix is too big to t into cache, therefore extra memory overhead is incurred to fetch and bring the data into cache.
As seen, such modeling can be very useful in performance prediction for di erent algorithms on the CM-5. This information can be used to choose optimal algorithms and to optimize program codes and automate performance estimation at compile-time by using the cost function of each basic primitive.

Conclusions
In this paper we presented a benchmarking study of the computation and communication performance of the CM-5 multicomputer. We formulated the communication overhead in terms of message size and latency.
Using vector units become more e cient than using only the SPARC microprocessor, when the vector lengths go over twenty. We can get half the peak performance for vector lengths of 100{200 for single-precision numbers, and of 200{300 for double-precision numbers. Vector units give us up to 30 M ops rate which results in about a 15 G ops processing rate for a 512-node CM-5.
Communication benchmarks show that the data network has a start-up latency of 84 microseconds and a bandwidth of 8.5 MB/sec for unidirectional transfer between two nearestneighbor nodes. Communication latencies for misaligned messages are longer than latencies for aligned-messages. Message transmission latencies and bandwidths are independent of partition size and vary only slightly with the number of network levels crossed.
There are several global operations that use the control network for communication. Concatenation operation requires time linearly proportional to the size of the resultant array. The reduction operators take about 5 microseconds for integers and 15{20 microseconds for oating-point numbers. Scans and segmented scans are quite fast and can be completed in 6{7 microseconds for integers.
We simulated basic communication primitives of mesh and hypercube topologies on the CM-5. The bandwidth for hypercube-type of communications was less than 4 MB/sec. This was also true in cases when all communication passed through the root of the CM-5 interconnection network. For mesh-type of communication patterns, the bandwidth was again about 4 MB/sec. The CM-5 data and control networks were found to be highly scalable. The performance gures remained constant for most operations when we evaluated similar primitives from 32 to 512 nodes.
We used the timing results of the computation and communication primitives in estimating the execution time of a small program. We implemented the Gaussian elimination algorithm with partial pivoting on the CM-5. The real execution time of the algorithm was found to be close to the estimated time which shows that we can use the results of our study to do static performance estimation at compile-time before running a program.