Title

Compiler techniques for enhancing data locality

Date of Award

1999

Degree Type

Dissertation

Embargo Date

1-2012

Degree Name

Doctor of Philosophy (PhD)

Department

Electrical Engineering and Computer Science

Advisor(s)

Alok Choudhary

Keywords

Cache, Memory, Compiler

Subject Categories

Computer Sciences | Physical Sciences and Mathematics

Abstract

Many applications are memory intensive and thus are bounded by memory latency and bandwidth. While improving memory performance is an important problem for uniprocessor machines, it is even more important for shared memory multiprocessors such as CC-NUMA (cache-coherent non-uniform memory access) machines. An architectural solution to the long memory latencies problem is to adopt a hierarchical memory system, where the higher levels (e.g., caches) are expensive, small, and fast whereas the lower levels (e.g., memories, secondary and tertiary storages) are cheaper, larger, and slower. Since the access latency varies from level to level, it is very important to utilize this memory hierarchy fully by having the majority of data accesses satisfied from the higher levels, i.e., caches. This dissertation first proposes a data transformation framework to improve cache locality. Instead of modifying the execution order of loop iterations as done by most of the current compiler techniques, the data transformations modify the memory layouts of large arrays. We illustrate the cases where data (memory layout) transformations are preferable to loop transformations and show how to generate code after a data layout transformation. Later we present a unified data locality optimization framework where both loop and data transformations are used in concert to improve cache locality. We demonstrate that such a unified framework achieves better results than frameworks based on pure loop or pure data transformations. We also quantify the benefits by giving performance numbers obtained on a single processor as well as multiple processors of SGI/Cray Origin 2000, a CC-NUMA machine. Our performance results on a uniprocessor as well as multiprocessors show marked improvements in cache misses and overall execution times. This unified technique can also be used for main memory-secondary storage hierarchy. We also briefly discuss how our approach can be made to work in an inter-procedural setting.

Access

Surface provides description only. Full text is available to ProQuest subscribers. Ask your Librarian for assistance.

http://libezproxy.syr.edu/login?url=http://proquest.umi.com/pqdweb?did=730191511&sid=1&Fmt=2&clientId=3739&RQT=309&VName=PQD