Publications

Reducing the Traffic of Loop-Based Programs Using a Prefetch Processor

Abstract

Large cache block sizes are used to take advantage of spatial locality and amortize long memory latency over more words. However, the cost of large cache block sizes is increased memory traffic requirements, especially for applications that show poor spacial locality. Software prefetching is usually presumed to increase memory traffic. We present an architecture that uses a separate processor devoted to prefetching that improves execution time and at the same time allows the cache block size to be reduced, thereby reducing memory traffic. Simulation results show that our architecture reduces traffic at the microprocessor chip boundary by between 15% and 67% while reducing execution time by up to 68% for eight scientific and signal processing benchmarks.

Date
March 25, 1997
Authors
Stephen P Crago, Alvin M Despain