Hiding thousand cycles memory latency through L2 prefetching
Localisation : Irisa, Rennes
Equipe(s) : Caps
Responsable : André Seznec (tél. direct :
02 99 84 73 36, email : seznec@irisa.fr)
The last 20 years trend in electonic industry indicates that microprocessor
frequency will soon reach 10 Ghz while the access time of main memory
will remain in a few tens of nanosecond. Therefore the gap between
processor performance and main memory will continue to grow. The
penalty for a data or instruction missing the on-die memory hierarchy
will soon become in the order of thousands of instructions.
While there exists a class of application for which the working
set moves very slowly and do not exhaust a L2 cache featuring a
few megabytes, the memory gap represents the major obstacle for
the performance increase on another class of applications. On many
of these applications, the parameters of the applications run on
a platform heavily depend on the performance of the platform itself,
that is the user will run the largest workload he/she will be able
to get executed. Hiding the memory gap for those applications is
a major issue.
The common current solution to deal with the increasing gap between
the main memory access time and processor performance is to add
more and more L2 or even L3 cache space. In a very few years, constructors
will be able to implement consequent multimegabytes of static memory
on the same die as a very wide issue SMT superscalar processor or
a multiprocessor, it would be natural to use it for L2 caches. However,
enlarging the L2 cache leads to diminishing returns for most applications,
unless the working set comes to fit in the L2 cache. To hide memory
latencies on cache misses, prefetching techniques have also been
proposed ( both hardware and software). Unfortunately, most of the
currently proposed techniques are only efficient at hiding a few
tens of instruction opportunities, i.e. hiding between a L1 cache
hit and a L2 cache hit, but not the complete memory access time.
In this study, we propose to investigate a different way of prefetching
in order to hide the L2 cache latency (i.e thousands of instructions
slots). As hypothesis, we will state that very substantial storage
space (i.e. several megabytes) can be used for allowing to implement
prefetching structures, we will also state that very different granularity
of prefetches can be used (i.e, from a cache line to a full physical
page, but also a sparse set insides the page). Two dimensions will
be first explored: reuse of the same memory static address access
pattern (i.e. the same flow of addresses), determination of dynamic
access pattern (.e.g. strides, lists, ..). The objective being to
hide a thousand of cycle.
|