Multiple cycle ahead pipelined branch predictors vs. branch predictor
hierarchies
Localisation : Irisa, Rennes
Equipe(s) : Caps
Responsable : André Seznec (tél. direct :
02 99 84 73 36, email : seznec@irisa.fr)
The path for performance on superscalar processors should combine
instruction level parallelism with high clock frequency. This results
in very deep pipelines (20 cycles on the Intel Pentium 4 !). The
penalty paid on each branch misprediction (either conditional or
indirect) is huge and will continue to grow for next generation
processors as they will feature more instruction level parallelism.
Even a small increment in the branch prediction accuracy may lead
in an effective performance benefit on such processors. Silicon
area for implementing very accurate branch predictors featuring
256 Kbytes of storage or even more and complex logic will represent
a very small part of the die in future generation processors. Unfortunately
it is unlikely that such predictor would be able to deliver a result
in less than 3 or 4 cycles.
The processor front-end has to predict the instruction flow at
a tremendous rate (more than one basic block per cycle) while the
prediction accuracy must not be sacrified. Usual branch predictors
uses information (branch histories, fetch addresses, ..) available
at a cycle to generate the next fetch addresses.
The solution that has been used so far to deal with this issue
is to rely on a hierarchy of predictions, the first one performed
in a single cycle followed by a second one more accurate, but performed
in two or more cycles (e.g. implicit fall through + two-level branch
predictor on Intel Pentium Pro, line predictor + hybrid branch predictor
on Alpha 21264). The disavandtage of this method is that many fetch
cycles must be cancelled, resulting in pipeline bubbles when the
two predictors disagree. Such a strategy leads to the waste of a
significant part of the instruction fetch bandwidth, and the loss
of many cycles on hard-to-predict branches. Therefore, part of the
front-end has to be oversized to allow the required throughput.
It should also be noted that when complexifying the branch predictor
results in an increase of its crossing latency, the extra accuracy
may be counterbalanced by extra bubbles in pipeline front-end.
We propose to investigate an alternative solution that would allow
to use huge and (hopefully) very accurate branch predictors without
wasting bubble cycles in pipeline front-end. In [1],
pipelined multiple block ahead branch prediction, we proposed to
predict fetch addresses using information available 2 or more cycles
ahead the fetch cycles. Such an approach allows to use very large
predictor tables and some extra complex logic in the predictor,
but hopefully to still achieve the prediction in time for fetch.
Bibliography
[1] A. Seznec, S.Jourdan, P. Sainrat, P. Michaud, "Multiple-Block
Ahead Branch Predictors", Proceedings of the 7th conference
on Architectural Support for Programming Languges and Operating
Systems, Boston, October 1996
|