Scheduling mechanisms for wide-issue dynamic order execution superscalar
processors
Localisation : Irisa, Rennes
Equipe(s) : Caps
Responsable : André Seznec / Pierre Michaud (tél.
direct : 02 99 84 73 36, email : seznec@irisa.fr)
The path for performance on superscalar processors should combine
instruction level parallelism with high clock frequency. However,
constructors are facing obstacles for implementing both features.
For instance, a very high clock frequency is enabled through very
deep pipeline on Intel Pentium 4 while Alpha EV8 will feature a
very wide instruction level parallelism and SMT parallelism with
a less aggressive clock. Among the mechanismthat may prevent a very
high clock frequency is the scheduling logic (wake-up logic+ selection
logic). The scheduling logic determines which instructions have
become fireable in the instruction window. The complexity of the
scheduling logic increases with the number of instructions that
can be issued on each cycle, but also with the depth of the window
of instructions waiting for execution. More precisely the hardware
complexity of the wake-up logic is directly proportionnal to the
number of inputs and outputs of an instruction as well as with the
number of instructions in the window: an entry in the instruction
window must check the availability of all its entries, the availability
of every result must be forwarded through the whole instruction
window. It is critical to perform the loop (wake-up logic-selection
logic) in a single cycle since this is required to enable direct
chaining of a dependent instruction after a one-cycle latency instruction.
We propose to investigate new solutions for reducing the complexity
of the scheduling logic for wide-issue dynamic execution superscalar
processors.
Two orthogonal directions will be explored. First, current implementations
of selection logic assume instructions (or micro-operations) with
2 operands and 1 result. One way to reduce the complexity of the
wake-up logic would be to use a larger granularity for firing instructions.
A group of dependent instructions might share a single entry in
the "instruction group window". The entries would have
to be enlarged to support groups of instructions with a set of N
operands and M results, but the number of entries would be reduced.
A group of instructions would become fireable as a whole. The first
advantage of such a technic would be the possibility to eliminate
some intermediate result write and read in the register file (and
also the propagation of its availability to the wake-up logic).
Therefore it might reduce the overall complexity of the scheduling
logic. The second advantage of this technic is to releave part of
the stress on the loop (wake-up logic-selection logic) since some
of the direct chaining of dependent instruction after a one-cycle
latency instruction can be hiden in instruction groups. On the other
hand, instructions would have to be grouped to efficiently use the
"instruction group window". It should be noticed that
a trace (or decoded) cache would be the right place to handle such
grouping, moreover IA32 ISA should be a good ISA for naturally allow
this grouping. Using an "instruction group window" also
raises the issue of instruction validation: instructions have to
be validated in groups. The study will have to check whether and
when this approach is valid. Particularly, the size of the groups
of instructions and the number of input operands and results will
have to be determined. Strategies for building instruction groups
at decode time (trace build time ) will be explored.
Second, we have recently proposed a prescheduling technic [1] to
reduce the width of the instruction window that should be searched
for fireable instructions. It consists of computing an approximate
execution cycle for each instruction and to send instructions to
reservation stations based on this predicted schedule: instructions
do not compete in the selection logic before they are likely to
be fireable. While preliminary results are encouraging [1], some
parameters are still to be investigated. Directions for further
researches include the use of cache behavior for better prescheduling
(including taking in account hit/miss prediction) , using the effective
scheduling for computing future prescheduling ( a la trace cache)
and applying prescheduling on a clustered architecture. Moreover
prescheduling might also be appplied to groups of instructions (as
those that could be built above) instead of individual instructions.
This might allow to reach better prescheduling heuristics than greedy
execution.
Finally, combinations of the techniques described above will have
to studied to explore the best tradeoffs between instruction window
size, hardware complexity, critical path.
[1] P. Michaud, A. Seznec, "Data-flow Prescheduling for Large
Instruction Windows in Out-of-order Processors", Proceedings
of HPCA-7, Jan. 2001
|