Project contacts: Thierry
Lafage, André Seznec
Introduction
Realistic microarchitecture simulations require realistic inputs
from a feeder (either a trace collection tool or an emulator). Realistic
inputs consists of the whole user and operating system activity
of a realistic target workload. However, the feeder by itself induces
a significant execution overhead, particularly when running instructions
which are not simulated (e.g. initialization phase). Therefore,
microarchitecture studies are generally performed using the beginning
of the application (or after skipping a few hundred millions of
instructions).
The main goal of the calvin2+DICE and LiKE
toolset is to enable microarchitecture simulations of
long applications by providing a very fast feeder able
to skip billions of instructions with a very limited
overhead compared with the original
application. Ideally, this feeder would also be able to
catch the activity of the operating system.
General Approach
To trace programs or to perform on-the-fly simulations, static
code annotation is generally a more efficient technique than instruction-set
emulation. However, instruction-set emulation is generally a much
more flexible approach: 1) it makes it possible to implement different
tracing/simulation strategies without the need to (re)instrument
the target programs and 2) all user activity (including dynamically
linked code and dynamically compiled code) can be traced and simulated.
Our approach takes advantage of both static code annotation for
its efficiency and instruction-set emulation for its flexibility.
A fast mode of execution is used to rapidly (i.e. with
a very low overhead) position the target program in interesting
simulation states. This mode relies on a direct execution of a lightly
instrumented version of the program on the host processor. On the
other hand, an instruction-set emulator is used to actually trace
the target program or enable on-the-fly simulations (emulation
mode). This emulator is embedded in the target program and can
take the control during the execution in fast mode.
At run time, the target program switches from the fast mode to
the emulation mode whenever switching events happen. Switching
events are monitored by the statically added code in the target
program. This code only tests whether a switching event has occured,
and on a switching event gives the control to the emulator. Note
that, mode switching is made deterministic since the annotation
code drives it. Switching back from emulation mode to fast mode
is managed by the emulator and is possible at any moment.
Light Static Code Annotation with calvin2
calvin2 is a static code annotation tool which uses the
SALTO library
to instrument SPARC assembly code.
calvin2 lightly instruments the target programs by inserting
checkpoints: the fast execution mode is the direct
execution of the instrumented programs. The checkpoint code sequence
consists in a few instructions (about 10) which checks whether the
control has to be given to DICE, the emulator. Switching from the
fast mode to the emulation mode is triggered by a switching event.
Checkpoint Insertion
The number of inserted checkpoints directly determines the execution
overhead in fast mode. So checkpoints must not be too numerous.
In contrast, their number and distribution among the code executed
determines the dynamic accuracy of mode switching (fast mode to
emulation mode). For instance if the checkpoint distribution was
uniform every 500 dynamic instructions, we could choose to enter
emulation mode at the Nth instruction +/- 250 (ideal accuracy)
at a cost of only around 2% performance (10 instructions added each
500 instructions). Unfortunately, checkpoints have to be inserted
at code generation.
For this reason, we have run experiments on the SPEC95 benchmarks
to characterize the distribution of the checkpoints executed when
they are inserted at procedure calls and inside each path of loops.
These experiments also allowed us to estimate the execution slowdown
in fast mode.
In a word, for all the programs but fpppp, given a dynamic
switching event, there is a very good probability (90+%) that the
execution mode switch actually happen within less than 100-200 instructions.
Also, such a checkpoint layout make us expect low execution slowdowns:
1.56 max. Inserting checkpoints at procedure calls and inside each
path of loops is quite acceptable.
Switching Events
We call switching event, the event that, during the execution
in fast mode, makes the next executed checkpoint pass control to
DICE. Switching back from the emulation mode to the fast mode is
determined by the simulation user (e.g. a given number of instructions
emulated or some point reached in the execution). Four different
types of switching event have been implemented so far (see here for more
details).
DICE: A Dynamic Inner Code Emulator
DICE emulates SPARC V9 instruction-set architecture (ISA)
code: it manages the emulation execution mode of target programs.
DICE is a piece of C and assembly code (archive library) which is
embedded in (linked with) the target application. As such, it can
receive the control, and return to direct execution at any moment
during the execution by saving/restoring the host processor state.
DICE works with programs instrumented by calvin2: the inserted
checkpoints are used to give control to it.
DICE enables simulation by calling user-defined analysis routines
between each instruction emulated. Analysis routines have direct
access to all information in the target program state, including
complete memory state, and register values.
Emulation Core
Figure 1: DICE
main processing loop.
|
DICE emulation core is made of the traditional fetch-decode-interpret
loop shown in Fig 1. Each instruction
is taken in the program text segment, decoded and interpreted. Trace
collection or on-the-fly simulation is allowed by calling specific
user-provided routines at each iteration of the main emulator loop.
Processor Model
DICE can emulate SPARC V9 ISA code and models
the architectural resources of an UltraSPARC processor.
These resources are used to keep in memory the state of the target
program. They are made of a memory copy of all the SPARC V9
non-privileged registers: general-purpose registers, floating-point
registers, and control register (PC, nPC, ...).
User Interface
DICE provides an interface which allows users to access dynamic
information in order to trace/simulate the target programs. Various
levels of detail are available and are configured at compilation
time by defining (or not) preprocessing macros.
DICE user interface is implemented through global variables and
a few function declarations. The functions (user analysis routines)
are to be defined by the user, and are called by DICE under well
defined circumstances (before/after each instruction emulation,
at system calls, and at checkpoints). Also, depending on DICE configuration,
some of these functions may or may not be called.
Each user analysis routine can access the host processor logical
resources (general-purpose registers, floating-point registers and
control registers) through the parameters passed to it or directly
through the memory model of the host processor.
More details about DICE internals are presented here.
Performance Evaluation of calvin2+DICE
Programs instrumented with calvin2 and linked with DICE
have two modes of execution: the fast mode and the emulation mode.
In order to evaluate execution slowdowns incurred by both execution
modes, we collected execution times of the SPEC95 benchmarks, running
entirely either in fast mode, or in emulation mode.
On average (upon all the SPEC95 benchmarks), the fast mode execution
slowdown range from 1.07 to 1.82 depending on the switching event
type. The average emulation mode slowdown for instruction and data
address trace generation (to /dev/null) in emulation mode
is 117.33.
LiKE: The Linux Kernel Emulator
DICE has been extended to LiKE (Linux Kernel Emulator) in order
to trace/simulate the operating system level activity. This extension
allowed us to complete our emulator: we modeled the privileged processor
resources (privileged registers), we added a support for the privileged
instructions, and we made it work in a true 64 bits environment
since we ran it on an UltraSPARC I workstation running Linux 2.2.8.
LiKE is a dynamically loadable module and can be incorporated
into a very slightly patched Linux kernel at any moment after kernel
boot. This feature is important because it lowers the degree of
intrusiveness of our tool: when the LiKE module is loaded,
the kernel is in a realistic state and LiKE does not disturb it
since it is only added to it. Also, the kernel can boot and be used
at full speed.
The current implementation use an external shared variable to
drive LiKE. When this variable is set and when the current process
is traced by DICE, LiKE take the control of the system calls made
by the current process.
However, LiKE is a preliminary version: it turned out to manage
to emulate the core of some system calls. This tool needs further
development to enable complete on-the-fly simulations. We plan to
set up a shared memory space (shared between the traced OS and the
traced processes) where the state of the on-the-fly simulator would
be updated either by a module connected to LiKE (when the kernel
has control in emulation mode), either by a library linked with
DICE (when one of the processes have the control and is in emulation
mode).
|