Let us consider a simple sequence of instructions:
1. read of the hardware clock counter
2. conditional branch
3. a load
4 . read of the hardware clock counter
The number of cycles for executing this short sequence will depend on:
1.Correct or incorrect branch prediction (both direction
and target)
2.Hit or miss of the instructions in the ITLB
3.Hit or miss of the instructions on the instruction cache
4.Hit or miss of the data load on the data cache
5.Hit or miss of the data load on the data TLB
6.In case of miss on one of those caches, hit or miss
on the L2 cache.
In addition from these binary status (present/absent or correct/wrong), the execution time of a sequence also depends on the precise status of all instructions in all the stages of the execution pipeline. This status is very complex. For instance, on an in-order execution processor such as the SUN UltraSparc II, up to 4 instructions may be executed per cycle and the pipeline features 9 stages. On out-of-order execution processors such as the Compaq Alpha 21264 or the Intel Pentium 4 , the status is even more complex, since more instructions can be in flight in the pipeline at the same time (up to 80 instructions on the Alpha 21264, more than 100 instructions on the Pentium 4), and the status of each instruction is complex: register renaming, waiting for execution, ...
Moreover modern superscalar processors feature numerous buffers which
aim at optimizing performance for instance write buffers, victim buffers
and prefetch buffers. The response time of the memory hierarchy servicing
a miss depends on the status of all these buffers. Moreover the response
time of the memory on a L2 cache miss will depend on any event conflicting
on the memory system or on the system bus.
For a Sun Workstation featuring an Ultrasparc II and running
under Solaris, we report here estimates on the minimum numbers of blocks
or entries that are displaced from data and instruction L1 caches, L2 caches
and instruction and data TLBs by a single operating system interruption.
Intuitively this represents an minimal evaluation of the perturbation introduced
by the interruption. We also report the ``minimum'' cumulated perturbation
on 100 consecutive interruptions. These numbers are reported for a non-loaded
machine (no other heavy process running) since on a loaded machine more
blocks (in average) will be evicted.
L1 data caches The UltraSparc L1 data cache is 16Kbyte and direct-mapped. It features 512 32-byte cache sectors. A miss fetches only 16 bytes, the second 16-byte block will be fetched only on demand. The state of sector location is therefore represented by the physical address of the data sector mapped onto it and the presence/absence of the two halves of the sector.
On a non-loaded machine most of the operating system call touch about
80-200
data cache sectors (with a peak around 100-110 cache sectors) while,
depending on the runs, 1-10 % of the operating system calls displace
almost all the blocks from the cache. For 100 consecutive interruptions,
the number of displaced blocks always exceeded 11,500 in our experiences.
L1 instruction cache and the conditional branch predictor The 16Kbyte instruction cache on the UltraSparc is 2-way set-associative and features 32-byte cache blocks. On the UltraSparc, the branch predictor is incorporated in the I-cache: a 2-bit counter is associated with every pair of instructions and a prediction of the address of the next 4-instruction block is associated with every 4-instruction group.
The state of a cache set can be represented by the ordered set of the addresses of the instruction blocks mapped onto it and the associated branch prediction information. An operating system call will flush down part of the I-cache, and therefore will also flush part of the branch prediction information.
We measured that, on a non loaded UltraSparc machine, most operating system calls displace around 250 32-byte blocks of instructions , while 100 consecutive operating systems displace at least 30,000 blocks.
TLBs The UltraSparc II features a data TLB and an instruction TLB. Both TLBs have 64 entries and are fully associative and feature a Not Last Used replacement policy. The global state of the TLB can be represented by the set of the addresses of the pages mapped by the TLB and the state of the logic needed for implementing the replacement policy.
We experimentally measured that, on a non loaded machine, every operating system invocation displaces a significant amount of data TLB entries (minimum 16, 52 in average !), but only displaces a few instruction TLB entries (6 in average). For 100 consecutive operating system invocations, the minimum cumulated number of displaced blocks always exceeded 4,500 for the data TLB, but only 600 for the instruction TLB.
L2 caches The UltraSparc II processor is used in conjunction with a 1 Mbyte L2 cache featuring 64-byte blocks.
In the vast majority of cases, an operating system invocation displaced between 850 and 950 blocks. The minimum cumulated number of displaced blocks for 100 operating systems invocations always exceeded 95,000.
Summary:
On a Sun workstation featuring an UltraSparc II and Solaris, the five
considered memorization structures are subject to lose a significant amount
of volatile non-architectural hardware information on operating system
invocations.
While the numbers presented here are only valid for this platform,
the same conclusion will prevail for other processors and other operating
systems for PCs and workstations.Moreover, other processors (e.g Alpha
21264, Pentium III) feature more complex branch prediction mechanisms that
are even more affected by the operating system than the ones on the UltraSparc
II.