A high-throughput, unpredictable random number generator...
Let's consider randomness as the outcome of an experiment, two extremal cases are of interest.
The first is the truly random number generator, which is based on a non-deterministic process. In that case, without condition, you cannot guess the outcome before it actually happens.
The second extreme case is pseudo-randomness. The process takes as input a (short) sequence of numbers (seed). It is called a pseudo-random number generator if, by considering only its outputs, it is computationally infeasible to distinguish it from a truly random number generator. By "computationally infeasible", we mean non polynomial time and memory and by "distinguish", we mean that the probability to take the good decision is significantly greater than 1/2.
Random bits produced by peripherals events (e.g. by Entropy Gathering Daemons), or by HAVEGE, are not, strictly speaking, truly random bits nor pseudo-random bits. They are not truly random because the process which produces them is deterministic. One could theoretically reproduce the sequence if he/she was able to reproduce all the past events on the machine. They are not pseudo-random either since there is no (short) seed which would allow an exact reproduction of the random sequence.
The randomness results instead from an inability to control or predict with sufficient accuracy the events involved in the generation process. For the sake of convenience, we will speak in that case of unpredictable randomness.
In microprocessor systems built around modern superscalar processors, the precise number of cycles needed to execute a (very short) sequence of instructions is dependent on many internal states of hardware components inside the microprocessor as well as on events external to the process. Let us consider a simple sequence of instructions:
The number of cycles for executing this short sequence will depend on:
In addition from these binary status (present/absent or correct/wrong), the execution time of a sequence also depends on the precise status of all instructions in all the stages of the execution pipeline. This status is very complex. For instance, on an in-order execution processor such as the SUN UltraSparc II, up to 4 instructions may be executed per cycle and the pipeline features 9 stages. On out-of-order execution processors such as the Compaq Alpha 21264 or the Intel Pentium 4 , the status is even more complex, since more instructions can be in flight in the pipeline at the same time (up to 80 instructions on the Alpha 21264, more than 100 instructions on the Pentium 4), and the status of each instruction is complex: register renaming, waiting for execution, ...
Moreover, modern superscalar processors feature numerous buffers which aim at optimizing performance for instance write buffers, victim buffers and prefetch buffers. The response time of the memory hierarchy servicing a miss depends on the status of all these buffers. Moreover the response time of the memory on a L2 cache miss will depend on any event conflicting on the memory system or on the system bus.
Any operating system invocation modifies the contents of the instruction and data caches, the translation buffers (TLBs), the L2 cache and the branch prediction structures.
For a Sun Workstation featuring an Ultrasparc II and running under Solaris, we report here estimates on the minimum numbers of blocks or entries that are displaced from data and instruction L1 caches, L2 caches and instruction and data TLBs by a single operating system interruption. Intuitively this represents an minimal evaluation of the perturbation introduced by the interruption. We also report the "minimum" cumulated perturbation on 100 consecutive interruptions. These numbers are reported for a non-loaded machine (no other heavy process running) since on a loaded machine more blocks (in average) will be evicted.
The UltraSparc L1 data cache is 16Kbyte and direct-mapped. It features 512 32-byte cache sectors. A miss fetches only 16 bytes, the second 16-byte block will be fetched only on demand. The state of sector location is therefore represented by the physical address of the data sector mapped onto it and the presence/absence of the two halves of the sector.
On a non-loaded machine most of the operating system call touch about 80-200 data cache sectors (with a peak around 100-110 cache sectors) while, depending on the runs, 1-10 % of the operating system calls displace almost all the blocks from the cache. For 100 consecutive interruptions, the number of displaced blocks always exceeded 11,500 in our experiences.
The 16Kbyte instruction cache on the UltraSparc is 2-way set-associative and features 32-byte cache blocks. On the UltraSparc, the branch predictor is incorporated in the I-cache: a 2-bit counter is associated with every pair of instructions and a prediction of the address of the next 4-instruction block is associated with every 4-instruction group.
The state of a cache set can be represented by the ordered set of the addresses of the instruction blocks mapped onto it and the associated branch prediction information. An operating system call will flush down part of the I-cache, and therefore will also flush part of the branch prediction information.
We measured that, on a non loaded UltraSparc machine, most operating system calls displace around 250 32-byte blocks of instructions, while 100 consecutive operating systems displace at least 30,000 blocks.
The UltraSparc II features a data TLB and an instruction TLB. Both TLBs have 64 entries and are fully associative and feature a Not Last Used replacement policy. The global state of the TLB can be represented by the set of the addresses of the pages mapped by the TLB and the state of the logic needed for implementing the replacement policy.
We experimentally measured that, on a non loaded machine, every operating system invocation displaces a significant amount of data TLB entries (minimum 16, 52 in average!), but only displaces a few instruction TLB entries (6 in average). For 100 consecutive operating system invocations, the minimum cumulated number of displaced blocks always exceeded 4,500 for the data TLB, but only 600 for the instruction TLB.
The UltraSparc II processor is used in conjunction with a 1 Mbyte L2 cache featuring 64-byte blocks.
In the vast majority of cases, an operating system invocation displaced between 850 and 950 blocks. The minimum cumulated number of displaced blocks for 100 operating systems invocations always exceeded 95,000.
On a Sun workstation featuring an UltraSparc II and Solaris, the five considered memorization structures are subject to lose a significant amount of volatile non-architectural hardware information on operating system invocations.
While the numbers presented here are only valid for this platform, the same conclusion will prevail for other processors and other operating systems for PCs and workstations.Moreover, other processors (e.g Alpha 21264, Pentium III) feature more complex branch prediction mechanisms that are even more affected by the operating system than the ones on the UltraSparc II.
External events introduce uncertainty in the volatile hardware states of the processor. Quantifying the volume of uncertainty is difficult since there does not exist any method to determine the entropy of a source that furnishes a random data of several thousands bits (i.e. the unmonitorable hardware states after each OS interrupt).
Below is the algorithm we used to empirically measure the average uncertainty introduced by an OS interrupt on the I-cache + branch predictor (does not work for Itanium):
int Entrop[SIZEENTROPY];
int A;
while (NBINTERRUPT
< NMININT) {
if (A==0)
A++; else A--;
Entrop[K] = (Entrop[K] << 5) ^ (Entrop[K] >>
27) ^
HARDTICK()^(Entrop[(K + 1) & (SIZEENTROPY - 1)]
>> 31);
K = (K + 1) & (SIZEENTROPY - 1);
/**repeated XX times **/
}
HARDTICK() is a function that reads the hardware clock counter, tests the difference with the last read value, and increments the NBINTERRUPT counter whenever this difference is higher than a threshold indicating that there was an interruption between two successive calls to the function reads.
Throughout the loop, HARDTICK() is called many times and the result of this read is combined through exclusive-OR and shifts in the Entrop array. The unrolling factor (XX) is chosen in such a way that the loop body fits in the instruction cache (e.g. 16Kbytes for Pentium III or UltraSparc II), and that it does not overflow the branch prediction structures.
The while loop is run until the number of encountered interruptions reaches NMININT. SIZEENTROPY is the size of the table used for gathering the read value of the hardware clock counter. The flow of instructions executed by the loop body of the algorithm is completely deterministic. Therefore, in the absence of operating system interrupt,the content of the instruction cache and also of the branch predictor should be completely predictable: we checked that the iterations of the loop just after an interruption present much more uncertainty that the iterations that occur long after the last iteration.
The Entrop array gathers (part of) the uncertainty injected in the instruction cache and the branch predictor by the operating system interrupts. The content of the Entrop array is recorded in a file and the Entrop array is reinitialized to zero.
Fixing SIZEENTROPY to 65536, we determined the threshold for NMININT above which the content of Entrop array can be CONSISTANTLY considered as random, i.e. it cannot be distinguished by our battery of tests from truly random numbers (i.e DieHard + ent + FIPS 140 +NIST suite). By consistantly, we mean that we got such results under many different workloads.
Uncertainty extracted from I-Cache + branch predictors for supported platforms
Processor |
Operating
System |
NMININT |
Av.
unc. per OS int. |
UltraSparc II | Solaris |
32 |
64 Kbits |
UltraSparc III | Solaris |
32 |
64 Kbits |
Pentium III | Linux |
64 |
32 Kbits |
Pentium III | Windows |
128 |
16 Kbits |
PowerPC 7450 | MacOS X |
32 |
64 Kbits |
Athlon XP | Linux/Windows |
64 |
32 Kbits |
Pentium IV | Linux |
64 |
32 Kbits |
Pentium IV | Windows |
64 |
32 Kbits |
Itanium 1 | Linux |
256 |
8 Kbits |
On a pseudo-random number generator, at any step in the computation, one can define its internal state as the values of internal variables and tables. This internal state determines the future behavior of the generator and the sequence of numbers that it will generate.
At any moment, one can also define the internal state of the HAVEGE generator as the content of its internal variables (the walk table, the values of pointers, the values in the result table) and the values of all the volatile hardware states in the processor (branch predictors, instruction and data caches, ...), on the system bus and on the memory system that determines the execution time of the sequences of instructions in HAVEGE.
If one was able to capture the internal state of the HAVEGE generator at a given point then he/she would be (theoretically) able to replicate the generated sequence from this point until the next occurrence of an interruption or until a new external event on the system bus or the memory system.
However, collecting the internal state of the HAVEGE generator is unfeasible at user-level. The user itself cannot collect and reestablish the internal state. For instance, to indirectly determine that a particular cache block was present, one has to access it, but then other internal states are modified (similar to Heisenberg's criteria).
Another possible mean to reproduce the sequence would be to set the internal state of the generator in a known state at the beginning of execution. Then, one can try to follow the sequence of instructions through only monitoring external events to the process. The internal state at the initiation of the generator depends on many parameters such as the exact date of launching (cycle accurate), the exact contents of the disk, their response time (cycle accurate), the memory content, etc. Moreover to follow the internal state, one has also to be able to monitor every external events, but also internal events during the initialization phase.
We describe below part of the volatile information that belongs to the internal state of the HAVEGE generator on an UltraSparc II workstation. A very similar analysis can be done for the other processors.
From the HAVEGE generator, each of the 512 32-byte cache lines of the L1 data cache can be considered as in one of seven possible states. Either it maps one of the two possible 32-byte blocks A0 or A1 from the Walk table or it maps a block irrelevant from HAVEGE. If it maps a relevant block then it maps only one or or both the 16-byte subblocks. That is, the L1 cache is in one of 7**512 possible states.
The inner loop body in the self modifying walk (just) fits in the L1 instruction cache. The L1 instruction cache is two-way set-associative and branch prediction is embedded in the instructioncache. It features 256 sets.
Only two instruction blocks I0 and I1 from the main HAVEGE loop are mapped onto a given sets of the instruction cache. A LRU replacement policy is implemented on the instruction cache. Therefore, from the HAVEGE perspective, each set has 7 different possible states, B being an instruction block independent from the HAVEGE algorithm: (I0, I1), (I1, I0), ( B, I0), (B, I1), (I0, B), (I1, B) and (B, B). Then from HAVEGE perspective, the instruction cache can be in at least different states 256**7.
On the UltraSparc II, the instruction cache also handles the branch prediction for both targets and directions. For targets, from the HAVEGE perspective, two states are visible: either the target is correct or it is not. For the Ultrasparc II, the main loop in HAVEGE2.0 features 11 iterations, each of these iterations features 21 conditional branches and 2 calls to HardClock. From the HAVEGE perspective, the target prediction generator features at least 2**253 different states. A 2-bit counter is associated with each conditional branch, therefore from the HAVEGE perspective, the branch predictor can be at least in 4**231 different states.
Considering only the two memorization structures above, it appears that the internal state of the HAVEGE generator includes thousands of binary internal volatile hardware states. A significant fraction of these states are destructed (modified) on each interruption. Each of these states influences the self modifying walks in HAVEGE.
Moreover, many other volatile hardware states (pipeline states, buffer contents and status, L2 caches, etc.) are part of the HAVEGE internal state. Some of the volatile hardware states touched by an operating system interrupt that are part of the internal HAVEGE state are correlated. For instance when an instruction block is evicted then its associated branch prediction information is also evicted, moreover the next instruction block has also a high probability to be also evicted. Nevertheless, there is little correlation between the blocks evicted from the data cache, and the blocks evicted from the instruction cache.