A faster Marsaglia's Xorshift pseudo-random generator in unsafe C#

The built-in pseudo-random number generator in .NET, System.Random, is suitable for a wide range of applications. It offers a reasonable speed and a very reasonable level of randomness. But if you need a lot of random numbers, and aren't too worried about their quality, you might go looking for an alternative.

One such alternative is a family of algorithms discovered by George Marsaglia, collectively called Xorshift. Xorshift is fast, has a period of around 2128 and passes Marsaglia's Diehard tests of randomness, provided the initial state is good.

But just how much faster is it than System.Random, given a C# implementation? Let's find out!

Measurement set-up

For my purpose, I needed to generate terabytes of pseudorandom data, in the form of small byte arrays. Hence all my comparisons concentrate on filling byte arrays. To perform these tests, I used QueryThreadCycleTime, a recently introduced function which returns the number of CPU cycles consumed by a specific thread. This is great for comparing performance of CPU-bound algorithms, since it excludes most of the measurement noise introduced by the OS preempting the thread and giving the CPU to a different process in the middle of a measurement.

The basic structure of each test is the same: measure the time taken to perform 200 calls to NextBytes, each filling an array 32,768 bytes long. These 200 calls are executed in a 20-iteration loop of 10 calls each, for a total of 6,553,600 random bytes generated per test. The RNG seed is fixed.

The program runs the above measurement 1000 times, recording the best time ever observed. I find that for deterministic CPU-bound tasks, recording the best time is much better than recording the average, since this eliminates measurement imperfections caused by other things running on the test machine (like the CPU cache getting flushed when the CPU got borrowed to do something else). This also means that the initial JIT compilation does not impact the results. Since the core task is always there but the imperfections come and go, the minimum time ever seen is a good approximation for how long the task takes under ideal conditions.

Among other considerations is making sure that the build runs fully optimized, which is easiest to achieve by running a Release build with the debugger detached. I also try to make the OS as idle as possible by shutting down all non-essential programs and services. All tests were performed on an Intel i5 3570K at stock frequency, and the test ran as a 64-bit process.

Speed comparisons

The code samples below show only the core loop, which assumes that the array to be filled has a length that's a multiple of some number. To keep the comparisons to Random fair, I actually wrote the code that handles odd lengths correctly, but it's not shown below for conciseness. I also made sure to test that each Xorshift implementation produces exactly the same sequence of bytes, regardless of which buffer lengths are requested. All of these things can be seen in the full code (link at the end of the article).

I will quote the results in CPU cycles. The documentation for QueryThreadCycleTime warns about interpreting the returned count. A quick test showed that on my PC, this counter grows by roughly 3.4 billion counts every second, matching my CPU clock frequency exactly, so it seems that the cycles measured are, in fact, CPU clock cycles. Also, whenever I mention “MB”, I mean 1 million bytes.

Baseline: built-in Random

Random is provided by the framework, so I've got no code to show here. Let's get straight to the numbers: generating 6,553,600 bytes takes 167.0 million cycles. This amounts to roughly 134 MB per second.

Result: 25.5 cycles per byte; 134 MB/s.

Simple translation of Xorshift

This is a direct translation of the Xorshift code to safe C#. It generates four random bytes per iteration of the inner loop:

void FillBuffer(byte[] buf, int offset, int offsetEnd)
{
    while (offset < offsetEnd)
    {
        uint t = x ^ (x << 11);
        x = y; y = z; z = w;
        w = w ^ (w >> 19) ^ (t ^ (t >> 8));
        buf[offset++] = (byte) (w & 0xFF);
        buf[offset++] = (byte) ((w >> 8) & 0xFF);
        buf[offset++] = (byte) ((w >> 16) & 0xFF);
        buf[offset++] = (byte) ((w >> 24) & 0xFF);
    }
}

The values x, y, z and w are all fields in a class, maintaining the RNG state between calls. We'll look at how this affects the speed later.

Result: 2.92 cycles per byte; 1160 MB/s. That's 8.7x faster than Random.

Using unsafe for the byte array access

This code is basically the same as before, but uses unsafe to write each generated byte to the target array:

unsafe void FillBuffer(byte[] buf, int offset, int offsetEnd)
{
    fixed (byte* pbytes = buf)
    {
        byte* pbuf = pbytes + offset;
        byte* pend = pbytes + offsetEnd;
        while (pbuf < pend)
        {
            uint t = x ^ (x << 11);
            x = y; y = z; z = w;
            w = w ^ (w >> 19) ^ (t ^ (t >> 8));
            *(pbuf++) = (byte) (w & 0xFF);
            *(pbuf++) = (byte) ((w >> 8) & 0xFF);
            *(pbuf++) = (byte) ((w >> 16) & 0xFF);
            *(pbuf++) = (byte) ((w >> 24) & 0xFF);
        }
    }
}

Result: 2.63 cycles per byte; 1290 MB/s. That's an 11% improvement over the fully-safe version.

This code is actually quite silly. The loop produces 4 bytes at a time in an integer variable, so it makes sense to write that directly to the buffer in one go:

protected unsafe override void FillBuffer(byte[] buf, int offset, int offsetEnd)
{
    fixed (byte* pbytes = buf)
    {
        uint* pbuf = (uint*) (pbytes + offset);
        uint* pend = (uint*) (pbytes + offsetEnd);
        while (pbuf < pend)
        {
            uint t = x ^ (x << 11);
            x = y; y = z; z = w;
            w = w ^ (w >> 19) ^ (t ^ (t >> 8));
            *(pbuf++) = w;
        }
    }
}

Result: 1.85 cycles per byte; 1840 MB/s. That's 42% faster than the silly-unsafe version.

Locals instead of fields

The x, y, z and w variables seen above are referring to class fields, so as to store the RNG state between invocations. Let's try copying this state into locals for the duration of the loop and then back:

protected unsafe override void FillBuffer(byte[] buf, int offset, int offsetEnd)
{
    uint x = _x, y = _y, z = _z; w = _w; // copy the state into locals temporarily
    fixed (byte* pbytes = buf)
    {
        uint* pbuf = (uint*) (pbytes + offset);
        uint* pend = (uint*) (pbytes + offsetEnd);
        while (pbuf < pend)
        {
            uint t = x ^ (x << 11);
            x = y; y = z; z = w;
            *(pbuf++) = w = w ^ (w >> 19) ^ (t ^ (t >> 8));
        }
    }
    _x = x; _y = y; _z = z; _w = w;
}

Result: 1.13 cycles per byte; 3030 MB/s. That's a 65% improvement over the same code reading and writing fields directly.

Unrolled loop

Now let's take this code and rewrite it to perform four iterations of the loop one after another. In a later section I'll go into a lot more detail about the idea behind this, as well as how I arrived at this code. For now let's just see the new inner loop:

while (pbuf < pend)
{
    uint tx = x ^ (x << 11);
    uint ty = y ^ (y << 11);
    uint tz = z ^ (z << 11);
    uint tw = w ^ (w << 11);
    *(pbuf++) = x = w ^ (w >> 19) ^ (tx ^ (tx >> 8));
    *(pbuf++) = y = x ^ (x >> 19) ^ (ty ^ (ty >> 8));
    *(pbuf++) = z = y ^ (y >> 19) ^ (tz ^ (tz >> 8));
    *(pbuf++) = w = z ^ (z >> 19) ^ (tw ^ (tw >> 8));
}

Result: 0.880 cycles per byte; 3870 MB/s. That's 28% faster than the non-unrolled version.

64-bit variant

In the code shown so far, the state variables and all the computations are 32 bits wide. Just to see how much faster it could go, I changed everything to 64-bit values. Perhaps unsurprisingly, it ran almost exactly twice as fast as the 32-bit variant.

Result: 0.445 cycles per byte; 7650 MB/s. That's 98% faster than the 32-bit version.

Unfortunately it generates a different pseudo-random sequence of bytes. Since I'm no expert on RNGs, and Xorshift is said to require a good seed to work well, I have little confidence that my 64-bit variant produces a comparable level of randomness.

Locals instead of fields

Reading a field is typically so fast as to be completely irrelevant, so replacing fields with locals might sound surprising. But this is a very tight loop! One would expect a good JIT compiler to use registers for locals, and I believe the spec actually allows the CLR to optimize field accesses in the same way, so long as the field is updated across memory barriers. However, for one reason or another, the 64-bit JIT compiler does not actually do this. See for yourself:

; using fields
; uint tx = x ^ (x << 11);
mov    edx, dword ptr [rdi+20h]   ; read x from memory
mov    eax, edx
shl    eax, 0Bh
xor    edx, eax
; using locals
; uint tx = x ^ (x << 11);
mov    eax, ebx                   ; read x from ebx
shl    eax, 0Bh
xor    ebx, eax
mov    ecx, ebx

The number of instructions is the same, but we're long past the point where the number of instructions has much semblance on execution time. What's important is the presence of that memory access. Even if it hits L1 cache, it's still significantly more expensive than reading a register. Let's see what impact this has on the three implementations we've seen so far:

  • safe with locals: 1260 MB/s, same code with fields: 1160 MB/s;
  • unsafe with locals: 3030 MB/s, same code with fields: 1840 MB/s;
  • unrolled with locals: 3870 MB/s, same code with fields: 1770 MB/s.

The thing about fields is that a different thread could theoretically read them at any time – something that isn't ever true of locals1. So allocating CPU registers for locals is guaranteed to be side-effect free, but doing so for fields can change program behaviour. Since .NET tools generally seem to avoid optimizations that alter program behaviour, I wonder if this might be why the 64-bit CLR went this route.

Here's another example of what switching to locals does. This one shows the shift of [x, y, z, w] by 32 bits:

; using fields
; x = y; y = z; z = w;
mov    eax, dword ptr [r10+24h]
mov    dword ptr [r10+20h], eax
mov    eax, dword ptr [r10+28h]
mov    dword ptr [r10+24h], eax
mov    eax, dword ptr [r10+2Ch]
mov    dword ptr [r10+28h], eax
; using locals
; x = y; y = z; z = w;
mov    r9d, ebx
mov    ebx, edi
mov    edi, r8d

This time it's even more obvious how important it is to use locals in such a tight loop. I haven't had the time to test this with the 32-bit runtime, but I wouldn't be surprised if the results are completely different there.

Loop unrolling

Loop unrolling is a pretty old optimization technique that's still useful, although the reasons why it helps have changed a lot. The idea is to rewrite the code in such a way as to perform more than one logical iteration of the loop per actual iteration.

If you haven't encountered this before, here is the basic idea:

// Before unrolling
for (int i = 0; i < end; i++)
{
    compute(i);
}

// After unrolling by 2
for (int i = 0; i < end; )
{
    compute(i++);
    compute(i++);
}

// After unrolling by 4
for (int i = 0; i < end; )
{
    compute(i++);
    compute(i++);
    compute(i++);
    compute(i++);
}

It's sometimes also possible to increment i just once at the end of the loop. Loop unrolling works best when each computation is independent of i and of the previous computation, and only has a significant effect when the loop body takes very little time.

On older processors, the biggest advantage of this approach was a reduction in the number of times the processor had to take a branch. Once the CPUs started using pipelined execution, branches became expensive because taking a branch meant that all the work in the pipeline had to be discarded, stalling the CPU by several clock cycles while the pipeline re-fills from the start of the loop. Later processors would try to predict whether a branch would be taken, ranging from very simple ideas (a backwards branch is always predicted to be taken) to advanced ones (basing the prediction on the exact history of the branch being taken the last few times it was hit).2

With branch prediction in place, this particular benefit of loop unrolling all but disappeared, but a new one appeared: processors could now execute more than one instruction per clock cycle, as long as certain conditions were met. Looping to the start of the iteration could mess up these conditions, preventing the CPU from executing as many instructions as it was otherwise able to.

Another benefit of loop unrolling – and perhaps the most obvious one – is the elimination of code that doesn't actually need to run on every iteration, and can be executed just once every N iterations.

Unrolling Xorshift

So the main purpose of unrolling the Xorshift loop is to eliminate code that doesn't need to run on every iteration of the loop. It might not be obvious that it has such code, but there's always the loop condition check. If we know the array is a multiple of N bytes long, we only need to check if the array has ended once every N elements:

while (pbuf < pend)
{
    uint t = x ^ (x << 11);
    x = y; y = z; z = w;
    *(pbuf++) = w = w ^ (w >> 19) ^ (t ^ (t >> 8));

    t = x ^ (x << 11);
    x = y; y = z; z = w;
    *(pbuf++) = w = w ^ (w >> 19) ^ (t ^ (t >> 8));
}

Moreover, once you write the loop body twice, you can start noticing some redundancies in all that data flowing around those four state variables (x, y, z and w). For example, x is assigned twice, but the result of the first assignment is only used once, to compute the new value for the temporary variable t. This assignment can be eliminated by moving the computation of the second t to the start of the loop, when the input value it needs is still available in y:

while (pbuf < pend)
{
    uint tx = x ^ (x << 11);
    uint ty = y ^ (y << 11);

    y = z; z = w;
    *(pbuf++) = w = w ^ (w >> 19) ^ (tx ^ (tx >> 8));

    x = y; y = z; z = w;
    *(pbuf++) = w = w ^ (w >> 19) ^ (ty ^ (ty >> 8));
}

Let's carry on shuffling those assignments around. Pick a variable read and trace where that value comes from. Then, if that value is still available in another variable, use that variable instead and remove the now-redundant assignment. The result:

while (pbuf < pend)
{
    uint tx = x ^ (x << 11);
    uint ty = y ^ (y << 11);
    x = z;
    y = w;
    *(pbuf++) = z = w ^ (w >> 19) ^ (tx ^ (tx >> 8));
    *(pbuf++) = w = z ^ (z >> 19) ^ (ty ^ (ty >> 8));
}

At this point it becomes clear that this loop could use some more unrolling, since it still has those two assignments that just shift the data around. Intuitively it looks like it might be possible to eliminate them completely. This intuition turns out to be correct; by repeating the above code twice and repeating the assignment elimination process, we arrive at the final variant:

while (pbuf < pend)
{
    uint tx = x ^ (x << 11);
    uint ty = y ^ (y << 11);
    uint tz = z ^ (z << 11);
    uint tw = w ^ (w << 11);
    *(pbuf++) = x = w ^ (w >> 19) ^ (tx ^ (tx >> 8));
    *(pbuf++) = y = x ^ (x >> 19) ^ (ty ^ (ty >> 8));
    *(pbuf++) = z = y ^ (y >> 19) ^ (tz ^ (tz >> 8));
    *(pbuf++) = w = z ^ (z >> 19) ^ (tw ^ (tw >> 8));
}

I am quite pleased with how this turned out! It's much more aesthetically satisfying than the original loop body (to my programmer eye, anyway). Here it is again, for contrast:

while (pbuf < pend)
{
    uint t = x ^ (x << 11);
    x = y; y = z; z = w;
    *(pbuf++) = w = w ^ (w >> 19) ^ (t ^ (t >> 8));
}

Yuck! The unrolled code is so much nicer. Do note, however, that the old code is only slightly slower (~30%), and the variant that used fields instead of locals was actually 4% faster in the “native” state than in the unrolled state. Lesson: don't assume; measure!

Conclusions

MB/s Cycles per byte Compared to Random Compared to Safe Xorshift
Random 134 25.500 - 0.1x
Safe Xorshift 1160 2.920 9x -
Unsafe silly 1290 2.630 10x 1.1x
Unsafe 1840 1.850 14x 1.6x
Unsafe+locals 3030 1.130 23x 2.6x
Unrolled+locals 3870 0.880 29x 3.3x
64-bit 7650 0.445 57x 6.6x

Repeatability

I'm showing all results to three significant figures because that's really how repeatable these numbers were on my test machine. In fact, most results were repeatable to four significant figures, but since some weren't, I decided to round them to 3 s.f. If some of the percentages seem a bit off, that's because they were calculated on the un-rounded figures.

Does it need to be so fast?

There isn't much one can do with 7.6 GB of random data every second. A top-of-the-line computer can barely even write to RAM that quickly. So does it need to be so fast?

Firstly, this was mostly done out of curiosity and for fun, rather than because it needed to be this fast. But there's also a good reason behind all this. Even if you can only consume the random data at 100 MB/s, an algorithm managing 133 MB/s will end up consuming 75% of a CPU core just for this. An algorithm generating 7600 MB/s will only consume 1% of a CPU core in the same situation.

Further improvements

I've tried a whole bunch of variations of the final code, including unrolling by 8, incrementing the pbuf by 4 once per iteration, shuffling the arithmetic expressions around, and a few other things I don't remember. They all made the code slower. If you manage to come up with a faster implementation in pure C#, do post a comment!

C# for fast code

There are a lot of people who are skeptical of C#'s suitability for fast code. I'd say that most of them don't actually have a good understanding of what C# can do quickly and where it fails. In my experience, the kind of fast code where C# succeeds is like the above: tight loops doing integer arithmetic on locals, or something along those lines.

If you start using any of the C# niceties, things go downhill. Niceties include a List<T>, a foreach loop, a lambda, a class to hold intermediate results. You have to eliminate common subexpressions by hand, even when they are obviously side-effect-free by means of being locals. C# won't notice that line of code that can be moved outside the loop. C# won't be clever about inferring the value of a boolean expression based on an if that it's contained in. But if you do all of these things by hand, it can be pretty fast.

Source code

If you want to reproduce these results or have a look at all the code in detail, see DemoXorshift on my BitBucket account. If you just want to grab the fastest variant of Xorshift see here, but remember that you'll have to allow unsafe code, which has consequences for low trust scenarios.


  1. C#'s locals captured by lambdas don't count, as they're technically fields. 

  2. This history of "first add pipelines, then branch prediction" is a little over-simplified: some CPUs had pipelines and branch prediction way before others, and some started out with both at once. But it's true of Intel CPUs: the first pipelined Intel CPU appears to be 80286, while branch prediction doesn't make an appearance until P5 (which was also the first Intel CPU to offer superscalar execution, conveniently keeping loop unrolling relevant despite the presence of a branch predictor). 

Tags: