programming, optimization

The Duff Device

Duff's Device explained: the famous C loop unrolling trick, why it worked in 1983 and what modern compilers teach us today.

June 2025

Duff’s Device was discovered by Tom Duff in November 1983 while working at Lucasfilm. He needed to speed up a real-time animation program that was running 50% too slowly. The program had to copy data into the Programmed IO data register of an Evans & Sutherland Picture System II and the straightforward loop just wasn’t fast enough.

What he came up with is one of the most famous pieces of C code ever written. It’s also one of the most confusing.

The idea

Let’s start with a simple loop that copies some memory.

for (int i = 0; i < len; i++) {
    *output++ = *input++;
}

For longer loops, the overhead of checking the loop condition adds up. On every iteration the loop checks the counter against len, increments it, and jumps back. That’s a lot of bookkeeping for a single copy operation.

Loop unrolling reduces this overhead by doing multiple operations per iteration.

// assuming len is a multiple of 8
int n = len / 8;
do {
    *output++ = *input++;
    *output++ = *input++;
    *output++ = *input++;
    *output++ = *input++;
    *output++ = *input++;
    *output++ = *input++;
    *output++ = *input++;
    *output++ = *input++;
} while (--n > 0);

Now we only pay the loop overhead once for every 8 copies. But what if len isn’t a multiple of 8? You’d need to handle the remainder separately. This is where Duff’s Device gets creative.

The trick

int n = (len + 8 - 1) / 8;
switch (len % 8) {
    case 0: do { *output++ = *input++;
    case 7:      *output++ = *input++;
    case 6:      *output++ = *input++;
    case 5:      *output++ = *input++;
    case 4:      *output++ = *input++;
    case 3:      *output++ = *input++;
    case 2:      *output++ = *input++;
    case 1:      *output++ = *input++;
    } while (--n > 0);
}

It looks like someone tricked the parser but it really is valid C. It exploits two features of the language: the fall-through behavior of switch statements and the fact that switch and do-while can share a compound statement.

The switch acts as a jump table. On the first pass, it jumps to the right case label based on the remainder of len % 8, executing only the leftover copies. After that, the do-while loop takes over, copying 8 elements at a time until done.

As Duff himself wrote: “after 10 years of writing C there are still little corners that I haven’t explored fully.”

Making it readable

If we disentangle the two constructs, it becomes easier to follow.

int n = (len + 8 - 1) / 8;

// first: handle the remainder
switch (len % 8) {
    case 0: *output++ = *input++;
    case 7: *output++ = *input++;
    case 6: *output++ = *input++;
    case 5: *output++ = *input++;
    case 4: *output++ = *input++;
    case 3: *output++ = *input++;
    case 2: *output++ = *input++;
    case 1: *output++ = *input++;
}

// then: copy 8 at a time
while (--n > 0) {
    *output++ = *input++;
    *output++ = *input++;
    *output++ = *input++;
    *output++ = *input++;
    *output++ = *input++;
    *output++ = *input++;
    *output++ = *input++;
    *output++ = *input++;
}

Same logic, no interleaved control structures. The switch handles the odd remainder, and the while loop does the rest in chunks of 8. In both versions, the loop overhead is reduced to 1/8 of the original.

I chose a memory copy operation only for demonstration purposes. In practice, the standard C library memcpy should always be preferred - it contains architecture-specific optimizations that will be faster than anything you write by hand.

What the compiler killed

Here’s the twist. In the late 1990s, the XFree86 project removed Duff’s Device from their codebase. The result? The server got faster and the binary got smaller.

The hand-unrolled code was actually preventing the compiler from applying its own optimizations. The interleaved switch and do-while created a control flow pattern that confused the optimizer. Simpler, straightforward loops gave the compiler room to do what it does best.

Modern compilers like GCC and Clang perform loop unrolling automatically. The -funroll-loops flag enables it, and with profile-guided optimization the compiler can make much better decisions about when and how much to unroll than a programmer writing code by hand. You can see exactly what the compiler does on Compiler Explorer.

But compilers don’t just unroll loops. They also apply SIMD vectorization - that’s processing multiple data elements in a single instruction. This is essentially what Duff’s Device was trying to achieve, but done at the hardware level. A single MOVDQA instruction can copy 16 bytes at once. That’s a fundamentally different level of optimization than reducing branch overhead.

Clever tricks compilers have replaced

The Duff Device is not alone. Many hand-crafted optimizations that were once essential have been absorbed by compilers.

What it comes down to: as compilers got smarter, hand-written optimizations went from helpful to harmful. Code that looks clever to a human can look opaque to an optimizer.

When to optimize by hand

There are still cases where manual optimization matters. Performance-critical inner loops in game engines, signal processing, and scientific computing sometimes benefit from hand-tuned SIMD intrinsics or carefully structured memory access patterns. But these are exceptions, not the rule.

For most code, the best optimization strategy is to write clear, straightforward loops and let the compiler do its job. Measure first, optimize second - and check the compiler output before rewriting anything by hand.

The Duff Device remains a wonderful piece of programming history. It teaches us about the dark corners of C, the evolution of compilers, and perhaps most importantly - that the clever hack you’re proud of today might be the thing slowing you down tomorrow.

selected photo