Random Lines from Large Files

When working with big data, taking samples is the only road to quick answers. Unfortunately that already poses a bigger hurdle than it should. Ask around how to get random lines from a file and you’ll most likely hear something like this:

shuf -n 10 file.txt
cat file.txt | sort --random-sort | head -n 10

Note that passing the file as an argument instead of piping through cat would avoid the useless use of cat.

Why these don’t scale

The sort approach is the worst offender. It reads the entire file, shuffles all lines, and only then lets you pick from the top. For a 10 GB file that means sorting billions of lines just to grab 10.

shuf is smarter but still not fast. Looking at the source code, it reads the entire input sequentially. Even with -n 1 it processes every single line. Recent versions use reservoir sampling internally when combined with -n, which at least keeps memory usage bounded. But it still has to scan through the whole file from start to finish.

For a 10 GB file, that means waiting minutes for what should take milliseconds.

Seeking instead of reading

If the file is on disk (not stdin), we can do something much faster. Instead of reading every line, we pick random byte positions, seek directly to them, skip to the next newline, and read the following complete line.

import os
import random

def random_lines(filename, n=10):
    size = os.path.getsize(filename)
    positions = sorted(random.randint(0, size - 1) for _ in range(n))

    lines = []
    with open(filename, 'rb') as f:
        for pos in positions:
            f.seek(max(0, pos - 1))
            f.readline()          # skip partial line
            line = f.readline()   # read full line
            if line:
                lines.append(line.decode().strip())
    return lines

for line in random_lines("data.txt"):
    print(line)

Sorting the positions before seeking means we sweep through the file in one direction, which is friendlier to the disk cache. The whole thing runs in O(n) seeks regardless of file size.

The catch - the line length bias

This approach picks random byte positions, not random line positions. That means longer lines occupy more bytes and are more likely to be “hit” by a random position. If all lines are roughly the same length, the bias is negligible. If your file has a mix of 10-character and 10,000-character lines, the long lines will be heavily overrepresented.

For quick exploratory sampling this usually doesn’t matter. But if statistical correctness is important, you need a different approach.

Reservoir sampling

Reservoir sampling (Algorithm R, published by Jeffrey Vitter in 1985) is the theoretically correct way to get a uniform random sample from data of unknown size. It works in a single pass and uses O(k) memory, where k is the sample size.

The idea is simple. Keep the first k lines in a “reservoir”. For every subsequent line j, generate a random number between 1 and j. If it falls within k, replace that slot in the reservoir. When you reach the end of the file, the reservoir contains a perfectly uniform sample.

import random

def reservoir_sample(filename, k=10):
    reservoir = []
    with open(filename) as f:
        for i, line in enumerate(f):
            if i < k:
                reservoir.append(line.strip())
            else:
                j = random.randint(0, i)
                if j < k:
                    reservoir[j] = line.strip()
    return reservoir

The downside is obvious: it still reads every line in the file, just like shuf. The advantage is that it’s guaranteed to produce a uniform sample regardless of line lengths, and it works on streams where you can’t seek.

Which one to pick

It depends on what you need.

For quick sampling from large files on disk where line lengths are reasonably uniform, the seek-based approach is hard to beat. It runs in milliseconds on files of any size.

For statistically correct uniform sampling, or when reading from a pipe or stream, reservoir sampling is the right choice. It’s slower but mathematically sound.

For small files, just use shuf -n 10. It’s simple and good enough.