Random Lines from Large Files
Reading or even sorting text files just to extract some lines becomes unfeasible at a certain size.
When working with big data taking samples is the only road to quick answers. Unfortunately, that already posts a bigger hurdle than it should be. When you ask people how to get a random sample of lines from a file you most likely will get these as an answer:
cat file.txt | shuf -n 10 | head -n 10
cat file.txt | sort --random-sort | head -n 10
Unsurprisingly, sort
and big data do not mix all that well. And even shuf
reads the whole input file into memory first.
But if using stdin is not a requirement providing a file allows for seeking and reading the file size. This allows for picking some random positions not based on lines but on byte positions. A quick seek, then skipping the remainder of line and output the next full line as the random value.
Simple and fast - even on big files.
lines = 10
filename = "filename.txt"
filesize = File.size(filename)
positions = lines.times.map { rand(filesize) }.sort
File.open(filename) do |file|
positions.each do |pos|
file.pos = [0, pos - 1].max
file.gets
puts file.gets
end
end