When working with big data taking samples is the only road to quick answers. Unfortunately, that already posts a bigger hurdle than it should be. When you ask people how to get a random sample of lines from a file you most likely will get these as an answer:
cat file.txt | shuf -n 10 | head -n 10 cat file.txt | sort --random-sort | head -n 10
sort and big data do not mix all that well. And even
shuf reads the whole input file into memory first.
But if using stdin is not a requirement providing a file allows for seeking and reading the file size. This allows for picking some random positions not based on lines but on byte positions. A quick seek, then finding the line marker and the random line is ready to output.
Simple and fast - even on big files.