I have a function that is supposed to return a random line from a massive file (~16MB/1_000_000 lines), something like shuf
command in Linux.
I tried to use BufReader
instead of reading everything into memory, but I'm not sure if this is the most efficient way as whenever it's reading from file the CPU consumption goes up. I'm also using 2 BufReader, one to use the count method on iterator and the other one to get a line from file. How can I make this better?
let buf_reader_content = BufReader::new(&f);
let num_lines = get_num_lines_in_file(&file_name);
let random_index = get_random_index(num_lines.unwrap_or(0));
let random_domain_name = buf_reader_content.lines().nth(random_index);
fn get_num_lines_in_file(file_name: &str) -> Option<usize> {
let may_file_handle = File::open(file_name);
match may_file_handle {
Ok(f) => {
let buf_reader_count = BufReader::new(&f);
let num_lines = buf_reader_count.lines().count();
return Some(num_lines);
},
Err(e) => {
println!("Cannot open file to get the count {:?}", file_name);
return None;
}
}
}
fn get_random_index(max: usize) -> usize {
if max > 0 {
let step = Range::new(0, max);
let mut rng = rand::thread_rng();
let choice = step.ind_sample(&mut rng);
return choice as usize;
} else {
return 0 as usize;
}
}
Take a look at reservoir sampling: Reservoir sampling - Wikipedia. Using that technique you can make a single pass over the file.
1 Like
There's no need to count the lines. If the file is 16MB, select a random value [0, 16M] and read +- some bytes at that location. Scan backwards and forwards for newline characters and you got your random line. Read more if you don't find a newline character.
Edit: I assume there is a function to read from an offset without reading the whole file into memory.
Edit: This works fine, I was running a debug build - I switched to release build and I barely notice when it runs! Thanks @vitalyd @johalun
Interesting, I tried to use https://doc.rust-lang.org/rand/src/rand/lib.rs.html#992-1008 as shown below, but still CPU consumption spikes when it's running,
extern crate rand;
use std::fs::File;
use std::io::{BufRead, BufReader};
use rand::{thread_rng, sample};
fn main() {
let f = File::open("bigfile.list");
match f {
Ok(h) => {
let b = BufReader::new(h);
let lines_iter = b.lines();
let mut rng = thread_rng();
let l = sample(&mut rng, lines_iter, 1);
println!("{:?}", l);
},
Err(e) => {
println!("Error - {:?}", e);
}
}
println!("Done");
}
That won't give uniform distribution if the lines are of different length.
1 Like
If you need accurate distribution, you have to do a full read of the file to determine position of each line. After that you can quickly seek and read any line:
use std::fs::File;
use std::io::{BufRead, BufReader, Seek, SeekFrom};
use rand::distributions::{IndependentSample, Range};
use std::path::Path;
extern crate rand;
struct RandomLiner {
reader: BufReader<File>,
positions: Vec<usize>,
}
impl RandomLiner {
fn new<P: AsRef<Path>>(path: P) -> RandomLiner {
let mut reader = BufReader::new(File::open(path).unwrap());
let mut positions = vec![0];
let mut current_position = 0;
let mut _buffer = Vec::new();
loop {
let count = reader.read_until(b'\n', &mut _buffer).unwrap();
if count == 0 {
positions.pop().unwrap();
break;
}
current_position += count;
positions.push(current_position);
}
RandomLiner {
reader: reader,
positions: positions,
}
}
fn get(&mut self, buf: &mut Vec<u8>) {
if self.positions.is_empty() {
panic!("no lines");
}
let mut rng = rand::thread_rng();
let index = Range::new(0, self.positions.len()).ind_sample(&mut rng);
self.reader.seek(SeekFrom::Start(self.positions[index] as u64)).unwrap();
self.reader.read_until(b'\n', buf).unwrap();
}
}
fn main() {
let args: Vec<_> = std::env::args_os().collect();
if args.len() < 2 {
panic!("not enough arguments");
}
let mut r = RandomLiner::new(&args[1]);
let mut buf = Default::default();
r.get(&mut buf);
print!("{}", String::from_utf8(buf).unwrap());
}
This can be further optimized by using fixed-length read in get()
instead of read_until
, as you already know length of each line (the last line would need special handling though).