My goal is to create a Vector of all words in a very large file of text. Because of the size, I'm loading it line by line and doing some preprocessing that significantly reduces the size before splitting it on whitespaces and then appending it to a pre-existing vector.
The specific code that's giving me trouble:
// Add each line to a loop
while let Some(line) = lines.next() {
if line.as_ref().unwrap().len() > 1000 {
// Creates a vector of words, hypothetically
line_vocab = re.replace_all(line.unwrap().as_str(), "")
// Type conversion hell
.to_string()
.as_str()
.split(" ")
.collect::<Vec<&str>>();
// Adds the words to the larger vocab list
vocab.extend(line_vocab)
}
}
My entire method so far:
fn main() {
// Open the bzip2 file using a buffered reader
let file = File::open("file").unwrap();
let reader = BufReader::new(file);
let mut bz = bzip2::read::MultiBzDecoder::new(reader);
let mut vocab = Vec::<&str>::new();
let re = Regex::new(r"(\[\[[^\]]*\|)|((\[)|(\]))|(\{\{[^\}]*\}\})|(<[^>]*>)|(<[^;]*;)").unwrap();
// Just 1k right now for testing
let mut lines = BufReader::new(&mut bz).lines().take(1000);
let mut line_vocab: Vec<&str>;
// Add each line to a loop
while let Some(line) = lines.next() {
if line.as_ref().unwrap().len() > 1000 {
// Creates a vector of words, hypothetically
line_vocab = re.replace_all(line.unwrap().as_str(), "")
// Type conversion hell
.to_string()
.as_str()
.split(" ")
.collect::<Vec<&str>>();
// Adds the words to the larger vocab list
vocab.extend(line_vocab)
}
}
println!("{:?}", vocab)
}
I've been stumped for about an hour on this, I'm sort of new to Rust so any help is appreciated ^^