Trying to split str by whitespace, getting temporary value

pixieboy · April 12, 2023, 12:01am

My goal is to create a Vector of all words in a very large file of text. Because of the size, I'm loading it line by line and doing some preprocessing that significantly reduces the size before splitting it on whitespaces and then appending it to a pre-existing vector.

The specific code that's giving me trouble:

// Add each line to a loop
while let Some(line) = lines.next() {
    if line.as_ref().unwrap().len() > 1000 {
        // Creates a vector of words, hypothetically
        line_vocab = re.replace_all(line.unwrap().as_str(), "")
            // Type conversion hell
            .to_string()
            .as_str()
            .split(" ")
            .collect::<Vec<&str>>();
        // Adds the words to the larger vocab list
        vocab.extend(line_vocab)
    }
}

My entire method so far:

fn main() {
    // Open the bzip2 file using a buffered reader
    let file = File::open("file").unwrap();
    let reader = BufReader::new(file);
    let mut bz = bzip2::read::MultiBzDecoder::new(reader);
    let mut vocab = Vec::<&str>::new();
    let re = Regex::new(r"(\[\[[^\]]*\|)|((\[)|(\]))|(\{\{[^\}]*\}\})|(<[^>]*>)|(&lt;[^;]*;)").unwrap();
    // Just 1k right now for testing
    let mut lines = BufReader::new(&mut bz).lines().take(1000);
    let mut line_vocab: Vec<&str>;

    // Add each line to a loop
    while let Some(line) = lines.next() {
        if line.as_ref().unwrap().len() > 1000 {
            // Creates a vector of words, hypothetically
            line_vocab = re.replace_all(line.unwrap().as_str(), "")
                // Type conversion hell
                .to_string()
                .as_str()
                .split(" ")
                .collect::<Vec<&str>>();
            // Adds the words to the larger vocab list
            vocab.extend(line_vocab)
        }
    }
    println!("{:?}", vocab)
}

I've been stumped for about an hour on this, I'm sort of new to Rust so any help is appreciated ^^

quinedot · April 12, 2023, 12:28am

A &T means you're borrowing from a T somewhere, and that T has to stay alive as long as the reference (&T). You may be used to &str corresponding to "literal strings"; those are borrowing static data which stays alive forever. But you're not dealing with literal strings in this case, you're dealing with borrows of locals.

The errors are because you're throwing out those locals (line and the results of .to_string()) at the end of the loop. You can't keep references to them around any longer than that -- the memory got freed.

The solution is to store Strings and not &str.

Let's walk through your code a little more to point out areas that could be improved.


    while let Some(line) = lines.next() {
        // No need to leave the line in the `Result` and unwrap all the time
        if line.as_ref().unwrap().len() > 1000 {
            line_vocab = re.replace_all(line.unwrap().as_str(), "")
                // You can treat a `Cow<'_, str>` as a `&str` for the
                // most part [*1], so you don't need this
                .to_string()
                // (or this)
                .as_str()
                .split(" ")
                // You don't need to collect into an intermediate
                // `Vec` in order to extend another `Vec`
                .collect::<Vec<&str>>();

            vocab.extend(line_vocab)
        }

[*1]: You may have added this because at some point along the way, the compiler complained a temporary wasn't lasting long enough. If that was the case, you'd be better off following its advice and storing the temporary in a variable.

Here's what the loop may look like after addressing those points.

    while let Some(line) = lines.next() {
        // Even better -- make your function return a `Result` too!
        // But you can tackle that another day...
        let line = line.unwrap();
        if line.len() > 1000 {
            vocab.extend(
                re.replace_all(&line, "")
                    .split(" ")
                    .map(|s| s.to_string())
            )
        }
    }

Or alternatively,

    while let Some(line) = lines.next() {
        let line = line.unwrap();
        if line.len() > 1000 {
            let tmp = re.replace_all(&line, "");
            let iter = tmp.split(" ").map(|s| s.to_string());
            vocab.extend( iter );
        }
    }

.split_whitespace() may be better.

If you can create a Regex that captures the parts you want to keep individually, that may be even better (but I'm going to choose not to try and unravel it now).

kornel · April 12, 2023, 12:32am

&str is not a string itself, but a temporary permission to view into some String that must already exist and be stored for long enough somewhere else (this is a simplification, but I think it's a helpful mental model to have).

Therefore to_string().as_str() here doesn't make sense, because the String is not stored anywhere (it's just a temporary value destroyed at the end of the expression), and as_str() becomes a view into that soon-to-be-destroyed value.

You can't make Vec<&str> from newly-created strings. This is impossible, because these &strs have to point to Strings that are stored somewhere for longer than the existence of Vec<&str>, and temporary variables inside a loop don't do that. Variables inside the loop end their lifetime on every loop iteration, and Rust doesn't have a garbage collector, so it can't make them live longer when needed.

Make Vec<String> instead. This is the type for newly-created stand-alone strings.

quinedot · April 12, 2023, 12:45am

Actually since you're just throwing all matches out, I believe this will do it.

            let iter = re
                .split(&line)
                .flat_map(|s| s.split_whitespace())
                .map(str::to_string);

            vocab.extend( iter );

pixieboy · April 12, 2023, 2:38am

Thank you so much, it works now! You have saved me much time and much advil

I'm new to Rust but not new to programming in general, and as I was typing out that weird string (pun unintended) of type conversions I knew there was something I wasn't doing right, but considering that my first Python project used substringing to get a value from a dict...

pixieboy · April 12, 2023, 2:41am

Thank you ^^

This inspired me to Google the difference between String and &str, which I definitely needed to do because I was just sort of throwing out whichever worked, but this is definitely more effective...

pixieboy · April 12, 2023, 2:45am

Sorry if this is a dumb question, but would you mind explaining what that's doing? I get the first part, I assume re.split(&line) is splitting the line whenever there's a match? But I don't understand the map part

If you don't feel like responding then no pressure, I would just appreciate it

(Also, the forum is telling me I should reply to several posts at once, oops. I don't use forums much)

quinedot · April 12, 2023, 4:03am

Sure. Here's the split method, which notes

each element of the iterator corresponds to text that isn’t matched by the regular expression.

So here:

re.split(&line) // iterator over `&str`

We have an iterator that returns the parts of the line that don't match your iterator.

Next, for each part, we want to split it up by whitespace. map is what you use to turn an iterator over X into an iterator over Y by transforming each item. But if we just did this:

re.split(&line)
  // Turns each `&str` into `SplitWhitespace<'_>`
  // ... which is an iterator over `&str`
  .map(|line| line.split_whitespace())

We'd have an iterator that returns other iterators. However, we really want to flatten this down into an iterator over the inner elements. That's what flat_map does, in one step.

re.split(&line) // Iterates over `&str`
  .flat_map(|line| line.split_whitespace()) // Also iterates over `&str`

Finally we want to map our borrowed &strs into Strings.

re.split(&line) // iterator over `&str`
  .flat_map(|s| s.split_whitespace()) // ...over `&str`
  .map(str::to_string); // ...over `String`

That notation is just another way to write this:

  .map(|s| s.to_string())

Because both the closure form and the function input a &str and output a String. I could have wrote the whole thing like so:

re.split(&line)
  .flat_map(str::split_whitespace)
  .map(str::to_string);

If you search for flatmap python, you'll find some takes on how to do similar things in Python using itertools or double list comprehensions.

pixieboy · April 12, 2023, 1:34pm

Thank you so much for helping me!! ^^

I was worried this was going to be like StackOverflow but this place seems much nicer :3

system · July 11, 2023, 1:35pm

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
Help needed to split line into vec help	7	4655	October 30, 2019
Tricky split_whitespace()	3	408	June 28, 2023
Populating a Vec of String from string literal? help	13	1392	April 14, 2022
Lifetime issue when reading file as String, split into words help	6	1161	June 13, 2020
Splitting a byte string into words help	4	136	April 2, 2024

Trying to split str by whitespace, getting temporary value

Related Topics