Searching through text to find characters in a supplied pattern

I am trying to learn rust, and have decided to try to make a very whimsical tool. Essentially the idea is that it would take text input from a .txt file, and search through the text for the first instance of each character in "Gday mate", format these characters to capitals, and print the full text into the terminal with the found characters in a new colour. For now I am working on the searching through text portion of it. I have come up with the following, and have commented it as best I can. I am very new to compiled and statically typed languages (coming from R and Julia mostly), and am very open to any and all help that you can give me. I have thought about trying to rewrite the find_aussies fxn to use an interator rather than the way I am doing it now, but haven't been able to figure that out. Currently I am running into issues with the way I am trying to pull out pieces of the supplied text, generating a panic due to subtraction with overflow. I am not sure how to fix this.

fn check_aussies(lslocs: Vec<Option<usize>>) -> bool {
    let res:bool;
    if lslocs.contains(&None){
        res = true
    } else {
        res = false
    }
res
}

fn find_aussies(s: &String) -> Vec<String> {
    let s_lower = &s.to_lowercase(); //convert to lowercase
    let pattern:&str = "gday mate"; //pattern to search for
    //let inbytes = s_lower.as_bytes(); 
    //let patbytes = pattern.as_bytes();
    let mut locchecks:Vec<Option<usize>> = Vec::new();
    let mut snippets:Vec<String> = Vec::new(); //where the pieces of the original string are stored
    let mut start:usize = 0; //starting point of the search
    let end:usize = s_lower.len(); //end of the search
    for i in pattern.chars() {
        let mut loc:usize = s_lower[start..end].find(i).unwrap(); //find location of first character from pattern in text
        locchecks.push(s[start..end].find(i)); //write the Option<> to the check vector
        snippets.push(s_lower[start..(loc-1)].to_string()); //push the beginning of the string up to the found char
        snippets.push(s_lower.chars().nth(loc).unwrap().to_uppercase().to_string()); //push the found char
        start = loc + 1; //advance start to character after the found match
    }
   //when the for loop is exhausted, push the remaining text from the input string to the output vector
    snippets.push(s_lower[start..end].to_string());
    let notfound = check_aussies(locchecks);
    if notfound {
        println!("No hidden aussies found!")
    } else {
        return snippets;
    }
    snippets
}

fn main() {
    //let args: Vec<String> = env::args().collect();
    //let filename = &args[1];
    //println!("Searching for the hidden aussie in {}", filename);
    /*let contents = fs::read_to_string(filename)
        .expect("Something went wrong reading the file");
    */
    let fileread = "good dolphins must always take massive enteroscopic treatments";
    let contents = fileread.to_string();
    let matching = find_aussies(&contents);
    println!("{:?}",matching); //print vector of string pieces for now
}

(Playground)

Errors:

   Compiling playground v0.0.1 (/playground)
warning: variable does not need to be mutable
  --> src/main.rs:21:13
   |
21 |         let mut loc:usize = s_lower[start..end].find(i).unwrap();
   |             ----^^^
   |             |
   |             help: remove this `mut`
   |
   = note: `#[warn(unused_mut)]` on by default

warning: `playground` (bin "playground") generated 1 warning
    Finished dev [unoptimized + debuginfo] target(s) in 2.06s
     Running `target/debug/playground`
thread 'main' panicked at 'attempt to subtract with overflow', src/main.rs:23:38
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Currently, I think your find_aussies function is more complicated than it needs to be, since it's both finding the characters and replacing them at the same time. If I were writing the program, I'd separate it into two different functions, one to list the positions of the characters, and another to create a new string based on this list. This simplifies the implementation (Rust Playground):

fn find_aussies(mut s: &str) -> Option<Vec<usize>> {
    const PATTERN: &str = "gday mate";
    let mut matches = Vec::new();
    for target in PATTERN.chars() {
        // `s` is the current tail of the string.
        // `pos` is the number of bytes to the start of the matched character.
        // `len` is the length of the matched character in bytes.
        let found = s
            .char_indices()
            .find(|(_, c)| c.to_lowercase().eq([target]));
        let (pos, len) = match found {
            Some((pos, c)) => (pos, c.len_utf8()),
            None => return None,
        };
        matches.push(pos);
        s = &s[pos + len..];
    }
    // Each "match" is the number of bytes from the end of the last matching
    // character to the start of the next.
    Some(matches)
}

fn highlight_aussies(mut s: &str, matches: &[usize]) -> String {
    let mut result = String::new();
    for &pos in matches {
        result.push_str(&s[..pos]);
        let c = s[pos..].chars().next().unwrap();
        let len = c.len_utf8();
        result.extend(c.to_uppercase());
        s = &s[pos + len..];
    }
    result.push_str(s);
    result
}

fn main() {
    let fileread = "good dolphins must always take massive enteroscopic treatments";
    let contents = fileread.to_string();
    if let Some(matching) = find_aussies(&contents) {
        println!("{matching:?}");
        let result = highlight_aussies(&contents, &matching);
        println!("{result:?}");
    } else {
        println!("No hidden aussies found!");
    }
}

Notice how find_aussies returns an Option<Vec<usize>>. If it finds all of the characters in the pattern, it returns Some(matches); otherwise, it returns None. This allows the caller to handle the error however it sees fit.

Also, since you mention iterators, find_aussies can equivalently be written with a map (Rust Playground):

fn find_aussies(mut s: &str) -> Option<Vec<usize>> {
    const PATTERN: &str = "gday mate";
    PATTERN
        .chars()
        .map(|target| {
            let (pos, c) = s
                .char_indices()
                .find(|(_, c)| c.to_lowercase().eq([target]))?;
            let len = c.len_utf8();
            s = &s[pos + len..];
            Some(pos)
        })
        .collect()
}

The closure returns Some(pos) if it matches the character, and None otherwise. When we collect the iterator into an Option<Vec<usize>>, it automatically stops iterating if a character is not found.

Feel free to ask any questions; it can be difficult to trace the control flow in this kind of code.

2 Likes

Without going into the algorithm at all, here is some general feedback about the code.

// fn find_aussies(s: &String) -> Vec<String> {
fn find_aussies(s: &str) -> Vec<String> {

Prefer using &str to &String and &[T] to &Vec<T>. They are both more general and less indirect.

// let s_lower = &s.to_lowercase();
let s_lower = s.to_lowercase();

I'm not sure why the borrow was there, but it was not needed.

let end:usize = s_lower.len(); //end of the search
// ...
{
    let mut loc:usize = s_lower[start..end].find(i).unwrap();
    locchecks.push(s[start..end].find(i));

Since you never update end, you can just use open ranges like s_lower[start..]. They'll contain the tail of the string.

        let loc: usize = s_lower[start..end].find(i).unwrap();
        locchecks.push(s[start..end].find(i));
        // ...
    }
    let notfound = check_aussies(locchecks);

The unwrap means you would panic before you got a chance to examine whether or not the find returned None. You would need to check within the loop.

snippets.push(s_lower[start..(loc - 1)].to_string());
snippets.push(s_lower.chars().nth(loc).unwrap().to_uppercase().to_string());

Rust string data is UTF8 encoded, so each character may take up to 4 bytes. Use methods like len_utf8 instead of assuming 1 byte per character.

(If your programs become even more sophisticated, you will need a Unicode library beyond stdlib to handle things like combining characters, etc. But you should be aware of the length issue in all of your code, because doing things like indexing into a &str at a non-character boundary will panic.)

    let notfound = check_aussies(locchecks);
    if notfound {
        println!("No hidden aussies found!")
    // } else {
    //     return snippets;
    }
    snippets

The code has the same meaning without the else block, since the following line returns snippets anyway. You also don't need to store the result in an intermediate variable (notfound).

(If you move the check into the loop, you don't need this block, or locchecks, at all.)

fn check_aussies(lslocs: Vec<Option<usize>>) -> bool {
    /*
    let res: bool;
    if lslocs.contains(&None) {
        res = true
    } else {
        res = false
    }
    res
    */
    lslocs.contains(&None)
}

This function could be similarly simplified as shown. (But you similarly may not need it if you change where you are performing the check.)

In case you weren't sure where the panic was coming from:

        let loc: usize = s_lower[start..end].find(i).unwrap();
        locchecks.push(s[start..end].find(i));
        snippets.push(s_lower[start..(loc - 1)].to_string());

find gives the 0-based index of the beginning of the search term. So if loc is 0 because i is found at the very start of search, then when you try to calculate loc - 1, you underflow.

Ranges in Rust are start-inclusive but end-exclusive:

    let list = [0, 1, 2, 3, 4, 5];
    println!("{:?}", &list[1..4]);
    // Prints [1, 2, 3]

So you don't need to subtract 1 to exclude that character anyway.

Also note that loc is relative to start, which you keep adjusting to be greater. So most places you use loc, you really needed start + loc. That includes within the indexing here, and also:

start = loc + 1;

since loc was found relative to start, you wanted start += loc + 1 (really loc + length) here as well.


When it comes to the algorithm, you should refer to @LegionMammal978's reply. But for the sake of seeing what your original code might look like after applying the above suggestions, here's another playground.

6 Likes

Thank you so much for the feedback. This is so useful.

1 Like

Thank you so much for your help. This solution works great in the Rust Playground, but when I move it to my computer, the compiler spits an error at lines 40 and 42, saying "there is no argument named matching/error". I am not sure why this issue comes up. I am using rust 1.56.0 in VS Code on OpenSUSE if that adds any context.

My apologies. In Rust 1.58.0, println!() and related macros gained the ability to print out variables with println!("{variable:?}"). In older versions, you have to write println!("{:?}", variable), which does the same. To fix the code, you just have to change println!("{matching:?}") and println!("{result:?}") to use the extra argument.

Oh great thank you so much!

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.