Search text file in both directions repetitively in one lookup


#1

Hi everyone,

I’m just a couple of days into Rust, and I’m struggling to understand how it is possible to implement the following scenario:

Scenario

  • There is a large text file, on a scale of gigabytes.
  1. Search this file forwards for some pattern (pattern1).
  2. If pattern1 is found, search backwards for another pattern (pattern2).
  3. If pattern2 is found, do something, go back to step 1, and proceed forward from the line after the previous match for pattern1. Repeat until EOF.

What have I tried

In Perl such task is very easy to implement with ‘Tie::File’ mapping file lines to array. It works too slow for me as I have many such files to search (doing it with ‘glob’), and have to do it regularly.

What I have got now is a forward-only lookup, it works, but I’m not sure that this is the right way to proceed:

extern crate glob;

use glob::glob;

use std::fs::File;
use std::io::BufReader;
use std::io::BufRead;

fn main() {
    let path_glob = "/some/*/path/*.txt".to_string();

    for entry in glob(&path_glob).expect("Failed to read glob pattern") {
        match entry {
            Ok(path) => {
                println!("{:?}", path.display());

                let mut f = File::open(path).expect("Can't read file");
                let mut file = BufReader::new(&f);

                for line in file.lines() {
                    match line {
                        Ok(valid_utf8) => {
                            if valid_utf8.contains("pattern1") {
                                println!("{}", valid_utf8);
                            }
                        }
                        Err(e) => {
                            println!("Invalid UTF-8: {}", e.to_string())
                        }
                    };
                }
            }
            Err(e) => println!("{:?}", e),
        }
    }
}

Honestly tried to google it out, but it seems that I lack some fundamental understanding to be able to formulate effective search keywords.

Any help would be greatly appreciated!

Some additional info

There is a reason, why I’m looking to perform the search in the sequence described above.

Let’s say, the text file has 10 millions lines. In this case pattern2 will be found in every 5-10 lines in average, while pattern1 may be found just a handful of times or not found at all. Since I have a lot of such big files, I seek every possibility to make my search as fast as possible.

The text files are application logs, pattern1 represents some data that needs to be found in a context of specific application action. pattern2 represents the log record header that has action uid and action step number.


#2

I would probably do a forward search for pattern1 and pattern2 at the same time, then keep track of the position of the latest occurrence of pattern2. When pattern1 shows up, I can store this position, seek to pattern2, do whatever, seek to pattern1 and continue. No searching backwards, just seeking to known positions.


#3

It sounds technically reasonable, but I wonder about performance impact.

Let’s say, the text file has 10 millions lines. In this case pattern2 will be found in every 5-10 lines in average, while pattern1 may be found just a handful of times or not found at all. Since I have a lot of such big files, I seek every possibility to make my search as fast as possible.


#4

I’d say it really depends on the cost of matching pattern2. If that is costly, then yes, searching backwards might be better. If the log record header is easily identifiable with a small pattern2, maybe the overhead of storing a position isn’t that big of a deal. You don’t need to parse pattern2 at that point anyway, just identify it.


#5

pattern2 is described by the following regex: ^\[.*?\]\s<\w+\|

I’ll have to try this approach I think.

Just out of curiosity: is there any low-level something about OS/filesystem that makes forward and backward searches hard/costly/impossible?

If there is anything good to read about it, I’d really like to.


#6

If most (or many) lines start with “[”, then I’d say that’s likely a costly pattern. I can’t really see why backwards would be harder, but it could be that OS level “read-ahead” logic from the filesystem will make forward searches more efficient.

I have no suggestions for reading material though. I’d say writing a few variants and see how they work out might be a reasonable step. It could very well be completely different depending on which OS is used, or HDD/SSD logic too, so actual measurements might be the only practical way to find out.


#7

Would something like this work?

extern crate regex;
use regex::Regex;
use std::fs::File;
use std::io::{BufRead, BufReader};

fn main() {
    let pattern1 = Regex::new("^pattern1").unwrap();
    let pattern2 = Regex::new("^pattern2").unwrap();
    let file = File::open("foo.txt").unwrap();
    bufread_search(BufReader::new(file), &pattern1, &pattern2).expect("Error searching file");
}

fn bufread_search<T: BufRead>(
    buf_read: T,
    pattern1: &Regex,
    pattern2: &Regex,
) -> std::io::Result<()> {
    let mut pattern2_record = None;
    for line in buf_read.lines() {
        let line = line?;
        if pattern2.is_match(&line) {
            pattern2_record = Some(line);
        } else if pattern1.is_match(&line) {
            if let Some(ref prev_pattern2) = pattern2_record {
                do_something(&prev_pattern2);
            }
        }
    }
    Ok(())
}

fn do_something(line: &str) {
    println!("do something with {}", line);
}

#8

Yep, thanks, works well!