Finding set of substrings in a long string

I’m trying to find what string among a given subsets is a available in a *.docx file, so I used the following crates:

  • dotext to read the contents of the *.docx file

  • regex to work with regular expressions

And ended with the below code:

extern crate dotext;
extern crate regex;

use dotext::*;

use std::io::Read;
use regex::RegexSet;

fn main(){
    let set = RegexSet::new(&[
        r"factorial",
        r"Hasan Yousef",
        r"carborundum",
    ]).unwrap();

    let mut file = Docx::open("samples/sample.docx").unwrap();
    let mut isi = String::new();
    let _ = file.read_to_string(&mut isi);

    if set.is_match(&isi){
        let matches: Vec<_> = set.matches(&isi).into_iter().collect();
        println!("Matches found: {} / {}, \nMatches Vector: {:?} of {:?}",
                matches.len(), set.len(), matches, set);
    } else {
        println!("No matches found");
    }
}

The above is working fine, and give the below output:

Matches found: 2 / 3,

Matches Vector: [0, 2] of RegexSet([“factorial”, “Hasan Yousef”, “carborundum”])

The issue I’ve is that I want to have the output telling me telling me the exact matches that found in the file, like:

Out of 3 subsets, the below 2 had been found:
factorial,
carborundum

I tried to iterate over the RegexSet but could not, also I tried to find something like the below code, also couldnot:

if isi.contains(|s| where s is in (mysubsets)) {
    println!(s);
}

Looks like a bit of a hack, but works. Check this: https://play.rust-lang.org/?version=stable&mode=debug&edition=2015&gist=c09af9876ee4dfadf5bbfe98ad01431d

It is based on two general ideas:

  • List of search strings will be needed twice - for RegexSet generation and for output, so we store it into the variable and feed this variable to RegexSet::new.
  • matches can be iterated over, so it can be mapped into corresponding items (thanks to the fact that RegexSet preserves order).

(Some other code changes were made just to make it run in playground. Something like mocks)

1 Like

Matches is a slice (or other kind of ordered collection) of indices into set… Something like matches.map(|m| set[m]).collect() might work.

Set isn’t indexable, unhappily. That was what I’ve checked first, too.

Oh, thats a pity then and in my opinion a design flaw in the API… I was unable to check as I’m on a mobile only.

Thanks, now I’ve the below code after adding argparse to be able to parse input arguments:

use std::io::Read;

extern crate argparse;
use argparse::{ArgumentParser, Store};

extern crate dotext;
use dotext::*;

extern crate regex;
use regex::RegexSet;

fn main(){
    let mut skills = String::new();
    {
        let mut ap = ArgumentParser::new();
        ap.refer(&mut skills)
            .add_option(&["-s", "--skills"], Store,
            "Skills / Experience required");
        ap.parse_args_or_exit();
    }
    let searches: Vec<&str> = skills.split(",").collect();
    //let searches = skills.split(",").collect::<Vec<&str>>();

    let set = RegexSet::new(&searches).unwrap();

    let mut file = Docx::open("samples/sample.docx").unwrap();
    let mut isi = String::new();
    let _ = file.read_to_string(&mut isi);

    if set.is_match(&isi){
        let matches: Vec<_> = set.matches(&isi).into_iter().collect();
        let exact_matches = matches.iter().
                                    map(|num| searches[*num]).collect::<Vec<&str>>();
        println!("Out of {} requirements, the below {} matches had been found:",
                set.len(), matches.len());
        for x in exact_matches {
                println!("{}", x );
        }
    } else {
            println!("No matches found");
    }
}

And can run it as:

Hasans-Air:debug hasan$ ./ufg -s “factorial, Hasan Yousef, carborundum”

Out of 3 requirements, the below 2 matches had been found:
factorial
carborundum

I just noticed there is a leading space in the input parameter, which is reading the skills and accordingly copies to the searches, how can I get rid of it?

If all you need is a slice of the pattern strings originally given to the regex, then that should be easy to expose as a new API item on regex sets. I’d be happy to take a PR for that. But yeah, otherwise you would need to create copies of the pattern strings and manage them yourself.

2 Likes

Done :heart:

Or did I understand that wrong and all you want is a function that returns a slice to the strings stored (instead of indexing the regexset directly)?