Regex example/best practice needed

#1

Hi,

Could anyone help in proposing the best practice solution to parsing an example file below:

file.txt

#AB     CA:this     EA:533224
@PD   OP:"this is the thing"     HG: v2.10-r784-dusty     KJ:tar -xzvf test.tar.gz ./test
#AC     CB:this     EB:55224
#AD     CC:that     EC:533
@PT   OP:"what is this"     HG: v5.10-r784-bla     KJ:tar -xzvf  t..tar.gz ./testA

main.rs

 for line in BufReader::new(file).lines() {
        let r = line.unwrap();
          ...
    }

How to extract information from lines so that i can print it in the following way:

AB=this(CA), 533224(EA)
PD=this is the thing(OP), v2.10-r784-dusty(HG), cmd: tar -xzvf test.tar.gz ./test
AC=this(CB), 55224(EB)
AD=that(CC), 533(EC)
PT=what is this(OP), v5.10-r784-bla(HG),  cmd: tar -xzvf  t..tar.gz ./testA

note there are two different match patterns. Could I preset these and then just pass then through match() I haven’t managed to find an example.

thank you !

#2

Then perhaps you want a RegexSet?
https://docs.rs/regex/1.1.0/regex/#example-match-multiple-regular-expressions-simultaneously

2 Likes
#3

looks promising … do you have an examples on how to extract the elements form the regex in order to reformat them ? Is this the best approach if large files are processed ?
thnx

#4

I don’t have my own examples, but there are more here:
https://docs.rs/regex/1.1.0/regex/struct.RegexSet.html#example

I think to extract elements you need Captures, which you can’t get from RegexSet AFAICS. You might be better off just testing each line manually against your two expressions, where you can try Regex::captures for each.

1 Like
#5

I guess I’m just too dumb for regex… any help ?

    let re = Regex::new(r"#(?P<a>[^\t])C(?P<b>[^\t])A:(?P<c>\d+)").unwrap();


    for line in BufReader::new(file).lines() {

        let r = line.unwrap();

        match re.captures(&r) {
            Some(caps) => {
                println!("{} ", &caps["a"]);
            }
            None => {
              println!("Line starting with @ ");
            }

        }
        
#6

For this input:

#AB CA:this EA:533224
#AC     CB:this     EB:55224
#AD     CC:that     EC:533

You’re using this regex:

#(?P<a>[^\t])C(?P<b>[^\t])A:(?P<c>\d+)

?

It won’t match. I assume the tabs are the separators, but that’s all you’re capturing in a and b.

I’ve simplified a bit (using positional rather than named captures, just to remove verbosity) and allowed spaces-or-tabs between parts.

Try (not tested)

#(A[BCD])\s+(C[ABC]):(\w+)\s+(E[ABC]):(\d+)

I like named captures, but use them with the x extended syntax that lets you use whitespace and format the regex nicely. First just get a basic capture.

1 Like
#7

sorry for trivial questions I am still learning :slight_smile: This worked ! As far as I can tell rust is a mix of many languages and a lot of wrappings. I just hope the wrappers do not slow down the execution. I haven’t seen that yet but I still haven’t came to real world processing.

thnx

1 Like