String pattern match with multiple conditions

Hi!

I am new to the Rust programming language and at the moment I am facing the following problem:

I receive data from a csv file that is structured in multiple columns like (please note that I have no control over the structure of the csv file):

date; description; value_1; value_2;...; value_n

My goal is to group the rows based on keywords in the description field and perform different calculations on the values depending on which group they fall into.

At the moment I have something like this (deserialized with csv using serde):

if description.contains("keyword #1") | description.contains("keyword #2") {
   do_somthing(all_the_values);
} else if description.contains("keyword #3") | description.contains("keyword #4") | description.contains("keyword #5") {
    do_something_else(all_the_values);
} else {
    do_fallback(all_the_values);
}

To me, this seems like a working but rather inelegant way to do this. My questions in general is:

What would be the most rust-idiomatic way to handle this?

Thank you!

Best,
Ted

This seems like a fine use case for regular expressions? Although if this is as simple as "string contains literal string", it might simply not be worth the hassle.

1 Like

I don't think your code is necessarily unidiomatic. Without using some more advanced string matching functionality like regular expressions would provide, you could use a match statement instead of your if-else chain to make it look a little bit more self-contained, maybe:

match description {
    _ if description.contains("keyword #1") | description.contains("keyword #2") => {},
    _ if description.contains("keyword #3") | description.contains("keyword #4") | description.contains("keyword #5") => {}
    _ => {}
}
3 Likes

Yeah it looks okay to me honestly.

Now if you're saying you want this to run as fast as possible, that's a different story. The approach you're taking can likely be improved on quite a bit. I'd be happy to help with that, but would insist on a reproduction that I can actually run. (It doesn't have to be your full or actual data set, but something that approximates your use case is fine.) Of course, this assumes that the string matching here is a bottleneck. If the actual work you're doing in each function per row is expensive, then the string matching probably doesn't matter much.

7 Likes

It seems to me that the single or | is not optimal from short-circuit standpoint. Maybe this can be optimises, but I am not sure.

1 Like

Playground demo of the lack of short circuiting in the or-patterns.

I don't know how much of a problem it is in practice. Not often will you need to perform expensive operations to return a possible match, I would have thought.

1 Like

Thank you all for your help :slight_smile:

I also think that regex might be an overkill in this case^^

@jofas The match statement looks good to me!

Regarding short circuiting: You guys are right, that's a bug. I've replaced it with ||

@BurntSushi

Thank you very much for the offering! I do not think that the performance loss will be an issue, but for the sake of my education, I've crafted a minimal example that covers the gist of my code.

Disclaimer: I know that the data structures and function signatures are far from optimal. I've tried to keep this example as minimal as possible. And I am new to rust :wink: Also, error handling is out of scope

A little bit of context: The csv data contains sensor data, and the sensors are identified by their name in the description field. For each sensor (or group of sensors) an appropriate calibration is performed, and then the average value is stored togehter with its date. The real application is a bit more complex, but no cuputation-heavy stuff. Therefore, I think this should suffice as an example.

Thanks again!

Regards,
Ted

use serde::Deserialize;
use chrono::NaiveDate;
use chrono::Datelike;

#[derive(Debug, Deserialize)]

struct Record {
    date: String,
    description: String,
    value_1: f64,
    value_2: f64,
    value_3: f64,
    value_4: f64,
}

struct Data {
    date_t: Vec<String>,
    date_p: Vec<String>,
    temp: Vec<f64>,
    pressure: Vec<f64>,
}

fn process_temp(v1: f64, v2: f64, v3: f64, v4: f64) -> f64 {
    (v1 * 2. + v2 * 2. + v3 * 2. + v4 * 2.)/4.
}

fn process_pressure_1(v1: f64, v2: f64, v3: f64, v4: f64) -> f64 {
    (v1 * 5. + v2 * 5. + v3 * 5. + v4 * 5.)/4.
}

fn process_pressure_2(v1: f64, v2: f64, v3: f64, v4: f64) -> f64 {
    (v1 * 3. + v2 * 3. + v3 * 3. + v4 * 3.)/4.
}

fn main() {

    let csv_data = "date;description;value_1;value_2;value_3;value_4
        13.04.2023;T_1;23.5;24.8;23.7;24.0
        13.04.2023;T_2;22.5;22.8;22.7;22.0
        13.04.2023;p_1;4.0;4.1;4.1;3.9
        13.04.2023;p_2;4.4;4.2;4.3;4.2";

    let mut rdr = csv::ReaderBuilder::new()
        .has_headers(true)
        .delimiter(b';')
        .from_reader(csv_data.as_bytes());

    let mut data = Data {
        date_t: vec![],
        date_p: vec![],
        temp: vec![],
        pressure: vec![],
    };

    for (result) in rdr.deserialize() {

        let record: Record = result.unwrap();

        let naive_date = NaiveDate::parse_from_str(&record.date, "%d.%m.%Y").unwrap();
        let month = naive_date.month();
        let year = naive_date.year();

        let string = record.description.to_lowercase();


        if string.contains("t_1") || string.contains("t_2") {
            data.temp.push(process_temp(record.value_1, record.value_2, record.value_3, record.value_4));
            data.date_t.push(format!("{}.{}", month, year));
        } else if string.contains("p_1") {
            data.pressure.push(process_pressure_1(record.value_1, record.value_2, record.value_3, record.value_4));
            data.date_p.push(format!("{}.{}", month, year));
        } else if string.contains("p_2") {
            data.pressure.push(process_pressure_2(record.value_1, record.value_2, record.value_3, record.value_4));
            data.date_p.push(format!("{}.{}", month, year));
        } else {
            println!("Unexpected value");
        }
    }
    for i in 0..data.date_t.len() {
        println!("{} - {}", data.date_t[i], data.temp[i]);
    }
    for i in 0..data.date_p.len() {    
        println!("{} - {}", data.date_p[i], data.pressure[i]);
    }
}

2 Likes

Thanks for the code! So the idea behind providing a reproduction is to give someone else enough information that they can reproduce the same thing you're seeing. Sometimes just the code is enough, but in a lot of cases, it isn't. For example, in this case, I would really need the input to the program. That is, the CSV data. It can be made up. As long as it's representative somehow of your real data, that's good enough.

The Cargo.toml would also be good to include.

1 Like

I just took OPs snippet above and put it into the playground; it runs just fine.

I think you've just overread this variable right here:

or what do you mean by input?

Here's the playground:

1 Like

In your example, the entire description is apparently what you are checking for "containment". If it is instead equality that you are looking for, then that's pretty easy to refactor using an enum and a match. (It can even be extended to use dynamic dispatch, via e.g. a HashMap of processing functions.)

Your code can improved by quite a bit in other regards as well, see this updated playground.

3 Likes

Ah yes I did, thank you! I was expecting bigger input haha. Fair enough. Thanks for pointing that out.

1 Like

here the mising Cargo.toml:

[package]
name = "example"
version = "0.1.0"
edition = "2021"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]
csv = "1.1"
serde = { version = "1", features = ["derive"] }
chrono = { version = "^0", features = ["serde"] }

@BurntSushi the real input is much bigger but since this is just an example, only the structure is shown.

@H2CO3 Ah, thanks for the remark! This is only the case in the example. In the real description, there is more text. Sorry, I should have included this in the example data. Thanks for the improvements, I will have a look :slight_smile:

OK, so I expanded your fake data. When looking at perf, it's really important to look at representative data. In this case, your data has only 4 rows. If that's what your real data has, then indeed, perf is not really relevant because pretty much anything you do will be extremely fast given the small amount of data.

The expanded can be found here (I'll remove this eventually): https://burntsushi.net/stuff/urlo-string-contains-perf-test.csv

I then modified your program so that main begins like so:

fn main() {
    let mut rdr = csv::ReaderBuilder::new()
        .has_headers(true)
        .delimiter(b';')
        .from_path("urlo-string-contains-perf-test.csv")
        .unwrap();

and ends with

    println!("{}, {}", data.date_t.len(), data.date_p.len());
}

So as to avoid printing everything out. (Which, I assume, is also not representative of your real use case. I don't know though for sure.)

I compiled with cargo build --release and ran it to make sure it runs and takes a "decent" amount of time:

$ time ./target/release/urlo-string-contains
2768480, 2768480

real    1.510
user    1.432
sys     0.077
maxmem  340 MB
faults  0

And now it's time to attach a profiler. I'm on Linux, so I can use perf. If you're on a different OS, you'll need to find a different tool since perf is Linux-only.

$ perf record -g --call-graph dwarf ./target/release/urlo-string-contains

And then run perf report to look at the breakdown:

  • Somewhere around 24-37% of the total time is spent in Serde deserializing. Especially parsing the floats.
  • Somewhere around 22% of the total time is spent parsing a date.
  • Only around 10% of the total time is actual CSV parsing.
  • I can't even find your str::contains calls in a profile, so the perf of your substring searching appears mostly irrelevant here.
5 Likes

@H2CO3 I've read your refactoring and there were a lot of valuable suggestions. Thank you! But I do not understand the following syntax of the match statement, although I do understand what the code does:

 let (proc_fn, values, dates): (fn(_, _, _, _) -> _, _, _) [...]

What does (fn(_, _, _, _) -> _, _, _) do/mean?

@BurntSushi Sorry, I misunderstood your requirement. I assumed you were asking for my general code structure. Nevertheless, thank you! Basically, I can go with the if-statement (or better an equivalent match statement)

If we imagine for a moment that type ascription work two dimensionally and line up the variable declarations and the type ascription, we have:

 let             :
(                (
    proc_fn           fn(_, _, _, _) -> _ 
,                ,
    values            _
,                ,
    dates             _
)                )

That is,

Variable Type ascription
proc_fn fn(_, _, _, _) -> _
values _
dates _

And _ means "infer this type for me".


Why is it there? The compiler needs some help to know you want to get a function pointer to something that takes four arguments and returns something: a fn(_, _, _, _) -> _. The reason it needs some help is that each function like process_temp and process_pressure_1 have their own distinct zero-sized type, despite having the same signatures. But proc_fn can only have one type. So the function types need to be coerced into function pointer types.

Without the ascription, the compiler doesn't know it needs to do some coercion and gives a type mismatch error instead (try removing : (fn(_, _, _, _) -> _, _, _)).

2 Likes

It's a function pointer type taking 4 arguments and returning a type, all of them to be inferred. (Ie., it's the minimal information required to coerce all fn items to a common fn pointer type.)

Ahh thank you! It looked like the last two underscores belong to the proc_fn function. Now it makes sense!

No, that's just a tuple.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.