Expressing "If all four of these options are Some, then..."

Here is some code I wrote to scan the header of a CSV file and determine whether all the columns I care about are available.

struct LandmarkFileColumns {
    addr: usize,
    longitude: usize,
    latitude: usize,
    distance: usize,
}

#[derive(Debug, Fail)]
#[fail(display="missing columns: {}", _0)]
struct LandmarkFileMissingColumnsError(String);

fn select_columns(header: &csv::StringRecord, location: &str)
    -> Result<LandmarkFileColumns, LandmarkFileMissingColumnsError> {

    let mut addr: Option<usize> = None;
    let mut longitude: Option<usize> = None;
    let mut latitude: Option<usize> = None;
    let mut distance: Option<usize> = None;
    let mut wanted = 4;

    for (index, field) in header.iter().enumerate() {
        match field {
            "addr"             => { addr      = Some(index); wanted -= 1; },
            "longitude"        => { longitude = Some(index); wanted -= 1; },
            "latitude"         => { latitude  = Some(index); wanted -= 1; },
            f if f == location => { distance  = Some(index); wanted -= 1; },
            _                  => {},
        }
        if wanted == 0 { break }
    }

    if wanted == 0 {
        Ok(LandmarkFileColumns {
            addr: addr.unwrap(),
            longitude: longitude.unwrap(),
            latitude: latitude.unwrap(),
            distance: distance.unwrap(),
        })

    } else {
        let mut missing: Vec<&str> = Vec::with_capacity(4);
        if addr.is_none()      { missing.push("addr"); }
        if longitude.is_none() { missing.push("longitude"); }
        if latitude.is_none()  { missing.push("latitude"); }
        if distance.is_none()  { missing.push(location); }

        Err(LandmarkFileMissingColumnsError(missing.join(", ")))
    }
}

This works but it feels redundant and not as provably correct as it should be. Is there a tighter way to write it? Particularly, is there a better way to express "if all four of these Options are Some then [break out of the loop / unwrap all of them]" ? "Better" means "easier for human readers and/or the compiler to see that this function cannot panic."

1 Like

You can pattern match on a tuple:

    if let (
        Some(addr),
        Some(longitude),
        Some(latitude),
        Some(distance)
    ) = (addr, longitude, latitude, distance) {
        Ok(LandmarkFileColumns {
            addr, longitude, latitude, distance
        })
    }

Edit: If you're ok with only reporting one missing column instead of them all, you can fold everything into a single match:

    match (addr, longitude, latitude, distance) {
        (None, _, _, _) => Err(LandmarkFileMissingColumnsError("addr".into())),
        (_, None, _, _) => Err(LandmarkFileMissingColumnsError("longitude".into())),
        (_, _, None, _) => Err(LandmarkFileMissingColumnsError("latitude".into())),
        (_, _, _, None) => Err(LandmarkFileMissingColumnsError("distance".into())),

        (Some(addr), Some(longitude), Some(latitude), Some(distance))
            => Ok(LandmarkFileColumns { addr, longitude, latitude, distance })
    }
19 Likes

In fact this technique can be used more generally: to match on arbitrarily complex state (but it still has to be precisely match-able, so anything behind a trait or trait object won't work, at least not directly).
It sounds like a "no duh" moment given the fragment above, but when I discovered it some years back, it was a bit of a revelation. It also immediately skewed my code style in the direction of favoring enums over traits when I can get away with it.

5 Likes

I would do it this way:

fn find_column(
    header: &csv::StringRecord,
    name: &str,
) -> Result<usize, LandmarkFileMissingColumnsError> {
    header
        .iter()
        .position(|field| field == name)
        .ok_or_else(|| LandmarkFileMissingColumnsError(name.into()))
}

fn select_columns(
    header: &csv::StringRecord,
    location: &str,
) -> Result<LandmarkFileColumns, LandmarkFileMissingColumnsError> {
    Ok(LandmarkFileColumns {
        addr: find_column(header, "addr")?,
        longitude: find_column(header, "longitude")?,
        latitude: find_column(header, "latitude")?,
        distance: find_column(header, location)?,
    })
}

Unfortunately it only reports the first missing column, but it's much simpler.

4 Likes

Note that CSV header may contains duplicated names which can beat assumptions of this loop. Pretty bad csv file, many tools may not handle them correctly, but still valid format. It would be better to not do anything weird on such case especially if it doesn't increase code complexity.

fn select_columns(header: &csv::StringRecord, location: &str)
    -> Result<LandmarkFileColumns, LandmarkFileMissingColumnsError> {

    let mut addr: Option<usize> = None;
    let mut longitude: Option<usize> = None;
    let mut latitude: Option<usize> = None;
    let mut distance: Option<usize> = None;

    for (index, field) in header.iter().enumerate() {
        match field {
            "addr" => addr = Some(index),
            "longitude" => longitude = Some(index),
            "latitude" => latitude  = Some(index),
            f if f == location => distance  = Some(index),
            _ => {},
        }
    }

    let mut missing = Vec::new();
    if addr.is_none() { missing.push("addr"); }
    if longitude.is_none() { missing.push("longitude"); }
    if latitude.is_none()  { missing.push("latitude"); }
    if distance.is_none()  { missing.push(location); }

    if missing.is_empty() {
        Ok(LandmarkFileColumns {
            addr: addr.unwrap(),
            longitude: longitude.unwrap(),
            latitude: latitude.unwrap(),
            distance: distance.unwrap(),
        })
    } else {
        Err(LandmarkFileMissingColumnsError(missing.join(", ")))
    }
}
1 Like

FWIW that's technically a quadratic algorithm, which is not too bad for 4 fields but for a couple dozen fields, it might be suprising that something is slow. OTOH pattern matching on the field names as literal strings is optimized very well by the compiler. (Not sure if it can see through the iterator magic and prove these are equivalent, but I wouldn't count on that.)

1 Like

I think the core of the question is how much you care about reporting all the problems at once.

By far the simplest is to use Option::ok_or and ? to error out as soon as something is missing.

When you need control flow to continue going along after the "error", though, I'm not sure it's possible for there to be a particularly nice flow in terms of compiler checks. Like if you imagine a function to do it, what's the perfect way to hold all the errors? It'd want to be something like -> Result<[usize; 4], NonEmptyArrayVec<_, 4>>? Rust is short on type vocabulary for such things.

It's worth to also compare what the serde deserialize derive emits, though it has the advantage of being machine generated and disadvantage of having to go through serde's generic layer. Manually cleaned up some, we get:

#[inline]
fn visit_map<__A>(self, mut __map: __A) -> Result<Self::Value, __A::Error>
where
    __A: _serde::de::MapAccess<'de>,
{
    let mut __field0: Option<usize> = None;
    let mut __field1: Option<usize> = None;
    let mut __field2: Option<usize> = None;
    let mut __field3: Option<usize> = None;
    while let Some(__key) = _serde::de::MapAccess::next_key::<__Field>(&mut __map)? {
        match __key {
            __Field::__field0 => {
                if Option::is_some(&__field0) {
                    return Err(__A::Error::duplicate_field("addr"));
                }
                __field0 = Some(_serde::de::MapAccess::next_value::<usize>(&mut __map)?);
            }
            __Field::__field1 => {
                if Option::is_some(&__field1) {
                    return Err(__A::Error::duplicate_field("longitude"));
                }
                __field1 = Some(_serde::de::MapAccess::next_value::<usize>(&mut __map)?);
            }
            __Field::__field2 => {
                if Option::is_some(&__field2) {
                    return Err(__A::Error::duplicate_field("latitude"));
                }
                __field2 = Some(_serde::de::MapAccess::next_value::<usize>(&mut __map)?);
            }
            __Field::__field3 => {
                if Option::is_some(&__field3) {
                    return Err(__A::Error::duplicate_field("distance"));
                }
                __field3 = Some(_serde::de::MapAccess::next_value::<usize>(&mut __map)?);
            }
            _ => {
                let _ = _serde::de::MapAccess::next_value::<_serde::de::IgnoredAny>(&mut __map)?;
            }
        }
    }
    let __field0 = match __field0 {
        Some(__field0) => __field0,
        None => _serde::__private::de::missing_field("addr")?,
    };
    let __field1 = match __field1 {
        Some(__field1) => __field1,
        None => _serde::__private::de::missing_field("longitude")?,
    };
    let __field2 = match __field2 {
        Some(__field2) => __field2,
        None => _serde::__private::de::missing_field("latitude")?,
    };
    let __field3 = match __field3 {
        Some(__field3) => __field3,
        None => _serde::__private::de::missing_field("distance")?,
    };
    Ok(LandmarkFileColumns {
        addr: __field0,
        longitude: __field1,
        latitude: __field2,
        distance: __field3,
    })
}

It's worth noting that serde's generated deserialization (which is the most widely used deserializer in the Rust ecosystem, and widely praised for generating useful errors in a sea of "error on line 128" deserializers) doesn't even attempt to report multiple missing fields. It could've built that into the design -- missing_fields(&["addr", "longitude", "latitude", "distance"] -- and even could add it post facto -- default to just reporting the first, as is current behavior -- but opted to just display a single missing field, I guess because that's good enough to explain the error to the user, who's going to go look at the file and (probably) see the other missing fields as well.


If it's just one (or maybe a few) structures that need to be loaded, I'd just write out something like serde's code (or find a way to use serde, because I'm lazy). If I'm deserializing more things, I'd write a simple macro_rules! macro to expand to this simple pattern (or spend more effort trying to cram my model into serde, because I want to just use serde).

3 Likes

The main reason I'm not using serde here is, the name of one of the columns to be decoded varies at runtime. That's the point of the "f if f == location" arm of the match. serde-csv insists on knowing the names of all the columns you want at compile time (or it used to, anyway; this is code I put down for over a year).

In context, I don't have to worry about duplicate column names, but I do want to diagnose all missing columns in one pass because this is step three of a slow process that has to be started over from scratch if it fails (it's a long story).

Applying @2e71828's suggestion plus a little code shuffling I got this, which I'm happy with as far as clarity and panic avoidance:

fn select_columns(header: &csv::StringRecord, location: &str)
    -> Result<LandmarkFileColumns, LandmarkFileMissingColumnsError> {

    let mut addr: Option<usize> = None;
    let mut lon:  Option<usize> = None;
    let mut lat:  Option<usize> = None;
    let mut dist: Option<usize> = None;

    for (index, field) in header.iter().enumerate() {
        match field {
            "addr"             => { addr = Some(index); },
            "longitude"        => { lon  = Some(index); },
            "latitude"         => { lat  = Some(index); },
            f if f == location => { dist = Some(index); },
            _                  => {},
        }
        if let (Some(addr), Some(lon), Some(lat), Some(dist)) =
            (addr, lon, lat, dist) {
                return Ok(LandmarkFileColumns { addr, lon, lat, dist });
            }
    }

    let mut missing: Vec<&str> = Vec::with_capacity(4);
    if addr.is_none() { missing.push("addr");      }
    if lon.is_none()  { missing.push("longitude"); }
    if lat.is_none()  { missing.push("latitude");  }
    if dist.is_none() { missing.push(location);    }

    Err(LandmarkFileMissingColumnsError(missing.join(", ")))
}

Code generation is reasonable. It could be better; Option<usize> has no niche values, so those four Options consume eight registers and (on x86) we run out of registers. If I were writing the assembly language by hand I'd combine the four 'is some?' booleans into one flag word, but I don't blame the compiler for not knowing how to do that. This isn't on the critical path, so a few register spills in the loop are not a big deal.

Yeah, this is something I've struggled with in other contexts. I write a lot of programs that process files of data, and I want them to diagnose all the problems with the input, not stop after the first error. ? makes it easy to write code that detects and propagates failures, but the code that traps the error, reports it, and recovers into degraded "keep parsing to detect more errors" mode still comes out butt ugly. I don't have any clever ideas.

The only times I've seen this come out really well is when someone finds a way to rephrase the problem such that the errors can become part of the model.

Like FSM operations never "fail", but they might cause a transition to the error state -- but that's a real state, with operations that can get you out of it.

Or the way the type checker handlers error, by having an error type, and not reporting new errors against the error type, but still allowing it to disappear naturally -- because drop(x) + 1 is always a new error, even if it couldn't figure out what the type of x is.

5 Likes

I just wanted to mention, as a single data point, that I usually want the exact opposite. I have a hard time dealing with multiple errors in a focused manner. I wish compilers only reported one error at a time only, or at
most as many as fit on a screen.

Here's another alternative (not tested): Rust Playground

fn select_columns(
    header: &csv::StringRecord,
    location: &str,
) -> Result<LandmarkFileColumns, LandmarkFileMissingColumnsError> {
    let columns = ["addr", "longitude", "latitude", location];
    let mut values = [None; 4];

    for (index, field) in header.iter().enumerate() {
        for (col_idx, col) in columns.iter().enumerate() {
            if &field == col {
                values[col_idx] = Some(index);
            }
        }
    }

    match values {
        [Some(addr), Some(longitude), Some(latitude), Some(distance)] => Ok(LandmarkFileColumns {
            addr,
            longitude,
            latitude,
            distance,
        }),
        _ => {
            let missing: Vec<_> = values
                .iter()
                .enumerate()
                .filter_map(|(n, f)| if f.is_none() { Some(columns[n]) } else { None })
                .collect();
            Err(LandmarkFileMissingColumnsError(missing.join(", ")))
        }
    }
}

It lists all the missing columns. There's a nested for loop, but it has the same complexity as the other solutions - it's still O(n) with number of cells because the number of columns is fixed at 4. It's likely(?) not as fast as a match statement would be however.

It would be quite easy to modify this to give a duplicate column error as well if that's something you need to worry about.

And here's yet another option (untested): Rust Playground

fn select_columns(
    header: &csv::StringRecord,
    location: &str,
) -> Result<LandmarkFileColumns, LandmarkFileMissingColumnsError> {
    let columns = ["addr", "longitude", "latitude", location];
    let mut values = [None; 4];
    let mut missing = vec![];

    for (col, val) in columns.iter().zip(&mut values) {
        match header.iter().position(|field| field == *col) {
            Some(x) => *val = Some(x),
            None => missing.push(*col),
        }
    }

    if missing.is_empty() {
        Ok(LandmarkFileColumns {
            addr: values[0].unwrap(),
            longitude: values[1].unwrap(),
            latitude: values[2].unwrap(),
            distance: values[3].unwrap(),
        })
    } else {
        Err(LandmarkFileMissingColumnsError(missing.join(", ")))
    }
}

It's like the previous one, but the for loops are switched round - a bit like @Riateche's solution above. We miss out on using the nice destructuring match, but it's short and easy to read. It may even be quite fast if we expect addr, longitude and latitude to usually be the first three columns.

It would be harder to add duplicate column detection to this one as we couldn't use the short-circuiting .position method.

Having done this I realize that it would have probably been good to use zip in my previous example too.