Tighten up this parser function?

The function below parses a byte string, which is expected to be all-ASCII and to take the form of a decimal fraction, into a std::time::Duration. It cannot use a floating-point intermediate for reasons explained in the comments (basically, loss of precision) and so there's a bunch of string bashing to compute the nanoseconds part of the value, and I feel like there's probably a better way to write it. In particular I don't love the chain of manipulations done to fraction or the while places < 9 loop at the bottom.

How would you write this function? Note: the format is pinned by external compatibility constraints, so make sure to accept exactly the same set of strings.

use std::time::Duration;
/// Parse a string (as read directly from disk, so a [u8]) as decimal
/// seconds and nanoseconds since the Unix epoch, into a Duration object.
pub fn parse_decimal_timestamp(data: &[u8]) -> Option<Duration> {
    // An f64 can only represent Unix timestamps with full nanosecond
    // precision if they are within ±2**53 nanoseconds of
    // 1970-01-01T00:00:00.000000000Z.  This is a smaller range than
    // you might expect: only from 6pm on September 18, 1969 until 6am
    // on April 15, 1970.  So, we cannot use f64 as an intermediary here.
    let ts = str::from_utf8(data).ok()?;
    let (seconds, fraction) = ts.split_once(".").unwrap_or((ts, ""));
    let fraction = fraction.trim_end_matches("0");
    let fraction = if fraction == "" {
        "0"
    } else if fraction.len() <= 9 {
        fraction
    } else {
        // Truncate to 9 digits, but reject the whole timestamp if any
        // of the discarded characters are not digits.  (Non-digits in
        // the preserved part of the fraction will be rejected by
        // .parse::<u32> below.)
        if (&fraction[9..]).chars().any(|c| !c.is_ascii_digit()) {
            return None;
        }
        &fraction[..9]
    };

    let secs = seconds.parse::<u64>().ok()?;
    let mut nanos = fraction.parse::<u32>().ok()?;
    if nanos > 0 {
        let mut places = fraction.len();
        while places < 9 {
            places += 1;
            nanos *= 10;
        }
    }
    Some(Duration::new(secs, nanos))
}

The while loop can be replaced by nanos *= 10.pow(9 - places);.

2 Likes

I would do something like this.

2 Likes

If you need maximum performance, one issue is that you are parsing the input as UTF8 and then re-parsing as a highly restricted ASCII subset. You could consider validating the input &[u8] directly.

1 Like

Oh wow. It would never have occurred to me to look for pow as a function from integers to integers. That's what 30 years of C does to your brain, I guess.

Misses an error case -- that any(|c| !c.is_ascii_digit()) clause is there for a reason.

Are there guidelines anywhere for how to do that in a readable manner? There are several other places in this program where I need to validate input that's expected to be a restricted subset of ASCII, but u8 slices and OsStr have such a limited API compared to str that it's really awkward to work with them.

The bstr crate is often recommended, there is probably other helpful stuff on crates.io as well. Unfortunately, I personally don't have much experience in this department.

1 Like

I think it's easiest to express your "max 9 digits" rules by working with an integer:

let nanos = fraction.parse::<u32>().ok()?;
if nanos > 999_999_999 {
    return None;
}

That doesn't work. Suppose the input is 1.123456789123456789, the desired return value is Some(Duration::new(1, 123456789)) but parsing 123456789123456789 into a u32 will overflow so your code will return None.

One potentially-undesirable detail of the current behavior is that decimal digits beyond the 9th are truncated, so the timestamp is always rounded down. You may want a different form of rounding.

the timestamp is always rounded down

That is intentional (again, dictated by compatibility with the data source).

Manual parsing implementation:

use std::time::Duration;
/// Parse a string (as read directly from disk, so a [u8]) as decimal
/// seconds and nanoseconds since the Unix epoch, into a Duration object.
pub fn parse_decimal_timestamp(data: &[u8]) -> Option<Duration> {
    // An f64 can only represent Unix timestamps with full nanosecond
    // precision if they are within ±2**53 nanoseconds of
    // 1970-01-01T00:00:00.000000000Z.  This is a smaller range than
    // you might expect: only from 6pm on September 18, 1969 until 6am
    // on April 15, 1970.  So, we cannot use f64 as an intermediary here.

    let mut secs: u64 = 0;
    let mut nanos: u32 = 0;

    'parse: {
        let mut chars_iter = data.iter().copied();

        // Parse full seconds
        'seconds: {
            for c in chars_iter.by_ref() {
                match c {
                    digit @ b'0'..=b'9' => {
                        secs = secs.checked_mul(10)?.checked_add((digit - b'0').into())?
                    }
                    b'.' => break 'seconds, // Jump to nanoseconds parse code.
                    _ => return None,
                }
            }

            // Skip nanosecond parse code, because no `.` was encountered.
            break 'parse;
        }

        // Parse nanoseconds
        let mut decimals_left: u32 = 9;
        while decimals_left > 0 {
            match chars_iter.next() {
                Some(digit @ b'0'..=b'9') => {
                    nanos = nanos * 10 + (digit - b'0').into();
                    decimals_left -= 1;
                }
                Some(_) => return None,
                None => {
                    nanos *= 10_u32.pow(decimals_left);
                    break 'parse;
                }
            }
        }

        // Validate remaining nanoseconds
        if !chars_iter.all(|c| c.is_ascii_digit()) {
            return None;
        }
    }

    Some(Duration::new(secs, nanos))
}

Godbolt

You should use char ('.' and '0' instead of "." and "0") here. It improves codegen, there is even a Clippy lint for it.

1 Like

Then put it back in before truncating.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.