Parse number from &[u8]

Is there any function in Rust standard library to parse a number from an ASCII string which I have as &[u8] (or Vec[u8]) directly, without going through an UTF-8 string (&str or String)?

Going through String is inefficient and complicates error handling.

    let my_bytes: &[u8] = "1234".as_bytes(); // this will be read from a file or something
    let my_str = String::from_utf8(Vec::from(my_bytes)).expect("Not UTF-8");
    let my_number: i64 = my_str.parse::<i64>().expect("Not a number");

The standard library does not offer number parsing for &[u8]. Its parsing implementation on &str operates on &[u8] internally (see here), but that’s not exposed to the user.

Going through &str is cheaper than going through String.

let my_str = std::str::from_utf8(my_bytes).expect("Not UTF-8");

Possibly, validating it as ascii is even cheaper than as UTF-8. I believe, you could do so (and end up with a &str; on stable Rust) without resorting to unsafe only with additional crates, such as the ascii crate, even though the is_ascii check it uses internally comes from std.

use ascii::AsAsciiStr;
let my_str = my_bytes.as_ascii_str().expect("Not Ascii").as_str();
3 Likes

There is not. I have written an extension trait to perform this behavior for the integer types I needed. It does not handle negative numbers, but adding that shouldn't be too difficult.

This code does not seem to perform any validation of the input besides whether the integer is too large, is that correct?

I suggest using the atoi crate. It is specifically designed for the use case of directly decoding integers from byte strings.

4 Likes

See also: ACP: Add `FromByteStr` trait with blanket impl `FromStr` · Issue #287 · rust-lang/libs-team · GitHub

Personally, if it were me and all I needed to do was parse a simple positive integer from a &[u8], I'd just write it myself:

fn parse_u64(bytes: &[u8]) -> Result<u64, ParseU64Error> {
    let mut n: u64 = 0;
    for &byte in bytes {
        let digit = match byte.checked_sub(b'0') {
            None => return Err(ParseU64Error::InvalidDigit { got: byte }),
            Some(digit) if digit > 9 => return Err(ParseU64Error::InvalidDigit { got: byte }),
            Some(digit) => {
                debug_assert!((0..=9).contains(&digit));
                u64::from(digit)
            }
        };
        n = n
            .checked_mul(10)
            .and_then(|n| n.checked_add(digit))
            .ok_or_else(|| ParseU64Error::NumberTooBig {
                bytes: bytes.to_vec(),
            })?;
    }
    Ok(n)
}

Playground: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=77b62ee35310333280a7db4b4f1d9ff3

This is the kind of function where it's simple and small enough that I don't usually bother with bringing in another crate for it (unless my needs are more complex). This is also partially why I'm a fan of the ACP linked above for adding it to std.

5 Likes

I'd have to check what I wrote (it's been a while), but it wouldn't surprise me. The methods were written solely for my use case, which involves quite a bit of validation elsewhere. I'm fairly certain the methods are only called if it's already known to be digits.

I was expecting atoi to use some optimized SIMD tricks, but no. Turns out it's just a pile of generic code, with basically no specialization. In the end my benchmark shows it to be mostly equivalent in performance to the simple function that @BurntSushi wrote above. There can be a difference of 10-20%, but depending on the size of integers different implementation comes out on top, and I don't see any clear pattern (e.g. it doesn't simply monotonically depend on the input size). Without any more specific benchmarks, I'd say the simple code is good enough.

Benchmark:

pub enum ParseU64Error {
    InvalidDigit { got: u8 },
    NumberTooBig { bytes: Vec<u8> },
}

pub fn parse_u64(bytes: &[u8]) -> Result<u64, ParseU64Error> {
    let mut n: u64 = 0;
    for &byte in bytes {
        let digit = match byte.checked_sub(b'0') {
            None => return Err(ParseU64Error::InvalidDigit { got: byte }),
            Some(digit) if digit > 9 => return Err(ParseU64Error::InvalidDigit { got: byte }),
            Some(digit) => {
                debug_assert!((0..=9).contains(&digit));
                u64::from(digit)
            }
        };
        n = n
            .checked_mul(10)
            .and_then(|n| n.checked_add(digit))
            .ok_or_else(|| ParseU64Error::NumberTooBig {
                bytes: bytes.to_vec(),
            })?;
    }
    Ok(n)
}

use atoi::FromRadix10SignedChecked;
use std::fmt::Display;

use criterion::{criterion_main, BatchSize, Bencher, Criterion};
use rand::{distributions::uniform::SampleUniform, Rng};

fn bench<T, R>(low: T, high: T, f: fn(&[u8]) -> R) -> impl FnMut(&mut Bencher)
where
    T: SampleUniform + Display + Ord + Copy,
{
    move |b| {
        b.iter_batched(
            || rand::thread_rng().gen_range::<T, _>(low..=high).to_string(),
            |s| f(s.as_bytes()),
            BatchSize::SmallInput,
        )
    }
}

fn bench_group<T>(c: &mut Criterion, group_name: &str, low: T, high: T)
where
    T: SampleUniform + Display + Ord + Copy + FromRadix10SignedChecked,
{
    c.benchmark_group(group_name)
        .bench_function("atoi", bench(low, high, ::atoi::atoi::<u64>))
        .bench_function(
            format!("atoi/{}", std::any::type_name::<T>()),
            bench(low, high, atoi::atoi::<T>),
        )
        .bench_function("parse_num", bench(low, high, parse_u64));
}

pub fn benches() {
    let mut criterion = Criterion::default().configure_from_args();
    bench_group(&mut criterion, "bench_digit", 0u8, 9u8);
    bench_group(&mut criterion, "bench_u8", u8::MIN, u8::MAX);
    bench_group(&mut criterion, "bench_u16", u16::MIN, u16::MAX);
    bench_group(&mut criterion, "bench_u32", u32::MIN, u32::MAX);
    bench_group(&mut criterion, "bench_u64", u64::MIN, u64::MAX);
}

criterion_main!(benches);
1 Like

A bit of testing shows that for performance, one should use the atoi_simd crate. It is consistently faster than either atoi or parse_u64 (or btoi and cluatoi, which are other integer parsing crates) for both signed and unsigned integers, and it is significantly faster for either long or single-digit integers.

Unfortunately, it can't really be used to parse integers in generic code, since the relevant traits are in a private module, and it doesn't seem possible to name them in your own trait bounds.

2 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.