What is the fastest way to convert bytes to numbers?

I have a binary file, it stored m * n numbers, in fact, it is a table. And the size of every number is 4 bytes. So what is the fastest way to read the file and push the number to a Vec?
I have an implementation, but it is slow, the pseudo code is like below:

let mut f = File::open("path").unwrap();
let mut buf = vec![0; 100000];
f.read_exact(&mut buf).unwrap();
let mut idx = 0;
let mut data = vec![0;25000];
while idx < buf.len(){
    let mut cur = std::io::Cursor::new(&buf[idx..(idx+4)]);
    let n = cur.read_f32::<byteorder::ByteOrder>().unwrap();
    data[idx/4] = n;
    idx += 4;
}

Obligatory question: are you running your code with optimizations turned on? (e. g. cargo run --release)

If there's still performance problems, decreasing the number of index checks might help. E. g. you could try to iterate over (chunk, dest) in buf.chunks_exact(4).zip(&mut data) with a for loop that calls …byteorder….read_f32(chunk) and writes the result to *dest.

4 Likes

I am not running in release mode, but I added the [profile.test] opt-level = 3 in Cargo.toml.
And I also compared with the numpy.fromfile, it is faster than above code.

Hm… core::mem::transmute?

1 Like

Will try this method.

You can do this instead. I'm not sure it will actually improve performance, but it should at least as fast as your code.

let mut f = File::open("path").unwrap();
let mut buf = vec![0; 100000];
f.read_exact(&mut buf).unwrap();

let data: Vec<_> = buf.chunks(4).map(|s| f32::from_be_bytes(s.try_into().unwrap())).collect();

Just a few notes, what's the endianness of your file? byteorder::ByteOrder is not a valid generic parameter to the .read_f32 method. If it's marginally slower than the numpy approach, it's possible that the numpy is doing some more cheats - like mmap the file and use it as a f32 array. in this case actual file loading is lazily done by the OS and the copying is skipped.

1 Like

Using the zerocopy crate, you can do this:

fn as_byte_array(floats: &mut [f32]) -> &mut [u8] {
    zerocopy::AsBytes::as_bytes_mut(floats)
}

Then you can read data directly into the buffer like this:

let mut buf = vec![0f32; 25000];
f.read_exact(as_byte_array(&mut buf)).unwrap();
// buf now contains f32 values from the file

You can also implement it without using the zerocopy crate, but then you need unsafe:

fn as_byte_array(floats: &mut [f32]) -> &mut [u8] {
    let len = floats.len();
    let ptr = floats.as_mut_ptr();
    // Safety: The pointer is valid for 4*len bytes since a f32 is four bytes,
    // and the alignment is also okay since u8 has a smaller alignment than
    // f32.
    unsafe {
        std::slice::from_raw_parts_mut(ptr.cast(), 4*len)
    }
}
9 Likes

...and neither f32 nor u8 have any validity invariants nor any padding bytes.

3 Likes

The endianness is LittleEndian.

I believe this method is more flexible.
If the number size is 3 bytes, do I need to change the 4*len to 3*len in the from_raw_parts_mut? Is the result also correct?
And thanks your answer!

And I am thinking, If the size of every number in one row is different, are above all methods still work?
For example the size in the table is like:

3bytes - 4bytes - 6bytes - 2bytes
3bytes - 4bytes - 6bytes - 2bytes
3bytes - 4bytes - 6bytes - 2bytes
.................................
.................................

At least buf.chunks will not work.

Your parsing code is way too complicated. There is no point in using a Cursor if you're not doing random-access operations on the buffer, and you're just reading a single number. You can read it directly. Moreover, there is no point in manually slicing the buffer and potentially incurring indexing costs when a slice can be directly used as a reader. Here is how I would write your code:

let mut buf = &std::fs::read("path").expect("could not read path");
let mut data = Vec::with_capacity(buf.len() / std::mem::size_of::<f32>());
while let Ok(n) = buf.read_f32() {
    data.push(n);
}
1 Like

Please don't spam pings.

Okay, sorry!

It shows the error:

no method named read_f32` found for reference &Vec<u8> in the current scope
method not found in &Vec<u8>

You can't just change it to 3*len in my method. Each float must consist of four bytes when you are done manipulating the bytes. So it would require additional modification to the data after reading it.

2 Likes

Right, my bad.

use byteorder::{ReadBytesExt, LE};
let buf = std::fs::read("path").expect("could not read path");
let mut data = Vec::with_capacity(buf.len() / std::mem::size_of::<f32>());
let mut reader = buf.as_slice();
while let Ok(n) = reader.read_f32::<LE>() {
    data.push(n);
}

Playground

1 Like

Or simply use transmute.

1 Like

What I posted would be the preferred way of transmuting the bytes to floats.

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.