How to read u32, f64 etc. from a file?

Hi experts,

This is how I read one u32

fn read_one_u32<R: BufRead>(reader: &mut R, result: &mut u32) -> std::io::Result<()> {
    let mut buffer = [0_u8; std::mem::size_of::<u32>()];
    reader.read_exact(&mut buffer)?;
    *result = u32::from_le_bytes(buffer);
    Ok(())
}

Can I just read 4 bytes directly into result, getting rid of buffer?

Thanks in advance.

It depends on what format the data is in.

If it's a string then you would use my_string.parse() (powered by std::str::FromStr).

Alternatively, if you know whether it'll be big-endian or little-endian binary data you can use the ReadBytesExt extension trait from the byte_order crate.

It lets you write code like this:

use std::io::Cursor;
use byteorder::{BigEndian, ReadBytesExt};

let mut rdr = Cursor::new(vec![2, 5, 3, 0]);
assert_eq!(517, rdr.read_u16::<BigEndian>().unwrap());
assert_eq!(768, rdr.read_u16::<BigEndian>().unwrap());
3 Likes

It is not a string. It is binary data. For example, 00 00 27 0A should be read as 9994_u32.

I changed my code as following. Do you think it correct?

let mut f = File::open(r#"path/to/my/file"#)?;
{
    let r = &mut f;
    let mut buffer = vec![0_u8; 100];
    r.take(100);
    r.read(&mut buffer)?;
    let mut rdr = Cursor::new(buffer);
    let file_code = rdr.read_u32::<BigEndian>()?;
    assert_eq!(9994_u32, file_code);
}

Thanks

Your original snippet is interpreting the bytes as a little-endian integer (from_le_bytes), but this one is interpreting them as a BigEndian integer-- One of them will produce incorrect results, but it's not possible for me to tell which; you'll need to refer to the file format documentation.

Thanks for your reply. Both little endian and big endian are used in the file. I have to be careful with that.

You could use unsafe to transmute &mut u32 into &mut [u8; 4], but this will make your code read incorrect number if endian of the host differs from the endian used in the file.

I wouldn't worry about the buffer — it's just 4 bytes on the stack. It may get optimized away. It's nothing compared to the relatively big amount of work happening on every use of Read and BufRead.

  • If your goal is to write less boilerplate code, and you control the file format, then use bincode to write and read data.

  • If your goal is to make reading of files faster and/or avoid bloating binary with unnecessary code, then try reading in the whole file into memory first, so that you make fewer syscalls and I/O error checks. Parsing from a simple in-memory slice will optimize better, because bounds checks are simpler and have fewer side effects than I/O and handling of io::Error.

  • In extreme case, you could try memory-mapping the file, but it's not worth using mmap unless you're really squeezing that last 1% of perf or dealing with files larger than available RAM.

3 Likes

The struct includes both little-endian u32 and big-endian u32. Can bincode handle that, writing less boilerplate code?

This sounds promising. The file could be extremely large. I'm going to read the file header and load the required data block into memory.

How can I get the size of available RAM? Or how can I call a WIN32 api from Rust?

Thanks.

I wouldn't check the amount of available RAM in your program to decide whether you should use a mmap. Just go for always file IO or always mmap.

2 Likes

I want to make the program as robust as possible. If there is not enough RAM, mmap is preferred. If the file is on a network disk, file IO is preferred.

Anyway, this post is not about mmap vs file IO. That can be later.

Thanks.

bincode is its own format, so it decides itself how the data is stored (typically in native endian). If you need to parse an existing format that you can't change, then you can't use bincode.

One possibly unappreciated option here is nom. I've seen it most often used as a text parsing library, but it actually started as and is better suited to being a binary format parser.

1 Like

That library sounds great. I should spend some time on reading the source code. I suppose.

Do you know any library that is capable of parsing IBM hexadecimal floating-point numbers? And VAX floating point numbers?

Thanks in advance.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.