Reading an Array of Ints from a File? [solved]


#1

How can I read N*2 bytes from a file an end up with an array of N short ints?

This feels like a job for transmut, but that’s scary.

Note: this is a large file, I don’t want to do this byte-by-byte. The endianess of the file is correct for my machine, I should be able to do this for the cost of the file read.

C could do this…

Thanks,

-kb


#2

What C does is effectively equivalent to a transmute :slight_smile:

But in Rust, the preferred way to handle this kind of endianness-dependent IO is to use the byteorder crate. In particular, I think you will find read_u16_into() or its i16 cousin useful.

To use that method…

// Add extern crate declaration (at the root of your crate)
extern crate byteorder;

// Bring ReadBytesExt trait and endianness marker in scope (in your module)
use byteorder::{ReadBytesExt, NativeEndian};

// Setup a buffer to hold the bytes
let mut buffer = [0u16; 4096];

// Call read_u16_into() on your favorite Reader (file, socket...)
reader.read_u16_into::<NativeEndian>(&mut buffer[..])?;

#3

Hi, this thread did discuss IO using memory mapping [Solved] Polymorphic IO: file (memmap) or stdin - read overlapping chunks


#4

I like the look of that, but newbie needs to know what is “reader”? I tried both a file and a bufreader:

#![feature(rustc_private)]
extern crate byteorder;
    
fn main() {
    let mut image_bytes: [u8; 17523200] = [0; 17523200];
 
    use std::fs::File;
    //use std::io::BufReader;
    use byteorder::{ReadBytesExt, NativeEndian};

    let mut buffer = [0u16; 17523200/2];
    let file = File::open("image_out");
    //let mut buf_reader = BufReader::new(file);
    file.read_u16_into::<NativeEndian>(&mut buffer[..]).unwrap();
}

Complains:

   |
12 |     file.read_u16_into::<NativeEndian>(&mut buffer[..]).unwrap();
   |          ^^^^^^^^^^^^^
   |
   = note: the method `read_u16_into` exists but the following trait bounds were not satisfied:
           `std::result::Result<std::fs::File, std::io::Error> : byteorder::ReadBytesExt`

What simple thing am I missing?

Thanks,

-kb


#5

returns a “Result<_>” which indicates Ok() or Err() in case of file-error. You hav got to unwrap it, for example using “expect(…)”


#6

Yes that was it. Thanks!

I’ve made that mistake multiple times before but previously the compiler complaint made more sense to me.

Next problem: I can’t do such a large array. Even if I put it in a box seems it wants to allocate on the stack first. (And then it would copy to the heap? Silly.)

Getting somewhat off the original topic… I guess I need a Vec. If I allocate it with_capacity(), will that still be C-fast? No extra copies?

This version runs without exploding:

#![feature(rustc_private)]
extern crate byteorder;

fn main() {
    use std::fs::File;
    use std::io::BufReader;
    use byteorder::{ReadBytesExt, NativeEndian};

    let mut buffer = Vec::with_capacity(17523200/2);
    let file = File::open("image_out").expect("failed to open file");
    let mut buf_reader = BufReader::new(file);
    buf_reader.read_u16_into::<NativeEndian>(&mut buffer[..]).expect("failed to read");
    println!("buffer.len(): {}\n", buffer.len());
}

But it doesn’t read anything. Len is 0.

-kb


#7

The Docs state:

fn read_u16_into<T: ByteOrder>(&mut self, dst: &mut [u16]) -> Result<()>

Reads a sequence of unsigned 16 bit integers from the underlying reader.

The given buffer is either filled completely or an error is returned. If an error is returned, the contents of dst are unspecified.

The read_u16_into operates on the raw buffer array [u16] (by slicing buffer[…]), it does not modify the Vec-meta-data.

After successfully reading (non error!), one has got to adapt the Vec-length meta-data “manually”

unsafe { buffer.set_len(17523200/2); }

Talking about the next step after reading from file, even if using Vec-Type to allocate array in heap-memory, it is good practice to hand over the data-buffer as slice.

let result = useful_function(buffer[...]);
// where 
// fn useful_function(buf: [u16]) -> WhatEver;

#8

Thanks for the prompt answer, frustrating it took until today for me to find the time to look at it carefully.

But it isn’t looking clear to me.

My current code:

#![feature(rustc_private)]
extern crate byteorder;

fn main() {
    use std::fs;
    use std::fs::File;
    use std::io::BufReader;
    use byteorder::{ReadBytesExt, NativeEndian};

    let file_name = "./image_out";

    let file_size = fs::metadata(file_name).expect("could not get file meta data").len() as usize;
    if &file_size % 2 != 0 {
        panic!("file was not even number of bytes long");
    }
    let short_array_size = file_size/2;

    let file = File::open(file_name).expect("failed to open file");
    let mut buf_reader = BufReader::new(file);

    let mut buffer = Vec::with_capacity(short_array_size);
    buf_reader.read_u16_into::<NativeEndian>(&mut buffer[..]).expect("failed to read");
    unsafe { buffer.set_len(short_array_size); }

    println!("{}\n", buffer[0]);
    println!("{}\n", buffer[short_array_size-1]);
}

The prints at the end both give me zero, which is not what is in the file. The read isn’t happening?

-kb


#9

Try with set_len before read_u16_into. As written, &mut buffer[..] would point to a zero-length buffer, into which nothing can be read.


#10

That’s it!

And when I do a release build and run timed rust and C versions back to back repeatedly on a long command line the result is…

…both are the same speed! I am not doing any extra buffer copies. Rust is fast.

Final code:

#![feature(rustc_private)]
extern crate byteorder;

fn main() {
    use std::fs;
    use std::fs::File;
    use std::io::BufReader;
    use byteorder::{ReadBytesExt, NativeEndian};

    let file_name = "./image_out";

    let file_size = fs::metadata(file_name).expect("could not get file meta data").len() as usize;
    if &file_size % 2 != 0 {
        panic!("file was not even number of bytes long");
    }
    let short_array_size = file_size/2;

    let file = File::open(file_name).expect("failed to open file");
    let mut buf_reader = BufReader::new(file);

    let mut buffer: Vec<u16> = Vec::with_capacity(short_array_size);
    unsafe { buffer.set_len(short_array_size); }
    buf_reader.read_u16_into::<NativeEndian>(&mut buffer[..]).expect("failed to read");

    println!("{}\n", buffer[0]);
    println!("{}\n", buffer[short_array_size-1]);
}

Very cool, thanks to all.

-kb, the Kent who is slowly learning this stuff.


#11

A related question: What if I had a composite file with more stuff in it, I didn’t want to read to the end, I just want the next N-bytes read into an array of N/2. How do I control the read to do less than the whole file?

Seems I might mess with the slice part of buffer[..], but I tried putting a top bound on the slice, and I didn’t get that to work–I admit I don’t really understand what the slice there means.

Thanks,

-kb


#12

Side note: you should set up a Cargo.toml with the byteorder crate as a dependency instead of using #![rustc_private]. See https://doc.rust-lang.org/cargo/guide/dependencies.html


#13

I was wondering what the #![rustc_private] was all about, but it was in an example I was lifting from and seemed the least of my questions.

Thanks,

-kb


#14

You do that by giving a slice of length equal to how much data you want to read to either the read or read_exact methods (part of the std::io::Read trait).

The above methods need a &mut [u8] but you have a Vec<u16>, so an adjustment needs to be made to convert a &mut [u16] to a &mut [u8]. One way to do that is:

let s: &mut [u8] = unsafe {
        // get a mut ptr to the start of the slice where data will be copied into, taking into account
        // that 2 bytes have already been consumed at the start of the buffer for the u16
        let ptr = (&mut buffer as *mut _ as *mut u8).offset(2);
        // form a u8 slice of the desired length, starting at `ptr`
        std::slice::from_raw_parts_mut(ptr, <desired_length>)
    };
buf_reader.read_exact(s);

You need to ensure the slice is pointing at valid memory, of course, so buffer has to be sized appropriately.

You can also consider using std::io::Cursor to wrap the underlying Vec - it’ll maintain a position in the buffer and adjust it as bytes are written to it.


#15

Cool. I got a read_exact() version to work, with even a transmute (important to know how to do transmute).

I also got the original version reading the amount of data I wanted by doing the slice notation correct (important to know how to avoid unsafe as much code as possible).

Thanks.

Still seems odd that I need an unsafe statement to read an even number of bytes from a file as a series of shorts.

-kb


#16

You can do without unsafe if you use a Vec<u8> buffer to read into, and just write the u16 to it manually (ie without going via read_u16_into).


#17

If I understand what you mean, that wouldn’t be as fast, right?

What I am doing now is going at IO speed, no extra data copies, no visiting of the data until some consumer algorithm wants to look at the data, no shifting and adding or anything like that.

-kb


#18

I meant instead of

buf_reader.read_u16_into::<NativeEndian>(&mut buffer[..]).expect("failed to read");

you’d write (buffer is now Vec<u8>)

buffer.write_u16::<NativeEndian>(
      buf_reader.read_u16::<NativeEndian>().expect("failed to read")).unwrap();

For reading raw bytes (i.e. your “read N bytes” example), you’d use the normal bulk read buf_reader.read_exact(&buffer[start .. end]).


#19

Consider what you said:

“Still seems odd that I need an unsafe statement to read an even number of bytes from a file as a series of shorts.”

Your response indicates that what you actually meant was this:

“Still seems odd that I need an unsafe statement to reinterpret a region of memory with an alignment of 1 byte as a sequence of shorts with no additional cost.”

Fundamentally, at some level, this requires unsafe because it requires the programmer to make an assertion about two types and the alignment of said types. The compiler doesn’t know what you mean.

The good news here is that you can build a safe abstraction that encodes these assumptions while maintaining the zero cost requirement. That’s what byteorder is. (Although, byteorder punts on the alignment question by doing unaligned loads/stores. You could build a different, and still safe, API that doesn’t punt on alignment.)

On the flip side, you can do it without maintaining the zero cost requirement, but in a way that is completely safe.

There is nuance here!