Count trailing zero bytes of a binary file

The answer to this question suggests using this shell line to count the trailing zeroes of a binary file:
zerobytes=$(( $( ggrep -aPo '\0*$' $file | wc -c ) - 1 ))
But this is very slow. I am trying to write a rust program to accomplish this, but I don't know how to iterate on the file in reverse. Here is my current script, which reads the whole file into memory. I prefer not to read the whole file into memory.

#!/bin/sh
#![allow()] /*
            exec cargo-play $0 -- "$@"
                        */

use std::env;
use std::fs::File;
// use std::io::BufReader;
use std::io::Read;

fn main() {
    let args: Vec<String> = env::args().collect();
    let filename = &args[1];
    let mut f = File::open(filename).unwrap();
    // let f = BufReader::new(f);
    // let f = f.collect().rev();

    let mut buffer = Vec::new();
    // read the whole file
    f.read_to_end(&mut buffer).unwrap();

    let mut count = 0;
    for b in (&buffer).iter().rev() {
        if b == &0 {
            count += 1
        } else {
            // println!("{}", b);
            break;
        }
    }
    println!("{}", count);
}

What you have looks fine. .rev() goes over an iterator backwards.

It reads the whole file into memory. I want to avoid that. I want something more like BufReader::new(f).rev().

Rust doesn't provide a way to read files backwards because OSs don't provide such capability. You can only read a file forwards, but you can jump to an arbitrary position in the file (at least if it's a normal file). You can use seek with SeekFrom::End to jump to a position relative to the end of file, e.g. SeekFrom::End(-1024) and read the last N bytes of the file. If you need to examine more bytes, you can seek further back using seek with SeekFrom::Current in a loop (for example, to read file in 1 Kb chunks, seek 2 Kb back after each read).

8 Likes

By the way, here is a shorter version of your original code, which is also slightly faster (because fs::read pre-allocates a buffer based on the size of the input file):

use std::env;
use std::fs;

fn main() {
    let filename = env::args().nth(1).unwrap();
    let buffer = fs::read(filename).unwrap();
    let count = buffer.iter().rev().take_while(|b| **b == 0).count();
    println!("{}", count);
}
3 Likes

Still looking for an answer? I've written one - as a learning exercise - that processes a file from end to beginning, counting the trailing zeros. It's pretty fast with a 1024 buffer over a 1 meg file of all zeros:

endee@penguin:~/rust/trailz$ time trailz test.zs
test.zs: 1048576

real    0m0.010s
user    0m0.003s
sys     0m0.008s

It reads the file one BUFF_SIZE at a time. Change BUFF_SIZE to suit yourself. I'm a rust-noob so there are probably more idiomatic ways do what it does. Useful tips welcomed ; )

//
// trailz -- utility to count number of trailing zeros in files
//
// usage: trailz file1 file2 ...
//
// outputs filename and number of trailing zeros
//
// Notes: 
//
//     Empty files and files with no trailing zeros both return 0
//
//     It skips itself as long as its path ends with 'trailz'
//
// Developer Notes:
//
//     If it is run with 'cargo test' then it will create several test files
//     
//     You can then run it with 'cargo run *.test' and it will process the
//     created files as input
//
//     If it cannot open a file, e.g. the file does not exist, then it will 
//     report "failed to open <filename> : <error message>" and continue
//
//     BUFF_SIZE defines the number of bytes in the buffer. It is set to 4 for
//     testing.  Change it to something larger when you make a release version.

use std::io::prelude::*;
use std::fs::File;
use std::fs;
use std::env;

const BUFF_SIZE : usize = 4;    //  CHANGE ME!!!!!!

fn main() {
    for arg in env::args() {
        if arg.ends_with("trailz") { continue; }
        match File::open(&arg) {
            Ok(_) => (),
            Err(err) => {
                println!("failed to open {} : {}", &arg, err);
                continue
            },
        }
        let t = tail_size(&arg).unwrap();
        println!("{}: {}", arg, t);
    } 
}

fn tail_size(name : &str) -> Option<u32> {
    let mut f = File::open(name).unwrap();
    let mut b = [0; BUFF_SIZE];
    let f_len = fs::metadata(&name).unwrap().len();
    if f_len == 0 {
        return Some(0); 
    }
    let mut pages = f_len / BUFF_SIZE as u64;
    if f_len % BUFF_SIZE as u64 != 0 { pages += 1 };
    let mut page : u64 = pages - 1;
    let mut t = 0;

    while page >= 0 as u64 {
        match f.seek(std::io::SeekFrom::Start(page * BUFF_SIZE as u64)) {
            Ok(_) => (),
            Err(err) => {println!("seek error in {} : error {}", name, err); return None; },
        };
        let n = f.read(&mut b).unwrap();

        let mut pos = n - 1 as usize;
        while pos >= 0 as usize {
            if b[pos] == 0 {
                t += 1;
            } else {
                return Some(t);
            }
            if pos == 0 {break;}
            else { pos -= 1; }
        }
        if page == 0 { break; }
        else { page -= 1; }
    }

    return Some(t);
}

// utility used to load test files
#[allow(dead_code)]
fn load_file(name : &str, data : Vec<u8>) -> std::io::Result<()> {
    let mut pos = 0;
    let mut b = File::create(name)?;

    while pos < data.len() {
        let bytes_written = b.write(&data[pos..])?;
        pos += bytes_written;
    }
    return Ok(());
}

#[test]
fn first_test() {
    let data = vec![0,0,0,0, 0,0,0,0, 0,1,0,0, 0,0,0,];
    load_file("first_test.test", data);
    let t = tail_size("first_test.test").unwrap();
    assert_eq!(t,5);
}

#[test]
fn no_trailing_zeros() {
    let data = vec![1,1,1,0,0,0,1,1,1,0,1,0,0,0,0,1,];
    load_file("no_trailing_zeros.test", data);
    let t = tail_size("no_trailing_zeros.test").unwrap();
    assert_eq!(t,0);
}

#[test]
fn empty_file() {
    let data = vec![];
    load_file("empty_file.test", data);
    let t = tail_size("empty_file.test").unwrap();
    assert_eq!(t,0);
}

#[test]
fn last_byte_is_zero() {
    let data = vec![0,0,0,0, 0,0,0,0, 0,1,0,0, 0,0,1,0,];
    load_file("last_byte_is_zero.test", data);
    let t = tail_size("last_byte_is_zero.test").unwrap();
    assert_eq!(t,1);
}
2 Likes

I think f.read(&mut b) here should probably be f.read_exact(&mut b)? Either that, or there needs to be some logic to call read again without seeking when n < BUFF_SIZE.

If I'm reading it correctly, your program will skip all bits within each BUFF_SIZE segment which happen to not be read by the read call.

As for making things more idiomatic, I think the only thing I'd recommend improving is the error handling. The tail_size function returns an Option, but never returns None. And errors are unwrapped, while it's generally better to push them up for things above to handle. What would you think of making tail_size return a Result<u32, std::io::Error> instead, and bubbling up the errors to the caller?

1 Like

Ok, I've watched a video on rust error handling (me.degrees_of_ignorance -= 1;) and converted it all to use Result and '?'

Here's the revised code (with a note on read_exact() added to the comments):

//
// trailz -- utility to count number of trailing zeros in files
//
// usage: trailz file1 file2 ...
//
// outputs filename and number of trailing zeros
//
// Notes: 
//
//     Empty file and files with no trailing zeros both return 0
//
//     It skips itself as long as its path ends with 'trailz'
//
// Developer Notes:
//
//     If it is run with 'cargo test' then it will create several test files
//     
//     You can then run it with 'cargo run *.test' and it will process the
//     created files as input
//
//     If it cannot open a file, e.g. the file does not exist, then it will 
//     report "failed to open <filename> : <error message>" and continue
//
//     BUFF_SIZE defines the number of bytes in the buffer. It is set to 4 for
//     testing.  Change it to something larger when you make a release version.

use std::io::prelude::*;
use std::fs::File;
use std::fs;
use std::env;
use std::io::Error;

const BUFF_SIZE : usize = 4;    //  CHANGE ME!!!!!!

fn main() -> Result<(), Error>{
    for arg in env::args() {
        if arg.ends_with("trailz") { continue; }
        match File::open(&arg) {
            Ok(_) => (),
            Err(err) => {
                println!("failed to open {} : {}", &arg, err);
                continue
            },
        }
        let t = tail_size(&arg)?;
        println!("{}: {}", arg, t);
    } 
    Ok(())
}

fn tail_size(name : &str) -> Result<u32, Error> {
    let mut f = File::open(name).unwrap();
    let mut b = [0; BUFF_SIZE];
    
    // Is the file empty?
    let f_len = fs::metadata(&name).unwrap().len();
    if f_len == 0 {
        return Ok(0); 
    }

    // calculate number of pages to process; last page in file may not be a full page
    let mut pages = f_len / BUFF_SIZE as u64;
    if f_len % BUFF_SIZE as u64 != 0 { pages += 1 };
    
    let mut t = 0;     // the count of _t_railing zeros

    // process pages, starting with the last one
    // Note that indexing has to be managed in longhand
    // because rust doesn't have a 'for pages..0 by -1' 
    // form for ranges
    let mut page : u64 = pages - 1;
    while page >= 0 as u64 {
        f.seek(std::io::SeekFrom::Start(page * BUFF_SIZE as u64))?;

        // read into the page buffer. Note that the last page in the file/first page read
        // is extremely likely to have less than BUFF_SIZE bytes in it => using read_exact()
        // is almost guaranteed to cause an error on the first read_exact() and leave the 
        // buffer in an undefined state.
        let n = f.read(&mut b)?;

        let mut pos = n - 1 as usize;
        while pos >= 0 as usize {
            if b[pos] == 0 {
                t += 1;
            } else {
                return Ok(t);
            }
            if pos == 0 {break;}
            else { pos -= 1; }
        }
        if page == 0 { break; }
        else { page -= 1; }
    }

    Ok(t)
}

// utility used to load test files
#[allow(dead_code)]
fn load_file(name : &str, data : Vec<u8>) -> std::io::Result<()> {
    let mut pos = 0;
    let mut b = File::create(name)?;

    while pos < data.len() {
        let bytes_written = b.write(&data[pos..])?;
        pos += bytes_written;
    }
    Ok(())
}

#[test]
fn first_test() {
    let data = vec![0,0,0,0, 0,0,0,0, 0,1,0,0, 0,0,0,];
    load_file("first_test.test", data);
    let t = tail_size("first_test.test").unwrap();
    assert_eq!(t,5);
}

#[test]
fn no_trailing_zeros() {
    let data = vec![1,1,1,0,0,0,1,1,1,0,1,0,0,0,0,1,];
    load_file("no_trailing_zeros.test", data);
    let t = tail_size("no_trailing_zeros.test").unwrap();
    assert_eq!(t,0);
}

#[test]
fn empty_file() {
    let data = vec![];
    load_file("empty_file.test", data);
    let t = tail_size("empty_file.test").unwrap();
    assert_eq!(t,0);
}

#[test]
fn last_byte_is_zero() {
    let data = vec![0,0,0,0, 0,0,0,0, 0,1,0,0, 0,0,1,0,];
    load_file("last_byte_is_zero.test", data);
    let t = tail_size("last_byte_is_zero.test").unwrap();
    assert_eq!(t,1);
}


1 Like

Awesome! The new version's error handling is great!

I appreciate your attention to read_exact throwing an error. I think, though, that this version still has the same problem as the original one: what happens if read reads fewer bytes than it should?

I had thought "the right amount" should always be BUFF_SIZE, but you're right that its different for the last slice. But I think there's still a problem. What if the following happens?

Say you're reading a file

[0,0,0,0, 1,0,0,0, 0,0,0,1, 0,0,0]

The first read will be at position 12. It can read 1-3 bytes. If it reads 3 bytes, it's all good. If it reads, say, 2 bytes, then you'll miscount the zeros.

The second read is at position 8. It can read 1-4 bytes. If it reads 4 bytes, then we're good. But what if it just happens to only read 2 or 3? Then we're not going to see that 1 and we're going to keep counting zeros.

I hope this helps show the problem! If the buff size is 4, it's very unlikely to happen. read will only partially read if it's read some data and would need t wait for disk IO , and files that small aren't likely to run into that. It's also a lot less likely to happen to a file you've just written, as the OS will still be caching that file in memory.

But I think if the files start being big enough (many MBs), and there's a big enough delay between when the file is created and when it's read, you'd start getting incorrect results. Read::read's documentation gives the exact contract here: read does not, in general, read the maximum number of bytes possible.

You're right that the solution isn't read_exact with BUFF_SIZE unconditionally. But I think to be fully correct the code does need to ensure that read has actually read the right number of bytes. Maybe f_len % BUFF_SIZE if f_len % BUFF_SIZE != 0 for the first segment?

Ah, now I understand what the problem is. It didn't occur to me that File::read was an inherently unreliable operation! Looking at the source for File shows that a File is just a wrapper for the local file system's file type, so results may vary depending on which platform you're running on. (I'm on a Chromebook.) Heh, this exercise is teaching me so much.

If we assume that the call to fs::metadata() gives us an accurate file length, then we can calculate the size of each page and use read_exact() instead of read. All pages should be BUFF_SIZE, except for the last one in the file/first one read. The read for that end page would be f_len % BUFF_SIZE long. I'll try that out in the morning.

Of course, all bets are off if something else is modifying the file while it is being scanned. I discovered this in testing when I used the same filename for all the test cases. cargo test seems to spread the tests across multiple threads so one test could clobber another test's data, leading to random test failures.

Picking up help requests from this forum is a way better way to improve my rust than working thru Hacker Rank problems. : )

2 Likes

Nope. read_exact() only reads into a buffer, and the number of bytes it reads is determined by the buffer size. Buffer sizes have to be declared with constant values at compile time. The size of the incomplete page at the end of the file is not known at compile time. Therefore you can't declare a buffer for it and so you can't read_exact() into it.

Any suggestions, anyone?

That's not true.
You can use any mutable slice, not just a static buffer.
For example, you could use a fixed size buffer and for the last chunk you can "slice it down" to the required size:

f.read_exact(&mut b[..last_chunk_size])

You could also use a Vec and resize it accordingly.

1 Like

Interesting. I tried something along the lines of this:

let mut slice = b[0..BUFF_SIZE];
f.read_exact[&slice];

And it wouldn't compile.

Here it is with f.read_exact(&mut b[..read_size]); which does compile:

// usage: trailz file1 file2 ...
//
// outputs filename and number of trailing zeros
//
// Notes: 
//
//     Empty file and files with no trailing zeros both return 0
//
//     It skips itself as long as its path ends with 'trailz'
//
// Developer Notes:
//
//     If it is run with 'cargo test' then it will create several test files
//     
//     You can then run it with 'cargo run *.test' and it will process the
//     created files as input
//
//     If it cannot open a file, e.g. the file does not exist, then it will 
//     report "failed to open <filename> : <error message>" and continue
//
//     BUFF_SIZE defines the number of bytes in the buffer. It is set to 4 for
//     testing.  Change it to something larger when you make a release version.

use std::io::prelude::*;
use std::fs::File;
use std::fs;
use std::env;
use std::io::Error;

const BUFF_SIZE : usize = 4;    //  CHANGE ME!!!!!!

fn main() -> Result<(), Error>{
    for arg in env::args() {
        if arg.ends_with("trailz") { continue; }
        match File::open(&arg) {
            Ok(_) => (),
            Err(err) => {
                println!("failed to open {} : {}", &arg, err);
                continue
            },
        }
        let t = tail_size(&arg)?;
        println!("{}: {}", arg, t);
    } 
    Ok(())
}

fn tail_size(name : &str) -> Result<u32, Error> {
    let mut f = File::open(name)?;
    let mut b = [0; BUFF_SIZE];
    
    // Is the file empty?
    let f_len = fs::metadata(&name)?.len();
    if f_len == 0 {
        return Ok(0); 
    }

    // calculate number of pages to process; last page in file may not be a full page
    let mut pages = f_len / BUFF_SIZE as u64;
    let rem =  f_len % BUFF_SIZE as u64;
    let mut rem = rem as usize;  // safe because rem < BUFF_SIZE, which is a usize

    if rem != 0 { pages += 1 };
    
    let mut t = 0;     // the count of _t_railing zeros

    // process pages, starting with the last one
    // Note that indexing has to be managed in longhand
    // because rust doesn't have a 'for pages..0 by -1' 
    // form for ranges
    let mut page : u64 = pages - 1;
    while page >= 0 as u64 {
        
        f.seek(std::io::SeekFrom::Start(page * BUFF_SIZE as u64))?;

        // read into the page buffer. Note that the last page in the file/first page read
        // is extremely likely to have less than BUFF_SIZE bytes in it.
        // rem holds the length of the remainder after the last whole page.

        let mut read_size = BUFF_SIZE;
        if rem != 0 {
            read_size = rem;
            rem = 0;
        }

        f.read_exact(&mut b[..read_size])?;

        let mut pos = read_size - 1 as usize;
        while pos >= 0 as usize {
            if b[pos] == 0 {
                t += 1;
            } else {
                return Ok(t);
            }
            if pos == 0 {break;}
            else { pos -= 1; }
        }
        if page == 0 { break; }
        else { page -= 1; }
    }

    Ok(t)
}

// utility used to load test files
#[allow(dead_code)]
fn load_file(name : &str, data : Vec<u8>) -> std::io::Result<()> {
    let mut pos = 0;
    let mut b = File::create(name)?;

    while pos < data.len() {
        let bytes_written = b.write(&data[pos..])?;
        pos += bytes_written;
    }
    Ok(())
}

#[test]
fn first_test() {
    let data = vec![0,0,0,0, 0,0,0,0, 0,1,0,0, 0,0,0,];
    load_file("first_test.test", data).unwrap();
    let t = tail_size("first_test.test").unwrap();
    assert_eq!(t,5);
}

#[test]
fn no_trailing_zeros() {
    let data = vec![1,1,1,0, 0,0,1,1, 1,0,1,0, 0,0,0,1,];
    load_file("no_trailing_zeros.test", data).unwrap();
    let t = tail_size("no_trailing_zeros.test").unwrap();
    assert_eq!(t,0);
}

#[test]
fn empty_file() {
    let data = vec![];
    load_file("empty_file.test", data).unwrap();
    let t = tail_size("empty_file.test").unwrap();
    assert_eq!(t,0);
}

#[test]
fn last_byte_is_zero() {
    let data = vec![0,0,0,0, 0,0,0,0, 0,1,0,0, 0,0,1,0,];
    load_file("last_byte_is_zero.test", data).unwrap();
    let t = tail_size("last_byte_is_zero.test").unwrap();
    assert_eq!(t,1);
}
1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.