Is this the most efficient way to perform a highly-iterative bitwise de/encoding task?


#1

I’m by no means a veteran programmer. Let’s say, for practice, that I’m making a library to more efficiently store a genome by cramming four 2-bit nucleotides into a byte. I have a rudimentary library file:

// dna/src/lib.rs
use std::mem;

#[derive(Copy, Clone, Debug, PartialEq)]
#[repr(u8)]
pub enum Nucleotide {
	Adenine = 0b00,
	Cytosine = 0b01,
	Guanine = 0b10,
	Thymine = 0b11,
}

pub fn encode(ns: &[Nucleotide]) -> u8 {
	ns.iter().fold(0u8, |byte, &n| byte << 2 | n as u8) //bitwise functional wizardry! nah, we just write the two bits to the
}                                                           //least significant end of the byte and then shift it left to make room for the next two

pub fn decode(byte: u8) -> Vec<Nucleotide> {
	(0u8..4).rev().map(|i| {
		unsafe {mem::transmute(byte >> i*2 & 0b11)} //and just do the reverse here.
	}).collect()                                        //i wish clike enums were less of a PITA to convert between...
}

pub fn from_byte(c: u8) -> Nucleotide {
	match c {
		b'A' => Nucleotide::Adenine,
		b'C' => Nucleotide::Cytosine,
		b'G' => Nucleotide::Guanine,
		b'T' => Nucleotide::Thymine,
		_ => panic!("char does not resolve to a Nucleotide"),
	}
}

pub fn from_nclt(n: Nucleotide) -> u8 {
	match n {
		Nucleotide::Adenine => b'A',
		Nucleotide::Cytosine => b'C',
		Nucleotide::Guanine => b'G',
		Nucleotide::Thymine => b'T',
	}
}

And an even more rudimentary main program that reads a file of ACGT’s and encodes it in my format:

// dna/src/main.rs
extern crate dna;

use std::io;
use std::io::{Read, Write};

fn main() {
    let stdin = io::BufReader::new(io::stdin()); //Q: does Read::bytes() already buffer for you?
    let mut stdout = io::stdout();
    let encoded: Vec<u8> = stdin.bytes()
        .map(|b| dna::from_byte(b.unwrap())) //convert the ASCII letters into my enum format
        .collect::<Vec<_>>() //Q: would it be better to make an iterator of length-4 arrays instead of collecting them all...
        .chunks(4)           //...and then dividing them up afterwards?
        .map(dna::encode) //take those enums and jam 4 of 'em at a time into a single byte...
        .collect();
    stdout.write(&encoded).ok(); //and write those compressed bytes.
}

I’m very happy about the highly functional style in which Rust allows me to write this, but I’m wondering if all these abstractions are truly zero-cost as advertised. As an amateur, are there any performance pitfalls I’m overlooking that I need to consider as I develop my coding skills?


#2

So you don’t need to wrap stdin() in a buffer, since it’s already buffered. However, that buffer is global and each call to .read() must lock a mutex. Read::bytes() makes a single-byte read on each call to .next() so that could get expensive (but not on BufReader since it’s just copying out of its internal, unlocked buffer).

If you do io::stdin().lock(), that will give you exclusive access to the buffered stdin. You don’t have to worry about the details of this locking since you’re not doing any threading.

I’ll add more later, I’m getting volun-told to drive for a beer run.

Edit-addendum:

For the most part, this looks pretty good. decode() returning Vec could be better; since you know it’s always going to be 4 nucleotides to a byte, it could return [Nucleotide; 4] instead. It’s more of an imperative style but it stays entirely on the stack:


fn decode(byte: u8) -> [Nucleotide; 4] {
    let mut ret = [Nucleotide::Adenine; 4];
    
    for i in (0u8 .. 4).rev() {
        ret[i] = unsafe { mem::transmute(byte >> i * 2 & 0b11) };
    }

    ret
}

I agree with converting an enum back from its discriminator, but you could wrap this in a safe interface:

impl Nucleotide {
    pub fn from_discriminator(dis: u8) -> Nucleotide {
        if dis > 3 {
            // `{:b}` says to write out this argument in binary
            panic!{"Invalid Nucleotide discriminator: {:b}", dis);
        }

        unsafe { mem::transmute(dis) }
    }
}

As for not allocating intermediate buffers, you can totally do this. You can thread the [Nucleotide; 4] directly to encode() and that directly to stdout(), though if you’re writing one byte at a time you probably want to use stdout().lock() (for the same reason you want to use stdin().lock()):

fn main() {
    let mut stdout = io::stdout().lock();

    io::stdin().lock().bytes()
        .map(Result::unwrap)
        .map(dna::from_byte)
        .map(|nucleotides| dna::encode(&nucleotides))
        .map(|byte| stdout.write(&[byte]).unwrap())
        .last(); // Exhaust the iterator without collecting to anything.
}

To me, this is kind of abusing functional style though. It looks uglier than it would in a mixed functional/imperative:

fn main() {
    let mut stdout = io::stdout().lock();

    for byte in io::stdin().lock().bytes().map(Result::unwrap) {
        let nucleotides = dna::from_byte(byte);
        let encoded = dna::encode(&nucleotides);
        stdout.write(&[encoded]).unwrap();
    }
}

#3

Nobody could promise you that every abstraction in rust is zero cost, and it’s not the case. Zero cost abstractions are possible.


#4

Precisely why I want to know which abstractions have cost, and which don’t!


#5

Edited my original reply.


#6

I like the ideas but your changes but there are a few issues:

  1. from_byte() semantically takes an ASCII byte (A, C, G, or T) and gives the corresponding variant; it doesn’t return any type of list of Nucleotides, but only one. It can’t be fed to encode(), so the unbuffered plan would probably involve nested for-loops.
  2. Stdin::lock() doesn’t seem to be designed for unthreaded programs. Trying to use it here makes the borrow checker complain about the lock’s scope not surrounding the entirety of main(), despite the lock only being able to be taken from within main().

main.rs:7:14: 7:25 error: borrowed value does not live long enough
main.rs:7     let stdin = io::stdin().lock().bytes();
                          ^~~~~~~~~~~
main.rs:6:11: 18:2 note: reference must be valid for the block at 6:10...
main.rs: 6 fn main() {
main.rs: 7     let stdin = io::stdin().lock().bytes();
main.rs: 8     {
main.rs: 9         let mut stdout = io::stdout();
main.rs:10         let encoded: Vec<u8> = stdin
main.rs:11             .map(|b| dna::from_byte(b.unwrap()))
           ...
main.rs:7:2: 7:41 note: ...but borrowed value is only valid for the statement at 7:1
main.rs:7     let stdin = io::stdin().lock().bytes();
              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
main.rs:7:2: 7:41 help: consider using a `let` binding to increase its lifetime
main.rs:7     let stdin = io::stdin().lock().bytes();
              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

#7

Okay, so I was looking at it slightly wrong. You do need an intermediate buffer but you can do it still on the stack, since you only have to accumulate 4 nucleotides before you can output a byte.

As for stdin().lock(), since the lifetime of .lock() is tied to the value of stdin(), it probably doesn’t like being chained. You just need to lift the stdin() value to a let-binding before locking it.

I’ll show both of these below:

fn main() {
    // Bind this to the local scope before locking it to satisfy lifetime constraints.
    let stdin = io::stdin();
    // Shadow `stdin` with the locked version. Only this one needs `mut`.
    let mut stdin = stdin.lock();

    // Do the same with stdout.
    let stdout = io::stdout();
    let mut stdout = stdout.lock();

    // Our intermediate buffer and the accumulation index.
    let i = 0;
    let mut nucleotides = [Nucleotide::Adenosine; 4];

    // Accumulate nucleotides until you have 4, then encode them as a byte and write them to stdout.
    for byte in stdin.bytes().map(Result::unwrap) {
        nucleotides[i] = dna::from_byte(byte);

        // We've accumulated 4 nucleotides, encode and write to stdout.
        if i == 3 {
            let out_byte = dna::encode(&nucleotides);
            stdout.write(&[out_byte]).unwrap();
            i = 0;
        }
    }
    
    // Forgot this before but it's probably a good idea.
    // Flushes the stdout buffer.
    stdout.flush().unwrap();
}

Obviously this might not work too well unless you ensure that the length of the input genome is divisible by 4 nucleotides. Your version would fill in the rest with Nucleotide::Adenosine (since .chunks(4) will yield the remainder in the last iteration if the length isn’t evenly divisible by 4) but this version will just not output the last length % 4 of the base pairs. It’s up to you which is preferable, or if you want to decide on a more consistent default behavior, like emitting an error.