We have a bioinformatics library that outputs a large ndarray to a binary format. It works column by column. I can't figure out how to make it run faster by, for example, converting the next column to binary while writing the current column. Maybe this is something async can help with.
The simple, single-threaded example below is simplified but captures the gist of the problem.
I'd love to see how to make this multithreaded with something like async.
Thanks to anyone who can help with this. (I'd think it would be a common pattern.)
- Carl
use ndarray as nd;
use std::{
fs::File,
io::{BufWriter, Write},
};
use thiserror::Error;
#[derive(Error, Debug)]
pub enum BedError {
#[error("Attempt to write illegal value to BED file. Only 0,1,2,missing allowed. '{0}'")]
BadValue(String),
#[error(transparent)]
IOError(#[from] std::io::Error),
}
pub fn write(
filename: &str,
iid_count: usize,
sid_count: usize,
high: f64,
) -> Result<(), BedError> {
assert!(iid_count % 4 == 0, "iid_count must be a multiple of 4");
let iid_count_div4 = iid_count / 4;
let val = nd::Array::from_elem((iid_count, sid_count), high-0.01);
let mut writer = BufWriter::new(File::create(filename)?);
for column in val.axis_iter(nd::Axis(1)) {
// Covert each column into a bytes_vector
let mut bytes_vector: Vec<u8> = vec![0; iid_count_div4]; // inits to 0
for (iid_i, &v0) in column.iter().enumerate() {
let byte = if v0 < 4.0 {
(v0 / 4.0f64).floor() as u8
} else {
return Err(BedError::BadValue(filename.to_string()).into());
};
let i_div_4 = iid_i / 4;
let i_mod_4 = iid_i % 4;
bytes_vector[i_div_4] |= byte << (i_mod_4 * 2);
}
// Write the bytes vector
writer.write_all(&bytes_vector)?;
}
return Ok(());
}
#[test]
fn test1() {
write("test1.bed", 12, 10, 4.0).unwrap();
}
#[test]
fn test2() {
let result = write("test2.bed", 12, 10, 5.0);
assert!(result.is_err());
}
Output:
running 2 tests
test test1 ... ok
test test2 ... ok
test result: ok. 2 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s
running 0 tests
test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s
Errors:
Compiling playground v0.0.1 (/playground)
Finished test [unoptimized + debuginfo] target(s) in 2.88s
Running unittests (target/debug/deps/playground-942e3b5cb80c1398)
Doc-tests playground