I want to be able to use Rust to work with variable lengths of substring of biological sequence data, something like "GATA" or "CTGACTAGCTGG" or "CATGCATGCATGCATGCATGCATGCATG" or something even much much longer.
Could someone help me understand what I'm trying to do in terms of encoding and the different types involved and how I might work with different types to make my program work correctly?
I'd like to read such substrings to bytes, as &[u8]
, then encode them as unique integers to use as the keys in a hashmap, and ultimately be able to make these keys readable again as strings for output.
Different size strings seem to want to be encoded into different size integers. I want to be able to write a program that can process all possible lengths of substring. The code examples below are very repetitive, just switching in different parameters for the integer type I'm trying to work with and showing in comments the output I get.
use byteorder::{BigEndian, ReadBytesExt, WriteBytesExt};
use std::mem;
fn main() {
// This compiles and works as hoped:
let mut bs = "NNNN".as_bytes();
eprintln!("{:?}", bs); // [78, 78, 78, 78]
let ui = bs.read_u32::<BigEndian>().unwrap();
eprintln!("{}", ui); // 1313754702
let mut bs = [0u8; mem::size_of::<u32>()];
bs.as_mut().write_u32::<BigEndian>(ui).expect("Unable to write");
println!("{:?}", bs); // [78, 78, 78, 78]
println!("{}", std::str::from_utf8(&bs).unwrap()); // NNNN
If I try to use u64 in place of u32 I get the following output:
[78, 78, 78, 78]
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Error { kind: UnexpectedEof, message: "failed to fill whole buffer" }', src/main.rs:23:41
Trying with a longer string, over 50 characters, I get the following:
// First trying with u32
// This compiles but the output is misleading
let mut bs = "NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN".as_bytes();
let u_face = bs.read_u32::<BigEndian>().unwrap();
eprintln!("{}", u_face); // 1313754702
let mut bs = [0u8; mem::size_of::<u32>()];
bs.as_mut().write_u32::<BigEndian>(u_face).expect("Unable to write");
println!("{:?}", &bs); // [78, 78, 78, 78]
println!("{}", std::str::from_utf8(&bs).unwrap()); // NNNN
// Now with u64
// This also compiles but the output is different
let mut bs = "NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN".as_bytes();
let u_face = bs.read_u64::<BigEndian>().unwrap();
eprintln!("{}", u_face); // 5642533481369980494
let mut bs = [0u8; mem::size_of::<u64>()];
bs.as_mut().write_u64::<BigEndian>(u_face).expect("Unable to write");
println!("{:?}", &bs); // [78, 78, 78, 78, 78, 78, 78, 78] (Note, this is not representing all the Ns we started with)
println!("{}", std::str::from_utf8(&bs).unwrap()); // NNNNNNNN (Same again, less than what we started with)
// Now with u128
// This also compiles but, again, output is different
let mut bs = "NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN".as_bytes();
let u_face = bs.read_u128::<BigEndian>().unwrap();
eprintln!("{}", u_face); // 104086371058169412353502821096776158798
let mut bs = [0u8; mem::size_of::<u128>()];
bs.as_mut().write_u128::<BigEndian>(u_face).expect("Unable to write");
println!("{:?}", &bs); // [78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78] (Still not all the Ns we started with)
println!("{}", std::str::from_utf8(&bs).unwrap()); // NNNNNNNNNNNNNNNN (Same again, less than what we started with)
}