Understanding different outcomes using u32, u64, and u128, encoding string to &[u8] to integer and back to string

I want to be able to use Rust to work with variable lengths of substring of biological sequence data, something like "GATA" or "CTGACTAGCTGG" or "CATGCATGCATGCATGCATGCATGCATG" or something even much much longer.

Could someone help me understand what I'm trying to do in terms of encoding and the different types involved and how I might work with different types to make my program work correctly?

I'd like to read such substrings to bytes, as &[u8], then encode them as unique integers to use as the keys in a hashmap, and ultimately be able to make these keys readable again as strings for output.

Different size strings seem to want to be encoded into different size integers. I want to be able to write a program that can process all possible lengths of substring. The code examples below are very repetitive, just switching in different parameters for the integer type I'm trying to work with and showing in comments the output I get.

use byteorder::{BigEndian, ReadBytesExt, WriteBytesExt};
use std::mem;
fn main() {
    //  This compiles and works as hoped:
    let mut bs = "NNNN".as_bytes();
    eprintln!("{:?}", bs); // [78, 78, 78, 78]
    let ui = bs.read_u32::<BigEndian>().unwrap();
    eprintln!("{}", ui); // 1313754702
    let mut bs = [0u8; mem::size_of::<u32>()];
    bs.as_mut().write_u32::<BigEndian>(ui).expect("Unable to write");
    println!("{:?}", bs); //  [78, 78, 78, 78]
    println!("{}", std::str::from_utf8(&bs).unwrap()); // NNNN

If I try to use u64 in place of u32 I get the following output:

[78, 78, 78, 78]
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Error { kind: UnexpectedEof, message: "failed to fill whole buffer" }', src/main.rs:23:41

Trying with a longer string, over 50 characters, I get the following:

// First trying with u32
// This compiles but the output is misleading
    let mut bs = "NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN".as_bytes();
    let u_face = bs.read_u32::<BigEndian>().unwrap();
    eprintln!("{}", u_face); // 1313754702
    let mut bs = [0u8; mem::size_of::<u32>()];
    bs.as_mut().write_u32::<BigEndian>(u_face).expect("Unable to write");
    println!("{:?}", &bs); // [78, 78, 78, 78]
    println!("{}", std::str::from_utf8(&bs).unwrap()); // NNNN

    // Now with u64
    // This also compiles but the output is different
    let mut bs = "NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN".as_bytes();
    let u_face = bs.read_u64::<BigEndian>().unwrap();
    eprintln!("{}", u_face); // 5642533481369980494
    let mut bs = [0u8; mem::size_of::<u64>()];
    bs.as_mut().write_u64::<BigEndian>(u_face).expect("Unable to write");
    println!("{:?}", &bs); // [78, 78, 78, 78, 78, 78, 78, 78] (Note, this is not representing all the Ns we started with)
    println!("{}", std::str::from_utf8(&bs).unwrap()); // NNNNNNNN (Same again, less than what we started with)

    // Now with u128
    // This also compiles but, again, output is different
    let mut bs = "NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN".as_bytes();
    let u_face = bs.read_u128::<BigEndian>().unwrap();
    eprintln!("{}", u_face); // 104086371058169412353502821096776158798
    let mut bs = [0u8; mem::size_of::<u128>()];
    bs.as_mut().write_u128::<BigEndian>(u_face).expect("Unable to write");
    println!("{:?}", &bs); // [78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78] (Still not all the Ns we started with)
    println!("{}", std::str::from_utf8(&bs).unwrap()); // NNNNNNNNNNNNNNNN (Same again, less than what we started with)
}

u32 means an integer that is stored as 32 bits. Since there are eight bytes per bit, this means that a u32 is four bytes long. You can't store more than four bytes of data in a u32. When you call read_u32, you are reading only the first four bytes of the string and ignoring the rest.

Computer processors can only do arithmetic on numbers with specific, fixed sizes like u32 and u64. Rust's built-in integer types roughly correspond to the types that computer hardware supports directly. There is no built-in type for "integer of arbitrary size."

You can just keep the substrings as &[u8] or Vec<u8> values, and use these as the keys of your hashmap. There's no need to encode them as integers. Hashing works just as well on byte strings as it does on integers.

1 Like

That's great to know, thanks again!

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.