How to parse Vec<u8> to i32

I have a file that contains nothing but integers. The first 21 characters in the file are:

000004795110719122020

I'm trying to read this file and print these characters to the terminal like this:

use std::fs;

fn main() {
    let file = fs::read("C:\\Users\\mthel\\.julia\\datadeps\\CPS 202012\\dec20pub.dat").expect("uh oh");
    for val in 0..21 {
        println!("{}", file[val]);
    }
}

This prints:

48
48
48
48
48
52
55
57
53
49
49
48
55
49
57
49
50
50
48
50
48

I see that fs::read returns a Result<Vec<u8>>, so how do I convert these values to "normal" integers and have them display correctly (and so I can ultimately do some math with them)?

1 Like

The Vec<u8> is representing the bytes of the file. If you want to read something other than bytes, you have to parse the file, and how you do that depends on how the data in the file is encoded. If it's just a list of ASCII digits, then the simplest thing to do is use fs::read_to_string, and then use str::parse on each byte:

use std::fs;

fn main() {
    let text = fs::read_to_string("C:\\Users\\mthel\\.julia\\datadeps\\CPS 202012\\dec20pub.dat").expect("uh oh");
    for i in 0..text.len() {
        println!("{}", str::parse::<i32>(&text[i]));
    }
}

This will probably break if you have a more complex encoding.

3 Likes

@skysch I'm getting an error because it won't let me index into a string

Ah, you have to do something like &text[i..i+1]. Strings are sliced by byte ranges.

2 Likes
fn main() {
    let text = fs::read_to_string("C:\\Users\\mthel\\.julia\\datadeps\\CPS 202012\\dec20pub.dat").expect("uh oh");
    for ch in text.chars() {
        println!("{}", ch.to_digit().expect("not a digit"));
    }
}

Seems more plausible

8 Likes

So can someone help me understand what's going on in my original example? I understand that the Vec<u8> represents the bytes of the file and that each byte is made up of 8 bits (and that there's one byte for each character in the file). So the read function loads a byte for each character into a vector that I can index into. However, when I call println!, what gets printed out appears to be ASCII decimal values. Is there some conversion going on under the hood? Does this conversion only take place when you call println! or other functions that are intended to display the bytes?

THIS explanation on std::mem:transmute (and why NOT to use it, and what to use insted) provides some interesting alternatives. They might not be exactly what you want, but they might inspire you.

let raw_bytes = [0x78, 0x56, 0x34, 0x12];

// use `u32::from_ne_bytes` instead
let num = u32::from_ne_bytes(raw_bytes);
// or use `u32::from_le_bytes` or `u32::from_be_bytes` to specify the endianness
let num = u32::from_le_bytes(raw_bytes);
assert_eq!(num, 0x12345678);
let num = u32::from_be_bytes(raw_bytes);
assert_eq!(num, 0x78563412);
1 Like

println is just printing the u8 values you read from the file in the manner you requested. You can print them in other ways if you prefer, by using the format specifiers accepted by println.

1 Like

So can someone help me understand what's going on in my original example? I understand that the Vec<u8> represents the bytes of the file and that each byte is made up of 8 bits (and that there's one byte for each character in the file). So the read function loads a byte for each character into a vector that I can index into.

Correct

However, when I call println! , what gets printed out appears to be ASCII decimal values.

Yes

Is there some conversion going on under the hood?

Yeah, basically. These bytes (u8) represent each character in your file, and they are encoded as ASCII. But the rust data type is a u8, an unsigned 8-bit integer. So even though they are intended in the file to represent ASCII, your code will print them as it would print any other integer: by converting it into a decimal string.

Just like if you said, let i: i32 = 314; println!("{}", i);, it would take the 32-digit base-2 representation of i, and convert it to a base-10 string.

Does this conversion only take place when you call println! or other functions that are intended to display the bytes?

Hmm, well, the conversion from u8 to text is happening in u8's implementation of std::fmt::Display. The various formatting macros in the standard library, and in other libraries such as logging libraries for example, use the various traits in std::fmt to convert from data types to text, mainly Display and Debug.

The following formatting macros exist in the standard library, and they can format text:

  • print! (prints to stdout)
  • println! (prints to stdout with newline character)
  • eprint! (prints to stderr)
  • eprintln! (prints to stderr with newline character)
  • write! (writes to whatever output stream you give it)
  • writeln! (writes to whatever output stream you give it)
  • format! (creates a String)

They use the "{}" syntax to format with the std::fmt::Display trait, the "{:?}" and "{:#?}" syntax to format with the std::fmt::Debug trait, and other lesser known syntaxes to format with various other flags and traits, like hex formatting.

The way they achieve this cool and consistent formatting syntax is that they all delegate to the format_args! macro, which produces a value which borrows its args and implements Display, representing the formatted string.

For example, these would all do the same thing:

  • println!("{} + 5 == {}", a, a + 5)
  • println!("{}", format_args!("{} + 5 == {}", a, a + 5))
  • println!("{}", format_args!("{}", format_args!("{} + 5 == {}", a, a + 5)))

But I worry that I'm getting carried away. Hopefully this clarifies.

2 Likes

You're definitely not getting carried away, this is what I needed :slightly_smiling_face:. As someone who has spent the vast majority of their programming life working in high-level languages, I've never had to concern myself much with how data are actually represented on a computer, so this is all very helpful. I have another question though, if I may:

If I need to do some mathematical operations with these numbers, should I just read the file, do the math, and then wait to do the conversion to an ASCII character when I'm at the point of needing to display something to a user? Or, is there some reason that I should go ahead and do the conversion up front prior to performing any mathematical operations?

EDIT: Nevermind, I think I have to parse it before doing any math. The number 1 is represented by the byte that's represented by the u8 value 49 (right?). If I want to do 1 + 1, I need to represent the 49 as 1 first, I can't do 49 + 49 and then convert 98 because 98 is the letter b, right?

2 Likes

The way I understand it, you would like to do something like that:

use std::{fs, io};

fn main() -> io::Result<()> {
    let digits: Vec<i32> = fs::read("input.dat")?.into_iter()
        .filter(|&x| x.is_ascii_digit())
        .map(|x| i32::from(x - b'0')).collect();
    println!("{:?}", digits);
    Ok(())
}

2 Likes

What does the - b'0' mean? I understand that b'0' is a ByteString (or a ByteStr?), but I don't understand why we are subtracting it from x

The expression b'0' has the type u8 and produces the ascii value of the zero character. By subtracting the ascii value of zero from an ascii digit, you get back the digit as an integer.

2 Likes

b'0' is the same as 48u8. It's a byte character (note the single quotes) representing the ASCII value of '0'.

2 Likes

x - b'0' basically takes advantage of the fact that historically, text encodings have pretty much always put the digits together, in order, from 0 to 9, and particularly for rust, that utf-8 does this, so if text represents the digit "0" with the byte (u8 value) 48, then "1" is represented by the value 49, "2" by 50, and so on. So b'0' - b'0' == 48 - 48 == 0, and b'1' - b'0' == 49 - 48 == 1, and so on.

2 Likes

As @cliff said, if you look at an ASCII table, you see that the ASCII values 48-57 represent the digit characters '0' through '9':

So, assuming an ASCII value is in that range, you can easily convert it to the numerical value by subtracting 48. b'0' is just syntactic sugar for "the u8 ASCII value of the character '0'", which 48.

Although, instead of doing this manually, you could just take advantage of the char.to_digit(10) function which already exists, and which ultimately does that internally.

2 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.