My config file lib SIGSEGVs when running example, also, is there a better way to do binary config format parsing without raw pointers?

This is both a code review and help post.
I have this project I am making, it's a binary configuration file format made for my hobby OS project, hence why it's no_std.
I did some code restructuring for array, and after I'd had made dictionary support, but now it SIGSEGVs at line 115 of src/lib.rs.
To reproduce the behaviour, run cargo run --example ftool examples/test.formulae
It used to work before I changed up the code to be able to parse arrays. also, it shouldn't even be reaching that line since I'm not yet using the Dictionary node anywhere in my code.
Here comes the code review part, is there there a better way to do this lib without raw pointers? I'm blaming the raw pointers for this headache.

EDIT: Forgot to link the project, silly me: https://github.com/Firework-OS/Formulae

Using raw pointers here feels unnecessary. Is there any reason why you don't want to use normal methods on &[u8] like slicing or methods like u32::from_le_bytes()?

I ran your program under gdb and it gave the following backtrace:

#0  0x00005555b6ec3cc0 in ?? ()
#1  0x000000000000000b in ?? ()
#2  0x00007fffffffb738 in ?? ()
#3  0x0000000000000070 in ?? ()
#4  0x000055555558954d in core::intrinsics::copy_nonoverlapping<u8> () at /rustc/bfe15646761a75f0259e204cab071565eed2b1e5/library/core/src/intrinsics.rs:2083
#5  alloc::vec::Vec::append_elements<u8, alloc::alloc::Global> () at library/alloc/src/vec/mod.rs:1798
#6  alloc::vec::spec_extend::{impl#4}::spec_extend<u8, alloc::alloc::Global> () at library/alloc/src/vec/spec_extend.rs:85
#7  alloc::vec::Vec::extend_from_slice<u8, alloc::alloc::Global> () at library/alloc/src/vec/mod.rs:2229
#8  alloc::string::String::push_str () at library/alloc/src/string.rs:850
#9  alloc::string::{impl#63}::write_str () at library/alloc/src/string.rs:2727
#10 core::fmt::{impl#0}::write_str<alloc::string::String> () at /rustc/bfe15646761a75f0259e204cab071565eed2b1e5/library/core/src/fmt/mod.rs:193
#11 0x00005555556b69c2 in core::fmt::run () at library/core/src/fmt/mod.rs:1216
#12 core::fmt::write () at library/core/src/fmt/mod.rs:1184
#13 0x00005555555884bc in core::fmt::Write::write_fmt<alloc::string::String> ()
    at /rustc/bfe15646761a75f0259e204cab071565eed2b1e5/library/core/src/fmt/mod.rs:186
#14 alloc::fmt::format () at library/alloc/src/fmt.rs:597
#15 0x00005555556c9807 in formulae::Node::from_bytes (node_type=0x5555558194b1, next=0x7fffffffcb78, data_end=0x5555558194eb) at src/lib.rs:115
#16 0x00005555556c996a in formulae::Node::from_bytes (node_type=0x555555819488, next=0x7fffffffcb78, data_end=0x5555558194eb) at src/lib.rs:106
#17 0x00005555556caac8 in formulae::Root::from_bytes (data=...) at src/lib.rs:182
#18 0x00005555555816a6 in ftool::main () at examples/ftool.rs:60

Looks like the node_type variable on line 103 (stack frame #16) doesn't contain what you think it does because we pass it into Node::from_bytes() which then blows up when it reads the value while printing the "Unimplemented Node type" error message.

1 Like

I have no idea why I used raw pointers, I guess my C programmer thought process took over. I think if I rewrite my code like how you suggested it'll fix this issue, and future issues similar to this.

Yeah, I'm guessing you've run into an index-out-of-bounds error or interpreted a pointer as an integer because all the pointer casting meant you had one too few/many levels of indirection.

If it were me, I'd phrase the from_bytes() function as something which reads the bytes and returns a tuple of the parsed Node and the remaining bytes.

That way you can parse multiple nodes something like this:

pub fn parse_nodes(mut input: &[u8]) -> Option<Vec<Node>> {
    let mut nodes = Vec::new();

    while input.len() > 0 {
        let (node, rest) = parse_node(input)?;
        nodes.push(node);
        input = rest;
    }

    Some(nodes)
}

Where I implemented parse_node() as something which peels off the first byte, does a match against constants (not an enum - Rust enums must always contain a valid variant, so blindly casting a u8 to NodeType is UB) then peels off the rest of the Node.

This "peel off some bytes then return the rest" approach composes really well and leads to some pretty clean code when you combine it with ? and shadowing the previous byte slice.

pub fn parse_node(input: &[u8]) -> Option<(Node, &[u8])> {
    let ([node_type], input) = read(input)?;

    match node_type {
        node_types::BOOL => {
            let ([byte], input) = read(input)?;
            Some((Node::Bool(byte != 0), input))
        }
        node_types::INT32 => {
            let (bytes, input) = read(input)?;
            let i = u32::from_le_bytes(bytes);
            Some((Node::Int32(i), input))
        }
        node_types::STRING => {
            let (s, input) = read_string(input)?;
            Some((Node::String(s), input))
        }
        _ => None,
    }
}

(playground)

2 Likes

Thank you!

1 Like

Any time you are working on something that can be described as a “parser” or “deserializer” or similar, it's a good idea to stick to not using any unsafe at all (at least as a starting point until you have a clear performance problem and a rigorous plan to fix it while maintaining soundness). Bugs in code that interpret externally provided data are one of the top sources of vulnerabilities, and the kind of thing that Rust's safe/unsafe distinction is meant to help prevent.

3 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.