My First Project - hexetta

I've been working on a project, called hexetta, similar to hexdump but more flexible and with more features. It can view data in binary, quaternay, octal, hexadecimal, and ascii. It is my first big project I've ever created, in any language. I would love any feedback.

2 Likes

Some observations

Command line parsing.

You're reinventing the wheel with your flags and config modules. Consider using clap.

Useless comments

Comments like

/***********
* `struct` *
***********/

and

/*********
* `impl` *
*********/

are utterly useless noise. The reader already knows this by reading the respective first keyword of the line.

Derive useful traits

#[derive(Clone, PartialEq)]
pub enum FlagName {

For a plain enum without associated data, deriving Copy may be more ergonomic. Also consider deriving Debug.

Used types

The type Vec<String> is very wasteful, since it may require a lot of allocations. Why not use a plain String, or, even more suitable, a Vec<u8>?

Unsafe unsafe

Why do you assume, that this is safe? Consider adding an appropriate safety comment or, even better, don't use unsafe code and just store the plain bytes.

            let s = unsafe { String::from_utf8_unchecked(buf.to_vec()) };
5 Likes

Yes, I know I reinvented the wheel with my command line parser. I wanted to do it for a learning experience, although my code is extremely messy, so I'm thinking about migrating to clap.

The comments are there to organize my code, when it was scattered around before, but now I don't need them. Thanks for the suggestion, I'll remove them.

Yes, but the reason I used them is because I needed a string representation of data in different bases. I could use String instead, that requires the functions in src/byte_format.rs taking an extra argument, but it might be faster.

I've checked that buf is an array with completely valid Unicode. I could change it to this, but I think it reduces performance:

let s = String::from_utf8(buf.to_vec()).unwrap();

Use String::with_capacity and push IMO.

1 Like

This is fine in theory, but:

  • do check if there's any actual statistically significant difference (eg with hyperfine or criterion), it's very easy to assume performance differences where the aren't any
  • it's a really good idea to provide a // SAFETY: comment above each unsafe block explaining how you're meeting the requirements of the unsafe method (which should be explicitly listed in the method docs): in this specific case, you can probably just use a simple comment like "// SAFETY: buf contains only ASCII"
  • ideally, isolate the unsafe code together with all the code upholding the requirements of the unsafe code such that it can't be misused "externally" - generally by extracting a module and using visibility and type states (using types to represent the state of data)

In this case, I'd not be surprised if the optimizer could completely eliminate the checks anyways. (I should give it a try on Compiler Explorer)

2 Likes

Like this?

for &byte in self {
    // old code:
    // let buf = [HEX[(byte >> 4) as usize], HEX[(byte & 0xF) as usize]];
    // let s = unsafe { String::from_utf8_unchecked(buf.to_vec()) };

    // new code:
    let mut s = String::with_capacity(2);
    s.push(HEX[(byte >> 4) as usize] as char);
    s.push(HEX[(byte & 0xF) as usize] as char);
    result.push(s);
}

I tried it on Compiler Explorer, and the two had the same functions, but the unsafe one had 987 lines of assembly, and the safe one had 1521 lines of assembly.

I ran hyperfine --warmup 3 -N "target/release/hexetta big", where big is a 16 MiB file.
unsafe String::from_utf8_unchecked: 0.999 seconds on average
safe String::from_utf8(...).unwrap(): 1.156 seconds on average
String::with_capacity and push: 1.026 seconds on average

With a 32 MiB file:
unsafe String::from_utf8_unchecked: 2.002 seconds on average
safe String::from_utf8(...).unwrap(): 2.304 seconds on average
String::with_capacity and push: 2.048 seconds on average

If I should avoid unsafe, String::with_capacity and push is the best option.

1 Like

With those results, I'd say something like this is pretty reasonable?

// Roughly 15% overall perf improvement compared to String::from_utf8().unwrap()
// SAFETY: buf contains only ASCII
let s = String::from_utf8_unchecked(buf);
1 Like

Update: I changed my app to use clap now

Update: I added a position indicator

About 1 second to format 16 megabytes into hex is slow.

BUT: In this case, it's OK! See below.

I wouldn’t recommend that approach.

This is a beginner project and apparently the author’s first one. Suggesting unsafe for an approximate 15% performance improvement is unlikely to be helpful in this context.

The performance issues stem mainly from the chosen data structures. Using a Vec<String> with two character strings introduces a lot of overhead: 26 bytes of overhead per input byte. For a 16 MiB input, that translates to over 16 million single allocations and about half a gigabyte of gross RAM usage (alignment, metadata). Under those conditions, poor performance is expected.

That said, this is entirely reasonable for a learning project.

If the original author is explicitly looking for performance advice, there are other options such as appending directly to a single String, or using a lightweight view like struct EncodedSlice<'a>(Encoding, &'a [u8]) and defer formatting until output.

2 Likes

Yes, but the kind of advice I'm trying for here is general advice for how to think about unsafe, not necessarily what's the best option for this specific project: a real measurable performance improvement is a legitimate case for unsafe, and thinking about the motivations and justifications and required effort needed for the legitimate uses is exactly the right approach in the longer term.

Simply giving a blanket "never use unsafe" is just silly, since the features and APIs obviously exist for a reason. Anything like "you should avoid unsafe" until you're experienced enough to know when you should" implies that that experience is somehow magically accumulated by doing random other things. Starting with "hey, using unsafe is probably bad, you should try to avoid the use of it unless you can show it's a correct use and it makes a difference over the safe alternatives" is how you get that experience to know when it's worth reaching for (at least for performance reasons - to work around the borrow checker etc. is a whole different kettle of fish)

Now, yes, there are more things to try to optimize this particular project, and it's quite likely to end up not using a String at all in a better performing approach, but several options there have been suggested already and really it's up to the OP how they want to approach trying improvements for their own project. Honestly at this point having fun is more important than anything!

1 Like

So yes, this is a fair argument I agree in general.

There are quite a few unsafe methods in Rust labeled _unchecked(), such as new_unchecked(), get_unchecked(), unwrap_unchecked(), and others. These methods are generally more performant than their safe counterparts.

Nevertheless, most Rust code prefers the safe versions, even in situations where the unsafe alternatives would technically be correct and could not go wrong at that specific point in the code.

Real world experience has shown, however, that something will eventually go wrong in the future. To me, it seems that the use of Rust (and other safe languages) often follows the idea that paying a small runtime overhead is worth it in exchange for preventing potential disasters (in the future).

Because of that, I would argue that a small performance gain from using unsafe code is rarely justified, especially when a measurable larger improvement could be achieved through a manageable algorithmic change while still staying entirely within safe code.

1 Like

Yes, I think we're basically in agreement in principle, and it's only a question of degree: that is, how much performance is "a small improvement" and how high the bar is for trying alternatives.

You definitely shouldn't just toss in unsafe "because it's faster" - but equally you shouldn't take it off the table when you're looking for performance improvements for too long: at some point for "library unsafe" methods like from_utf8_unchecked() this just means you're losing out on the design and ergonomics of existing library code like String and creating clumsy duplicates just to avoid a keyword. It's quite easy (in general, I don't think this project is currently an example) to end up with the same or worse performance by just spreading out the checks you were trying to avoid elsewhere in the code implicitly.

On the other hand, especially for library code, it can be valuable to be able to claim zero unsafe usage, and you shouldn't get overly comfortable going for unsafe, and you'll definitely need to build up experience in how to safely use unsafe that you could be spending elsewhere, etc.... There's very few simple answers in programming, I here just wanted to avoid presenting a too dogmatic position on unsafe of "don't ever use it" that doesn't really reflect the complexity of the reasoning behind that, especially when the OP has demonstrated a locally fairly reasonable use of it in practice.

1 Like

Hi! It's great to see a project like this implemented in Rust. Coincidentally, I needed a hex/binary "reader" because I'm working on a project that creates binary files, and sometimes I need to check the output bit by bit. So, here is some feedback from a user's perspective, not about the code implementation:

  • First, there is a better tool to compare against this that is xxd. In fact I couldn't find an option on hexdump to output the content in binary format but it's available on xxd with the -b argument. So I started like you being an hexdump user that later realized that xxd is not just better (it also has colored output), but also more widely available in Linux distributions without requiring it to install, it's just there, and the same I think applies to MacOS.
  • The output coloring is useful because it highlights what are the "unprintable" characters better, or the 0x00 bytes that are usually specials in binary format as well.
  • The --layout argument is hard to understand, you either need to change how to set this option or document better how it works:
      -l, --layout <LAYOUT>
              Control the layout of the output
              LAYOUT is a sequence of a number followed by an encoding
              Available encodings are:
                  - b        binary
                  - q        quaternary
                  - o        octal
                  - h        hex
                  - a        ascii
               [default: 16h16a]
    
    The first part is to set how many bytes to print in what encoding, like 16h would be: print 16 bytes in a row in hexadecimal format, but what is the second? it took me a while to realize is to setup the second section at the right with a different encoding, that by default would be otherwise 16a, but what happens when you set something like 16h08a? I see the output and it doesn't make sense for me, the number of columns should be both equals for both sections, and what you want is to setup what encoding to use on each one, and how many columns but the number of columns should be the same for both sections. I would split this into more arguments, one to set the numbers of columns for either sections -c COLS as it works in xxd, one to setup the encoding in the first section that is by default hexadecimal, and the other to set the encoding in the second section that by default is ASCII, plus an option to enable/disable the second section (in hexdump the option is -C).
  • Once you have an acceptable version, publish it on crates.io , so it's easier to install, it would be something like cargo install hexetta and that's it, no need to clone, build, and install all with different commands.
1 Like

Update: I refactored the code and added a lot of features (such as color).

When should I release this as 1.0.0?

Nice! will check soon.

When should I release this as 1.0.0?

I would first release 0.1.0 somehow, like publish it on Crates.io and/or binary releases available for some platforms so it's easier to install for no savy rust developers, then after more feedback 0.1.x or 0.2.x until you know you are not going to introduce breaking changes for a while. Then is the moment for 1.0.0 I think.

1 Like

Oh I see you are in version 0.6.x already, that's fine, but I would wait a bit more for 1.0.

Also I would recommend to explain the project a little bit more in the README, that's is going to be the REAME in Crates.io as well, so you make your point there why to choose your CLI over hexdump or xxd. I would even include an example of the output in the README too.

2 Likes

Did I up the versions too fast? I upped the minor version every time I added a new feature, but a lot of crates on crates.io have more conservative version numbers.

Roughly there's not any rules for 0.x - that's kind of a blessing and a curse: you can do whatever you like but that's why people feel a bit nervous about depending on them. You can use it as 0.major.minor but that's just a best effort default behavior from the resolver that a lot of semver implementations have settled on, it's not strictly in the standard nor has any strong motivation for the "not ready for public use" semantics 0.x is supposed to have.

In practice it doesn't really mean anything for binaries though, since nobody is pulling them in as dependencies with valid version ranges.

1 Like