[Review][CLI][IO] hck, a sharp cut clone with some specific issues

I've written up a description of the project here: There must be a better way! by sstadick · Pull Request #6 · sstadick/hck · GitHub

tl;dr - Go to here and help me figure out how to make that ideal world work.

There are only a couple hundred lines of code, with big chunk of that duplicated (see why below).

hck is a clone of cut, but faster and with more features. I hit two major roadblocks, one with lifetimes and one with types. Both are currently solved by copy-pasting a bunch of code.

Happy to have comments as reviews on that PR, or comments here. If anyone is willing / wants to chat in real time (text or voice) lmk.

Here's the PR comment in case you don't want to go to github:


These changes are already on master, this PR is meant to inform future iterations.

It might be worth starting with the very bottom In an ideal world.

What is hck?

hck is a near clone of cut, but faster and with more features, such as rearranging output columns regex delimiters and selection by headers. A few more features are planned as well such as regex filters on columns and possibly some kind of nice hookup to bat or fzf.

As it stands in my own benchmarking (in other words, probably flawed and biased benchmarks), hck is on par with tsv-utils from ebay, which seems to be about the fastest toolkit out there. I would like to keep or exceed this performance.

How does it work?

Setup

hck takes a list of files or stdin and some subset of fields as options. Fields are parsed by the FieldRange module/struct. The code for that was adapted from the rust coreutils version of cut.

A writer opened on stdout (from grep_cli to avoid the rustc not-buffered-stdout issue) is created.

For each input the run function in main.rs is called. It takes a reader, writer, and the parsed options.

The field ranges are re-parsed for each input source since the headers for each input could be in a different order. This is not ideal and could be changed in the future to do it all right at the start, or at least only parse the index based fields once.

In the run function the reader and writer are wrapped in BufWriter / BufReader. A core::Core object is created to manage the actual parsing and writing of data.

At this point we hit the first limitation imposed by the current design, if we are splitting the line based on a regex, we call the core.process_reader_regex, and if using a substr delim we call core.process_reader_substr. These methods differ only in how the string is split.

Parsing lines

Looking at core::Core::process_reader_*, the main loop there was taken from bstr's for_line_with* code. It is a nearly zero copy line reader and was the fastest of the many ways I tried parsing lines. The function itself in bstr required a closure, more on that in a second.

The cached Vec<Vec<&str>> issue

The "process a line" code is repeated twice in the loop. This is bad and I'd love to change it. For each line, because we are reordering the outputs, we need to store references to parsed fields in the "new" order since we must (or at least it seems easiest / fastest) iterate over the fields their input order. This is done by storing groups of fields on the shuffler Vec as Vec<Vec<&str>>. We want to avoid allocating and deallocating that vec over and over for each line. This turns out to be really really hard and annoying.

See my question here and here to see how I ended up with what you see now.

The short explanation is that empty_shuffler is created outside the loop with a &'static str inner, then coerced inside the loop to give &str a shorter lifetime so that the compiler will allow references from the current buffer to be stored on a container that lives longer than that data. When the fields are written in their new order the inner Vec<&'str>'s are drained so the vec is effectively recycled, but rustc doesn't know that yet so we transmute the shuffler back into the empty_shuffler (again, see above questions for more details on how / why this works).

let mut empty_shuffler: Vec<Vec<&'static [u8]>> = vec![vec![]; self.fields.len()];
loop {
    // coerce the lifetime to match that of the loop
    let mut shuffler = empty_shuffler;
    // ... fill the shuffler with &str from the current line
    // write the fields in the new order, draining each inner vec
    self.join_appender(shuffler.iter_mut().flat_map(|values| values.drain(..)))?;
    // "reset" the empty_shuffler for the next loop
    empty_shuffler = unsafe { core::mem::transmute(shuffler) };
}

HELPPPPP In src/lib/line_parser.rs you will see an attempt at moving this logic elsewhere to compartmentalize things. Every attempt I've made at this falls apart with lifetime errors around the empty_shuffler and its inners living longer than the data that was just read. The lifetime coercion seems to fall apart across a function boundary. This especially does not work with a closure.

The split issue

This leads directly into the next issue. We can parse on two delimiter types: &[u8] or regex::byte::Regex. I've gone down the road of a splitter function that returns Box<dyn Iterator<Item=&[u8>> and it works, but there is an appreciable performance hit since that is literally the hottest part of the code.

HELPPPP Again, src/lib/line_parser.rs makes an attempt at splitting this out into a trait with different implementations. Sadly things feel apart due to The cached vec issue before really getting to play with this design. I would love feedback on this though. Specifically, should the return type from split be elevated up to an associated type for the trait? If so this moves a lifetime onto the trait, how does that work out?

In an ideal world

In an ideal world I would be able to use:

// method on `core::Core`
pub fn process_reader<R: Read>(&mut self, reader: &mut BufReader<R>) -> Result<(), io::Error> {
    let mut shuffler: Vec<Vec<&'static [u8]>> = vec![vec![]; self.fields.len()];
    reader.for_byte_line(|line| {
        self.line_parser.parse_line(line, &mut shuffler)?;
         self.join_appender(shuffler.iter_mut().flat_map(|values| values.drain(..)))?;
        Ok(true)
    })?;
}

To the best of my knowledge there is no way to reuse a vec like that, which is very sad.

If there is a better place in the rust community for requesting long form review / advice I'd love to repost there!