First crate code review request: CSV log cleaner

I created this little crate as a first project to teach myself Rust: https://crates.io/crates/csv_log_cleaner (My background's mainly in Python). It's a CLI tool which cleans CSV files to conform to a type schema by streaming them through small memory buffers using multiple threads and logs data loss.

Docs are here: csv_log_cleaner - Rust

Github repo is here: GitHub - ambidextrous/csv_log_cleaner: Clean CSV files to conform to a type schema by streaming them through small memory buffers using multiple threads and logging data loss.

I would very much appreciate any comments or code reviews (or even PRs) that people might be willing to offer.

In particular, I'd appreciate it if someone could take look at the Rayon threadpool handling around here: csv_log_cleaner/lib.rs at 07ca52da3b8a82037877e031258a057e02e726ea · ambidextrous/csv_log_cleaner · GitHub All of my tests are passing, but I've had enough experience with multi-threading in other languages to know that might just be blind luck and I may have actually messed up the multi-threading disastrously.

A couple of things stand out already in connection with process_rows's interface:

  • argument-position impl Trait shouldn't be used in public functions unless you specifically want to restrict the caller from explicitly instantiating the corresponding type parameters (reader and writer). I don't see any good reason for that, here, so you should probably be using explicit type parameters instead.
  • Taking a &String (or a &Vec, etc.) parameter is actively harmful. It adds no value over &str (because a &String can't grow as it's behind an immutable reference), but it requires the caller to have a concrete String allocated. This would incur a superfluous allocation when they have something that would coerce to a &str but is not a String. You should just take an &str, or even better, a type implementing AsRef<Path> if you are trying to describe a filesystem path. Better yet, you should take Readers and Writers, so that people can use your function with arbitrary readers and writers, not only with files.
  • Returning Box<dyn Error> in a library is not very useful. It conveys no information. Create a semantic error type instead.

Moving on to the implementation:

  • Your ColumnType enum could be Copy but it isn't.
  • CSVCleaningError::new() unconditionally copies the string, allocating even if you already have a String. For sink arguments (e.g. typical constructor arguments), just take them by value. Callers can always clone explicitly if they need to, but with the current API, then can never avoid an unnecessary clone.
  • You shouldn't implement Error::description(). It's deprecated.
  • This and many other mistakes would have been caught by the use of Rust's official linter, Clippy. Use it.
  • write!(f, "{}", self.message) is superfluous since self.message is already a string. Use the macro-less, more direct f.write_str() method instead.
  • process_rows() is very long-winded. More than 100 rows, densely written. It should likely be split up and refactored.
  • In the same function, you are printing to the console. Don't. Library code shouldn't randomly print progress information. It should preferably be pure computation. You can always print progress information before and after you call it from the executable.
  • On line 208, you are doing fs::write(…).expect(&log_error_message). This is bad practice – there is a huge variety of reasons why writing to a file might fail; you shouldn't count on I/O working as an invariant that should be .expect()ed away. This is doubly true because your function already returns an error. People expect their Result-returning functions to fail gracefully, by returning an Err, not by panicking and blowing up the whole process/thread.
  • process_row_buffer() has 7 arguments. process_row_buffer_errors() has 8. Those should have been structs .
  • You are doing a lot of push-in-a-loop and insert-in-a-loop operations (e.g., in process_row_buffer() and copy_to_std_hashmap().). Those should be Iterator::collect() instead.
  • are_equal_spec_and_csv_columns() allocates, clones, and sorts superfluously a lot. Its implementation should instead be:
     csv_columns_record.len() == spec.len() &&
     csv_columns_record
         .iter()
         .all(|field| spec.contains_key(field))
    
  • If you anticipate many small maps (as the comment on FxHashMap indicates), you might be better off not using any hashing at all. For small (dozens of elements) maps, comparison-only linear or binary search in a flat array might actually be faster. You'd have to benchmark, though, since this depends on so many factors.
2 Likes

@H2CO3 Thank you very much! Will implement.