Seeking code feedback: Tdms file parsing library

Hello,

I've just managed to get a Tdms parsing utility to load and parse data and I felt like I had to hack around a lot of things to get it to work. In particular I think my use of enums to handle datatypes could be better? Also it feels like there's a lot of boiler plate that could possibly be generic, but I'm not sure how.

I'm hoping to get some answers to questions I've put in the comments next to the functions/structures in question. I would also appreciate any general advice on better constructs to use, ways to approach "genericising" and so forth.

Keen for any kind of feedback!

The repository: https://github.com/AJAnderson/rstdms

2 Likes
#[derive(Debug)]
pub enum DataType { ... }
#[derive(Debug)]
pub enum DataTypeVec { ... }
#[derive(Debug)]
pub enum RawDataType { ... }

I've used this sort of pattern, and to me it seems not so bad. It's a solution that has very high potential for the type checker to help you, leaving little room for mistakes. But the cost of making the code easier to write is that you need to write more of it. :stuck_out_tongue: I believe that when HKTs (higher kinded types) are implemented, they'll be able to help do things like this with less boilerplate.

If you change DataType::Void to DataType::Void(()) (and DataTypeVec::Void(Vec<()>)), then you may be able to use macros to reduce some of the boilerplate.

An common (but tricky!) alternative

Will the user often know the type that they expect to be deserializing into? In this case, one possible solution is to have custom Serialize/Deserialize traits, like serde. These can be provided with automatic derive macros:

#[derive(tdms::Serialize, tdms::Deserialize)]
struct Point {
    x: f64,
    y: f64,
}

DataType could implement these traits for use by people who don't know what type is in the file (serving a role like serde_json::Value). Then f64 can implement the traits by deferring to the DataType impl and e.g. throwing an error if it isn't a FloatDouble.


impl fmt::Display for DataTypeVec { ... }
impl fmt::Display for TdmsSegment { ... }

These feel to me like odd things to implement. Is it needed for anything in particular? I'd save fmt::Display for types that have a clearly meaningful string representation, e.g., that represent an error message, or a portion of a file or a syntax tree.

If you want these for debugging purposes, you could have a fn debug(&self) or fn pretty(&self) that returns some wrapper type Pretty<'_> which implements Display.

Awesome, thank you! I'll have a look into your suggestions.

To answer a few more of the questions in the code (I was interrupted yesterday and had to stop early):

// QUESTION: I struggled to make this a one liner, something in the background kept
// wrapping Option around the result, regardless of whehter I called unwrap
// QUESTION: Is there a better way to map raw values to enum than the approach I have taken?
let prop_datatype = num::FromPrimitive::from_u32(file.match_read_u32()?);
let prop_datatype = prop_datatype.unwrap();

The following:

let prop_datatype = num::FromPrimitive::from_u32(file.match_read_u32()?).unwrap();

should work. However, I have to wonder whether this can legimitately fail? (Is it known at this type that the u32 in the file must correspond to a valid datatype?) Returning an error that signals a bad datatype id would be more robust.


// QUESTION: How to I convert this to &str? Do I want to?
impl From<String> for TdmsError {
    fn from(msg: String) -> TdmsError {
        TdmsError::Custom(msg)
    }
}

Sorry, I wasn't sure what this was asking.

(I also unfortunately wasn't able to figure out what is happening in read_object / didn't want to spend too much energy on it)


.or(Err("Unable to convert buffer to string".to_string()))?;

This allocates an error string even on success. I normally write:

.map_err(|_| "Unable to convert buffer to string".to_string())?;

On a recent version you improved this by adding a variant to your error type. I would go a bit further and impl std::fmt::Display and std::error::Error for this type:

impl fmt::Display for TdmsError {
    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
        match self {
            TdmsError::Io(e) => write!(f, "an IO error occurred: {}", e),
            TdmsError::FromUtf8(e) => write!(f, "unable to convert buffer to string: {}", e),
            TdmsError::Custom(msg) => write!(f, "{}", msg),
        }
    }
}

impl std::error::Error for TdmsError { }

Also, beware that enum variants are always public, meaning that you can't add a new one backwards compatibly. There are two ways to address this:

// Option 1: A nonexhaustive enum.
//
// (#[nonexaustive] isn't stable yet but for now you can add
//   an undocumented variant)
pub enum TdmsError {
    Io(io::Error),
    FromUtf8(string::FromUtf8Error),
    Custom(String),

    // Force users to add a `_` branch to their matches.
    #[doc(hidden)]
    _NoCompleteMatch,
}

// Option 2: A struct to add a layer of privacy
pub struct TdmsError(ErrorKind);

enum ErrorKind {
    Io(io::Error),
    FromUtf8(string::FromUtf8Error),
    Custom(String),
}

Just as a quick complement, most of your naming is pretty idiomatic! (e.g. Tdms instead of TDMS; a Custom error variant). Just a couple of things:

  • It may be even more idiomatic to trim some of the Tdms prefixes to remove stuttering (e.g. FileHandle, so that people can use it as tdms::FileHandle).
  • TdmsFile::new_file sounds like it should be TdmsFile::open. (you did this already for TdmsFileHandle, so maybe there's a reason you chose this name?)
  • I think the correct spelling is "endianness"

pub fn objects(&self) -> Vec<&String> {

This is an unusual return type. &str is usually better than &String because it gives you the freedom to switch to other ways of storing the strings internally (like e.g. Rc<str>, or using some kind of string interning crate).

Also, it may not be necessary to return a Vec; now that return position impl Trait is available, you can write something like:

-> impl Iterator<Item=&str> + '_;

although this can sometimes be too conservative, as most of the time you'll have no problem returning iterators that further satisfy the properties of ExactSizeIterator and DoubleEndedIterator. (e.g. self.some_vec.iter().map(|x| x.thing))

My personal pet pattern is to do something like this:

-> impl VeclikeIterator<Item=&str> + '_;

/// Trait alias for an iterator that has all of the same, nice properties
/// that [`std::vec::IntoIter`] has.
pub trait VeclikeIterator
    : ExactSizeIterator + DoubleEndedIterator + iter::FusedIterator
{ }

impl<I> VeclikeIterator for I
where
    I: ExactSizeIterator + DoubleEndedIterator + iter::FusedIterator,
{ }

When using this, beware that .filter() and .flat_map() prevent you from satisfying ExactSizeIterator, as does iter::repeat(x).take(n).


#[derive(Debug)]
pub struct TdmsFileHandle {
    handle: io::BufReader<std::fs::File>,
    endianess: Endianess,
}

This reminds me of a mistake I saw recently in the npy crate, which is that it was impossible to test it without accessing the filesystem. Try the following:

#[derive(Debug)]
pub struct TdmsFileHandle<R> {
    handle: R,
    endianess: Endianess,
    // Allows the Tdms file to be embedded somewhere in
    // the middle of the file.
    start_pos: u64,
}

impl<R: Seek + BufRead> TdmsFileHandle<R> {
    pub fn new(mut handle: R) -> Self {
        let start_pos = handle.seek(SeekFrom::current(0))?;
        /* ... */
    }
}

This allows it to be used on a io::Cursor<Vec<u8>> so that you can construct files in memory for unit tests, and as a bonus it allows TDMS files to appear embedded somewhere in the middle of a file.

Wow thank you so much, I'll digest those over the next few days and try and implement them.

I really appreciate you taking the time to provide such comprehensive feedback!

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.