First Rust project: Command line tool

Hello all,
I would highly appreciate some constructive criticism on my first project. It is a command line tool to handle PDB files (a special file format used for storing information on proteins, DNA and the like) and manipulate them in a way that makes them useful as input for another program (ORCA quantum chemistry package).
Since it is a very niche use case, I have not published it on crates.io but rather on my own repo:
theDoctor/pdbman: Utility tool to handle PDB files and use them as input for QM/MM calculations in ORCA - pdbman - Codeberg.org

It is not quite "finished" (at least from my current point of view) but basically all of the features I wanted are implemented and by now I mostly fuss about its structure, clarity, efficiency and the like so I figured a few pointers by people who are better at this than I am might be helpful before I get bogged down by needless small edits.

For trying it out I included a couple of PDB files in the assets folder.

I haven't put in the time to do a full review, but I do note that you're doing a lot of string comparisons instead of types, and have the same list of strings in a lot of places. I suggesting using enums instead. E.g.

use std::str::FromStr;
enum Residue {
    None,
    Sidechain,
    Backbone,
}

impl FromStr for Residue {
    // There are better `Err` approaches, but for illustration
    type Err = String;
    fn from_str(s: &str) -> Result<Self, Self::Err> {
        match s {
            "None" => Ok(Residue::None),
            "Sidechain" => Ok(Residue::Sidechain),
            "Backbone" => Ok(Residue::Backbone),
            _ => Err(format!("Residue '{}' is not recognized", s))
        }
    }
} 

And so on for aminos, etc. Then you can either convert everything to internal types upfront and benefit from the stronger typing. Or if you want to migrate graduallly or just be less invasive to the PDB type or something, you could attempt the conversion in each method.

use std::error::Error;
fn elsewhere(partial: &str) -> Result<(), Box<dyn Error>> {
    match Residue::from_str(partial)? {
        Residue::None => {},
        Residue::Sidechain => {},
        Residue::Backbone => {},
    }
    Ok(())
}

There are also crates that can reduce the boilerplate. I've found strum useful in cases like this.

1 Like

Thanks for the suggestion! I actually have implemented this for most of my pattern matching. I used to do all of the pattern matching in run() with strings but later converted all of that to enums. The Partial enum you suggested already exists I just forgot to implement it in the functions you pointed out.

In the case of the aminos (and other such lists) however I don't quite see the benefit. These are arrays only used to check if they contain strings that are parsed from the PDB input files. Granted the use case is somewhat similar to the Partial enums above but i created the enums mainly to simplify pattern matching which is not done with the amino arrays.
The code repetition for these arrays could be eliminated by making them consts so am I missing something here?

Yeah, I missed the fact that those are only used to see if it's part of a recognized set. (There's some code that associates some strings with floats that I thought was one of the same list of strings; I looked again and it's not the same list.) So no, I don't think you're missing anything.

You could just put the "is this a recognized X" logic into a function to get rid of the duplication. Or if you wanted to use const, you could use a slice of sorted values and use Vec::binary_search. Or you could use a lazy_static and have a HashSet. Lots of options.

For now I have changed the arrays from inside the functions to const arrays like so:

const BACKBONE_ATOMS: [&str; 8] = ["C", "O", "N", "H", "CA", "HA", "HA2", "HA3",];

This lets me perform the same logic using something like BACKBONE_ATOMS.contains(some_string). Is there a downside to this or rather a reason to do it with a binary search or other method?

Not really. Mostly a matter of preference or habit. Performance is an argument for some of the alternatives, but is very unlikely to matter in this situation IMHO. Simplicity and non-reliance on human sorting are counter-arguments.

Just gonna bump this again. The source code has been updated somewhat but I'm still interested in other people reviewing my approach of dealing with command line options and structuring a run function.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.