Hello from a rustacean nauplius

Greetings from Italy, I'm a fresh beginner in Rust and a former molecular biologist with a passion for programming and learning new languages. I've been working at web backend/api development during the last ten years (mainly in ruby and recently in Elixir). I'm thinking to try to learn and use Rust to develop a bioinformatic tool I'll have to build to maintain some sequence databases (very big text files with hundred thousands of DNA sequences) for species identification purposes, which require frequents taxonomic reviews of their entries. I'd like this tool to be fast and executable in windows, mac, and linux boxes, without the overload of interpreters and/or virtual machines. I know I'm going to learn a lot from you (lurking, to begin with) so I want to thank you all since now.

Here's the nauplius of a crab, the first of the series of its development stages:
Schermata 2020-05-26 alle 23.31.37

I hope one day I could be a proud ...

Ouch!, new users can only post one image.. :no_mouth:
So, here you have to imagine a nice Ferris the Crab picture :slight_smile:

4 Likes

Welcome! With that background post, I'm certain that you'll find many members of this forum who will gladly contribute to your education in Rust and, when asked specifics, to your bioinformatic tool project.

What do you want for the overall structure of that tool? Some parts will be easier to do as a new Rustacean; others will require both a lot of learning and, probably, significant help.

Do you have a specific database technology that you want to use? (Some databases have better existing support in Rust than others.) How big do you expect your eventual database to be? 10 GB? 100 GB? larger? Focused requirements like these will better enable others on this forum to volunteer to help.

1 Like

Hello @TomP and thank you for the warm welcome.
I'm still in the pre-analysis phase, collecting requirements from the future users (mainly a biotech company I'm collaborating with) and trying to keep my mind open as much as possibile to any solution. The sequence db input files are large plain text files, about 600 MB, in FASTA format, i.e. around 400.000 consecutive DNA sequences each identified by a header like:

> Kingdom;Phylum;Class;Order;Family;Genus;

The goal is to update specific headers with the relative reviewed taxonomy, which is encoded in a reference table with the old and the new nomenclature (that could be served as a plain CSV file).

If this would be a one time job, I'd have been tempted to import all the entries in a PostrgreSQL db, along with the conversion reference table, use some simple SQL update queries, and encode back the output text file in FASTA format. But the it seems it is going to be a recurrent task to be performed with new updates of the taxonomy.
Also I'd like to build a stand-alone executable command line tool, avoiding any need to install a db server for intermediate processing, for ease of distribution and usage.
I'm imagining to invoke the CLI tool passing the two input files as arguments, and have it produce the updated output.
I have also thought that this could be a job for AWK, but I don't really know it either. Or I could also use ruby or elixir indeed (I have plenty of experience with them), but I'm convinced it would be much better to build compiled executables that could be used in any major OS with no dependencies. So I'm exploring using Rust. And I find this is a very appealing excuse/opportunity to finally learn some rudiments of it!
Though I know almost nothing, so any criticism/suggestion or hint is very welcome. Thanks again.

1 Like

Do I understand correctly that you only want to update FASTA headers, but don't have to decode and match on FASTA sequence information? That's a comparatively trivial file processing task, just reading lines, parsing them when they are FASTA headers (presumably starting with '<'), then checking to see whether the specified taxonomy needs to be replaced in the header before outputting the updated header followed by copying the rest of that input sequence.

1 Like

Yes, you understand correctly.

A subsequent processing task is also needed (a different CLI command I'd say): it should use the taxonomically updated output of the previous one plus a different fasta file as inputs; this time it should match on the sequence data (not the header) to infer the new taxonomic status of its entries. It seems still the same algorithm to me, with a slight variation.

So I think I'll have to learn how to stream-process files, check each entries against a reference csv table or the updated fasta db, perform the eventually required changes in the header and write the results in a new file.

It shouldn't be too complex indeed. That's why I'm thinking it could be in my capabilities even as a total noob. I also learned that no battle plan survives the encounter with the enemy, but the journey is so intriguing that I don't care about that too much.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.