Bioinformatics tools that could use speeding up


#1

Although it looks like a lot of pipelines are io-bound, there are a some really interesting tools that would be more helpful if they were a faster. Those speedups could be achieved with great algorithms and more C code, but I’m hoping it will be done using Rust.

In the interest of helping that to happen, here’s a link to a growing list of tools that could use attention:
Survey/Vote: if you could double the speed of any three commandline tools…


#2

I do want to discourage you, but most of the tools I saw or written (I worked mostly in genome assembly, alignment, metagenomics) were already written in C/C++ with some helper scripts in Python/Perl (but performance critical code was in C/C++).

So you won’t get any free lunch, by rewriting things in Rust. You can get speedups by using better algorithms, engineering (like SIMD instructions, cache locality or better parallelism), but that is not that easy.

Also I have seen very few buffer overflows and null pointer exceptions, but I have seen too much unreadable C code. So it might be helpful, to not only build tools, which work and are fast, but also tools, which can be extended and their code can be read by someone else than the author.


#3

This isn’t discouraging at all. The fact that a lot of the code is in C/C++ shows that people value speed, which suggests that Rust can make a contribution.

Rust is one of the few languages that doesn’t have to be slower than C/C++.

But it’s more fun to write (subjective, I know), and can be nicer to read (not guaranteed, of course). So a Rust project may see more iteration and activity than a C/C++ one, and if that iteration is focused on speed then the code can end up faster than a comparable C/C++ repo.


#4

I agree with you.

Now the questions is:
Do you want to invent some new algo, build a tool and write research paper about it, or do you want to pick some existing tool and try rewriting (and maybe speeding it up) it in Rust?

Edit (more info):
Just to be clear, if in future I find some time to write some bioinformatics tools (which I did a lot during PhD), I would definitelly start writting them in Rust.

And BTW if you really want to be helpful, take this crate: https://crates.io/crates/hdf5 and make it work :slight_smile:


#5

And BTW if you really want to be helpful, take this crate: https://crates.io/crates/hdf5 and make it work :slight_smile:

That’s really interesting. Can you give me more context on that last bit? Have you used hdf5 or seen it used? What impact would it have to get that crate working?


#6

HDF5 is quite standard data format not only used in bioinformatics (but for example in bioinformatics the raw data from Nanopore MinION sequencer are produced as HDF5).
Problem with HDF5 crate is, that it can open the HDF5, but read functionality is not implemented yet, so it is basically useless.

Last time I wanted to use it, I just called relevant FFI functions from hdf5-sys and was not amused.


#7

HDF is the defacto standard for data sharing in the sciences. Many disciplines have actual standards that are really HDF under the hood. I.E. Tables with specific names and columns stored in HDF. The crate is being developed by a very knowledgeable user of the c-api, but it is in analysis paralysis at the moment. I think they are hoping to get a completely 0 cost, completely safe, and ergonomic library, and they just don’t see how to get there yet. Figuring out how to get something useable, while the design work stalls, would be huge. I don’t know if that is sweat equity and encouraging a unstable api will break release, or if that is a friendly hdf5-prototype fork, or something else.

At the moment I end up using python to process our data. So I don’t need 0 cost, just usable to get me off python.