Bioinformatics tools that could use speeding up

dhbradshaw · November 10, 2017, 1:02am

Although it looks like a lot of pipelines are io-bound, there are a some really interesting tools that would be more helpful if they were a faster. Those speedups could be achieved with great algorithms and more C code, but I'm hoping it will be done using Rust.

In the interest of helping that to happen, here's a link to a growing list of tools that could use attention:
Survey/Vote: if you could double the speed of any three commandline tools...

usamec · November 10, 2017, 9:34am

I do want to discourage you, but most of the tools I saw or written (I worked mostly in genome assembly, alignment, metagenomics) were already written in C/C++ with some helper scripts in Python/Perl (but performance critical code was in C/C++).

So you won't get any free lunch, by rewriting things in Rust. You can get speedups by using better algorithms, engineering (like SIMD instructions, cache locality or better parallelism), but that is not that easy.

Also I have seen very few buffer overflows and null pointer exceptions, but I have seen too much unreadable C code. So it might be helpful, to not only build tools, which work and are fast, but also tools, which can be extended and their code can be read by someone else than the author.

dhbradshaw · November 10, 2017, 12:50pm

This isn't discouraging at all. The fact that a lot of the code is in C/C++ shows that people value speed, which suggests that Rust can make a contribution.

Rust is one of the few languages that doesn't have to be slower than C/C++.

But it's more fun to write (subjective, I know), and can be nicer to read (not guaranteed, of course). So a Rust project may see more iteration and activity than a C/C++ one, and if that iteration is focused on speed then the code can end up faster than a comparable C/C++ repo.

usamec · November 10, 2017, 12:54pm

I agree with you.

Now the questions is:
Do you want to invent some new algo, build a tool and write research paper about it, or do you want to pick some existing tool and try rewriting (and maybe speeding it up) it in Rust?

Edit (more info):
Just to be clear, if in future I find some time to write some bioinformatics tools (which I did a lot during PhD), I would definitelly start writting them in Rust.

And BTW if you really want to be helpful, take this crate: https://crates.io/crates/hdf5 and make it work

dhbradshaw · November 10, 2017, 1:11pm

And BTW if you really want to be helpful, take this crate: https://crates.io/crates/hdf5 and make it work

That's really interesting. Can you give me more context on that last bit? Have you used hdf5 or seen it used? What impact would it have to get that crate working?

usamec · November 10, 2017, 1:19pm

HDF5 is quite standard data format not only used in bioinformatics (but for example in bioinformatics the raw data from Nanopore MinION sequencer are produced as HDF5).
Problem with HDF5 crate is, that it can open the HDF5, but read functionality is not implemented yet, so it is basically useless.

Last time I wanted to use it, I just called relevant FFI functions from hdf5-sys and was not amused.

Eh2406 · November 10, 2017, 1:36pm

HDF is the defacto standard for data sharing in the sciences. Many disciplines have actual standards that are really HDF under the hood. I.E. Tables with specific names and columns stored in HDF. The crate is being developed by a very knowledgeable user of the c-api, but it is in analysis paralysis at the moment. I think they are hoping to get a completely 0 cost, completely safe, and ergonomic library, and they just don't see how to get there yet. Figuring out how to get something useable, while the design work stalls, would be huge. I don't know if that is sweat equity and encouraging a unstable api will break release, or if that is a friendly hdf5-prototype fork, or something else.

At the moment I end up using python to process our data. So I don't need 0 cost, just usable to get me off python.

Topic		Replies	Views
Any recommended study materials in this case? help	3	211	July 22, 2024
Success story: new Rustacean beating C perf in first week	12	6129	June 6, 2017
Rust for "data first" problems?	20	3033	April 30, 2021
Hello from a rustacean nauplius community	4	745	May 26, 2020
Rust faster than C++ and Ada 2012 on a simple file processing benchmark	9	1774	October 5, 2018

Bioinformatics tools that could use speeding up

Related topics