Interest for NLP in Rust?

Hey all!

I was wondering if the wider Rust community would be interested in development of a natural language processing library (à la NLTK, but obviously smaller in scope). I’ve seen prior art in this field, but previous efforts seem to be mostly small experiments, and aren’t maintained that well at the moment. In general machine learning in Rust seems to be pretty immature, so I was wondering if there was any interest in development in this field.

ML is a topic that gets periodically mentioned, but most of the time people have too hard of a time justifying themselves the burden of going against the momentum of TensorFlow or Keras.

It is a shame, but until someone sufficiently strong-willed and with a lot of time on her hands builds a killer app that showcases Rust’s strengths in this area, the situation is unlikely to change…

6 Likes

Well. For NLP you mostly need (maybe not all of these):

  • stemmers, tokenizers, … it is present somewhere, but not in one consistent crate, especially for other things than english
  • something equivalent to CountVectorizer in Python (you can do it yourself in Rust, not that hard)
  • softmax regression over sparse data, we do not have it right now
  • some deep learning library which can work with text, we have that https://github.com/usamec/cntk-rs/blob/master/examples/sparse_ops_and_word_embeddings.rs, but it is hard to use, since it is just low level wrapper
  • HMM, CRF models - not sure
  • something like Chainer deep learing library to work with tree models - again not sure

And as HadrienG said, it is quite hard to justify not using Python. I am interested in this, but not enough to put a lot of time into it :frowning:

1 Like

For CRF models I wrote a binding to crfsuite recently.

Having a great Rust ecosystem could be made accessible from many host languages and boost both Python and other languages - including even C/C++. That would be a very long game though until you are competitive. But it can be done!

1 Like

If this could be useful, I’ve implemented whatlang, a library for natural language detection based on trigrams. The library supports 83 languages, does not require any databases and works extremely fast.

You can play with online demo (built with WASM) here: https://www.greyblake.com/whatlang/

3 Likes

I am working on the vtext project aiming to implement a number of features mentioned by @usamec. So far, it includes tokenizers, a partial implementation of scikit-learn’s CountVectorizer / HashingVectorizer and string distance metrics. Feedback or suggestions would be very much appreciated.

Having a great Rust ecosystem could be made accessible from many host languages and boost both Python and other languages - including even C/C++. That would be a very long game though until you are competitive.

Indeed, my main motivation so far is to have fast functionality in Python and vtext is also built into a Python Rust module.

About the second point, at least for the above-mentioned functionality, I would say it should be somewhat competitive already. It is often faster than existing Python/Cython implementations. A good unicode support in Rust is also a significant advantage for text processing.

1 Like

It’s frequently the case that training the model is a tiny fraction of the code that has to be written and often it’s the least interesting part of the app.

Since Rust is a fantastic application development language, writing apps that use ML in Rust is an easy sell to me. I would glady write a production application that incorporates ML in Rust, it would be (and is) my first choice. That would apply to NLP applications like search as well.

The hard part is often cleaning, parsing, and/or generating the data and then deploying the inferences where they’re useful and on the time scale and resource budget that’s available - e.g. on mobile devices, embedded device, offline, on the GPU, interacting with the camera driver, etc. When the other option is C or C++, Rust bindings to the established ecosystem is a great alternative for inference.

There are two parts to ML problems, training and inference. I could be wrong, but I don’t think see how a training library natively written in Rust is going to supplant something like Tensorflow until const generics goes in and then enough time passes for a nice LA library based on const generics to mature to the level of Eigen, which is a colossal tower of awesomeness. Even when that happens, there’s still CUDA and OpenCL, which aren’t going to be Rusty in the forseeable future, so completely Rust-only until you hit the metal has to wait on that as well. But the good news, again, is the cost-free interop with C ABIs and the fact that API to a model is typically small and well-defined. Training in $language and deploying in Rust is a great solution. (I’m happy for a coterie of intrepid geniuses to prove me wrong and develop a full-blown Tensorflow alternative; I would happily never read another line of C++ again).

For inference, exporting a model trained in some language to ONNX then using, for instance, tract inference engine in Rust is a possibility.

There are two parts to ML problems, training and inference. I could be wrong, but I don’t think see how a training library natively written in Rust is going to supplant something like Tensorflow

It’s not necessarily about supplanting, it is more of a question what such projects use internally even if the user API is in a different language. For instance, Tensorflow went with Swift recently, but Rust was high in the list of possible choices.

I’ve got a string distances package I’ve been working on for my company. I’ll give an example here of a real-world project. The general goal is product/hotel matching (just the names).

As far the string distances goes, I looked around extensively and did not find what I wanted, so decided to spin my own. But I’m in a weird spot where there are some open source projects that have like 50% of what I need but the other half requires a fundamental shift in the design of the API… so I’ve been trying to figure out how to best consolidate work that’s been done, give credit, attribute previous work, but move the needle… but also without offending people or coming off as “stealing” someone’s work. Suggesting PR’s on projects is not often that productive in this case due to the amount of changes I’m making… they are virtually different projects. Nobody wants a PR to their project where you’ve changed over half of the code and overhauled the API, because that seems kind of rude too.

For me personally, Rust fits well for data engineering and productionizing trained models. So basically, the bookends of a typical data science project. Which in reality I think is a fantastic thing. That is, when you think of a team you’ve often got the data engineering person or machine learning engineer person whose job it is to serve that data on a silver platter to the research or data scientist, and then get handed a trained model or architecture to deploy into a production environment where high reliability and low latency are key.

My current project, which requires good string distance algos, is an outrgrowth of something I wrote for another company. I need to be able to process crap tons of data, build a custom data object that contains various string metrics. Then hand those metrics off for training/matching/etc… (which I was going to use Julia to do)… then I want the results of that modeling so I can do classification. In my previous job I used go for all of it but ran into some issues I suspect related to GC, so I’ve shifted the heavy lifting to Rust.

Worth pointing out for anyone who does not know, NLTK was started by the linguist Stephen Bird, and over the years had significant work contributed by graduate students… much of this was happening in an academic setting and funded by grants.
Far as I know there are not many academics using Rust… unless they are computer science academics the typical target of PL is something perceived as “easy” to pick up (R, Python, Matlab, Julia).

In fact, the only academic i know working on NLP/Computational Linguistics and rust is (https://elaml.danieldk.eu/schedule/) @danieldk

1 Like

@jbowles Thanks for @-ing me, I don’t follow this forum actively.

Several members of our research group use Rust. We never set out to build a Rust NLP framework, it has been largely a pragmatic choice. We annotate machine-annotate large amounts of data (decades of newspaper text, web-scale corpora). Even though we also use Python, the overhead in Python parts for large corpora is typically too large, unless you use Cython.

I used Go before, but it had some downsides for us, especially when combined with Tensorflow, which requires aligned memory for tensors. There are workarounds, but they are annoying (mostly due to expensive cgo calls and non-deterministic destruction).

Over the last two years or so, we have built various NLP-related tools, such as:

And a smattering of tools that are really project-specific.

Unfortunately, there are still a lot of gaps for doing NLP in Rust, from basic preprocessing tools (robust tokenization, sentence splitting), numeric/ML libraries, to actual language processing tools.

I think it would be worthwhile to set up a Rust NLP working group that aims to work towards creating something like spaCy in Rust. I am not interested in working through the bureaucratic hurdles of setting up a working group (if there are any), but I would definitely be willing to be part of such a group and contribute code and ideas.

3 Likes

Training in $language and deploying in Rust is a great solution.

Actually, you don’t have to. You can define a Tensorflow graph in Python, save it, and do both training and prediction in Rust. This is how we use Tensorflow.

(Theoratically, you can also define the graph directly in Rust, but then you’ll miss out on a lot of useful abstractions that the Tensorflow Python library provides.)

In my own project (AlphaZero-style game player), I started with this approach, and eventually backed off and started running my training in Python. I liked the Python dataset API, which makes batching and shuffling easy. Do you have a public example of how you’re using the TF bindings for training? I’d love to read it.

Do you have a public example of how you’re using the TF bindings for training? I’d love to read it.

Tensorflow graph runs:

Vectorization of inputs + batching:

4 Likes

Thanks! That’s very helpful.

While this thread is still somewhat alive, Luca Palmieri has opened discussion for areas related to this:

There is also science-and-ai-dev channel on discord server: Rust Programming Language Community Server with a