Interest for NLP in Rust?


Hey all!

I was wondering if the wider Rust community would be interested in development of a natural language processing library (à la NLTK, but obviously smaller in scope). I’ve seen prior art in this field, but previous efforts seem to be mostly small experiments, and aren’t maintained that well at the moment. In general machine learning in Rust seems to be pretty immature, so I was wondering if there was any interest in development in this field.


ML is a topic that gets periodically mentioned, but most of the time people have too hard of a time justifying themselves the burden of going against the momentum of TensorFlow or Keras.

It is a shame, but until someone sufficiently strong-willed and with a lot of time on her hands builds a killer app that showcases Rust’s strengths in this area, the situation is unlikely to change…


Well. For NLP you mostly need (maybe not all of these):

  • stemmers, tokenizers, … it is present somewhere, but not in one consistent crate, especially for other things than english
  • something equivalent to CountVectorizer in Python (you can do it yourself in Rust, not that hard)
  • softmax regression over sparse data, we do not have it right now
  • some deep learning library which can work with text, we have that, but it is hard to use, since it is just low level wrapper
  • HMM, CRF models - not sure
  • something like Chainer deep learing library to work with tree models - again not sure

And as HadrienG said, it is quite hard to justify not using Python. I am interested in this, but not enough to put a lot of time into it :frowning:


For CRF models I wrote a binding to crfsuite recently.


Having a great Rust ecosystem could be made accessible from many host languages and boost both Python and other languages - including even C/C++. That would be a very long game though until you are competitive. But it can be done!


If this could be useful, I’ve implemented whatlang, a library for natural language detection based on trigrams. The library supports 83 languages, does not require any databases and works extremely fast.