I was wondering if the wider Rust community would be interested in development of a natural language processing library (à la NLTK, but obviously smaller in scope). I’ve seen priorart in this field, but previous efforts seem to be mostly small experiments, and aren’t maintained that well at the moment. In general machine learning in Rust seems to be pretty immature, so I was wondering if there was any interest in development in this field.
Having a great Rust ecosystem could be made accessible from many host languages and boost both Python and other languages - including even C/C++. That would be a very long game though until you are competitive. But it can be done!
If this could be useful, I’ve implemented whatlang, a library for natural language detection based on trigrams. The library supports 83 languages, does not require any databases and works extremely fast.
I am working on the vtext project aiming to implement a number of features mentioned by @usamec. So far, it includes tokenizers, a partial implementation of scikit-learn’s CountVectorizer / HashingVectorizer and string distance metrics. Feedback or suggestions would be very much appreciated.
Having a great Rust ecosystem could be made accessible from many host languages and boost both Python and other languages - including even C/C++. That would be a very long game though until you are competitive.
Indeed, my main motivation so far is to have fast functionality in Python and vtext is also built into a Python Rust module.
About the second point, at least for the above-mentioned functionality, I would say it should be somewhat competitive already. It is often faster than existing Python/Cython implementations. A good unicode support in Rust is also a significant advantage for text processing.
It’s frequently the case that training the model is a tiny fraction of the code that has to be written and often it’s the least interesting part of the app.
Since Rust is a fantastic application development language, writing apps that use ML in Rust is an easy sell to me. I would glady write a production application that incorporates ML in Rust, it would be (and is) my first choice. That would apply to NLP applications like search as well.
The hard part is often cleaning, parsing, and/or generating the data and then deploying the inferences where they’re useful and on the time scale and resource budget that’s available - e.g. on mobile devices, embedded device, offline, on the GPU, interacting with the camera driver, etc. When the other option is C or C++, Rust bindings to the established ecosystem is a great alternative for inference.
There are two parts to ML problems, training and inference. I could be wrong, but I don’t think see how a training library natively written in Rust is going to supplant something like Tensorflow until const generics goes in and then enough time passes for a nice LA library based on const generics to mature to the level of Eigen, which is a colossal tower of awesomeness. Even when that happens, there’s still CUDA and OpenCL, which aren’t going to be Rusty in the forseeable future, so completely Rust-only until you hit the metal has to wait on that as well. But the good news, again, is the cost-free interop with C ABIs and the fact that API to a model is typically small and well-defined. Training in $language and deploying in Rust is a great solution. (I’m happy for a coterie of intrepid geniuses to prove me wrong and develop a full-blown Tensorflow alternative; I would happily never read another line of C++ again).
I’ve got a string distances package I’ve been working on for my company. I’ll give an example here of a real-world project. The general goal is product/hotel matching (just the names).
As far the string distances goes, I looked around extensively and did not find what I wanted, so decided to spin my own. But I’m in a weird spot where there are some open source projects that have like 50% of what I need but the other half requires a fundamental shift in the design of the API… so I’ve been trying to figure out how to best consolidate work that’s been done, give credit, attribute previous work, but move the needle… but also without offending people or coming off as “stealing” someone’s work. Suggesting PR’s on projects is not often that productive in this case due to the amount of changes I’m making… they are virtually different projects. Nobody wants a PR to their project where you’ve changed over half of the code and overhauled the API, because that seems kind of rude too.
For me personally, Rust fits well for data engineering and productionizing trained models. So basically, the bookends of a typical data science project. Which in reality I think is a fantastic thing. That is, when you think of a team you’ve often got the data engineering person or machine learning engineer person whose job it is to serve that data on a silver platter to the research or data scientist, and then get handed a trained model or architecture to deploy into a production environment where high reliability and low latency are key.
My current project, which requires good string distance algos, is an outrgrowth of something I wrote for another company. I need to be able to process crap tons of data, build a custom data object that contains various string metrics. Then hand those metrics off for training/matching/etc… (which I was going to use Julia to do)… then I want the results of that modeling so I can do classification. In my previous job I used go for all of it but ran into some issues I suspect related to GC, so I’ve shifted the heavy lifting to Rust.
Worth pointing out for anyone who does not know, NLTK was started by the linguist Stephen Bird, and over the years had significant work contributed by graduate students… much of this was happening in an academic setting and funded by grants.
Far as I know there are not many academics using Rust… unless they are computer science academics the typical target of PL is something perceived as “easy” to pick up (R, Python, Matlab, Julia).
@jbowles Thanks for @-ing me, I don’t follow this forum actively.
Several members of our research group use Rust. We never set out to build a Rust NLP framework, it has been largely a pragmatic choice. We annotate machine-annotate large amounts of data (decades of newspaper text, web-scale corpora). Even though we also use Python, the overhead in Python parts for large corpora is typically too large, unless you use Cython.
I used Go before, but it had some downsides for us, especially when combined with Tensorflow, which requires aligned memory for tensors. There are workarounds, but they are annoying (mostly due to expensive cgo calls and non-deterministic destruction).
Over the last two years or so, we have built various NLP-related tools, such as:
And a smattering of tools that are really project-specific.
Unfortunately, there are still a lot of gaps for doing NLP in Rust, from basic preprocessing tools (robust tokenization, sentence splitting), numeric/ML libraries, to actual language processing tools.
I think it would be worthwhile to set up a Rust NLP working group that aims to work towards creating something like spaCy in Rust. I am not interested in working through the bureaucratic hurdles of setting up a working group (if there are any), but I would definitely be willing to be part of such a group and contribute code and ideas.
In my own project (AlphaZero-style game player), I started with this approach, and eventually backed off and started running my training in Python. I liked the Python dataset API, which makes batching and shuffling easy. Do you have a public example of how you’re using the TF bindings for training? I’d love to read it.