Tantivy 0.4.0 released + blog post


#1

Tantivy is a search engine library written in Rust, strongly inspired by Lucene.

Here is the blog post about how tantivy’s indexing works.

The changelog for 0.4.0 is available here.


#2

Awesome article, it shows that for some kinds of software Rust and C++ are among the very few possibilities. I’d like your article to show pieces of code and perhaps some diagram of the data structures memory layout.

From your Reddit post:

I spent a fair bit of time over compiler and library bugs.

Have you reported all those bugs? Compared to other languages (even languages that are quite older than Rust) I find Rust compiler/stdlib to have very few bugs.


#3

Rust compiler/stdlib are incredibly solid. Unfortunately, tantivy works with nightly and I hit a bug in the compiler (my code would segfault sporadically). I wasn’t really sure the bug was in the compiler, so I shared my problem on reddit, someone else experience a similar issue and ended up filling the issue. It was solved in around a week.

The second issue was a 10x-slower performance regression. I didn’t report it, because I suspect it is due to inlining, and I was asking too much to the compiler to begin with. Still, a 10x performance change is unexpected.

Apart from that I hit a major issue in crossbeam (memory-leak) that was already reported and got solved recently. An issue in rust-protobuf (segfault) when decoding specific values for which I sent a patch, and I think that’s all. I also sent a patch to a very trivial bug in the fst crate.

To make it clear, it is rare for me to find bugs in Java’s famous libraries (Guava, protobuf, etc.). That does not mean that rust compiler or rust crates have more bugs but simply that there is less users.

The productivity gain I got from using Rust instead of C++ largely (veeeerrrry largely) compensate for the time spent investigating bugs.


#4

Awesome article, it shows that for some kinds of software Rust and C++ are among the very few possibilities. I’d like your article to show pieces of code and perhaps some diagram of the data structures memory layout.

I feel more comfortable to talk about search than about rust. I’ll try to get out of my zone of comfort next time and write a bit about some interesting rust patterns I used in tantivy.


#5

Performance/inlining issues are valid issues, and sometimes they are worth reporting.


#6

Probably… There is still a ticket open on tantivy’s issue list. I’ll eventually investigate the root cause and possibly submit a bug report if necessary.


#7

Hello, I’m using sphinx search now in my project and it’s generally used for gathering & matching books metadata (authors, publishers, etc.). So I grabbed and indexed all authors and search over them in cycle when needed. The case is speed and morphology. Does your engine support morphological search?


#9

Tantivy does not support any form of morphological search in the last version (0.4.0). A proper configurable text analysis pipeline with a snowball stemmer is scheduled for 0.5.0.

It is actually already sitting in a pull request.

I would not recommend you to use tantivy for you just yet. 0.5.0 Should be released in September I think.

Are you happy with sphinx? What did you prefer it over Elasticsearch or Solr?
Do you use it as a library, or with the server?


#10

Thanks for the reply.
Sphinx is used both in data-parsers and as search engine for the website. I have site with audiobooks. Data parsers do query available voiceovers on different sites and fills missing metadata. So on input I have no strongly structured data. That’s why I should find at least book author name and book title.
Here comes sphinx with morphological search. It has very fast indexing but it’s api and design is a bit outdated. Now it’s not developed. I wanted to migrate to elastic. But all parsers are rewritten into rust from python. That’s why I’m looking forward the tantivy.

Btw, there is library https://github.com/irbis-labs/rsmorphy for morphological analysis of russian and ukrainian languages. Maybe it will help you in future for adopting search not only for english language.


#11

It should not be too difficult to integrate rsmorphy with tantivy 0.5.0.

I see it sometimes offers a list of forms with different probability. Do you usually index all of them or only the one with the highest probability? Do you have a payload per token today, to use the probability for scoring your documents?


#12

There is 2 parts of data verification and analysis.
The first one is sphinx morphological search. For example, I found on site audiobook with string title:
“Stephen King — The dark tower”. I use this data and search through sphinx index over all book authors. There is present “Stephen King” and it would be the first in result list. But russian and ukrainian languages are more advanced in this part. They have different ending for different cases, gender, plural, etc… That’s why I split string by spaces and get the main form of each word now using morphy library. This is not clear for book authors, but more in demand for book name when there are a lot of possible variations in naming.


#13

Ok, so you only retain the main form.