Looking for an ElasticSearch-like minimal search engine as a standalone library (no server)

Hi there!

I'm currently building a media streaming server which is mostly designed to handle large music collections (think dozens of thousands of files split taking up several hundreds of gigabytes).

Everything's working great so far, I have a Rocket + Async-GraphQL server which works amazingly well.

The indexing is made by looking at all the tracks (on multiple threads to accelerate the process) and building a cache which allows to quickly list the albums, the tracks of each album, the artists, etc.
With 15k file it is small enough to take less than 10 MB on the disk as a JSON file, and even less in the RAM.

Now, I need to make a search query where you input a search text and a few filters (like who the artist is, what genre should be selected, etc.) and it returns the results. The search text should be applied on several fields, meaning the search should be performed on the track's title but also on the artists' name, the album's name, etc.

The problem is that I can't simply iterate over each entry of the index and perform the search, as it would take quite a bit of time, especially on lower-end machines. What I'd need is an indexing library similar to what ElasticSearch or MeiliSearch do, but a lot more minimalist - I don't want to have a separate server as all my data is contained in a simple index, and I just want to perform basic searches, nothing to fancy.

What solution is there in the Rust ecosystem for this purpose? I didn't find anything for now.

Thanks in advance for your help!

Unless you need a pure-Rust implementation for some reason, I'd look at SQLite's full-text-search.

1 Like

I'm using a JSON index so I don't want to deal with a SQLite database to perform the search, I'd like to build the index easily directly from my existing one.

Ideas:

  • Tantivy is a search engine library. I wouldn't call it "minimal," but it is a library.
  • Are you sure an exhaustive search isn't good enough? Have you tried it? It sounds like you have a tiny amount of data that should be able to be searched exhaustively pretty quickly.
  • Otherwise, you could always build your own index. I did just that for all of the data in IMDb.
5 Likes

For tracks only (which are the main datasource to search in), there are 5 fields to search, and the server should handle up to 100k file in under 500ms ideally. Which means I need to perform at least 100k * 5 * 2 (to get to a second) = 1M string comparisons per second, and as it shouldn't be an exact match (one word can match one field and another one another field), it meens I also have to split it.

So it's 1M * comparison par second. Which on a modest machine will not be possible. Of course it's a "worst case scenario", but it can still happen. Even on more modest libraries it can be quite hard to reach.

Also, I'd like to have at least a bit of natural writing expressions, such as AND / OR and parenthesis, which is already a lot harder to support.

I'm going to try Tanvity and see what I get :slight_smile:

If you want to write this yourself, you can probably get a long way by:

  1. Reduce the query to Conjunctive normal form
  2. Use a Word ↦ [Tracks] mapping to find a candidate set for each AND clause
  3. Iterate over the entirety of these sets to check for the full given condition
1 Like

As I was curious I still made a little benchmark on my computer. Here is the code:

Here are the results on my i7-8750H CPU (6 cores, 12 threads):

> Running search test 'empty string'...
> Found 100000 result(s) in 0 ms
----------------
> Running search test 'super'...
> Found 100000 result(s) in 36 ms
----------------
> Running search test 'super album'...
> Found 100000 result(s) in 94 ms
----------------
> Running search test 'super artist'...
> Found 100000 result(s) in 54 ms
----------------
> Running search test 'long prefix'...
> Found 100000 result(s) in 72 ms
----------------
> Running search test 'long suffix'...
> Found 100000 result(s) in 77 ms
----------------
> Running search test 'no-result'...
> Found 0 result(s) in 185 ms
----------------
> Running search test 'long string'...
> Found 0 result(s) in 177 ms
----------------

To my surprise, performing a word-per-word comparison instead of the whole string at once produces better results, even though it performs more comparisons.

So it could be viable, the comparison here is no multi-threaded, so I'll see what it gives when multi-threaded (if it's any better).

EDIT: With Rayon it's a lot faster:

> Running search test 'empty string'...
> Found 100000 result(s) in 23 ms
----------------
> Running search test 'super'...
> Found 100000 result(s) in 23 ms
----------------
> Running search test 'super album'...
> Found 100000 result(s) in 27 ms
----------------
> Running search test 'super artist'...
> Found 100000 result(s) in 25 ms
----------------
> Running search test 'long prefix'...
> Found 100000 result(s) in 20 ms
----------------
> Running search test 'long suffix'...
> Found 100000 result(s) in 20 ms
----------------
> Running search test 'no-result'...
> Found 0 result(s) in 40 ms
----------------
> Running search test 'long string'...
> Found 0 result(s) in 39 ms
----------------

Aye, indeed, "when in doubt, use brute force." Or at least try it to see if it's good enough, because it's often simpler.

While there are a variety of different ways to speed that code up, here's a very small change that gives you a nice boost (approaching rayon speeds without any parallelism, so with rayon, this should go even faster):

fn test_search(name: &'static str, tracks: &HashMap<String, Track>, search: &str) {
    println!("> Running search test '{name}'...");
    let now = SystemTime::now();

    let finders: Vec<_> = search
        .split_whitespace()
        .map(|word| memchr::memmem::Finder::new(word))
        .collect();

    let results: Vec<_> = tracks
        .values()
        .filter(|track| {
            finders.iter().all(|finder| {
                finder.find(track.title.as_bytes()).is_some()
                || track.artists.iter().any(|artist| finder.find(artist.as_bytes()).is_some())
                || track.albums.iter().any(|album| finder.find(album.as_bytes()).is_some())
            })
        })
        .collect();

    let elapsed = now.elapsed().unwrap();

    println!(
        "> Found {} result(s) in {} ms\n----------------",
        results.len(),
        elapsed.as_millis()
    );
}

And timings:

[andrew@frink benchmark-track-search]$ ./target/release/benchmark-track-search-original
> Running search test 'empty string'...
> Found 100000 result(s) in 0 ms
----------------
> Running search test 'super'...
> Found 100000 result(s) in 45 ms
----------------
> Running search test 'super album'...
> Found 100000 result(s) in 148 ms
----------------
> Running search test 'super artist'...
> Found 100000 result(s) in 64 ms
----------------
> Running search test 'long prefix'...
> Found 100000 result(s) in 67 ms
----------------
> Running search test 'long suffix'...
> Found 100000 result(s) in 70 ms
----------------
> Running search test 'no-result'...
> Found 0 result(s) in 180 ms
----------------
> Running search test 'long string'...
> Found 0 result(s) in 156 ms
----------------
[andrew@frink benchmark-track-search]$ ./target/release/benchmark-track-search
> Running search test 'empty string'...
> Found 100000 result(s) in 0 ms
----------------
> Running search test 'super'...
> Found 100000 result(s) in 12 ms
----------------
> Running search test 'super album'...
> Found 100000 result(s) in 54 ms
----------------
> Running search test 'super artist'...
> Found 100000 result(s) in 30 ms
----------------
> Running search test 'long prefix'...
> Found 100000 result(s) in 37 ms
----------------
> Running search test 'long suffix'...
> Found 100000 result(s) in 44 ms
----------------
> Running search test 'no-result'...
> Found 0 result(s) in 51 ms
----------------
> Running search test 'long string'...
> Found 0 result(s) in 56 ms
----------------

In effect, this uses the memchr crate's implementation of substring search, and in particular takes advantage of memchr::memmem::Finder to build the searchers once for each word in your query. Since you're doing a lot of searches, this saves quite a bit of work that is invariant for every substring search. std doesn't have APIs to do this, so it's forced to rebuild its substring searcher every time. (core::str::pattern::StrSearcher::new shows up in the profile as taking up a non-trivial fraction of execution time.) It's typically negligible for a single search, but it adds up in cases like these.

IMO, your next vector for making this faster (if you even need/want to) would be to reduce the number of searches somehow. That might require changes to how you store your data though.

2 Likes

Thanks!

I tested it and this only gives slightly (< 10%) performance improvement over my original script.

I think it's because strings comparison are so fast already that the multi-threading overhead is high enough that the gain we get your script is neligeable, even though when run sequentially it performs a lot better.

So thanks a lot! I'll go with the "bruteforce" was as it seems to perform really well :laughing:

And I'll take your version as on machines with low cores the performance gain is huge (up to 3x) compared to mine (for anyone wondering you can limit Rayon's threading capabilities by setting the RAYON_NUM_THREADS variable to the desired number of threads).

Aye. And just one small clarification here: the fundamental operation in your program is substring search. A string comparison is more like a memcmp: you're just looking to see whether two strings are equivalent (or if one is less than the other). A substring search is looking to see if one string is contained inside another.

(I'm sure you know all of this, but the phrasing did confuse me briefly until I saw your code.)

1 Like

Yes that's the main problem, if I was doing full-word comparison, I would have just hashed each word in each field and put them into a HashSet. This way, for a given track, I would just have had to check if the track's hashset contains the value, and it would have been a hell lot faster ^^

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.