[Call for advice] What should be the next step for tantivy?

fulmicoton · August 15, 2018, 4:27am

I lead the search engine project called tantivy.

My objective is to have some businesses adopt tantivy in production
in a year or two. What do you think is the right way to spend my time and energy to see that happen?

Currently, I am considering:

spending my time on better examples/documentation
writing blog posts, do talks about how tantivy works
building a log search engine
building an all-in-one standalone easy to use search server on top of tantivy
making tantivy web assembly ready
adding bindings to nodeJS and/or python
building a public search API for the web (well, common crawl) (I estimate the running cost to be in the thousands of USD a month... It's quite low but I need to find a way to monetize that.)

So... what do you guys think?

dcarosone · August 15, 2018, 5:43am

I've just found out about it from your question. Looks interesting! I make fairly extensive use of ES and Kibana, primarily for log analysis, so my impressions are coloured by that use case.

The fixed and strict schema thing is going to be interesting. In practice, this is more or less true for ES too, despite all the marketing claims of being schema-free and flexible. You don't really get what you want out of it without setting at least some amount of schema - but you can pretty much throw data at it to start, investigate its content and structure, and iterate on schema development. For tantivy, it seems like you won't get far without at least a reasonable start at a schema.

So my first impression of what might help would be a schema-discovery mechanism. It could be a separate or loosely-coupled tool; feed it a corpus of stuff and let it pick up some field names and types and maybe some statistics that can help make guesses about terms and storage options. Even if it just picks fully-tokenised and stored text for all the fields it finds that aren't always numeric. Enough to get far enough that the task is more tuning than just trying to get data ingested.

My second suggestion is support for more data types; IP addresses and mask matches are a common next example, but this should be driven by target use cases you decide (like your option of building a log search engine). I'm not sure I'd recommend picking any particular target for this too soon.

Certainly, better intro materials (tutorials, samples, blogs, etc) are always useful, but they're for showing off the features you have and attracting users, who will hopefully bring their own needs to guide further development.

Fiedzia · August 15, 2018, 9:00am

Hi. I am a long-time solr and elasticsearch user. Both provide 95% of what I need. What they lack is ability to embed them in non-java code and that would be main usecase of Tantivy for me. I think python (as well ruby/js) bindings would make Tantivy goto tool for search similar to sqlite.

So the order of things I'd like to have:

Python bindings
Simple solution to experiment with (schema definition + data loading + getting query results in 5 lines of Python).
Spatial search is a must.

You can definitely beat solr in ease-of-use and documentation aspects.

Some random notes:

Having query language is important, but I see pros and cons of every solutions (solr DSL, sql, ES json), so this probably should be up to the user to decide which one that want to use.
There are numerous log search engines already, I don't see what another one could add.
Some simple to use service on top would be probably useful. Perhaps with GUI for defining schemas.

ehiggs · August 15, 2018, 10:12am

Do you have any clients? What are they asking about?

I could come up with loads of ideas but as I'm not giving you money I don't think my opinion should matter much. That said, I agree with @Fiedzia on all points.

Python bindings would be great. Then you could scrape in Python, use the Python nlp libraries (nltk, spacy), and then submit the document using an API from Python. This lets tantivy walk in the door to places where people want to index documents without going full JVM on ES.
Making sure that the entry level solution is simple is a wonderful way of gaining adoption.
Spatial search is also really cool. I'm not sure if @Fiedzia has specifically geospatial in mind or if he means vector space search (which is searching in high dimensional space).

As for log search, an interesting side project of maybe an afternoon for a particularly clever person, or 2 years for me: We know logs have the form %Y%m%d [classname] <logmsg> and logmsg has the form format string with interpolated values such as %s and %d with values injected. Wouldn't it be possible to reverse engineer a binary format on the fly that can figure out what the logmsg format strings are and then index them, compress them, etc.

kornel · August 15, 2018, 10:29am

Your example is beautiful, and helpful in getting started with a basic search. However, I'm missing information/examples for more advanced uses.

I'm planning to use tantivy for search of crates on lib.rs. I have more data to include in the search. I'm not sure if tantivy handles it, or do I have to use pre/post-processing to include it (I'll post specific technical question on github to avoid changing topic of the thread).

vitalyd · August 15, 2018, 11:01am

One thing that’s a bit unclear to me is whether tantivy is “lucene in Rust” or whether you intend to support other unique features/algorithms. Mind you I’m not a lucene/ES/solr user myself. What was/is the motivation for building tantivy?

I do know folks using ES (or the entire ELK combo) and I think there’s room for a more efficient, faster, lower resource utilizing replacement/competitor. In some ways, the business model would/could be similar to what ScyllaDB has done for Cassandra users. But if one were to take this road, the performance advantage would have to be significant (eg scylladb touts a 10x improvement over Cassandra).

ddorian43 · August 15, 2018, 12:38pm

Doing the scylladb thing, will take a lot of work. There are benchmarks vs lucene https://tantivy-search.github.io/bench/ , but to take it 10x is different (like looking at scylladb blog about different type of optimizations they do). There's a big patch for cassandra to use rocksdb which fixes many stuff but it takes a lot of work to do scylladb.

vitalyd · August 15, 2018, 12:44pm

Definitely; I was using scylladb as an example of a company that's making a living off rewriting an existing platform, keeping feature parity (as much as possible), but offering a more performant solution (rather than any extra features ... which I think they plan on doing at some point as well).

But if performance isn't the differentiator here, what is? That would be a crucial question to answer for @fulmicoton so that suggestions and his (and others) tantivy effort is well-guided. IMO.

ddorian43 · August 15, 2018, 12:52pm

I think performance can really be. As blog posts from lucene commiter has shown that doing stuff in c has been 1.5x+ faster even with jni overhead (not counting SIMD etc).

vitalyd · August 15, 2018, 1:04pm

So if performance/resource util is the answer, then I think going after the ES space is what I'd suggest off-the-cuff. I don't know what improvement would sway the users into Tantivy's direction, but I suspect if it was "even" 1.5x faster with drastically reduced footprint and easier deployment model, it might be enough. But Tantivy (or a new system using it) would need to gain the distributed system aspects that ES has.

In fact, I think there might be a business model for a company to RIIR (or in C++) the various Java/Scala based "big data" solutions and offer perf as the selling point (possibly extending into unique features later). I think right now the time is somewhat ripe for this because the Java ecosystem is perturbed by Java 9+ breakage (i.e. jigsaw) and the somewhat radical shift in the future release model of Java - it's not enterprise friendly in a lot of ways.

This is a lot of work but @fulmicoton did mention a horizon of a year or two.

ehiggs · August 15, 2018, 2:10pm

I agree but there are a lot of tooling issues that also need to be addressed. e.g. binary crates, crates.io mirroring, application resource deployment (~war files), etc.

Anyway, the shortcut (ha!) method to getting Rust into big data as an application is to implement cassandra, hbase, hive, and hdfs in Rust. These are the applications that there's no hype for because they just work and are taken for granted. Spark needs constant hype because it's not the default choice for data analysis. Cassandra's performance gains are met by scylladb so that's probably not interesting, but the others surely are. ES is probably seen as becoming infra rather than product so maybe it is a good target to chase.

ddorian43 · August 15, 2018, 4:42pm

But there kinda is ?:

For HDFS -> QFS, cassandra -> scylladb, elasticsearch -> vespa.ai, hive -> apache-kudu? (not really sure how hive works), hbase -> tikv/yugabytedb, though no spark alternative in native.

vitalyd · August 15, 2018, 4:46pm

https://datafusion.rs/ is a WIP for Spark-like stuff.

Also, vespa.ai seems to be a Java product?

ddorian43 · August 15, 2018, 5:04pm

vespa.ai is c++ underneath (content nodes) and java query-nodes. (kinda like voltdb or tikv(rust)/tidb(go))

vitalyd · August 15, 2018, 5:07pm

Ah, I see (I'd not heard of vespa.ai before). Don't want to hijack this thread, but seems like an odd choice to stick java query nodes into the mix while having storage nodes in c++.

dcarosone · August 15, 2018, 10:44pm

Great! I wanted to suggest exactly this collaboration, but didn't want to speak for you.

dcarosone · August 15, 2018, 10:51pm

I'm still curious what you had in mind here. I'm not really seeing the fit, other than perhaps for indexing some heavy use of offline storage in some hypothetical app. It could certainly be a distinguishing feature, but in terms of reward for effort I suspect it falls squarely in the category of waiting to see if there is user demand. The effort required is likely to diminish sharply as the tooling rapidly improves, too.

kornel · August 15, 2018, 11:05pm

WebAssembly is interesting! If the index can be compressed well, it can be shipped with the page. That gives instant results and works offline. Rustdoc documents already have such search built-in.

fulmicoton · August 16, 2018, 12:20am

Thanks! I see you posted on github. I'll reply there.

fulmicoton · August 16, 2018, 12:23am

The index compress quite well. Much better than the JS based solution.
The WASM itself is on the other hand much bigger. I think I can bring it down to somewhere around 3MB.

A good search in the browser, could be sweet for chrome extensions. There has been quite a few people trying to make search extensions that index everything you browse using JS, and they are having trouble with performance.

Topic		Replies	Views
Tantivy 0.4.0 released + blog post	12	1794	January 12, 2023
Full-text, in-memory, search library help	1	221	November 5, 2025
Should I use Rust as my backend server and search program for my college project?	7	303	November 25, 2025
Looking for an ElasticSearch-like minimal search engine as a standalone library (no server) help	11	1125	July 3, 2022
MeiliSearch forum on GitHub Discussions (:new: beta) announcements	1	543	July 19, 2020

[Call for advice] What should be the next step for tantivy?

Related topics