[Call for advice] What should be the next step for tantivy?

I lead the search engine project called tantivy.

My objective is to have some businesses adopt tantivy in production
in a year or two. What do you think is the right way to spend my time and energy to see that happen?

Currently, I am considering:

  • spending my time on better examples/documentation
  • writing blog posts, do talks about how tantivy works
  • building a log search engine
  • building an all-in-one standalone easy to use search server on top of tantivy
  • making tantivy web assembly ready
  • adding bindings to nodeJS and/or python
  • building a public search API for the web (well, common crawl) (I estimate the running cost to be in the thousands of USD a month… It’s quite low but I need to find a way to monetize that.)

So… what do you guys think?

1 Like

I’ve just found out about it from your question. Looks interesting! I make fairly extensive use of ES and Kibana, primarily for log analysis, so my impressions are coloured by that use case.

The fixed and strict schema thing is going to be interesting. In practice, this is more or less true for ES too, despite all the marketing claims of being schema-free and flexible. You don’t really get what you want out of it without setting at least some amount of schema - but you can pretty much throw data at it to start, investigate its content and structure, and iterate on schema development. For tantivy, it seems like you won’t get far without at least a reasonable start at a schema.

So my first impression of what might help would be a schema-discovery mechanism. It could be a separate or loosely-coupled tool; feed it a corpus of stuff and let it pick up some field names and types and maybe some statistics that can help make guesses about terms and storage options. Even if it just picks fully-tokenised and stored text for all the fields it finds that aren’t always numeric. Enough to get far enough that the task is more tuning than just trying to get data ingested.

My second suggestion is support for more data types; IP addresses and mask matches are a common next example, but this should be driven by target use cases you decide (like your option of building a log search engine). I’m not sure I’d recommend picking any particular target for this too soon.

Certainly, better intro materials (tutorials, samples, blogs, etc) are always useful, but they’re for showing off the features you have and attracting users, who will hopefully bring their own needs to guide further development.

1 Like

Hi. I am a long-time solr and elasticsearch user. Both provide 95% of what I need. What they lack is ability to embed them in non-java code and that would be main usecase of Tantivy for me. I think python (as well ruby/js) bindings would make Tantivy goto tool for search similar to sqlite.

So the order of things I’d like to have:

  1. Python bindings
  2. Simple solution to experiment with (schema definition + data loading + getting query results in 5 lines of Python).
  3. Spatial search is a must.

You can definitely beat solr in ease-of-use and documentation aspects.

Some random notes:

  • Having query language is important, but I see pros and cons of every solutions (solr DSL, sql, ES json), so this probably should be up to the user to decide which one that want to use.
  • There are numerous log search engines already, I don’t see what another one could add.
  • Some simple to use service on top would be probably useful. Perhaps with GUI for defining schemas.
1 Like

Do you have any clients? What are they asking about?

I could come up with loads of ideas but as I’m not giving you money I don’t think my opinion should matter much. That said, I agree with @Fiedzia on all points.

  1. Python bindings would be great. Then you could scrape in Python, use the Python nlp libraries (nltk, spacy), and then submit the document using an API from Python. This lets tantivy walk in the door to places where people want to index documents without going full JVM on ES.
  2. Making sure that the entry level solution is simple is a wonderful way of gaining adoption.
  3. Spatial search is also really cool. I’m not sure if @Fiedzia has specifically geospatial in mind or if he means vector space search (which is searching in high dimensional space).

As for log search, an interesting side project of maybe an afternoon for a particularly clever person, or 2 years for me: We know logs have the form %Y%m%d [classname] <logmsg> and logmsg has the form format string with interpolated values such as %s and %d with values injected. Wouldn’t it be possible to reverse engineer a binary format on the fly that can figure out what the logmsg format strings are and then index them, compress them, etc.

Your example is beautiful, and helpful in getting started with a basic search. However, I'm missing information/examples for more advanced uses.

I'm planning to use tantivy for search of crates on lib.rs. I have more data to include in the search. I'm not sure if tantivy handles it, or do I have to use pre/post-processing to include it (I'll post specific technical question on github to avoid changing topic of the thread).

3 Likes

One thing that’s a bit unclear to me is whether tantivy is “lucene in Rust” or whether you intend to support other unique features/algorithms. Mind you I’m not a lucene/ES/solr user myself. What was/is the motivation for building tantivy?

I do know folks using ES (or the entire ELK combo) and I think there’s room for a more efficient, faster, lower resource utilizing replacement/competitor. In some ways, the business model would/could be similar to what ScyllaDB has done for Cassandra users. But if one were to take this road, the performance advantage would have to be significant (eg scylladb touts a 10x improvement over Cassandra).

Doing the scylladb thing, will take a lot of work. There are benchmarks vs lucene https://tantivy-search.github.io/bench/ , but to take it 10x is different (like looking at scylladb blog about different type of optimizations they do). There’s a big patch for cassandra to use rocksdb which fixes many stuff but it takes a lot of work to do scylladb.

Definitely; I was using scylladb as an example of a company that's making a living off rewriting an existing platform, keeping feature parity (as much as possible), but offering a more performant solution (rather than any extra features ... which I think they plan on doing at some point as well).

But if performance isn't the differentiator here, what is? That would be a crucial question to answer for @fulmicoton so that suggestions and his (and others) tantivy effort is well-guided. IMO.

I think performance can really be. As blog posts from lucene commiter has shown that doing stuff in c has been 1.5x+ faster even with jni overhead (not counting SIMD etc).

So if performance/resource util is the answer, then I think going after the ES space is what I’d suggest off-the-cuff. I don’t know what improvement would sway the users into Tantivy’s direction, but I suspect if it was “even” 1.5x faster with drastically reduced footprint and easier deployment model, it might be enough. But Tantivy (or a new system using it) would need to gain the distributed system aspects that ES has.

In fact, I think there might be a business model for a company to RIIR (or in C++) the various Java/Scala based “big data” solutions and offer perf as the selling point (possibly extending into unique features later). I think right now the time is somewhat ripe for this because the Java ecosystem is perturbed by Java 9+ breakage (i.e. jigsaw) and the somewhat radical shift in the future release model of Java - it’s not enterprise friendly in a lot of ways.

This is a lot of work but @fulmicoton did mention a horizon of a year or two.

I agree but there are a lot of tooling issues that also need to be addressed. e.g. binary crates, crates.io mirroring, application resource deployment (~war files), etc.

Anyway, the shortcut (ha!) method to getting Rust into big data as an application is to implement cassandra, hbase, hive, and hdfs in Rust. These are the applications that there’s no hype for because they just work and are taken for granted. Spark needs constant hype because it’s not the default choice for data analysis. Cassandra’s performance gains are met by scylladb so that’s probably not interesting, but the others surely are. ES is probably seen as becoming infra rather than product so maybe it is a good target to chase.

But there kinda is ?:

For HDFS -> QFS, cassandra -> scylladb, elasticsearch -> vespa.ai, hive -> apache-kudu? (not really sure how hive works), hbase -> tikv/yugabytedb, though no spark alternative in native.

https://datafusion.rs/ is a WIP for Spark-like stuff.

Also, vespa.ai seems to be a Java product?

vespa.ai is c++ underneath (content nodes) and java query-nodes. (kinda like voltdb or tikv(rust)/tidb(go))

Ah, I see (I’d not heard of vespa.ai before). Don’t want to hijack this thread, but seems like an odd choice to stick java query nodes into the mix while having storage nodes in c++.

Great! I wanted to suggest exactly this collaboration, but didn’t want to speak for you.

I’m still curious what you had in mind here. I’m not really seeing the fit, other than perhaps for indexing some heavy use of offline storage in some hypothetical app. It could certainly be a distinguishing feature, but in terms of reward for effort I suspect it falls squarely in the category of waiting to see if there is user demand. The effort required is likely to diminish sharply as the tooling rapidly improves, too.

WebAssembly is interesting! If the index can be compressed well, it can be shipped with the page. That gives instant results and works offline. Rustdoc documents already have such search built-in.

2 Likes

Thanks! I see you posted on github. I’ll reply there.

The index compress quite well. Much better than the JS based solution.
The WASM itself is on the other hand much bigger. I think I can bring it down to somewhere around 3MB.

A good search in the browser, could be sweet for chrome extensions. There has been quite a few people trying to make search extensions that index everything you browse using JS, and they are having trouble with performance.