VTPL: Fused text + vector search in a single index pass

I just published VTPL - a library that combines text search and vector search into one index structure, eliminating the need for separate indexes and a merge step.

The core idea

Traditional hybrid search queries two indexes (inverted index + HNSW graph) and merges results. VTPL embeds PQ-compressed vectors directly inside posting list entries:
Traditional: trigram "con" → [doc_1, doc_7, doc_42] VTPL: trigram "con" → [(doc_1, pq₁), (doc_7, pq₇), (doc_42, pq₄₂)] └── 32-byte PQ-compressed embedding

One pass over the posting lists gives you both pattern match scores and semantic similarity scores simultaneously.

Benchmarks (real data: AG News + MiniLM-L6-v2, 384-dim)

Query speed at 10k documents:

  • Fused (text + vector): 1.9 ms
  • Text-only: 1.6 ms
  • Vector-only scan: 7.8 ms

Parallel indexing (rayon + AtomicU32):

  • 10k docs: 352 ms parallel vs 706 ms sequential (2x)

Smart 3-level cache (steady state):

  • Trigram: 95% hit rate
  • Word: 95% hit rate
  • Semantic: 91% hit rate

The cache works on overlapping queries, not just identical ones - "concurrent hash" and "concurrent programming" share cached trigram posting list scans.

Architecture highlights

  • VtplEntry is #[repr(C)] / 36 bytes for cache-friendly scans
  • Asymmetric distance tables: 32 table lookups per candidate, zero float multiplies at query time
  • ParallelBuilder uses rayon + DashMap + AtomicU32 for lock-free parallel indexing
  • CachedIndex uses Arc-backed entries with parking_lot::RwLock and confidence-based eviction

Links

Would love feedback on the approach, especially from anyone working on search or IR systems. The fused scoring currently uses a simple linear combination - curious if anyone has ideas for better fusion strategies.