I just published VTPL - a library that combines text search and vector search into one index structure, eliminating the need for separate indexes and a merge step.
The core idea
Traditional hybrid search queries two indexes (inverted index + HNSW graph) and merges results. VTPL embeds PQ-compressed vectors directly inside posting list entries:
Traditional: trigram "con" → [doc_1, doc_7, doc_42] VTPL: trigram "con" → [(doc_1, pq₁), (doc_7, pq₇), (doc_42, pq₄₂)] └── 32-byte PQ-compressed embedding
One pass over the posting lists gives you both pattern match scores and semantic similarity scores simultaneously.
Benchmarks (real data: AG News + MiniLM-L6-v2, 384-dim)
Query speed at 10k documents:
- Fused (text + vector): 1.9 ms
- Text-only: 1.6 ms
- Vector-only scan: 7.8 ms
Parallel indexing (rayon + AtomicU32):
- 10k docs: 352 ms parallel vs 706 ms sequential (2x)
Smart 3-level cache (steady state):
- Trigram: 95% hit rate
- Word: 95% hit rate
- Semantic: 91% hit rate
The cache works on overlapping queries, not just identical ones - "concurrent hash" and "concurrent programming" share cached trigram posting list scans.
Architecture highlights
VtplEntryis#[repr(C)]/ 36 bytes for cache-friendly scans- Asymmetric distance tables: 32 table lookups per candidate, zero float multiplies at query time
ParallelBuilderusesrayon+DashMap+AtomicU32for lock-free parallel indexingCachedIndexusesArc-backed entries withparking_lot::RwLockand confidence-based eviction
Links
- GitHub: github.com/Razshy/vtpl
- crates.io: crates.io/crates/vtpl
cargo add vtpl
Would love feedback on the approach, especially from anyone working on search or IR systems. The fused scoring currently uses a simple linear combination - curious if anyone has ideas for better fusion strategies.