Most AI knowledge bases follow the same pattern: split content into chunks, convert each chunk into embeddings, store vectors in a database with vector search (often pgvector on PostgreSQL), and on each query, embed it, retrieve nearest chunks, and feed them to the LLM.
Each 1024-dimensional vector is 4 KB of float32 values. With 50,000 chunks, that is about 200 MB of embeddings before index overhead.
Quantization changes the tradeoff
Instead of storing 1024 raw floats, TurboQuant first randomly rotates the vector, then quantizes each coordinate independently with a precomputed scalar grid. It stores packed quantized values plus a small amount of metadata. In TurboQuant Lite, a 768-dimensional float32 vector shrinks from 3,072 bytes to 388 bytes at 4-bit quantization, with reported distortion under 1.1% MSE.
The catch: pgvector cannot natively search TurboQuant's 4-bit format directly. It expects supported types like float32 or halfvec and its own index structures. So there are two paths.
Path A: Keep both. Use the vector index for search, and store compressed vectors as a low-cost recovery layer to avoid re-embedding.
Path B: Drop the index. Load compressed vectors into application memory and brute-force scan.
Where brute-force compression wins
Customer support routing. With 500 response templates, brute-force on compressed vectors can still be fast enough, while simple appends avoid index maintenance.
Where indexed search wins
Large-scale retrieval. A system indexing millions of documents cannot afford full scans per query. At that scale, ANN indexing justifies its overhead.
The principle
Compression is not universally better or worse than indexed vector search. It depends on three things:
- Dataset size. Small datasets favor brute-force scan. Large datasets favor ANN indexes.
- Write frequency. Frequently updated datasets pay index maintenance costs on every write. Compressed brute-force is append-only.
- Latency budget. Sub-millisecond requirements need indexes. 10ms budgets can afford compressed scans over moderate datasets.
Small, frequently updated datasets can favor brute-force on compressed vectors. Large, stable datasets favor indexed float32 vectors. Many real systems need both.