Vector database migration: turbopuffer & cohere embed v4

Experiments & Findings

Background

Ada's AI agents ground their answers in each client's knowledge base (KB). The knowledge retrieval layer is on the hot path of every conversation, so latency and recall directly shape agent quality. We migrated knowledge search and coaching retrieval from our previous vector database to turbopuffer, and re-evaluated our embedding stack from scratch along the way, moving to Cohere Embed v4.

Vector database: turbopuffer

Operating our previous vector database at multi-tenant scale was a constant tuning exercise: shard counts, node sizing, noisy neighbors. Beyond that, there was a deeper architectural issue: knowledge bases ingested years ago and rarely queried still sat in live shards alongside today's traffic, paying the same hot-shard cost. Turbopuffer's compute/storage separation fixes this: cold namespaces (indexes) sit cheaply in object storage and only pay compute when hit. Its serverless model also makes supporting per-feature, per-tenant namespaces straightforward, providing low latency and nearly endless scalability for our use cases.

Model: Cohere embed-v4

We chose Cohere embed-v4 after a four-way bakeoff on 15K production queries across 10 customers. Top 10 recall results:

- Cohere embed-v4 (float32): 90.7%

- Cohere embed-v4 (int8): 90.6%

- OpenAI text-embedding-3-large (incumbent): 88.0%

- Qwen3-0.6B: 87.6%

Hosting: different answer per workload. Embeddings serve two workloads with very different SLOs (service-level objectives). Online retrieval (user query → top-k chunks of KB content) is latency-bound with sub-200ms P95. Offline KB indexing (article → vectors at ingest) is throughput-bound and ≤500ms P50 is fine. What matters is requests-per-dollar and absorbing spikes up to 10M tokens/minute when a customer republishes their knowledge base. One provider for both was the first assumption to falsify.

For the online path, we tested six options against sub-100ms P50 and sub-200ms P95 targets. These were first-pass measurements in our multi-tenant configuration; further tuning with each vendor would likely have moved their numbers, but we needed to pick a path that cleared SLO out of the box and move on.

In our setup, OpenAI text-embedding-3-large measured outside SLO (P50 188–200ms, P99 493–932ms). We then tested the same Cohere’s embed-v4 model across several inference providers and setups. The Cohere public API hit ~108ms P50 but tailed past 1s under our multi-tenant rate-limit profile. Managed offerings on AWS Bedrock and Azure AI Foundry showed long P99 tails in our configuration as well. Cohere Vault, their dedicated single-tenant deployment, landed at ~100ms P50, with ~45ms of that being cross-cloud overhead from our setup that would shrink in a co-located arrangement. Self-hosting on dedicated instances via Sagemaker came in at P50 ~77ms and cleared SLO without further tuning, so we went with it for the online path.

For the offline path, the question was more about requests-per-dollar with p50 latency at ≤500ms. In our testing, Cohere Vault delivered ~2.7× throughput-per-dollar over our self-hosted setup, and its managed autoscaler absorbed KB-republish spikes faster than our self-managed scaling. Vault was the right fit for this workload.

int8 quantization: embed-v4 supports int8 natively. Same latency as float32, retrieval-quality delta within ±0.001 across every top-k metric, vectors 4× smaller (1KB vs 4KB at 1024 dims). At Ada's RAG footprint — thousands of customer KBs with up to tens of thousands of articles each — a 4× reduction in vector size compounds into proportionally lower turbopuffer storage and faster retrieval for AI agents.

Market implication.

Three takeaways generalize. First, "one inference configuration for all embedding workloads" is a bad default — online and offline have different SLOs and different cost shapes, and in our case, different winners. Second, int8 quantization on a modern embedding model is a free 4× storage win on multi-tenant RAG with no measurable retrieval cost — a strong default for anyone storing more than a single customer's worth of vectors. Third, vector-store architecture matters as much as raw latency: compute/storage separation plus cheap per-tenant namespaces is what makes multi-tenant RAG scale. The database choice was an enabling decision, not just a destination.