Your RAG problem is an eval problem, not a vector-store problem.
Teams burn months optimizing chunk sizes, hybrid retrieval, and reranker stacks. Then they ship — and discover they never had a reliable way to know if the answer was right.
Most RAG systems we are asked to rescue have a sophisticated retrieval pipeline and a non-existent eval harness. The team can describe their reranker but cannot tell you the F1 on a held-out set of expert-graded answers.
This is backwards. The eval harness is the spec. Without it, every change is a vibes-based experiment — chunk size up, chunk size down, embedding model swap, hybrid weight knob — with no way to know if you are getting better or worse.
Build the eval first. Hand-grade fifty representative queries with expert ground truth. Score every change against that set. The retrieval architecture you end up with will be different — and shippable.
Want to scope an engagement around this?