ragpgvectorembeddingslangchainretrievalai-engineering

Building a RAG Pipeline That Actually Works: Chunking, Embeddings, and Retrieval Tuning

June 4, 20266 minSimple Agent Team

Why Most RAG Pipelines Fail in Production

The failure isn’t in the LLM. It’s in the retrieval. Naive implementations split text every 1,000 characters, dump the chunks into an in-memory FAISS index using text-embedding-ada-002, and return the top-3 cosine matches. In production, this collapses for three predictable reasons.

First, the chunk boundary problem: fixed-size cuts slice through code blocks, procedures, or conditional logic. The LLM receives half a webhook rotation step and fabricates the rest. Second, embedding model mismatch: a general-purpose model trained on Reddit and Wikipedia will retrieve a semantically similar “user management” blog post instead of your exact API schema for user provisioning. Third, cosine similarity is not enough: it measures vector proximity, not factual relevance. A chunk about “deprecated v1 keys” is close in embedding space to a query about “current signing keys,” but feeding it to the model produces confident hallucinations. Most “bad RAG” is actually bad retrieval dressed up with a good generator.

Chunking: The Decision That Makes or Breaks Retrieval

Use token-based boundaries, not characters. For a typical SaaS docs site written in Markdown, start with RecursiveCharacterTextSplitter from LangChain (or MarkdownNodeParser from LlamaIndex) configured with tiktoken encoding cl100k_base.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    encoding_name="cl100k_base",
    chunk_size=512,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " ", ""],
)

The separators order matters. It preserves paragraph and sentence boundaries before falling back to word-level splits. A 512-token chunk with 50-token overlap keeps local context without bloating the index. For structured docs, pair this with parent-document retrieval: embed small, precise child chunks, but retrieve the full parent section (e.g., 2,048 tokens) to feed the LLM. This prevents the boundary problem while keeping the embedding search granular.

If your docs contain OpenAPI YAML or JSON tables, add a custom separator for code fences. A split inside a required array breaks schema validation logic and causes the generator to invent fields.

Choosing Your Embedding Model (with numbers)

Do not default to the cheapest option. On the MTEB retrieval benchmark, smaller generalist models (e.g., sentence-transformers/all-MiniLM-L6-v2) typically score in the 20–35 NDCG@10 range (estimated), while dedicated models like Cohere embed-multilingual-v3 or OpenAI text-embedding-3-large land in the 55–65 range (estimated). That gap directly translates to production recall: if your embedding model misses the correct chunk in the top-8, the reranker never sees it.

For a SaaS docs pipeline, pick a model aligned with your content language and domain. If your docs are English-only technical prose, OpenAI text-embedding-3-large at 256 dimensions (using the dimensions parameter) gives strong retrieval with lower pgvector storage cost than the full 3,072-dimension output. If you have mixed-language support tickets or multilingual API references, Cohere embed-multilingual-v3 at 1,024 dimensions is the safer baseline.

Store vectors in pgvector with an HNSW index, not flat search. For 5,000–50,000 chunks, the difference between an exact scan and HNSW is the difference between 200 ms and 20 ms per query. Build the index with m=16, ef_construction=64 in SQL:

CREATE INDEX ON docs USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

Retrieval Tuning: top-k, reranking, and MMR

Cosine similarity from the ANN index is your first filter, not your last. Run top-k=8 from the HNSW layer, then rerank. A cross-encoder or API reranker (e.g., Cohere Rerank v3) scores query-chunk pairs with full cross-attention, catching nuances that cosine similarity misses—like the difference between “rotate keys” (cryptography) and “rotate logs” (infrastructure).

# Pipeline sketch
candidates = index.similarity_search(query, k=8, search_kwargs={"hnsw_ef": 32})
reranked = co.rerank(
    model="rerank-v3.5",
    query=query,
    documents=[c.page_content for c in candidates],
    top_n=3
)
context = [candidates[r.index] for r in reranked.results]

If you skip reranking, expect roughly 10–20 percentage points lower recall@3 (estimated from public reranker benchmarks). Those missing points surface in production as answers built from the wrong documentation version.

Maximal Marginal Relevance (MMR) adds diversity. In LangChain, set fetch_k=20 and lambda_mult=0.5 to trade off similarity against redundancy. MMR helps when a query maps to multiple features (“How do permissions work?” could hit RBAC, OAuth scopes, or workspace roles). The cost is extra latency and, in some cases, precision loss. Measure it on your held-out query set; if it drags down top-3 accuracy, drop it.

A Worked Example: 50-page SaaS Docs to Production in 4 Hours

Scenario: a Docusaurus site with 50 pages of mixed Markdown and OpenAPI specs. Total raw text: ~120,000 tokens.

Ingestion: Use LlamaIndex MarkdownNodeParser to preserve header hierarchy, falling back to RecursiveCharacterTextSplitter (512 tokens, 50-token overlap, cl100k_base) for leaf nodes. Result: ~1,200 nodes.

Embedding: Encode with Cohere embed-multilingual-v3 (1,024 dims). Batch insert into PostgreSQL 15 with pgvector. Build HNSW index with m=16, ef_construction=64. Index time: under 5 minutes for this volume.

Query: “How do I rotate webhook signing keys?”

ANN retrieval: HNSW search with ef=32, top-k=8. Latency: ~15–25 ms estimated on a standard pgvector instance. Without reranking, the top-3 includes one chunk about API key rotation (wrong entity) and two about webhook configuration (relevant but not the exact procedure). Estimated recall@3: ~61%.
Reranking: Run Cohere Rerank v3 over the 8 candidates, returning top-3. The reranker penalizes the generic API-key chunk and promotes the specific “Rotate signing secret” procedure. Estimated recall@3 rises to ~79–83%. Reranker latency: ~150–300 ms estimated.
Generation: Feed the top-3 chunks to gpt-4o-mini with a citation constraint (“Answer using only the provided context, cite the source header”).

Total pipeline p99 latency: ~400–500 ms. If your SLA demands <300 ms, switch to a local cross-encoder like cross-encoder/ms-marco-MiniLM-L-6-v2 via sentence-transformers and reduce top-k to 5.

Checklist Before You Go Live

Chunk audit: Verify no code fences, tables, or numbered procedures are split at boundaries. Inspect the 20 longest chunks manually.
Embedding validation: Curate 50 question–answer pairs from your docs. Measure top-5 recall. If it is below ~70% (estimated target for production viability), switch models or add domain fine-tuning with MultipleNegativesRankingLoss.
Index verification: Run EXPLAIN ANALYZE on your pgvector query. Confirm an Index Scan using hnsw plan, not a Seq Scan.
Reranker budget: Log p99 reranker latency. If it exceeds your API gateway timeout, downsize the cross-encoder or cache frequent queries.
MMR ablation: Test MMR on ambiguous queries. If precision drops, remove it; diversity is worthless if the correct chunk is pushed to rank 4.
Guardrail: If the reranker score for your top result is below 0.3, return a “No relevant documentation found” response instead of letting the LLM hallucinate.

Simple Agent handles this automatically — citation-level sourcing out of the box.

Ready to build your AI agent?

Guided setup in 90 seconds. Included quota + predictable overage.

Create my agent