Technical Deep Dive

Building Production RAG: How We Achieved 87% Recall with Hybrid Search + pgvector

When we A/B tested pure vector search vs. hybrid search (vector + keyword), hybrid won by 34% on recall for business documents. Here's how we built a production RAG system that costs $0 incremental on PostgreSQL + pgvector vs. $675/month for Pinecone.

16 min read January 2025

Key Takeaways

  • Hybrid search (vector + BM25) achieves 87% recall vs 71% vector-only
  • pgvector on PostgreSQL costs $0 incremental vs $675/month for Pinecone
  • Section-aware chunking preserves context that fixed-size misses
  • Parallel query execution saves 30% latency (50ms vs 75ms)
  • PostgreSQL RLS provides database-level multi-tenant security

Why RAG for Document Search?

The business problem: Searching thousands of invoices, purchase orders, and contracts by natural language. When a CFO asks "Show me all shipments from Acme Corp delayed in Q4," traditional keyword search fails.

Why traditional search fails:

Why pure LLM context stuffing fails:

RAG as the middle ground: Retrieve relevant chunks (hybrid search), then generate responses with source citations. Cost: $0.05 per query vs $0.40 for full context [2].

Production metrics after 50K+ queries:

  • 94% query success rate (user gets relevant results)
  • 87% recall@10 (hybrid search) vs 71% (vector-only) vs 65% (BM25-only)
  • P95 latency: 2.8s (retrieval + LLM generation)

Architecture Overview: The 30-Second Version

Here's the technical stack before the deep dive:

  1. PostgreSQL + pgvector as the vector store (not Pinecone/Weaviate)
  2. Hybrid search: Vector similarity (semantic) + BM25 full-text (keyword precision)
  3. Parallel execution: Vector and BM25 queries run concurrently, fused via RRF
  4. Multi-tenant isolation: Row-Level Security (RLS) for secure multi-company data
  5. Cost optimization: ~$0 incremental vs $675/month for Pinecone (10M vectors) [3]
# Simplified hybrid search (Python pseudocode) async def hybrid_search(query: str, company_id: str, limit: int = 20): # Generate embedding for query query_embedding = await openai_embed(query) # 150ms # Run vector + BM25 queries in parallel (50ms total vs 75ms sequential) with ThreadPoolExecutor(max_workers=2) as executor: vector_future = executor.submit( vector_search, query_embedding, company_id, limit=50 ) bm25_future = executor.submit( fulltext_search, query, company_id, limit=50 ) vector_results = vector_future.result() # 40ms bm25_results = bm25_future.result() # 35ms # Reciprocal Rank Fusion (RRF) fused_results = rrf_fusion(vector_results, bm25_results, k=60) # Apply recency boost (business logic) final_results = recency_boost(fused_results, days_threshold=30) return final_results[:limit]

Why this architecture? PostgreSQL + pgvector eliminates a separate vector DB, reduces infrastructure complexity, and costs $0 incremental (already paying for PostgreSQL). Hybrid search catches both semantic similarity ("urgent payment" = "expedited invoice") and exact matches (PO-68755).

Deep Dive: Document Chunking Strategy

The chunking challenge: Invoices and POs have mixed structure: headers (vendor, PO number, date), line items (part numbers, quantities, prices), and terms. Naive fixed-size chunking breaks this context.

Our Chunking Approach

Target chunk size: ~600 tokens with 100-token overlap (based on embedding model limits)

Industry best practices recommend 400-800 tokens with 10-20% overlap [4]. We chose 600 tokens because:

Section-aware chunking: Preserve document structure

def chunk_invoice(text: str, metadata: dict) -> List[Chunk]: """ Section-aware chunking for procurement documents. Preserves headers, line items, and terms as logical units. """ sections = split_by_headers(text) # Detect: header, line_items, terms chunks = [] for section in sections: # Add business context prefix for better retrieval po_number = metadata.get('po_number', 'Unknown') vendor = metadata.get('vendor_name', 'Unknown') chunk_text = f"PO {po_number} | Vendor: {vendor}\n\n{section}" # Track page numbers for citations page_num = extract_page_number(section) chunks.append(Chunk( text=chunk_text, page=page_num, metadata=metadata )) return chunks

Why section-aware chunking?

Common mistake

Using fixed 512-token chunks splits tables mid-row. Result: "Show me quantity for part XYZ-123" retrieves chunk with part number but not quantity (in next chunk). Section-aware chunking prevents this.

The key insight: Vector search misses exact matches (part numbers, PO IDs, entity names). BM25 keyword search handles precision, vectors handle semantics.

Why Hybrid Search Outperforms Either Method Alone

71%
Vector-Only Recall
65%
BM25-Only Recall
87%
Hybrid Recall (+34%)

Real-world example where vector search fails:

Query: "Find PO-68755" # Vector search result: Chunks: [ "Purchase Order 68755 from Acme Corp...", # MISS (cosine: 0.72) "PO acknowledgment for order #68755...", # MISS (cosine: 0.68) "Quote 68750 for industrial parts..." # False positive (cosine: 0.81) ] # BM25 search result: Chunks: [ "Purchase Order 68755 from Acme Corp...", # HIT (exact text match) "PO acknowledgment for order #68755...", # HIT (exact match) ] # Hybrid (RRF fusion) result: Chunks: [ "Purchase Order 68755 from Acme Corp...", # TOP (both methods rank high) "PO acknowledgment for order #68755...", # 2nd (both methods agree) "Quote 68750 for industrial parts..." # Filtered (low BM25 score) ]

Reciprocal Rank Fusion (RRF) Algorithm

RRF combines ranked results without tuning hyperparameters [5]. Formula: score = 1 / (k + rank) where k=60 (default).

def rrf_fusion(vector_results, bm25_results, k=60): """ Reciprocal Rank Fusion: combines rankings without score normalization. Advantage: No hyperparameter tuning, robust to score scale mismatch. """ chunk_scores = defaultdict(float) # Score vector results (rank 1 = highest score) for rank, result in enumerate(vector_results, start=1): chunk_scores[result['chunk_id']] += 1.0 / (k + rank) # Score BM25 results (additive scoring) for rank, result in enumerate(bm25_results, start=1): chunk_scores[result['chunk_id']] += 1.0 / (k + rank) # Sort by combined score (descending) ranked_chunks = sorted( chunk_scores.items(), key=lambda x: x[1], reverse=True ) return ranked_chunks

Why RRF over weighted averaging?

Parallel Query Execution

Running vector + BM25 queries sequentially costs 75ms (40ms + 35ms). Parallel execution: 50ms total (30% faster).

-- Vector search (HNSW index, approximate nearest neighbor) SELECT chunk_id, chunk_text, embedding <=> %s AS distance -- cosine distance operator FROM document_chunks WHERE company_id = %s ORDER BY embedding <=> %s -- %s = query_embedding LIMIT 50; -- Over-fetch for RRF fusion -- BM25 full-text search (GIN index) SELECT chunk_id, chunk_text, ts_rank_cd(tsvector, query) AS rank FROM document_chunks WHERE company_id = %s AND tsvector @@ plainto_tsquery('english', %s) ORDER BY ts_rank_cd(tsvector, query) DESC LIMIT 50;

Database indexing critical for performance:

Production Considerations

Multi-Tenant Isolation with Row-Level Security

The challenge: Multiple companies, one database—zero tolerance for data leakage.

Solution: PostgreSQL Row-Level Security (RLS) + session variables. Even SQL injection can't cross tenant boundaries.

-- Enable RLS on document_chunks table ALTER TABLE document_chunks ENABLE ROW LEVEL SECURITY; -- Create policy: users only see their company's data CREATE POLICY company_isolation ON document_chunks FOR ALL USING (company_id = current_setting('app.current_company_id')::uuid); -- Python: Set RLS context per query cur.execute("SET LOCAL app.current_company_id = %s", [company_id]) # All subsequent queries auto-filter by company_id via RLS policy results = cur.execute("SELECT * FROM document_chunks WHERE ...")

Why RLS over application-level filtering?

Cost Optimization: pgvector vs. External Vector DBs

Vector DB 10M Vectors + 1M Queries/Month Notes
Pinecone ~$675 Managed, simple, expensive
Weaviate (self-hosted) ~$200 + ops overhead Open source, complex ops
pgvector (on existing RDS) $0 incremental Already paying for PostgreSQL

Key insight: For <=100M vectors, pgvector on existing PostgreSQL is cost-optimal [3][8]. Beyond 100M vectors or need distributed search, migrate to specialized DB.

Latency Budget Breakdown

Total P95 latency: <2s
Breakdown: |-- Embedding generation: ~150ms (OpenAI API call, cacheable) |-- Vector query: ~40ms (HNSW index on 1M+ chunks) |-- BM25 query: ~35ms (GIN index, parallel with vector) |-- RRF fusion: ~5ms (in-memory sorting) +-- LLM response: ~1.8s (Grok streaming) Parallel execution saves 30ms (vector + BM25 concurrent vs sequential)

pgvector performance at scale:

Common RAG Pitfalls and How We Avoided Them

Pitfall 1: Poor Chunking Breaks Context

Symptom: Line items split across chunks, losing vendor context

Fix: Section-aware chunking with business context prefix (see chunking section)

Pitfall 2: Vector Search Misses Exact Matches

Symptom: Query "PO-68755" doesn't match "Purchase Order 68755"

Fix: Hybrid search. BM25 catches exact matches, vector handles fuzzy semantics

Pitfall 3: No Relevance Filtering

Symptom: Weak matches pollute results, confuse LLM

Fix: Minimum RRF score threshold (0.005) filters noise

# Apply relevance threshold before returning results MIN_RRF_SCORE = 0.005 final_results = [ chunk for chunk, score in ranked_chunks if score >= MIN_RRF_SCORE ][:limit]

Pitfall 4: Slow Sequential Queries

Symptom: Vector + BM25 queries take 75ms sequentially

Fix: Parallel execution with ThreadPoolExecutor (50ms total)

Pitfall 5: Context Window Overflow

Symptom: Top 50 chunks exceed LLM context limit

Fix: Limit to top 20 chunks (~12K tokens), use reranker if needed (future work)

Performance Metrics and Results

Business Metrics

Technical Metrics

1.2s
P50 Latency
2.8s
P95 Latency
87%
Recall@10 (Hybrid)

A/B Test Results (3-week test, 5K queries):

  • Hybrid vs. pure vector: +34% recall improvement
  • RRF fusion vs. weighted average: +12% relevance (NDCG@3)
  • Recency boost: +8% user satisfaction for time-sensitive queries

Future Enhancements

1. Reranker Layer

Cross-encoder model to re-score top-K results (10-20% accuracy gain). Example: Cohere Rerank API or open-source BGE reranker [9].

2. Query Expansion

Generate multiple query variations, especially for acronyms/abbreviations. Example: "PO" = ["Purchase Order", "PO", "P.O."]

3. Semantic Caching

Cache embeddings for common queries (reduce OpenAI costs by 40-60%).

4. GraphRAG

Model relationships between documents (PO -> Invoice -> Receipt) for supply chain queries.

5. Long-Context Models

When GPT-5/Claude 4 support 1M+ tokens, revisit chunking strategy (may eliminate chunking for single-doc queries).

Conclusion and Key Takeaways

Production-grade RAG requires:

  1. Hybrid search is non-negotiable: Vector + BM25 outperforms either alone for structured documents (+34% recall in our tests)
  2. pgvector is underrated: For <=100M vectors, it's cost-optimal and performant ($0 vs $675/month)
  3. Chunking matters more than embeddings: Section-aware chunking preserves business context that fixed-size misses
  4. Parallel queries save 30% latency: Run vector + BM25 concurrently (50ms vs 75ms sequential)
  5. Multi-tenancy is critical: PostgreSQL RLS ensures secure data isolation at database level

When to use Kynthar's approach:

When to use external vector DB (Pinecone/Weaviate):

Try Production RAG in Action

Process 25 documents free, then ask: "Show me all documents from [your vendor]". See hybrid search + RRF fusion in action.

Start Free Trial

No credit card required | 5-minute setup | Cancel anytime

Sources and References

  1. Internal Kynthar A/B Test Data. (2024). 3-week test with 5,000 queries comparing hybrid search (87% recall) vs pure vector (71%) vs BM25-only (65%). Measured on procurement document corpus.
  2. OpenAI Pricing. (2025). API Pricing - text-embedding-3-small: $0.02/1M tokens. GPT-4 context: $0.002/1K tokens input.
  3. Medium. (2024). "Postgres Vector Search with pgvector: Benchmarks, Costs, and Reality Check"
  4. Firecrawl. (2025). "Best Chunking Strategies for RAG in 2025"
  5. Microsoft Azure. (2025). "Hybrid search scoring (RRF)"
  6. Azure AI Search Blog. (2024). "Outperforming vector search with hybrid retrieval and reranking"
  7. Supabase. (2024). "pgvector 0.4.0 performance"
  8. TigerData. (2024). "Pgvector vs. Qdrant: Open-Source Vector Database Comparison"
  9. Unstructured. (2024). "Improving Retrieval in RAG with Reranking"
  10. Orkes. (2024). "Best Practices for Production-Scale RAG Systems"
  11. Assembled. (2024). "Better RAG results with Reciprocal Rank Fusion and Hybrid Search"

About this article: Technical architecture based on production Kynthar system processing 50,000+ documents/month across 100+ companies. Performance metrics measured over 3-month period (Oct-Dec 2024). A/B test data from controlled 3-week experiment with 5,000 queries. All cost estimates current as of January 2025.