Kynthar
Product
FeaturesHow It WorksLearn
PricingAbout
Log inSchedule Demo
FeaturesFull platform capabilitiesHow It WorksFrom upload to validated resultsLearnGuides and platform overview
Technical Deep Dive

Building Production RAG: How We Achieved 87% Recall with Hybrid Search + pgvector

When we A/B tested pure vector search vs. hybrid search (vector + keyword), hybrid won by 34% on recall for business documents. Here's how we built a production RAG system that costs $0 incremental on PostgreSQL + pgvector vs. $675/month for Pinecone.

By Josh Spadaro16 min readJanuary 2026

Key Takeaways

  • Hybrid search (vector + BM25) achieves 87% recall vs 71% vector-only
  • pgvector on PostgreSQL costs $0 incremental vs $675/month for Pinecone
  • Section-aware chunking preserves context that fixed-size misses
  • Parallel query execution saves 30% latency (50ms vs 75ms)
  • PostgreSQL RLS provides database-level multi-tenant security

What is RAG (Retrieval-Augmented Generation)?

RAG is an AI architecture that combines document retrieval with language model generation. For document search, RAG first retrieves relevant passages using hybrid search (combining vector embeddings with BM25 keyword matching), then uses an LLM to synthesize answers from those passages. This achieves 87% recall compared to 51% for keyword-only search, while grounding responses in your actual documents.

Why RAG for Document Search?

Quick answer: RAG (Retrieval-Augmented Generation) enables natural language search across business documents by combining vector similarity search with LLM generation. Traditional keyword search fails on semantic queries like "show delayed shipments from Acme" while pure LLM approaches hit token limits and cost $0.40/query. RAG retrieves relevant chunks first, then generates answers at $0.05/query.

The business problem: Searching thousands of invoices, purchase orders, and contracts by natural language. When a CFO asks "Show me all shipments from Acme Corp delayed in Q4," traditional keyword search fails. With 39% of invoices containing errors [Industry Research 2024], finding and analyzing documents quickly is essential.

Why traditional search fails:

  • Exact keyword matching: "Acme Corporation" != "Acme Corp" (zero results despite same vendor)
  • No semantic understanding:"delayed shipments" won't match "delivery date > expected date"
  • No temporal reasoning: "Q4" must translate to Oct 1 - Dec 31 date filters

Why pure LLM context stuffing fails:

  • Token limits: 200K context window = ~400 invoices max (Grok/GPT-4)
  • Cost explosion: $0.40 per query at 200K tokens (unsustainable at scale)
  • Hallucination risk: LLMs fabricate data when source isn't in context
  • Exception volume: AP departments spend 62% of their time handling exceptions [Ardent Partners 2024], making fast search critical

RAG as the middle ground: Retrieve relevant chunks (hybrid search), then generate responses with source citations. Cost: $0.05 per query vs $0.40 for full context [2].

Production metrics after 50K+ queries:

  • 94% query success rate (user gets relevant results)
  • 87% recall@10 (hybrid search) vs 71% (vector-only) vs 65% (BM25-only)
  • P95 latency: 2.8s (retrieval + LLM generation)

Architecture Overview: The 30-Second Version

Quick answer: Production RAG uses PostgreSQL with pgvector for vector storage ($0 incremental vs $675/month Pinecone), hybrid search combining vector similarity and BM25 full-text matching, parallel query execution for 30% latency reduction, and Row-Level Security for multi-tenant isolation. This achieves 87% recall at sub-2-second latency.

Here's the technical stack before the deep dive:

  1. PostgreSQL + pgvector as the vector store (not Pinecone/Weaviate)
  2. Hybrid search: Vector similarity (semantic) + BM25 full-text (keyword precision)
  3. Parallel execution: Vector and BM25 queries run concurrently, fused via RRF
  4. Multi-tenant isolation: Row-Level Security (RLS) for secure multi-company data
  5. Cost optimization: ~$0 incremental vs $675/month for Pinecone (10M vectors) [3]
# Simplified hybrid search (Python pseudocode)async def hybrid_search(query: str, company_id: str, limit: int = 20):# Generate embedding for queryquery_embedding = await openai_embed(query) # 150ms# Run vector + BM25 queries in parallel (50ms total vs 75ms sequential)with ThreadPoolExecutor(max_workers=2) as executor: vector_future = executor.submit( vector_search, query_embedding, company_id, limit=50) bm25_future = executor.submit( fulltext_search, query, company_id, limit=50) vector_results = vector_future.result() # 40msbm25_results = bm25_future.result() # 35ms# Reciprocal Rank Fusion (RRF)fused_results = rrf_fusion(vector_results, bm25_results, k=60)# Apply recency boost (business logic)final_results = recency_boost(fused_results, days_threshold=30)return final_results[:limit]

Why this architecture? PostgreSQL + pgvector eliminates a separate vector DB, reduces infrastructure complexity, and costs $0 incremental (already paying for PostgreSQL). Hybrid search catches both semantic similarity ("urgent payment" = "expedited invoice") and exact matches (PO-68755). This matters because automated AP departments maintain exception rates below 5% vs. 20%+ for manual processes [Industry Research 2024].

Deep Dive: Document Chunking Strategy

Quick answer: Effective RAG chunking uses 600 tokens with 100-token overlap, section-aware splitting to keep invoice line items together, and business context prefixes (PO number, vendor name) to improve retrieval accuracy. Fixed-size chunking breaks tables mid-row, losing critical context between part numbers and quantities.

The chunking challenge: Invoices and POs have mixed structure: headers (vendor, PO number, date), line items (part numbers, quantities, prices), and terms. Naive fixed-size chunking breaks this context.

Our Chunking Approach

Target chunk size: ~600 tokens with 100-token overlap (based on embedding model limits)

Industry best practices recommend 400-800 tokens with 10-20% overlap [4]. We chose 600 tokens because:

  • Invoice line items: Typically 5-15 lines @ 30-40 tokens each = 150-600 tokens
  • Embedding model:OpenAI text-embedding-3-small optimized for <8191 tokens
  • LLM context efficiency: 20 chunks x 600 tokens = 12K tokens (leaves room for system prompt)

Section-aware chunking: Preserve document structure. Exception invoices cost 3-5x more to process than standard invoices [Industry Research 2024], so getting retrieval right the first time reduces downstream costs.

def chunk_invoice(text: str, metadata: dict) -> List[Chunk]:""" Section-aware chunking for procurement documents. Preserves headers, line items, and terms as logical units. """sections = split_by_headers(text) # Detect: header, line_items, termschunks = []for section in sections:# Add business context prefix for better retrievalpo_number = metadata.get('po_number', 'Unknown') vendor = metadata.get('vendor_name', 'Unknown') chunk_text = f"PO {po_number} | Vendor: {vendor}\n\n{section}"# Track page numbers for citationspage_num = extract_page_number(section) chunks.append(Chunk( text=chunk_text, page=page_num, metadata=metadata ))return chunks

Why section-aware chunking?

  • Line items stay together (no split mid-row breaking part number + quantity context)
  • Business context prefix improves retrieval ("PO 12345" helps match queries)
  • Page tracking enables PDF citations ("See page 3 for line items")
Common mistake

Using fixed 512-token chunks splits tables mid-row. Result: "Show me quantity for part XYZ-123" retrieves chunk with part number but not quantity (in next chunk). Section-aware chunking prevents this.

Deep Dive: Hybrid Search Implementation

Quick answer: Hybrid search combines vector embeddings (semantic similarity) with BM25 keyword matching, achieving 87% recall versus 71% for vector-only and 65% for BM25-only. Vector search misses exact matches like "PO-68755" while BM25 misses synonyms. Reciprocal Rank Fusion (RRF) merges results without hyperparameter tuning.

The key insight: Vector search misses exact matches (part numbers, PO IDs, entity names). BM25 keyword search handles precision, vectors handle semantics.

Why Hybrid Search Outperforms Either Method Alone

71%
Vector-Only Recall
65%
BM25-Only Recall
87%
Hybrid Recall (+34%)

Real-world example where vector search fails:

Query: "Find PO-68755"# Vector search result:Chunks: ["Purchase Order 68755 from Acme Corp...", # MISS (cosine: 0.72)"PO acknowledgment for order #68755...", # MISS (cosine: 0.68)"Quote 68750 for industrial parts..." # False positive (cosine: 0.81)]# BM25 search result:Chunks: ["Purchase Order 68755 from Acme Corp...", # HIT (exact text match)"PO acknowledgment for order #68755...", # HIT (exact match)]# Hybrid (RRF fusion) result:Chunks: ["Purchase Order 68755 from Acme Corp...", # TOP (both methods rank high)"PO acknowledgment for order #68755...", # 2nd (both methods agree)"Quote 68750 for industrial parts..." # Filtered (low BM25 score)]

Reciprocal Rank Fusion (RRF) Algorithm

RRF combines ranked results without tuning hyperparameters [5]. Formula: score = 1 / (k + rank) where k=60 (default).

def rrf_fusion(vector_results, bm25_results, k=60):""" Reciprocal Rank Fusion: combines rankings without score normalization. Advantage: No hyperparameter tuning, robust to score scale mismatch. """chunk_scores = defaultdict(float)# Score vector results (rank 1 = highest score)for rank, result in enumerate(vector_results, start=1): chunk_scores[result['chunk_id']] += 1.0 / (k + rank)# Score BM25 results (additive scoring)for rank, result in enumerate(bm25_results, start=1): chunk_scores[result['chunk_id']] += 1.0 / (k + rank)# Sort by combined score (descending)ranked_chunks = sorted( chunk_scores.items(), key=lambda x: x[1], reverse=True)return ranked_chunks

Why RRF over weighted averaging?

  • No hyperparameter tuning: k=60 works across query types [5]
  • Score scale independence: Works when vector scores (0-1) and BM25 scores (0-100) differ
  • Research-backed: Azure AI Search reports RRF outperforms weighted methods [6]
  • Business impact:Best-in-Class exception rate is 9% vs. 22% for all others [Ardent Partners 2025]—better search helps achieve this

Parallel Query Execution

Running vector + BM25 queries sequentially costs 75ms (40ms + 35ms). Parallel execution: 50ms total (30% faster).

-- Vector search (HNSW index, approximate nearest neighbor)SELECT chunk_id, chunk_text, embedding <=> %s AS distance -- cosine distance operatorFROM document_chunksWHERE company_id = %sORDER BY embedding <=> %s -- %s = query_embeddingLIMIT 50; -- Over-fetch for RRF fusion-- BM25 full-text search (GIN index)SELECT chunk_id, chunk_text, ts_rank_cd(tsvector, query) AS rankFROM document_chunksWHERE company_id = %sAND tsvector @@ plainto_tsquery('english', %s)ORDER BY ts_rank_cd(tsvector, query) DESCLIMIT 50;

Database indexing critical for performance:

  • HNSW index: Approximate nearest neighbor (10x faster than exact) [7]
  • GIN index: Generalized Inverted Index for full-text search (PostgreSQL native)
  • Statement timeout: 2-second limit for production SLA

Production Considerations

Quick answer: Production RAG requires PostgreSQL Row-Level Security for multi-tenant data isolation, pgvector for cost-effective vector storage (handles 10M+ vectors at 60-80ms P95), and parallel query execution. Total query latency breaks down to 150ms embedding, 50ms search, and 1.8s LLM generation.

Multi-Tenant Isolation with Row-Level Security

The challenge:Multiple companies, one database—zero tolerance for data leakage.

Solution: PostgreSQL Row-Level Security (RLS) + session variables. Even SQL injection can't cross tenant boundaries.

-- Enable RLS on document_chunks tableALTER TABLE document_chunks ENABLE ROW LEVEL SECURITY;-- Create policy: users only see their company's dataCREATE POLICY company_isolation ON document_chunksFOR ALLUSING (company_id = current_setting('app.current_company_id')::uuid);-- Python: Set RLS context per querycur.execute("SET LOCAL app.current_company_id = %s", [company_id])# All subsequent queries auto-filter by company_id via RLS policyresults = cur.execute("SELECT * FROM document_chunks WHERE ...")

Why RLS over application-level filtering?

  • Database-enforced security (not bypassable via code bug)
  • Protection against SQL injection—critical since organizations lose 5% of revenue to fraud annually [ACFE 2024]
  • Audit trail built into PostgreSQL logs

Cost Optimization: pgvector vs. External Vector DBs

Vector DB10M Vectors + 1M Queries/MonthNotes
Pinecone~$675Managed, simple, expensive
Weaviate (self-hosted)~$200 + ops overheadOpen source, complex ops
pgvector (on existing RDS)$0 incrementalAlready paying for PostgreSQL

Key insight:For <=100M vectors, pgvector on existing PostgreSQL is cost-optimal [3][8]. Beyond 100M vectors or need distributed search, migrate to specialized DB. AI-powered fraud detection is now integrated into 61% of AP systems [Industry Research 2025], making robust document search infrastructure essential.

Latency Budget Breakdown

Total P95 latency: <2s
Breakdown: |-- Embedding generation: ~150ms (OpenAI API call, cacheable) |-- Vector query: ~40ms (HNSW index on 1M+ chunks) |-- BM25 query: ~35ms (GIN index, parallel with vector) |-- RRF fusion: ~5ms (in-memory sorting) +-- LLM response: ~1.8s (Grok streaming) Parallel execution saves 30ms (vector + BM25 concurrent vs sequential)

pgvector performance at scale:

  • 1M vectors: 40ms P95 query time (HNSW index)
  • 10M vectors: 60-80ms P95 (still acceptable)
  • 50M vectors: 471 QPS at 99% recall [8]
  • >100M vectors: Consider Pinecone/Qdrant (distributed search needed)

Common RAG Pitfalls and How We Avoided Them

Quick answer: Five common RAG pitfalls: poor chunking splitting tables mid-row, vector search missing exact matches (use hybrid), weak matches polluting results (apply RRF threshold 0.005), sequential queries adding latency (use parallel execution), and context overflow (limit to 20 chunks at 12K tokens).

Pitfall 1: Poor Chunking Breaks Context

Symptom: Line items split across chunks, losing vendor context

Fix: Section-aware chunking with business context prefix (see chunking section)

Pitfall 2: Vector Search Misses Exact Matches

Symptom: Query "PO-68755" doesn't match "Purchase Order 68755"

Fix: Hybrid search. BM25 catches exact matches, vector handles fuzzy semantics. This is crucial since active monitoring reduces fraud detection time from 12 to 6 months [ACFE 2024].

Pitfall 3: No Relevance Filtering

Symptom: Weak matches pollute results, confuse LLM

Fix: Minimum RRF score threshold (0.005) filters noise

# Apply relevance threshold before returning resultsMIN_RRF_SCORE = 0.005final_results = [ chunk for chunk, score in ranked_chunksifscore >= MIN_RRF_SCORE ][:limit]

Pitfall 4: Slow Sequential Queries

Symptom: Vector + BM25 queries take 75ms sequentially

Fix: Parallel execution with ThreadPoolExecutor (50ms total)

Pitfall 5: Context Window Overflow

Symptom: Top 50 chunks exceed LLM context limit

Fix: Limit to top 20 chunks (~12K tokens), use reranker if needed (future work)

Performance Metrics and Results

Quick answer: Production metrics from 50K+ queries show 94% query success rate, 87% recall@10 with hybrid search, P50 latency of 1.2s and P95 of 2.8s. A/B testing confirmed hybrid search outperforms pure vector by 34% on recall. Embedding cost is $0.02 per 1M tokens using OpenAI text-embedding-3-small.

Business Metrics

  • Query success rate: 94% (queries return relevant results)
  • Zero-result rate: 6% (acceptable for exploratory queries)
  • User satisfaction: 4.2/5 (based on feedback)

Technical Metrics

1.2s
P50 Latency
2.8s
P95 Latency
87%
Recall@10 (Hybrid)
  • Embedding cost: $0.02 per 1M tokens (OpenAI text-embedding-3-small)
  • Index size: 1.2M chunks, 2GB vector index, 800MB BM25 index
  • Recall comparison: 87% (hybrid) vs 71% (vector-only) vs 65% (BM25-only)

A/B Test Results (3-week test, 5K queries):

  • Hybrid vs. pure vector: +34% recall improvement
  • RRF fusion vs. weighted average: +12% relevance (NDCG@3)
  • Recency boost: +8% user satisfaction for time-sensitive queries

Future Enhancements

Quick answer: Future RAG improvements include cross-encoder rerankers for 10-20% accuracy gain, query expansion for acronyms and abbreviations, semantic caching to reduce embedding costs by 40-60%, GraphRAG for document relationships (PO to Invoice to Receipt), and long-context models that may eliminate chunking for single-doc queries.

1. Reranker Layer

Cross-encoder model to re-score top-K results (10-20% accuracy gain). Example: Cohere Rerank API or open-source BGE reranker [9].

2. Query Expansion

Generate multiple query variations, especially for acronyms/abbreviations. Example: "PO" = ["Purchase Order", "PO", "P.O."]

3. Semantic Caching

Cache embeddings for common queries (reduce OpenAI costs by 40-60%).

4. GraphRAG

Model relationships between documents (PO -> Invoice -> Receipt) for supply chain queries.

5. Long-Context Models

When GPT-5/Claude 4 support 1M+ tokens, revisit chunking strategy (may eliminate chunking for single-doc queries).

Conclusion and Key Takeaways

Quick answer: Production RAG requires hybrid search (non-negotiable for structured documents), pgvector on PostgreSQL for cost efficiency, section-aware chunking, parallel query execution, and RLS for multi-tenancy. Use this approach for B2B SaaS with structured documents and existing PostgreSQL. Consider Pinecone/Weaviate only above 100M vectors.

Production-grade RAG requires:

  1. Hybrid search is non-negotiable: Vector + BM25 outperforms either alone for structured documents (+34% recall in our tests)
  2. pgvector is underrated:For <=100M vectors, it's cost-optimal and performant ($0 vs $675/month)
  3. Chunking matters more than embeddings: Section-aware chunking preserves business context that fixed-size misses
  4. Parallel queries save 30% latency: Run vector + BM25 concurrently (50ms vs 75ms sequential)
  5. Multi-tenancy is critical: PostgreSQL RLS ensures secure data isolation at database level

When to use Kynthar's approach:

  • B2B SaaS with structured documents (invoices, contracts, reports)
  • Multi-tenant architecture with shared database
  • Already using PostgreSQL (minimize infrastructure)
  • Need <2s query latency with citations

When to use external vector DB (Pinecone/Weaviate):

  • >100M vectors or need horizontal scaling
  • Real-time updates (high write throughput)
  • Need advanced features (metadata filtering across multiple dimensions)

Try Production RAG in Action

Process 100 documents free, then ask: "Show me all documents from [your vendor]". See hybrid search + RRF fusion in action.

Start Free

No credit card required | 5-minute setup | Cancel anytime

Sources and References

  1. Internal Kynthar A/B Test Data. (2024). 3-week test with 5,000 queries comparing hybrid search (87% recall) vs pure vector (71%) vs BM25-only (65%). Measured on procurement document corpus.
  2. OpenAI Pricing. (2025). API Pricing - text-embedding-3-small: $0.02/1M tokens. GPT-4 context: $0.002/1K tokens input.
  3. Medium. (2024). "Postgres Vector Search with pgvector: Benchmarks, Costs, and Reality Check"
  4. Firecrawl. (2025). "Best Chunking Strategies for RAG in 2025"
  5. Microsoft Azure. (2025). "Hybrid search scoring (RRF)"
  6. Azure AI Search Blog. (2024). "Outperforming vector search with hybrid retrieval and reranking"
  7. Supabase. (2024). "pgvector 0.4.0 performance"
  8. TigerData. (2024). "Pgvector vs. Qdrant: Open-Source Vector Database Comparison"
  9. Unstructured. (2024). "Improving Retrieval in RAG with Reranking"
  10. Orkes. (2024). "Best Practices for Production-Scale RAG Systems"
  11. Assembled. (2024). "Better RAG results with Reciprocal Rank Fusion and Hybrid Search"

About this article: Technical architecture based on production Kynthar system processing 50,000+ documents/month across 100+ companies. Performance metrics measured over 3-month period (Oct-Dec 2024). A/B test data from controlled 3-week experiment with 5,000 queries. All cost estimates current as of January 2026.

Kynthar

The autonomous intelligence platform for procurement. Reads every document and email, cross-references everything, and catches what falls through the cracks.

256-bit SSL AWS
Product
  • Features
  • How It Works
  • Pricing
Learn
  • All Resources
  • Fraud Patterns
Company
  • About
  • Contact
  • Privacy Policy
  • Terms of Service
  • Data Processing Agreement
  • Security

Not an IDP. Not OCR. Not an ERP replacement.

© 2026 Kynthar, Inc. All rights reserved.