Key Takeaways
- Hybrid search (vector + BM25) achieves 87% recall vs 71% vector-only
- pgvector on PostgreSQL costs $0 incremental vs $675/month for Pinecone
- Section-aware chunking preserves context that fixed-size misses
- Parallel query execution saves 30% latency (50ms vs 75ms)
- PostgreSQL RLS provides database-level multi-tenant security
What is RAG (Retrieval-Augmented Generation)?
RAG is an AI architecture that combines document retrieval with language model generation. For document search, RAG first retrieves relevant passages using hybrid search (combining vector embeddings with BM25 keyword matching), then uses an LLM to synthesize answers from those passages. This achieves 87% recall compared to 51% for keyword-only search, while grounding responses in your actual documents.
Why RAG for Document Search?
The business problem: Searching thousands of invoices, purchase orders, and contracts by natural language. When a CFO asks "Show me all shipments from Acme Corp delayed in Q4," traditional keyword search fails. With 39% of invoices containing errors [Industry Research 2024], finding and analyzing documents quickly is essential.
Why traditional search fails:
- Exact keyword matching: "Acme Corporation" != "Acme Corp" (zero results despite same vendor)
- No semantic understanding: "delayed shipments" won't match "delivery date > expected date"
- No temporal reasoning: "Q4" must translate to Oct 1 - Dec 31 date filters
Why pure LLM context stuffing fails:
- Token limits: 200K context window = ~400 invoices max (Grok/GPT-4)
- Cost explosion: $0.40 per query at 200K tokens (unsustainable at scale)
- Hallucination risk: LLMs fabricate data when source isn't in context
- Exception volume: AP departments spend 62% of their time handling exceptions [Ardent Partners 2024], making fast search critical
RAG as the middle ground: Retrieve relevant chunks (hybrid search), then generate responses with source citations. Cost: $0.05 per query vs $0.40 for full context [2].
Production metrics after 50K+ queries:
- 94% query success rate (user gets relevant results)
- 87% recall@10 (hybrid search) vs 71% (vector-only) vs 65% (BM25-only)
- P95 latency: 2.8s (retrieval + LLM generation)
Architecture Overview: The 30-Second Version
Here's the technical stack before the deep dive:
- PostgreSQL + pgvector as the vector store (not Pinecone/Weaviate)
- Hybrid search: Vector similarity (semantic) + BM25 full-text (keyword precision)
- Parallel execution: Vector and BM25 queries run concurrently, fused via RRF
- Multi-tenant isolation: Row-Level Security (RLS) for secure multi-company data
- Cost optimization: ~$0 incremental vs $675/month for Pinecone (10M vectors) [3]
Why this architecture? PostgreSQL + pgvector eliminates a separate vector DB, reduces infrastructure complexity, and costs $0 incremental (already paying for PostgreSQL). Hybrid search catches both semantic similarity ("urgent payment" = "expedited invoice") and exact matches (PO-68755). This matters because automated AP departments maintain exception rates below 5% vs. 20%+ for manual processes [Industry Research 2024].
Deep Dive: Document Chunking Strategy
The chunking challenge: Invoices and POs have mixed structure: headers (vendor, PO number, date), line items (part numbers, quantities, prices), and terms. Naive fixed-size chunking breaks this context.
Our Chunking Approach
Target chunk size: ~600 tokens with 100-token overlap (based on embedding model limits)
Industry best practices recommend 400-800 tokens with 10-20% overlap [4]. We chose 600 tokens because:
- Invoice line items: Typically 5-15 lines @ 30-40 tokens each = 150-600 tokens
- Embedding model: OpenAI text-embedding-3-small optimized for <8191 tokens
- LLM context efficiency: 20 chunks x 600 tokens = 12K tokens (leaves room for system prompt)
Section-aware chunking: Preserve document structure. Exception invoices cost 3-5x more to process than standard invoices [Industry Research 2024], so getting retrieval right the first time reduces downstream costs.
Why section-aware chunking?
- Line items stay together (no split mid-row breaking part number + quantity context)
- Business context prefix improves retrieval ("PO 12345" helps match queries)
- Page tracking enables PDF citations ("See page 3 for line items")
Using fixed 512-token chunks splits tables mid-row. Result: "Show me quantity for part XYZ-123" retrieves chunk with part number but not quantity (in next chunk). Section-aware chunking prevents this.
Deep Dive: Hybrid Search Implementation
The key insight: Vector search misses exact matches (part numbers, PO IDs, entity names). BM25 keyword search handles precision, vectors handle semantics.
Why Hybrid Search Outperforms Either Method Alone
Real-world example where vector search fails:
Reciprocal Rank Fusion (RRF) Algorithm
RRF combines ranked results without tuning hyperparameters [5]. Formula: score = 1 / (k + rank) where k=60 (default).
Why RRF over weighted averaging?
- No hyperparameter tuning: k=60 works across query types [5]
- Score scale independence: Works when vector scores (0-1) and BM25 scores (0-100) differ
- Research-backed: Azure AI Search reports RRF outperforms weighted methods [6]
- Business impact: Best-in-Class exception rate is 9% vs. 22% for all others [Ardent Partners 2025]—better search helps achieve this
Parallel Query Execution
Running vector + BM25 queries sequentially costs 75ms (40ms + 35ms). Parallel execution: 50ms total (30% faster).
Database indexing critical for performance:
- HNSW index: Approximate nearest neighbor (10x faster than exact) [7]
- GIN index: Generalized Inverted Index for full-text search (PostgreSQL native)
- Statement timeout: 2-second limit for production SLA
Production Considerations
Multi-Tenant Isolation with Row-Level Security
The challenge: Multiple companies, one database—zero tolerance for data leakage.
Solution: PostgreSQL Row-Level Security (RLS) + session variables. Even SQL injection can't cross tenant boundaries.
Why RLS over application-level filtering?
- Database-enforced security (not bypassable via code bug)
- Protection against SQL injection—critical since organizations lose 5% of revenue to fraud annually [ACFE 2024]
- Audit trail built into PostgreSQL logs
Cost Optimization: pgvector vs. External Vector DBs
| Vector DB | 10M Vectors + 1M Queries/Month | Notes |
|---|---|---|
| Pinecone | ~$675 | Managed, simple, expensive |
| Weaviate (self-hosted) | ~$200 + ops overhead | Open source, complex ops |
| pgvector (on existing RDS) | $0 incremental | Already paying for PostgreSQL |
Key insight: For <=100M vectors, pgvector on existing PostgreSQL is cost-optimal [3][8]. Beyond 100M vectors or need distributed search, migrate to specialized DB. AI-powered fraud detection is now integrated into 61% of AP systems [Industry Research 2025], making robust document search infrastructure essential.
Latency Budget Breakdown
pgvector performance at scale:
- 1M vectors: 40ms P95 query time (HNSW index)
- 10M vectors: 60-80ms P95 (still acceptable)
- 50M vectors: 471 QPS at 99% recall [8]
- >100M vectors: Consider Pinecone/Qdrant (distributed search needed)
Common RAG Pitfalls and How We Avoided Them
Pitfall 1: Poor Chunking Breaks Context
Symptom: Line items split across chunks, losing vendor context
Fix: Section-aware chunking with business context prefix (see chunking section)
Pitfall 2: Vector Search Misses Exact Matches
Symptom: Query "PO-68755" doesn't match "Purchase Order 68755"
Fix: Hybrid search. BM25 catches exact matches, vector handles fuzzy semantics. This is crucial since active monitoring reduces fraud detection time from 12 to 6 months [ACFE 2024].
Pitfall 3: No Relevance Filtering
Symptom: Weak matches pollute results, confuse LLM
Fix: Minimum RRF score threshold (0.005) filters noise
Pitfall 4: Slow Sequential Queries
Symptom: Vector + BM25 queries take 75ms sequentially
Fix: Parallel execution with ThreadPoolExecutor (50ms total)
Pitfall 5: Context Window Overflow
Symptom: Top 50 chunks exceed LLM context limit
Fix: Limit to top 20 chunks (~12K tokens), use reranker if needed (future work)
Performance Metrics and Results
Business Metrics
- Query success rate: 94% (queries return relevant results)
- Zero-result rate: 6% (acceptable for exploratory queries)
- User satisfaction: 4.2/5 (based on feedback)
Technical Metrics
- Embedding cost: $0.02 per 1M tokens (OpenAI text-embedding-3-small)
- Index size: 1.2M chunks, 2GB vector index, 800MB BM25 index
- Recall comparison: 87% (hybrid) vs 71% (vector-only) vs 65% (BM25-only)
A/B Test Results (3-week test, 5K queries):
- Hybrid vs. pure vector: +34% recall improvement
- RRF fusion vs. weighted average: +12% relevance (NDCG@3)
- Recency boost: +8% user satisfaction for time-sensitive queries
Future Enhancements
1. Reranker Layer
Cross-encoder model to re-score top-K results (10-20% accuracy gain). Example: Cohere Rerank API or open-source BGE reranker [9].
2. Query Expansion
Generate multiple query variations, especially for acronyms/abbreviations. Example: "PO" = ["Purchase Order", "PO", "P.O."]
3. Semantic Caching
Cache embeddings for common queries (reduce OpenAI costs by 40-60%).
4. GraphRAG
Model relationships between documents (PO -> Invoice -> Receipt) for supply chain queries.
5. Long-Context Models
When GPT-5/Claude 4 support 1M+ tokens, revisit chunking strategy (may eliminate chunking for single-doc queries).
Conclusion and Key Takeaways
Production-grade RAG requires:
- Hybrid search is non-negotiable: Vector + BM25 outperforms either alone for structured documents (+34% recall in our tests)
- pgvector is underrated: For <=100M vectors, it's cost-optimal and performant ($0 vs $675/month)
- Chunking matters more than embeddings: Section-aware chunking preserves business context that fixed-size misses
- Parallel queries save 30% latency: Run vector + BM25 concurrently (50ms vs 75ms sequential)
- Multi-tenancy is critical: PostgreSQL RLS ensures secure data isolation at database level
When to use Kynthar's approach:
- B2B SaaS with structured documents (invoices, contracts, reports)
- Multi-tenant architecture with shared database
- Already using PostgreSQL (minimize infrastructure)
- Need <2s query latency with citations
When to use external vector DB (Pinecone/Weaviate):
- >100M vectors or need horizontal scaling
- Real-time updates (high write throughput)
- Need advanced features (metadata filtering across multiple dimensions)
Try Production RAG in Action
Process 25 documents free, then ask: "Show me all documents from [your vendor]". See hybrid search + RRF fusion in action.
Start Free TrialNo credit card required | 5-minute setup | Cancel anytime
Sources and References
- Internal Kynthar A/B Test Data. (2024). 3-week test with 5,000 queries comparing hybrid search (87% recall) vs pure vector (71%) vs BM25-only (65%). Measured on procurement document corpus.
- OpenAI Pricing. (2025). API Pricing - text-embedding-3-small: $0.02/1M tokens. GPT-4 context: $0.002/1K tokens input.
- Medium. (2024). "Postgres Vector Search with pgvector: Benchmarks, Costs, and Reality Check"
- Firecrawl. (2025). "Best Chunking Strategies for RAG in 2025"
- Microsoft Azure. (2025). "Hybrid search scoring (RRF)"
- Azure AI Search Blog. (2024). "Outperforming vector search with hybrid retrieval and reranking"
- Supabase. (2024). "pgvector 0.4.0 performance"
- TigerData. (2024). "Pgvector vs. Qdrant: Open-Source Vector Database Comparison"
- Unstructured. (2024). "Improving Retrieval in RAG with Reranking"
- Orkes. (2024). "Best Practices for Production-Scale RAG Systems"
- Assembled. (2024). "Better RAG results with Reciprocal Rank Fusion and Hybrid Search"
About this article: Technical architecture based on production Kynthar system processing 50,000+ documents/month across 100+ companies. Performance metrics measured over 3-month period (Oct-Dec 2024). A/B test data from controlled 3-week experiment with 5,000 queries. All cost estimates current as of January 2026.