Key Takeaways
- Hybrid search (vector + BM25) achieves 87% recall vs 71% vector-only
- pgvector on PostgreSQL costs $0 incremental vs $675/month for Pinecone
- Section-aware chunking preserves context that fixed-size misses
- Parallel query execution saves 30% latency (50ms vs 75ms)
- PostgreSQL RLS provides database-level multi-tenant security
Why RAG for Document Search?
The business problem: Searching thousands of invoices, purchase orders, and contracts by natural language. When a CFO asks "Show me all shipments from Acme Corp delayed in Q4," traditional keyword search fails.
Why traditional search fails:
- Exact keyword matching: "Acme Corporation" != "Acme Corp" (zero results despite same vendor)
- No semantic understanding: "delayed shipments" won't match "delivery date > expected date"
- No temporal reasoning: "Q4" must translate to Oct 1 - Dec 31 date filters
Why pure LLM context stuffing fails:
- Token limits: 200K context window = ~400 invoices max (Grok/GPT-4)
- Cost explosion: $0.40 per query at 200K tokens (unsustainable at scale)
- Hallucination risk: LLMs fabricate data when source isn't in context
RAG as the middle ground: Retrieve relevant chunks (hybrid search), then generate responses with source citations. Cost: $0.05 per query vs $0.40 for full context [2].
Production metrics after 50K+ queries:
- 94% query success rate (user gets relevant results)
- 87% recall@10 (hybrid search) vs 71% (vector-only) vs 65% (BM25-only)
- P95 latency: 2.8s (retrieval + LLM generation)
Architecture Overview: The 30-Second Version
Here's the technical stack before the deep dive:
- PostgreSQL + pgvector as the vector store (not Pinecone/Weaviate)
- Hybrid search: Vector similarity (semantic) + BM25 full-text (keyword precision)
- Parallel execution: Vector and BM25 queries run concurrently, fused via RRF
- Multi-tenant isolation: Row-Level Security (RLS) for secure multi-company data
- Cost optimization: ~$0 incremental vs $675/month for Pinecone (10M vectors) [3]
Why this architecture? PostgreSQL + pgvector eliminates a separate vector DB, reduces infrastructure complexity, and costs $0 incremental (already paying for PostgreSQL). Hybrid search catches both semantic similarity ("urgent payment" = "expedited invoice") and exact matches (PO-68755).
Deep Dive: Document Chunking Strategy
The chunking challenge: Invoices and POs have mixed structure: headers (vendor, PO number, date), line items (part numbers, quantities, prices), and terms. Naive fixed-size chunking breaks this context.
Our Chunking Approach
Target chunk size: ~600 tokens with 100-token overlap (based on embedding model limits)
Industry best practices recommend 400-800 tokens with 10-20% overlap [4]. We chose 600 tokens because:
- Invoice line items: Typically 5-15 lines @ 30-40 tokens each = 150-600 tokens
- Embedding model: OpenAI text-embedding-3-small optimized for <8191 tokens
- LLM context efficiency: 20 chunks x 600 tokens = 12K tokens (leaves room for system prompt)
Section-aware chunking: Preserve document structure
Why section-aware chunking?
- Line items stay together (no split mid-row breaking part number + quantity context)
- Business context prefix improves retrieval ("PO 12345" helps match queries)
- Page tracking enables PDF citations ("See page 3 for line items")
Using fixed 512-token chunks splits tables mid-row. Result: "Show me quantity for part XYZ-123" retrieves chunk with part number but not quantity (in next chunk). Section-aware chunking prevents this.
Deep Dive: Hybrid Search Implementation
The key insight: Vector search misses exact matches (part numbers, PO IDs, entity names). BM25 keyword search handles precision, vectors handle semantics.
Why Hybrid Search Outperforms Either Method Alone
Real-world example where vector search fails:
Reciprocal Rank Fusion (RRF) Algorithm
RRF combines ranked results without tuning hyperparameters [5]. Formula: score = 1 / (k + rank) where k=60 (default).
Why RRF over weighted averaging?
- No hyperparameter tuning: k=60 works across query types [5]
- Score scale independence: Works when vector scores (0-1) and BM25 scores (0-100) differ
- Research-backed: Azure AI Search reports RRF outperforms weighted methods [6]
Parallel Query Execution
Running vector + BM25 queries sequentially costs 75ms (40ms + 35ms). Parallel execution: 50ms total (30% faster).
Database indexing critical for performance:
- HNSW index: Approximate nearest neighbor (10x faster than exact) [7]
- GIN index: Generalized Inverted Index for full-text search (PostgreSQL native)
- Statement timeout: 2-second limit for production SLA
Production Considerations
Multi-Tenant Isolation with Row-Level Security
The challenge: Multiple companies, one database—zero tolerance for data leakage.
Solution: PostgreSQL Row-Level Security (RLS) + session variables. Even SQL injection can't cross tenant boundaries.
Why RLS over application-level filtering?
- Database-enforced security (not bypassable via code bug)
- Protection against SQL injection
- Audit trail built into PostgreSQL logs
Cost Optimization: pgvector vs. External Vector DBs
| Vector DB | 10M Vectors + 1M Queries/Month | Notes |
|---|---|---|
| Pinecone | ~$675 | Managed, simple, expensive |
| Weaviate (self-hosted) | ~$200 + ops overhead | Open source, complex ops |
| pgvector (on existing RDS) | $0 incremental | Already paying for PostgreSQL |
Key insight: For <=100M vectors, pgvector on existing PostgreSQL is cost-optimal [3][8]. Beyond 100M vectors or need distributed search, migrate to specialized DB.
Latency Budget Breakdown
pgvector performance at scale:
- 1M vectors: 40ms P95 query time (HNSW index)
- 10M vectors: 60-80ms P95 (still acceptable)
- 50M vectors: 471 QPS at 99% recall [8]
- >100M vectors: Consider Pinecone/Qdrant (distributed search needed)
Common RAG Pitfalls and How We Avoided Them
Pitfall 1: Poor Chunking Breaks Context
Symptom: Line items split across chunks, losing vendor context
Fix: Section-aware chunking with business context prefix (see chunking section)
Pitfall 2: Vector Search Misses Exact Matches
Symptom: Query "PO-68755" doesn't match "Purchase Order 68755"
Fix: Hybrid search. BM25 catches exact matches, vector handles fuzzy semantics
Pitfall 3: No Relevance Filtering
Symptom: Weak matches pollute results, confuse LLM
Fix: Minimum RRF score threshold (0.005) filters noise
Pitfall 4: Slow Sequential Queries
Symptom: Vector + BM25 queries take 75ms sequentially
Fix: Parallel execution with ThreadPoolExecutor (50ms total)
Pitfall 5: Context Window Overflow
Symptom: Top 50 chunks exceed LLM context limit
Fix: Limit to top 20 chunks (~12K tokens), use reranker if needed (future work)
Performance Metrics and Results
Business Metrics
- Query success rate: 94% (queries return relevant results)
- Zero-result rate: 6% (acceptable for exploratory queries)
- User satisfaction: 4.2/5 (based on feedback)
Technical Metrics
- Embedding cost: $0.02 per 1M tokens (OpenAI text-embedding-3-small)
- Index size: 1.2M chunks, 2GB vector index, 800MB BM25 index
- Recall comparison: 87% (hybrid) vs 71% (vector-only) vs 65% (BM25-only)
A/B Test Results (3-week test, 5K queries):
- Hybrid vs. pure vector: +34% recall improvement
- RRF fusion vs. weighted average: +12% relevance (NDCG@3)
- Recency boost: +8% user satisfaction for time-sensitive queries
Future Enhancements
1. Reranker Layer
Cross-encoder model to re-score top-K results (10-20% accuracy gain). Example: Cohere Rerank API or open-source BGE reranker [9].
2. Query Expansion
Generate multiple query variations, especially for acronyms/abbreviations. Example: "PO" = ["Purchase Order", "PO", "P.O."]
3. Semantic Caching
Cache embeddings for common queries (reduce OpenAI costs by 40-60%).
4. GraphRAG
Model relationships between documents (PO -> Invoice -> Receipt) for supply chain queries.
5. Long-Context Models
When GPT-5/Claude 4 support 1M+ tokens, revisit chunking strategy (may eliminate chunking for single-doc queries).
Conclusion and Key Takeaways
Production-grade RAG requires:
- Hybrid search is non-negotiable: Vector + BM25 outperforms either alone for structured documents (+34% recall in our tests)
- pgvector is underrated: For <=100M vectors, it's cost-optimal and performant ($0 vs $675/month)
- Chunking matters more than embeddings: Section-aware chunking preserves business context that fixed-size misses
- Parallel queries save 30% latency: Run vector + BM25 concurrently (50ms vs 75ms sequential)
- Multi-tenancy is critical: PostgreSQL RLS ensures secure data isolation at database level
When to use Kynthar's approach:
- B2B SaaS with structured documents (invoices, contracts, reports)
- Multi-tenant architecture with shared database
- Already using PostgreSQL (minimize infrastructure)
- Need <2s query latency with citations
When to use external vector DB (Pinecone/Weaviate):
- >100M vectors or need horizontal scaling
- Real-time updates (high write throughput)
- Need advanced features (metadata filtering across multiple dimensions)
Try Production RAG in Action
Process 25 documents free, then ask: "Show me all documents from [your vendor]". See hybrid search + RRF fusion in action.
Start Free TrialNo credit card required | 5-minute setup | Cancel anytime
Sources and References
- Internal Kynthar A/B Test Data. (2024). 3-week test with 5,000 queries comparing hybrid search (87% recall) vs pure vector (71%) vs BM25-only (65%). Measured on procurement document corpus.
- OpenAI Pricing. (2025). API Pricing - text-embedding-3-small: $0.02/1M tokens. GPT-4 context: $0.002/1K tokens input.
- Medium. (2024). "Postgres Vector Search with pgvector: Benchmarks, Costs, and Reality Check"
- Firecrawl. (2025). "Best Chunking Strategies for RAG in 2025"
- Microsoft Azure. (2025). "Hybrid search scoring (RRF)"
- Azure AI Search Blog. (2024). "Outperforming vector search with hybrid retrieval and reranking"
- Supabase. (2024). "pgvector 0.4.0 performance"
- TigerData. (2024). "Pgvector vs. Qdrant: Open-Source Vector Database Comparison"
- Unstructured. (2024). "Improving Retrieval in RAG with Reranking"
- Orkes. (2024). "Best Practices for Production-Scale RAG Systems"
- Assembled. (2024). "Better RAG results with Reciprocal Rank Fusion and Hybrid Search"
About this article: Technical architecture based on production Kynthar system processing 50,000+ documents/month across 100+ companies. Performance metrics measured over 3-month period (Oct-Dec 2024). A/B test data from controlled 3-week experiment with 5,000 queries. All cost estimates current as of January 2025.