What is RAG and how does it work?

RAG (Retrieval-Augmented Generation) is an AI architecture that combines document retrieval with language models. It first retrieves relevant document chunks using vector similarity search, then passes those chunks to an LLM to generate accurate, grounded answers. This reduces hallucination compared to pure LLM responses.

Why is hybrid search better than pure vector search?

Hybrid search combines vector (semantic) and keyword (BM25) search, achieving 87% recall versus 65% for pure vector search. Vector search excels at semantic similarity but misses exact terms. Keyword search finds precise matches but misses synonyms. Combining both with reciprocal rank fusion captures the best of both approaches.

How do you build a production RAG system?

Build production RAG with: (1) document chunking at 512-1024 tokens with overlap, (2) embeddings via OpenAI or open-source models, (3) vector storage in pgvector or Pinecone, (4) hybrid retrieval combining vector + BM25, (5) reranking with cross-encoders, and (6) citation tracking so users can verify sources.

How much does a RAG system cost to run?

RAG costs vary by scale. Using PostgreSQL with pgvector costs $0 for the vector database versus $675/month for Pinecone at similar scale. Embedding costs are ~$0.0001 per 1K tokens. LLM inference is the main cost at $0.01-0.06 per query depending on the model. Self-hosted options can reduce costs further.

Production RAG: 87% Recall with pgvector

Key Takeaways

Hybrid search (vector + BM25) achieves 87% recall vs 71% vector-only
pgvector on PostgreSQL costs $0 incremental vs $675/month for Pinecone
Section-aware chunking preserves context that fixed-size misses
Parallel query execution saves 30% latency (50ms vs 75ms)
PostgreSQL RLS provides database-level multi-tenant security

What is RAG (Retrieval-Augmented Generation)?

RAG is an AI architecture that combines document retrieval with language model generation. For document search, RAG first retrieves relevant passages using hybrid search (combining vector embeddings with BM25 keyword matching), then uses an LLM to synthesize answers from those passages. This achieves 87% recall compared to 51% for keyword-only search, while grounding responses in your actual documents.

Why RAG for Document Search?

Quick answer: RAG (Retrieval-Augmented Generation) enables natural language search across business documents by combining vector similarity search with LLM generation. Traditional keyword search fails on semantic queries like "show delayed shipments from Acme" while pure LLM approaches hit token limits and cost $0.40/query. RAG retrieves relevant chunks first, then generates answers at $0.05/query.

The business problem: Searching thousands of invoices, purchase orders, and contracts by natural language. When a CFO asks "Show me all shipments from Acme Corp delayed in Q4," traditional keyword search fails. With 39% of invoices containing errors [Industry Research 2024], finding and analyzing documents quickly is essential.

Why traditional search fails:

Exact keyword matching: "Acme Corporation" != "Acme Corp" (zero results despite same vendor)
No semantic understanding: "delayed shipments" won't match "delivery date > expected date"
No temporal reasoning: "Q4" must translate to Oct 1 - Dec 31 date filters

Why pure LLM context stuffing fails:

Token limits: 200K context window = ~400 invoices max (Grok/GPT-4)
Cost explosion: $0.40 per query at 200K tokens (unsustainable at scale)
Hallucination risk: LLMs fabricate data when source isn't in context
Exception volume: AP departments spend 62% of their time handling exceptions [Ardent Partners 2024], making fast search critical

RAG as the middle ground: Retrieve relevant chunks (hybrid search), then generate responses with source citations. Cost: $0.05 per query vs $0.40 for full context [2].

Production metrics after 50K+ queries:

94% query success rate (user gets relevant results)
87% recall@10 (hybrid search) vs 71% (vector-only) vs 65% (BM25-only)
P95 latency: 2.8s (retrieval + LLM generation)

Architecture Overview: The 30-Second Version

Quick answer: Production RAG uses PostgreSQL with pgvector for vector storage ($0 incremental vs $675/month Pinecone), hybrid search combining vector similarity and BM25 full-text matching, parallel query execution for 30% latency reduction, and Row-Level Security for multi-tenant isolation. This achieves 87% recall at sub-2-second latency.

Here's the technical stack before the deep dive:

PostgreSQL + pgvector as the vector store (not Pinecone/Weaviate)
Hybrid search: Vector similarity (semantic) + BM25 full-text (keyword precision)
Parallel execution: Vector and BM25 queries run concurrently, fused via RRF
Multi-tenant isolation: Row-Level Security (RLS) for secure multi-company data
Cost optimization: ~$0 incremental vs $675/month for Pinecone (10M vectors) [3]

# Simplified hybrid search (Python pseudocode)
async def hybrid_search(query: str, company_id: str, limit: int = 20):
    # Generate embedding for query
    query_embedding = await openai_embed(query)  # 150ms

    # Run vector + BM25 queries in parallel (50ms total vs 75ms sequential)
    with ThreadPoolExecutor(max_workers=2) as executor:
        vector_future = executor.submit(
            vector_search, query_embedding, company_id, limit=50
        )
        bm25_future = executor.submit(
            fulltext_search, query, company_id, limit=50
        )

        vector_results = vector_future.result()  # 40ms
        bm25_results = bm25_future.result()      # 35ms

    # Reciprocal Rank Fusion (RRF)
    fused_results = rrf_fusion(vector_results, bm25_results, k=60)

    # Apply recency boost (business logic)
    final_results = recency_boost(fused_results, days_threshold=30)

    return final_results[:limit]
            

Why this architecture? PostgreSQL + pgvector eliminates a separate vector DB, reduces infrastructure complexity, and costs $0 incremental (already paying for PostgreSQL). Hybrid search catches both semantic similarity ("urgent payment" = "expedited invoice") and exact matches (PO-68755). This matters because automated AP departments maintain exception rates below 5% vs. 20%+ for manual processes [Industry Research 2024].

Deep Dive: Document Chunking Strategy

Quick answer: Effective RAG chunking uses 600 tokens with 100-token overlap, section-aware splitting to keep invoice line items together, and business context prefixes (PO number, vendor name) to improve retrieval accuracy. Fixed-size chunking breaks tables mid-row, losing critical context between part numbers and quantities.

The chunking challenge: Invoices and POs have mixed structure: headers (vendor, PO number, date), line items (part numbers, quantities, prices), and terms. Naive fixed-size chunking breaks this context.

Our Chunking Approach

Target chunk size: ~600 tokens with 100-token overlap (based on embedding model limits)

Industry best practices recommend 400-800 tokens with 10-20% overlap [4]. We chose 600 tokens because:

Invoice line items: Typically 5-15 lines @ 30-40 tokens each = 150-600 tokens
Embedding model: OpenAI text-embedding-3-small optimized for <8191 tokens
LLM context efficiency: 20 chunks x 600 tokens = 12K tokens (leaves room for system prompt)

Section-aware chunking: Preserve document structure. Exception invoices cost 3-5x more to process than standard invoices [Industry Research 2024], so getting retrieval right the first time reduces downstream costs.

def chunk_invoice(text: str, metadata: dict) -> List[Chunk]:
    """
    Section-aware chunking for procurement documents.
    Preserves headers, line items, and terms as logical units.
    """
    sections = split_by_headers(text)  # Detect: header, line_items, terms
    chunks = []

    for section in sections:
        # Add business context prefix for better retrieval
        po_number = metadata.get('po_number', 'Unknown')
        vendor = metadata.get('vendor_name', 'Unknown')
        chunk_text = f"PO {po_number} | Vendor: {vendor}\n\n{section}"

        # Track page numbers for citations
        page_num = extract_page_number(section)

        chunks.append(Chunk(
            text=chunk_text,
            page=page_num,
            metadata=metadata
        ))

    return chunks
            

Why section-aware chunking?

Line items stay together (no split mid-row breaking part number + quantity context)
Business context prefix improves retrieval ("PO 12345" helps match queries)
Page tracking enables PDF citations ("See page 3 for line items")

Common mistake

Using fixed 512-token chunks splits tables mid-row. Result: "Show me quantity for part XYZ-123" retrieves chunk with part number but not quantity (in next chunk). Section-aware chunking prevents this.

Deep Dive: Hybrid Search Implementation

Quick answer: Hybrid search combines vector embeddings (semantic similarity) with BM25 keyword matching, achieving 87% recall versus 71% for vector-only and 65% for BM25-only. Vector search misses exact matches like "PO-68755" while BM25 misses synonyms. Reciprocal Rank Fusion (RRF) merges results without hyperparameter tuning.

The key insight: Vector search misses exact matches (part numbers, PO IDs, entity names). BM25 keyword search handles precision, vectors handle semantics.

Why Hybrid Search Outperforms Either Method Alone

71%

Vector-Only Recall

65%

BM25-Only Recall

87%

Hybrid Recall (+34%)

Real-world example where vector search fails:

Query: "Find PO-68755"

# Vector search result:
Chunks: [
  "Purchase Order 68755 from Acme Corp...",  # MISS (cosine: 0.72)
  "PO acknowledgment for order #68755...",   # MISS (cosine: 0.68)
  "Quote 68750 for industrial parts..."      # False positive (cosine: 0.81)
]

# BM25 search result:
Chunks: [
  "Purchase Order 68755 from Acme Corp...",  # HIT (exact text match)
  "PO acknowledgment for order #68755...",   # HIT (exact match)
]

# Hybrid (RRF fusion) result:
Chunks: [
  "Purchase Order 68755 from Acme Corp...",  # TOP (both methods rank high)
  "PO acknowledgment for order #68755...",   # 2nd (both methods agree)
  "Quote 68750 for industrial parts..."      # Filtered (low BM25 score)
]
            

Reciprocal Rank Fusion (RRF) Algorithm

RRF combines ranked results without tuning hyperparameters [5]. Formula: score = 1 / (k + rank) where k=60 (default).

def rrf_fusion(vector_results, bm25_results, k=60):
    """
    Reciprocal Rank Fusion: combines rankings without score normalization.
    Advantage: No hyperparameter tuning, robust to score scale mismatch.
    """
    chunk_scores = defaultdict(float)

    # Score vector results (rank 1 = highest score)
    for rank, result in enumerate(vector_results, start=1):
        chunk_scores[result['chunk_id']] += 1.0 / (k + rank)

    # Score BM25 results (additive scoring)
    for rank, result in enumerate(bm25_results, start=1):
        chunk_scores[result['chunk_id']] += 1.0 / (k + rank)

    # Sort by combined score (descending)
    ranked_chunks = sorted(
        chunk_scores.items(),
        key=lambda x: x[1],
        reverse=True
    )

    return ranked_chunks
            

Why RRF over weighted averaging?

No hyperparameter tuning: k=60 works across query types [5]
Score scale independence: Works when vector scores (0-1) and BM25 scores (0-100) differ
Research-backed: Azure AI Search reports RRF outperforms weighted methods [6]
Business impact: Best-in-Class exception rate is 9% vs. 22% for all others [Ardent Partners 2025]—better search helps achieve this

Parallel Query Execution

Running vector + BM25 queries sequentially costs 75ms (40ms + 35ms). Parallel execution: 50ms total (30% faster).

-- Vector search (HNSW index, approximate nearest neighbor)
SELECT chunk_id, chunk_text,
       embedding <=> %s AS distance  -- cosine distance operator
FROM document_chunks
WHERE company_id = %s
ORDER BY embedding <=> %s  -- %s = query_embedding
LIMIT 50;  -- Over-fetch for RRF fusion

-- BM25 full-text search (GIN index)
SELECT chunk_id, chunk_text,
       ts_rank_cd(tsvector, query) AS rank
FROM document_chunks
WHERE company_id = %s
  AND tsvector @@ plainto_tsquery('english', %s)
ORDER BY ts_rank_cd(tsvector, query) DESC
LIMIT 50;
            

Database indexing critical for performance:

HNSW index: Approximate nearest neighbor (10x faster than exact) [7]
GIN index: Generalized Inverted Index for full-text search (PostgreSQL native)
Statement timeout: 2-second limit for production SLA

Production Considerations

Quick answer: Production RAG requires PostgreSQL Row-Level Security for multi-tenant data isolation, pgvector for cost-effective vector storage (handles 10M+ vectors at 60-80ms P95), and parallel query execution. Total query latency breaks down to 150ms embedding, 50ms search, and 1.8s LLM generation.

Multi-Tenant Isolation with Row-Level Security

The challenge: Multiple companies, one database—zero tolerance for data leakage.

Solution: PostgreSQL Row-Level Security (RLS) + session variables. Even SQL injection can't cross tenant boundaries.

-- Enable RLS on document_chunks table
ALTER TABLE document_chunks ENABLE ROW LEVEL SECURITY;

-- Create policy: users only see their company's data
CREATE POLICY company_isolation ON document_chunks
  FOR ALL
  USING (company_id = current_setting('app.current_company_id')::uuid);

-- Python: Set RLS context per query
cur.execute("SET LOCAL app.current_company_id = %s", [company_id])
# All subsequent queries auto-filter by company_id via RLS policy
results = cur.execute("SELECT * FROM document_chunks WHERE ...")
            

Why RLS over application-level filtering?

Database-enforced security (not bypassable via code bug)
Protection against SQL injection—critical since organizations lose 5% of revenue to fraud annually [ACFE 2024]
Audit trail built into PostgreSQL logs

Cost Optimization: pgvector vs. External Vector DBs

Vector DB	10M Vectors + 1M Queries/Month	Notes
Pinecone	~$675	Managed, simple, expensive
Weaviate (self-hosted)	~$200 + ops overhead	Open source, complex ops
pgvector (on existing RDS)	$0 incremental	Already paying for PostgreSQL

Key insight: For <=100M vectors, pgvector on existing PostgreSQL is cost-optimal [3][8]. Beyond 100M vectors or need distributed search, migrate to specialized DB. AI-powered fraud detection is now integrated into 61% of AP systems [Industry Research 2025], making robust document search infrastructure essential.

Latency Budget Breakdown

Total P95 latency: <2s

Breakdown: |-- Embedding generation: ~150ms (OpenAI API call, cacheable) |-- Vector query: ~40ms (HNSW index on 1M+ chunks) |-- BM25 query: ~35ms (GIN index, parallel with vector) |-- RRF fusion: ~5ms (in-memory sorting) +-- LLM response: ~1.8s (Grok streaming) Parallel execution saves 30ms (vector + BM25 concurrent vs sequential)

pgvector performance at scale:

1M vectors: 40ms P95 query time (HNSW index)
10M vectors: 60-80ms P95 (still acceptable)
50M vectors: 471 QPS at 99% recall [8]
>100M vectors: Consider Pinecone/Qdrant (distributed search needed)

Common RAG Pitfalls and How We Avoided Them

Quick answer: Five common RAG pitfalls: poor chunking splitting tables mid-row, vector search missing exact matches (use hybrid), weak matches polluting results (apply RRF threshold 0.005), sequential queries adding latency (use parallel execution), and context overflow (limit to 20 chunks at 12K tokens).

Pitfall 1: Poor Chunking Breaks Context

Symptom: Line items split across chunks, losing vendor context

Fix: Section-aware chunking with business context prefix (see chunking section)

Pitfall 2: Vector Search Misses Exact Matches

Symptom: Query "PO-68755" doesn't match "Purchase Order 68755"

Fix: Hybrid search. BM25 catches exact matches, vector handles fuzzy semantics. This is crucial since active monitoring reduces fraud detection time from 12 to 6 months [ACFE 2024].

Pitfall 3: No Relevance Filtering

Symptom: Weak matches pollute results, confuse LLM

Fix: Minimum RRF score threshold (0.005) filters noise

# Apply relevance threshold before returning results
MIN_RRF_SCORE = 0.005
final_results = [
    chunk for chunk, score in ranked_chunks
    if score >= MIN_RRF_SCORE
][:limit]
            

Pitfall 4: Slow Sequential Queries

Symptom: Vector + BM25 queries take 75ms sequentially

Fix: Parallel execution with ThreadPoolExecutor (50ms total)

Pitfall 5: Context Window Overflow

Symptom: Top 50 chunks exceed LLM context limit

Fix: Limit to top 20 chunks (~12K tokens), use reranker if needed (future work)

Performance Metrics and Results

Quick answer: Production metrics from 50K+ queries show 94% query success rate, 87% recall@10 with hybrid search, P50 latency of 1.2s and P95 of 2.8s. A/B testing confirmed hybrid search outperforms pure vector by 34% on recall. Embedding cost is $0.02 per 1M tokens using OpenAI text-embedding-3-small.

Business Metrics

Query success rate: 94% (queries return relevant results)
Zero-result rate: 6% (acceptable for exploratory queries)
User satisfaction: 4.2/5 (based on feedback)

Technical Metrics

1.2s

P50 Latency

2.8s

P95 Latency

87%

Recall@10 (Hybrid)

Embedding cost: $0.02 per 1M tokens (OpenAI text-embedding-3-small)
Index size: 1.2M chunks, 2GB vector index, 800MB BM25 index
Recall comparison: 87% (hybrid) vs 71% (vector-only) vs 65% (BM25-only)

A/B Test Results (3-week test, 5K queries):

Hybrid vs. pure vector: +34% recall improvement
RRF fusion vs. weighted average: +12% relevance (NDCG@3)
Recency boost: +8% user satisfaction for time-sensitive queries

Future Enhancements

Quick answer: Future RAG improvements include cross-encoder rerankers for 10-20% accuracy gain, query expansion for acronyms and abbreviations, semantic caching to reduce embedding costs by 40-60%, GraphRAG for document relationships (PO to Invoice to Receipt), and long-context models that may eliminate chunking for single-doc queries.

1. Reranker Layer

Cross-encoder model to re-score top-K results (10-20% accuracy gain). Example: Cohere Rerank API or open-source BGE reranker [9].

2. Query Expansion

Generate multiple query variations, especially for acronyms/abbreviations. Example: "PO" = ["Purchase Order", "PO", "P.O."]

3. Semantic Caching

Cache embeddings for common queries (reduce OpenAI costs by 40-60%).

4. GraphRAG

Model relationships between documents (PO -> Invoice -> Receipt) for supply chain queries.

5. Long-Context Models

When GPT-5/Claude 4 support 1M+ tokens, revisit chunking strategy (may eliminate chunking for single-doc queries).

Conclusion and Key Takeaways

Quick answer: Production RAG requires hybrid search (non-negotiable for structured documents), pgvector on PostgreSQL for cost efficiency, section-aware chunking, parallel query execution, and RLS for multi-tenancy. Use this approach for B2B SaaS with structured documents and existing PostgreSQL. Consider Pinecone/Weaviate only above 100M vectors.

Production-grade RAG requires:

Hybrid search is non-negotiable: Vector + BM25 outperforms either alone for structured documents (+34% recall in our tests)
pgvector is underrated: For <=100M vectors, it's cost-optimal and performant ($0 vs $675/month)
Chunking matters more than embeddings: Section-aware chunking preserves business context that fixed-size misses
Parallel queries save 30% latency: Run vector + BM25 concurrently (50ms vs 75ms sequential)
Multi-tenancy is critical: PostgreSQL RLS ensures secure data isolation at database level

When to use Kynthar's approach:

B2B SaaS with structured documents (invoices, contracts, reports)
Multi-tenant architecture with shared database
Already using PostgreSQL (minimize infrastructure)
Need <2s query latency with citations

When to use external vector DB (Pinecone/Weaviate):

>100M vectors or need horizontal scaling
Real-time updates (high write throughput)
Need advanced features (metadata filtering across multiple dimensions)

Try Production RAG in Action

Process 25 documents free, then ask: "Show me all documents from [your vendor]". See hybrid search + RRF fusion in action.

Start Free Trial

No credit card required | 5-minute setup | Cancel anytime

Sources and References

Internal Kynthar A/B Test Data. (2024). 3-week test with 5,000 queries comparing hybrid search (87% recall) vs pure vector (71%) vs BM25-only (65%). Measured on procurement document corpus.
OpenAI Pricing. (2025). API Pricing - text-embedding-3-small: $0.02/1M tokens. GPT-4 context: $0.002/1K tokens input.
Medium. (2024). "Postgres Vector Search with pgvector: Benchmarks, Costs, and Reality Check"
Firecrawl. (2025). "Best Chunking Strategies for RAG in 2025"
Microsoft Azure. (2025). "Hybrid search scoring (RRF)"
Azure AI Search Blog. (2024). "Outperforming vector search with hybrid retrieval and reranking"
Supabase. (2024). "pgvector 0.4.0 performance"
TigerData. (2024). "Pgvector vs. Qdrant: Open-Source Vector Database Comparison"
Unstructured. (2024). "Improving Retrieval in RAG with Reranking"
Orkes. (2024). "Best Practices for Production-Scale RAG Systems"
Assembled. (2024). "Better RAG results with Reciprocal Rank Fusion and Hybrid Search"

About this article: Technical architecture based on production Kynthar system processing 50,000+ documents/month across 100+ companies. Performance metrics measured over 3-month period (Oct-Dec 2024). A/B test data from controlled 3-week experiment with 5,000 queries. All cost estimates current as of January 2026.