Technical Deep Dive

The $0 Anomaly Detection System

Pure SQL. Sub-100ms. Zero ML infrastructure. Deterministic, auditable fraud detection that runs on your existing database without external API costs.

18 min read January 2025

Key Takeaways

  • Pure SQL catches 95% of fraud patterns ML-based systems detect
  • Sub-100ms latency by batching all line items into single queries
  • Deterministic results auditors can verify and reproduce
  • Adaptive sigma thresholds handle low sample sizes gracefully
  • Zero marginal cost vs $0.02+ per ML API call

Most modern fraud detection pitches involve machine learning, neural networks, and expensive API calls. You'll hear about "AI-powered anomaly detection" that costs $0.01-0.10 per transaction, requires ML infrastructure, and produces black-box results that auditors can't verify.

There's a simpler approach: pure SQL statistical analysis. It catches 95% of the same fraud patterns, runs in under 100ms, costs nothing per transaction, and produces deterministic results you can explain to any auditor.

$0
Per-transaction cost
<100ms
Detection latency
12+
Detection patterns
100%
Deterministic

Why Not Machine Learning?

Machine learning has its place, but invoice fraud detection isn't it. Here's why:

When ML Makes Sense

ML excels at unstructured problems: image classification, natural language understanding, pattern discovery in unknown domains. Invoice fraud has well-defined patterns that statistical methods catch equally well at a fraction of the cost.

The Architecture

Pure SQL Detection Pipeline
Input
Invoice Data
Engine
PostgreSQL
Analysis
12 SQL Queries
Output
Anomaly Flags

The system runs entirely within your PostgreSQL database. No external services, no API calls, no network latency. Each detection pattern is a single optimized query that compares the incoming invoice against historical data.

Core Detection Patterns

1. Statistical Price Outliers

The foundation of anomaly detection is z-score analysis: how many standard deviations is this price from the historical mean?

Price Outlier Detection Query
SELECT li.vendor_part_number AS part_number, AVG(li.unit_price) AS avg_price, STDDEV(li.unit_price) AS stddev_price, COUNT(*) AS sample_size FROM invoice_line_items li JOIN procurement_headers ph ON li.doc_id = ph.doc_id WHERE ph.vendor_id = $vendor_id AND ph.company_id = $company_id AND li.vendor_part_number = ANY($part_numbers) AND li.unit_price IS NOT NULL AND ph.doc_id != $current_doc_id -- Exclude current invoice GROUP BY li.vendor_part_number

The key insight: batch all line items into a single query. An invoice with 100 line items requires 1 query, not 100. This is what makes sub-100ms possible.

Adaptive Thresholds

A fixed 3-sigma threshold doesn't work when you have limited historical data. With only 2 prior purchases, normal variance makes the third purchase look like an outlier.

The solution: adaptive thresholds based on sample size:

Threshold Selection Logic
-- Python logic after fetching SQL stats if sample_size == 1: threshold = 5.0 # Very conservative elif sample_size == 2: threshold = 4.0 # Conservative else: threshold = 3.0 # Normal statistical threshold # Calculate z-score if stddev_price > 0: deviation = (current_price - avg_price) / stddev_price else: # Zero variance: use percentage-based fallback pct_diff = abs(current_price - avg_price) / avg_price deviation = pct_diff * 20 # 5% = 1sigma equivalent

The zero-variance case is important: if you've bought Widget-X at exactly $100.00 for 20 consecutive purchases, the standard deviation is 0. A naive implementation would divide by zero or flag $100.01 as an infinite-sigma outlier. The percentage fallback handles this.

2. Duplicate Detection: Three Methods

Duplicate payments are the most common AP fraud. But "duplicate" isn't simple—you need three detection methods:

1
Exact Match

Same invoice number from same vendor. Case-insensitive. Checks against ALL invoices including paid.

2
Amount Match

Same amount (within 0.5%) from same vendor within 30 days. Different invoice numbers. Catches number-changers.

3
Fuzzy Match

Similar invoice number (85%+ trigram similarity) AND similar amount. Catches typos and OCR errors.

Fuzzy Duplicate Detection with pg_trgm
SELECT ph.doc_id, ph.document_number, ph.total_amount, similarity(ph.document_number, $invoice_number) AS sim_score FROM procurement_headers ph WHERE ph.vendor_id = $vendor_id AND ph.company_id = $company_id AND ph.doc_id != $current_doc_id AND ph.document_type = 'invoice' AND ABS(ph.total_amount - $total_amount) / $total_amount < 0.01 AND similarity(ph.document_number, $invoice_number) > 0.85 AND LOWER(TRIM(ph.document_number)) != LOWER(TRIM($invoice_number)) ORDER BY sim_score DESC LIMIT 5

The pg_trgm extension provides trigram-based similarity scoring. "INV-1234" vs "INV-I234" scores ~95% similarity. The query requires BOTH similar numbers AND similar amounts to reduce false positives.

Critical: Include Paid Invoices

Many AP systems exclude paid invoices from duplicate checks. This is wrong. The most common fraud is resubmitting already-paid invoices months later. Always check against the full invoice history.

3. Cross-Vendor Price Comparison

Are you paying more than you should? Cross-vendor comparison requires item normalization— mapping different vendor part numbers to canonical item IDs:

Cross-Vendor Price Analysis
WITH best_price AS ( SELECT MIN(ili.unit_price) AS min_price FROM invoice_line_items ili JOIN procurement_headers ph ON ili.doc_id = ph.doc_id WHERE ili.item_id = $item_id AND ph.company_id = $company_id AND ili.unit_price > 0 AND ph.doc_id != $current_doc_id ) SELECT bp.min_price, vm.canonical_name AS best_vendor, im.canonical_description FROM best_price bp CROSS JOIN ( SELECT ili.doc_id FROM invoice_line_items ili JOIN procurement_headers ph ON ili.doc_id = ph.doc_id WHERE ili.item_id = $item_id AND ili.unit_price = (SELECT min_price FROM best_price) LIMIT 1 ) best_doc JOIN procurement_headers ph ON best_doc.doc_id = ph.doc_id LEFT JOIN vendor_master vm ON ph.vendor_id = vm.vendor_id JOIN item_master im ON im.item_id = $item_id

This finds the lowest price ever paid for this item from ANY vendor, then compares the current price. If you're paying 20%+ more than the best available price, something's wrong.

Performance Optimization

Batch Queries

The #1 performance killer is N+1 queries. An invoice with 50 line items should NOT require 50 separate database queries. Batch everything:

-- BAD: Query per line item (50 queries for 50 items) for item in line_items: stats = query_historical_prices(item.part_number) -- GOOD: Single query for all items part_numbers = [item.part_number for item in line_items] stats_map = query_all_historical_prices(part_numbers) # Uses ANY($part_numbers)

Index Strategy

These compound indexes are critical for sub-100ms performance:

Required Indexes
-- Invoice line item lookups CREATE INDEX idx_line_items_vendor_company_part ON invoice_line_items(vendor_id, company_id, vendor_part_number); -- Duplicate detection by invoice number CREATE INDEX idx_headers_vendor_docnum ON procurement_headers(vendor_id, company_id, document_number); -- Amount-based duplicate detection CREATE INDEX idx_headers_vendor_amount_date ON procurement_headers(vendor_id, company_id, total_amount, document_date); -- Trigram index for fuzzy matching CREATE INDEX idx_headers_docnum_trgm ON procurement_headers USING gin(document_number gin_trgm_ops); -- Item normalization lookups CREATE INDEX idx_line_items_item_id ON invoice_line_items(item_id);

Query Execution Time

Benchmark results on a mid-size dataset (500K invoices, 5M line items):

Detection Latency by Pattern (500K invoice dataset)
Price outliers
12ms
Quantity spikes
8ms
Exact duplicate
3ms
Amount duplicate
6ms
Fuzzy duplicate
18ms
Cumulative fraud
10ms
Cross-vendor price
22ms
All 12 checks
~85ms

Determinism: The Audit Advantage

Pure SQL detection is deterministic: same input always produces same output. This matters for audits and compliance:

Property ML-Based Detection SQL-Based Detection
Same input, same output? Not guaranteed (model updates) Yes, always
Explainable to auditors? Difficult (black box) "4.2 sigma above mean"
Reproducible results? Only with versioned models Yes, query is the record
Historical queries? Need model snapshots Re-run query on data snapshot

When an auditor asks "why was this flagged?", you can show them the exact SQL query and the specific values that triggered the detection. No model interpretation required.

Cost Comparison

At scale, the cost difference is substantial:

Monthly Volume ML API ($0.02/check) Pure SQL Annual Savings
1,000 invoices $20 $0 $240
10,000 invoices $200 $0 $2,400
100,000 invoices $2,000 $0 $24,000
1,000,000 invoices $20,000 $0 $240,000

The "cost" of SQL-based detection is the database compute you're already paying for. The queries run in milliseconds and don't require additional infrastructure.

Item Normalization: The Secret Sauce

Cross-vendor and historical price analysis only work if you can identify when two different part numbers refer to the same item. This requires an item master:

Item Master Schema
CREATE TABLE item_master ( item_id UUID PRIMARY KEY, company_id UUID NOT NULL, canonical_description TEXT NOT NULL, category TEXT, created_at TIMESTAMPTZ DEFAULT NOW() ); -- Maps vendor-specific part numbers to canonical items CREATE TABLE item_vendor_mapping ( mapping_id UUID PRIMARY KEY, item_id UUID REFERENCES item_master(item_id), vendor_id UUID REFERENCES vendor_master(vendor_id), vendor_part_number TEXT NOT NULL, UNIQUE(vendor_id, vendor_part_number) );

When an invoice arrives with "HP-CF294A-BLK", the system looks up the mapping to find canonical item_id "ITEM-TONER-001". Now it can compare prices across all vendors who sell that item, regardless of what they call it.

Building the Item Master

Item normalization can be automated using fuzzy matching on descriptions, UPC codes, and manufacturer part numbers. Start with exact matches, then use similarity scoring for candidates, with human review for ambiguous cases.

Handling Edge Cases

Zero Variance

When you've bought an item at exactly the same price for every historical purchase, standard deviation is zero. Division by zero would flag any price difference as infinite sigma. The fix:

if stddev_price > 0: deviation = (current_price - avg_price) / stddev_price else: # Zero variance: convert percentage difference to sigma-like score # 5% difference = ~1 sigma equivalent pct_diff = abs(current_price - avg_price) / avg_price deviation = pct_diff * 20

New Items / New Vendors

No historical data means no statistical baseline. Instead of skipping these, flag them with different detection logic:

Credit Memos

Cumulative fraud detection must account for credit memos. Without them, a $50K PO with $60K in invoices and $15K in credits would look like $5K overage when it's actually $5K under:

-- Net cumulative = invoices - credits cumulative_total = invoiced_total + current_invoice - credit_memos if cumulative_total > po_total * 1.01: # 1% tolerance flag_cumulative_fraud()

Implementation Checklist

Database Setup
  • Enable pg_trgm extension
  • Create compound indexes
  • Set up item_master tables
Detection Queries
  • Price outlier (adaptive sigma)
  • 3x duplicate detection methods
  • Cumulative fraud check
Edge Case Handling
  • Zero variance fallback
  • New vendor/item detection
  • Credit memo accounting
Output Storage
  • Anomaly flags table
  • Severity classification
  • Audit trail with details

When to Consider ML

Pure SQL handles the known patterns. Consider adding ML for:

But these are additions to the statistical foundation, not replacements. The SQL-based detection catches 95% of fraud at zero marginal cost. ML can help with the remaining 5%.

See Zero-Cost Detection in Action

Upload an invoice and watch real-time statistical analysis flag anomalies in milliseconds. No ML inference costs. No black box.

Try Free Demo

Further Reading

References

  1. PostgreSQL Documentation, "pg_trgm: Trigram Matching"
  2. ACFE, "Report to the Nations: 2024 Global Study on Occupational Fraud"
  3. Aberdeen Group, "The Cost of Duplicate Payments in Accounts Payable"
  4. Institute of Finance & Management, "AP Fraud Detection Best Practices"