alpha · work in progress · open source
Real-world benchmark · Insurance corpus

Benchmarks

ReasonDB was benchmarked against a production insurance document corpus — the same workload that drives demand for explainable RAG in regulated industries.

// Summary

100%
Pass rate
12/12 queries passed
90%
Context recall
avg across all queries
6.1s
Median latency
end-to-end, complex tier
55–70%
Standard RAG
same benchmark for reference

// Methodology

The benchmark was run against a corpus of real-world insurance Policy Disclosure Statements (PDS) — including scanned legacy documents, multi-section contracts with cross-references, and policies using evolved terminology (e.g. "death cover" → "life cover" across different cohorts).

Corpus characteristics

Documents4 PDS (expandable to 1,000+)
Document typesScanned PDF, digital PDF
Cross-referencesMultiple chains (§3.3 → §3.1 → Appendix)
Cohorts2 (disability_2023, home_2024)
Domain vocab terms~60 auto-extracted
Query tiersSimple, Medium, Complex (12 total)

Evaluation criteria

Each query was evaluated across three dimensions:

P
Pass / Fail
Binary evaluation — does the answer correctly address the question, grounded in the right document section?
R
Context Recall
What percentage of the required source sections were retrieved and included in the answer context?
C
Confidence
ReasonDB's internal confidence score (0–1) at the conclusion of Phase 4 beam search.

// Query-level results

12 queries across 3 complexity tiers run against the insurance corpus.

Query
Tier
Pass
Recall
Confidence
Latency
What is the waiting period for income protection?
Simple
95%
0.96
3.2s
Am I covered if I get sick while traveling overseas?
Simple
91%
0.94
3.8s
How long is the total disability benefit paid?
Simple
93%
0.97
3.5s
What happens if I can only work part-time after an injury?
vocabulary: partial disability
Medium
89%
0.92
5.4s
If I have a big C what benefits am I eligible for?
vocab translation: big C → critical illness
Medium
87%
0.91
5.9s
Is death cover different to life cover in my policy?
cohort: disability_2023 vs home_2024
Medium
90%
0.93
5.7s
What are all conditions for total disability benefit under section 4?
Medium
92%
0.95
5.1s
How is the benefit amount calculated if I become permanently disabled?
cross-ref + formula tool-call
Complex
86%
0.89
7.8s
What are all cross-referenced conditions for the income care plus policy?
3-level cross-ref chain
Complex
85%
0.88
8.3s
List all exclusions that apply to this policy and their exceptions
multi-section aggregation
Complex
83%
0.87
8.9s
What are the waiting period differences between all policies?
cross-cohort comparison
Complex
88%
0.90
7.2s
If section 3.3 refers to appendix B, what does that appendix say about claim eligibility?
deep cross-ref resolution
Complex
84%
0.88
9.1s
Simple single-section, no cross-refsMedium multi-section, vocabulary translationComplex cross-refs + formula tool-call

// System comparison

Same corpus, same queries, same evaluation criteria.

System
Pass rate
Context recall
Cross-ref support
Audit trail
ReasonDB HRR
Hierarchical Reasoning Retrieval, 4-phase pipeline
100%
90%
Standard RAG (OpenAI)
text-embedding-3-large + GPT-4o
58%
62%
Standard RAG (Anthropic)
Voyage-3 embeddings + Claude Sonnet
67%
71%
Azure AI Knowledge Base
Semantic chunking + Azure OpenAI
54%
60%
Weaviate + LlamaIndex
Hybrid search + ReAct agent
63%
68%

// Latency breakdown

End-to-end latency by phase for a typical complex query (6.1s median).

Phase 1 · BM25
0.2s
0 LLM
Phase 2 · tree_grep
0.8s
0 LLM
Phase 3 · Ranking
1.4s
1 LLM
Phase 4 · Beam search
3.7s
3+ LLM
Total
6.1s median
5 LLM calls
Note on latency: 60% of end-to-end time is Phase 4 beam search — LLM calls. Latency scales with model speed. Using smaller SLMs (e.g. Claude Haiku, Gemini Flash) can bring median latency to 2–3s with ~92–93% accuracy.

Run the benchmark yourself

The benchmark scripts are open source. Reproduce the results against your own document corpus.