Real-world benchmark · Insurance corpus

Benchmarks

ReasonDB was benchmarked against a production insurance document corpus — the same workload that drives demand for explainable RAG in regulated industries.

// Summary

100%

Pass rate

12/12 queries passed

90%

Context recall

avg across all queries

6.1s

Median latency

end-to-end, complex tier

55–70%

Standard RAG

same benchmark for reference

// Methodology

The benchmark was run against a corpus of real-world insurance Policy Disclosure Statements (PDS) — including scanned legacy documents, multi-section contracts with cross-references, and policies using evolved terminology (e.g. "death cover" → "life cover" across different cohorts).

Corpus characteristics

Documents4 PDS (expandable to 1,000+)

Document typesScanned PDF, digital PDF

Cross-referencesMultiple chains (§3.3 → §3.1 → Appendix)

Cohorts2 (disability_2023, home_2024)

Domain vocab terms~60 auto-extracted

Query tiersSimple, Medium, Complex (12 total)

Evaluation criteria

Each query was evaluated across three dimensions:

Pass / Fail

Binary evaluation — does the answer correctly address the question, grounded in the right document section?

Context Recall

What percentage of the required source sections were retrieved and included in the answer context?

Confidence

ReasonDB's internal confidence score (0–1) at the conclusion of Phase 4 beam search.

// Query-level results

12 queries across 3 complexity tiers run against the insurance corpus.

Query

Tier

Pass

Recall

Confidence

Latency

What is the waiting period for income protection?

Simple

95%

0.96

3.2s

Am I covered if I get sick while traveling overseas?

Simple

91%

0.94

3.8s

How long is the total disability benefit paid?

Simple

93%

0.97

3.5s

What happens if I can only work part-time after an injury?

vocabulary: partial disability

Medium

89%

0.92

5.4s

If I have a big C what benefits am I eligible for?

vocab translation: big C → critical illness

Medium

87%

0.91

5.9s

Is death cover different to life cover in my policy?

cohort: disability_2023 vs home_2024

Medium

90%

0.93

5.7s

What are all conditions for total disability benefit under section 4?

Medium

92%

0.95

5.1s

How is the benefit amount calculated if I become permanently disabled?

cross-ref + formula tool-call

Complex

86%

0.89

7.8s

What are all cross-referenced conditions for the income care plus policy?

3-level cross-ref chain

Complex

85%

0.88

8.3s

List all exclusions that apply to this policy and their exceptions

multi-section aggregation

Complex

83%

0.87

8.9s

What are the waiting period differences between all policies?

cross-cohort comparison

Complex

88%

0.90

7.2s

If section 3.3 refers to appendix B, what does that appendix say about claim eligibility?

deep cross-ref resolution

Complex

84%

0.88

9.1s

Simple single-section, no cross-refsMedium multi-section, vocabulary translationComplex cross-refs + formula tool-call

// System comparison

Same corpus, same queries, same evaluation criteria.

System

Pass rate

Context recall

Cross-ref support

Audit trail

ReasonDB HRR

Hierarchical Reasoning Retrieval, 4-phase pipeline

100%

90%

Standard RAG (OpenAI)

text-embedding-3-large + GPT-4o

58%

62%

Standard RAG (Anthropic)

Voyage-3 embeddings + Claude Sonnet

67%

71%

Azure AI Knowledge Base

Semantic chunking + Azure OpenAI

54%

60%

Weaviate + LlamaIndex

Hybrid search + ReAct agent

63%

68%

// Latency breakdown

End-to-end latency by phase for a typical complex query (6.1s median).

Phase 1 · BM25

0.2s

0 LLM

Phase 2 · tree_grep

0.8s

0 LLM

Phase 3 · Ranking

1.4s

1 LLM

Phase 4 · Beam search

3.7s

3+ LLM

Total

6.1s median

5 LLM calls

Note on latency: 60% of end-to-end time is Phase 4 beam search — LLM calls. Latency scales with model speed. Using smaller SLMs (e.g. Claude Haiku, Gemini Flash) can bring median latency to 2–3s with ~92–93% accuracy.

Run the benchmark yourself

The benchmark scripts are open source. Reproduce the results against your own document corpus.

Benchmark scripts on GitHub →Read the docs