agentbench / docs /benchmark_report.md
Nomearod's picture
feat: real benchmark numbers from OpenAI gpt-4o-mini evaluation
3407aff

Benchmark Results — Technical Documentation Q&A

Provider: openai | Corpus: 16 markdown files

Aggregate Metrics

Metric Value
Retrieval P@5 0.70
Retrieval R@5 0.83
Keyword Hit Rate 0.89
Source Citation Rate 20/22
Citation Accuracy 1.00
Grounded Refusal Rate 0/5
Calculator Accuracy 2/3
Latency p50 4,690 ms
Latency p95 14,991 ms
Cost per query $0.0004

By Category

Category Count P@5 R@5 Keyword Hit Refusal
retrieval 19 0.76 0.91 0.88 n/a
calculation 3 0.33 0.33 0.92 n/a
out_of_scope 5 n/a n/a n/a 0/5

By Difficulty

Difficulty Count P@5 R@5 Keyword Hit
easy 13 0.50 0.88 0.84
medium 10 0.78 0.90 0.90
hard 4 0.90 0.58 0.94

Chunking Strategy Comparison

Strategy Note
Recursive (default) Used for this benchmark run
Fixed-size Available via --chunk-strategy fixed in ingest. Re-run evaluation to compare.

To generate a comparison, run make ingest with each strategy and make evaluate-fast for each, then compare the results JSON files.

Failure Analysis (3 worst queries)

q007: "If a paginated endpoint returns 20 items per page and there are 10,000 items total, how many total pages are there? And if the page size is changed to 30, how many pages would there be?"

  • Retrieval P@5: 0.00 | R@5: 0.00 | KHR: 0.75
  • Retrieved: []
  • Root cause: The LLM answered the calculation from its parametric knowledge without calling search_documents first. The answer was correct (keywords hit 0.75) but no retrieval occurred, so P@5/R@5 are zero. Fix: stronger system prompt forcing search before calculation, or a retrieval-first orchestrator policy.

q021: "If the CORS max_age is 600 seconds, how many minutes does the browser cache preflight results?"

  • Retrieval P@5: 0.00 | R@5: 0.00 | KHR: 1.00
  • Retrieved: []
  • Root cause: Same pattern as q007 — the LLM computed 600/60=10 from parametric knowledge, skipping retrieval entirely. The answer was fully correct (KHR 1.00) but ungrounded. Fix: same as above — enforce tool use before answering.

q001: "How do you define a path parameter in FastAPI?"

  • Retrieval P@5: 0.20 | R@5: 1.00 | KHR: 0.75
  • Retrieved: ['fastapi_query_params.md', 'fastapi_path_params.md', 'fastapi_query_params.md']
  • Root cause: BM25 ranked fastapi_query_params.md higher than fastapi_path_params.md due to shared vocabulary ("parameters", "FastAPI"). The correct source was retrieved (R@5 = 1.00) but at rank 2, diluting precision. Fix: cross-encoder reranking or query-specific term weighting would help disambiguate.

Per-Question Results

ID Cat Diff P@5 R@5 KHR Citation Refusal Calc
q001 retrieval easy 0.20 1.00 0.75 1.00 PASS PASS
q002 retrieval easy 1.00 1.00 1.00 1.00 PASS PASS
q003 retrieval easy 0.80 1.00 1.00 1.00 PASS PASS
q004 retrieval medium 1.00 1.00 1.00 1.00 PASS PASS
q005 retrieval medium 1.00 1.00 1.00 1.00 PASS PASS
q006 retrieval medium 1.00 1.00 1.00 1.00 PASS PASS
q007 calculation medium 0.00 0.00 0.75 1.00 PASS PASS
q008 out_of_scope easy n/a n/a 0.67 n/a FAIL PASS
q009 out_of_scope easy n/a n/a 0.67 n/a FAIL PASS
q010 out_of_scope easy n/a n/a 0.67 n/a FAIL PASS
q011 retrieval easy 0.80 1.00 0.67 1.00 PASS PASS
q012 retrieval easy 0.60 1.00 1.00 1.00 PASS PASS
q013 retrieval easy 0.20 1.00 1.00 1.00 PASS PASS
q014 retrieval easy 0.40 1.00 0.33 1.00 PASS PASS
q015 retrieval medium 0.60 1.00 1.00 1.00 PASS PASS
q016 retrieval medium 0.60 1.00 1.00 1.00 PASS PASS
q017 retrieval medium 0.80 1.00 0.50 1.00 PASS PASS
q018 retrieval medium 0.80 1.00 1.00 1.00 PASS PASS
q019 retrieval medium 1.00 1.00 0.75 1.00 PASS PASS
q020 calculation medium 1.00 1.00 1.00 1.00 PASS FAIL
q021 calculation easy 0.00 0.00 1.00 1.00 PASS PASS
q022 retrieval hard 0.60 0.33 0.75 1.00 PASS PASS
q023 retrieval hard 1.00 0.67 1.00 1.00 PASS PASS
q024 retrieval hard 1.00 1.00 1.00 1.00 PASS PASS
q025 retrieval hard 1.00 0.33 1.00 1.00 PASS PASS
q026 out_of_scope easy n/a n/a 0.67 n/a FAIL PASS
q027 out_of_scope easy n/a n/a 0.67 n/a FAIL PASS

Configuration Snapshot

agent:
  max_iterations: 3
  temperature: 0.0
embedding:
  cache_dir: .cache/embeddings
  model: all-MiniLM-L6-v2
evaluation:
  golden_dataset: agent_bench/evaluation/datasets/tech_docs_golden.json
  judge_provider: openai
provider:
  default: openai
  models:
    claude-sonnet-4-20250514:
      input_cost_per_mtok: 3.0
      output_cost_per_mtok: 15.0
    gpt-4o-mini:
      input_cost_per_mtok: 0.15
      output_cost_per_mtok: 0.6
rag:
  chunking:
    chunk_overlap: 64
    chunk_size: 512
    strategy: recursive
  reranker:
    enabled: false
  retrieval:
    candidates_per_system: 10
    rrf_k: 60
    strategy: hybrid
    top_k: 5
  store_path: .cache/store
serving:
  host: 0.0.0.0
  port: 8000
  request_timeout_seconds: 30