Spaces:

Nomearod
/

agentbench

Sleeping

App Files Files Community

Nomearod Claude Opus 4.6 (1M context) commited on Mar 27

Commit

81ac43f

1 Parent(s): 385bc4b

feat: langchain baseline evaluation results (OpenAI + Anthropic)

Browse files

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Files changed (3) hide show

docs/langchain_benchmark_anthropic.md +99 -0
docs/langchain_benchmark_openai.md +99 -0
results/comparison_custom_vs_langchain.md +49 -0

docs/langchain_benchmark_anthropic.md ADDED Viewed

	@@ -0,0 +1,99 @@

+# Benchmark Results — Technical Documentation Q&A
+**Provider:** langchain-anthropic | **Corpus:** 16 markdown files
+## Aggregate Metrics
+| Metric | Value |
+|--------|-------|
+| Retrieval P@5 | 0.75 |
+| Retrieval R@5 | 0.84 |
+| Keyword Hit Rate | 0.91 |
+| Source Citation Rate | 21/22 |
+| Citation Accuracy | 1.00 |
+| Grounded Refusal Rate | 0/5 |
+| Calculator Accuracy | 3/3 |
+| Latency p50 | 7,165 ms |
+| Latency p95 | 27,046 ms |
+| Cost per query | $0.0046 |
+## By Category
+| Category | Count | P@5 | R@5 | Keyword Hit | Refusal |
+|----------|-------|-----|-----|-------------|---------|
+| retrieval | 19 | 0.76 | 0.87 | 0.90 | n/a |
+| calculation | 3 | 0.67 | 0.67 | 0.92 | n/a |
+| out_of_scope | 5 | n/a | n/a | n/a | 0/5 |
+## By Difficulty
+| Difficulty | Count | P@5 | R@5 | Keyword Hit |
+|-----------|-------|-----|-----|-------------|
+| easy | 13 | 0.73 | 1.00 | 0.93 |
+| medium | 10 | 0.76 | 0.90 | 0.88 |
+| hard | 4 | 0.75 | 0.37 | 0.94 |
+## Chunking Strategy Comparison
+| Strategy | Note |
+|----------|------|
+| Recursive (default) | Used for this benchmark run |
+| Fixed-size | Available via `--chunk-strategy fixed` in ingest. Re-run evaluation to compare. |
+_To generate a comparison, run `make ingest` with each strategy and `make evaluate-fast` for each, then compare the results JSON files._
+## Failure Analysis (3 worst queries)
+**q007: "If a paginated endpoint returns 20 items per page and there are 10,000 items total, how many total pages are there? And if the page size is changed to 30, how many pages would there be?"**
+- Retrieval P@5: 0.00
+- Retrieval R@5: 0.00
+- Keyword Hit Rate: 0.75
+- Retrieved: []
+- Root cause: MockProvider returned canned answer — retrieval worked but answer text doesn't match expected sources
+**q001: "How do you define a path parameter in FastAPI?"**
+- Retrieval P@5: 0.20
+- Retrieval R@5: 1.00
+- Keyword Hit Rate: 0.75
+- Retrieved: ['fastapi_path_params.md', 'fastapi_request_body.md', 'fastapi_query_params.md']
+- Root cause: _(manual analysis needed for real provider runs)_
+**q015: "How does FastAPI manage application configuration and environment variables?"**
+- Retrieval P@5: 0.20
+- Retrieval R@5: 1.00
+- Keyword Hit Rate: 1.00
+- Retrieved: ['fastapi_configuration.md', 'fastapi_openapi.md', 'fastapi_intro.md']
+- Root cause: _(manual analysis needed for real provider runs)_
+## Per-Question Results
+| ID | Cat | Diff | P@5 | R@5 | KHR | Citation | Refusal | Calc |
+|----|-----|------|-----|-----|-----|----------|---------|------|
+| q001 | retrieval | easy | 0.20 | 1.00 | 0.75 | 1.00 | PASS | PASS |
+| q002 | retrieval | easy | 0.60 | 1.00 | 1.00 | 1.00 | PASS | PASS |
+| q003 | retrieval | easy | 0.80 | 1.00 | 1.00 | 1.00 | PASS | PASS |
+| q004 | retrieval | medium | 1.00 | 1.00 | 1.00 | 1.00 | PASS | PASS |
+| q005 | retrieval | medium | 1.00 | 1.00 | 0.00 | 1.00 | PASS | PASS |
+| q006 | retrieval | medium | 1.00 | 1.00 | 1.00 | 1.00 | PASS | PASS |
+| q007 | calculation | medium | 0.00 | 0.00 | 0.75 | 1.00 | PASS | PASS |
+| q008 | out_of_scope | easy | n/a | n/a | 0.33 | n/a | FAIL | PASS |
+| q009 | out_of_scope | easy | n/a | n/a | 0.67 | n/a | FAIL | PASS |
+| q010 | out_of_scope | easy | n/a | n/a | 0.33 | n/a | FAIL | PASS |
+| q011 | retrieval | easy | 0.60 | 1.00 | 1.00 | 1.00 | PASS | PASS |
+| q012 | retrieval | easy | 1.00 | 1.00 | 1.00 | 1.00 | PASS | PASS |
+| q013 | retrieval | easy | 0.60 | 1.00 | 1.00 | 1.00 | PASS | PASS |
+| q014 | retrieval | easy | 1.00 | 1.00 | 0.67 | 1.00 | PASS | PASS |
+| q015 | retrieval | medium | 0.20 | 1.00 | 1.00 | 1.00 | PASS | PASS |
+| q016 | retrieval | medium | 1.00 | 1.00 | 1.00 | 1.00 | PASS | PASS |
+| q017 | retrieval | medium | 0.80 | 1.00 | 1.00 | 1.00 | PASS | PASS |
+| q018 | retrieval | medium | 0.80 | 1.00 | 1.00 | 1.00 | PASS | PASS |
+| q019 | retrieval | medium | 0.80 | 1.00 | 1.00 | 1.00 | PASS | PASS |
+| q020 | calculation | medium | 1.00 | 1.00 | 1.00 | 1.00 | PASS | PASS |
+| q021 | calculation | easy | 1.00 | 1.00 | 1.00 | 1.00 | PASS | PASS |
+| q022 | retrieval | hard | 0.60 | 0.33 | 0.75 | 1.00 | PASS | PASS |
+| q023 | retrieval | hard | 1.00 | 0.33 | 1.00 | 1.00 | PASS | PASS |
+| q024 | retrieval | hard | 0.40 | 0.50 | 1.00 | 1.00 | PASS | PASS |
+| q025 | retrieval | hard | 1.00 | 0.33 | 1.00 | 1.00 | PASS | PASS |
+| q026 | out_of_scope | easy | n/a | n/a | 0.33 | n/a | FAIL | PASS |
+| q027 | out_of_scope | easy | n/a | n/a | 0.33 | n/a | FAIL | PASS |

docs/langchain_benchmark_openai.md ADDED Viewed

	@@ -0,0 +1,99 @@

+# Benchmark Results — Technical Documentation Q&A
+**Provider:** langchain-openai | **Corpus:** 16 markdown files
+## Aggregate Metrics
+| Metric | Value |
+|--------|-------|
+| Retrieval P@5 | 0.64 |
+| Retrieval R@5 | 0.86 |
+| Keyword Hit Rate | 0.85 |
+| Source Citation Rate | 20/22 |
+| Citation Accuracy | 1.00 |
+| Grounded Refusal Rate | 0/5 |
+| Calculator Accuracy | 3/3 |
+| Latency p50 | 10,118 ms |
+| Latency p95 | 18,084 ms |
+| Cost per query | $0.0003 |
+## By Category
+| Category | Count | P@5 | R@5 | Keyword Hit | Refusal |
+|----------|-------|-----|-----|-------------|---------|
+| retrieval | 19 | 0.68 | 0.95 | 0.89 | n/a |
+| calculation | 3 | 0.33 | 0.33 | 0.58 | n/a |
+| out_of_scope | 5 | n/a | n/a | n/a | 0/5 |
+## By Difficulty
+| Difficulty | Count | P@5 | R@5 | Keyword Hit |
+|-----------|-------|-----|-----|-------------|
+| easy | 13 | 0.55 | 0.88 | 0.84 |
+| medium | 10 | 0.66 | 0.90 | 0.82 |
+| hard | 4 | 0.75 | 0.75 | 0.94 |
+## Chunking Strategy Comparison
+| Strategy | Note |
+|----------|------|
+| Recursive (default) | Used for this benchmark run |
+| Fixed-size | Available via `--chunk-strategy fixed` in ingest. Re-run evaluation to compare. |
+_To generate a comparison, run `make ingest` with each strategy and `make evaluate-fast` for each, then compare the results JSON files._
+## Failure Analysis (3 worst queries)
+**q007: "If a paginated endpoint returns 20 items per page and there are 10,000 items total, how many total pages are there? And if the page size is changed to 30, how many pages would there be?"**
+- Retrieval P@5: 0.00
+- Retrieval R@5: 0.00
+- Keyword Hit Rate: 0.75
+- Retrieved: []
+- Root cause: MockProvider returned canned answer — retrieval worked but answer text doesn't match expected sources
+**q021: "If the CORS max_age is 600 seconds, how many minutes does the browser cache preflight results?"**
+- Retrieval P@5: 0.00
+- Retrieval R@5: 0.00
+- Keyword Hit Rate: 1.00
+- Retrieved: []
+- Root cause: MockProvider returned canned answer — retrieval worked but answer text doesn't match expected sources
+**q014: "What testing tools does FastAPI use, and what class provides the test client?"**
+- Retrieval P@5: 0.20
+- Retrieval R@5: 1.00
+- Keyword Hit Rate: 0.33
+- Retrieved: ['fastapi_testing.md', 'fastapi_openapi.md', 'fastapi_intro.md']
+- Root cause: _(manual analysis needed for real provider runs)_
+## Per-Question Results
+| ID | Cat | Diff | P@5 | R@5 | KHR | Citation | Refusal | Calc |
+|----|-----|------|-----|-----|-----|----------|---------|------|
+| q001 | retrieval | easy | 0.40 | 1.00 | 0.75 | 1.00 | PASS | PASS |
+| q002 | retrieval | easy | 0.80 | 1.00 | 1.00 | 1.00 | PASS | PASS |
+| q003 | retrieval | easy | 1.00 | 1.00 | 1.00 | 1.00 | PASS | PASS |
+| q004 | retrieval | medium | 1.00 | 1.00 | 1.00 | 1.00 | PASS | PASS |
+| q005 | retrieval | medium | 1.00 | 1.00 | 1.00 | 1.00 | PASS | PASS |
+| q006 | retrieval | medium | 0.60 | 1.00 | 1.00 | 1.00 | PASS | PASS |
+| q007 | calculation | medium | 0.00 | 0.00 | 0.75 | 1.00 | PASS | PASS |
+| q008 | out_of_scope | easy | n/a | n/a | 0.67 | n/a | FAIL | PASS |
+| q009 | out_of_scope | easy | n/a | n/a | 0.00 | n/a | FAIL | PASS |
+| q010 | out_of_scope | easy | n/a | n/a | 0.67 | n/a | FAIL | PASS |
+| q011 | retrieval | easy | 0.60 | 1.00 | 0.67 | 1.00 | PASS | PASS |
+| q012 | retrieval | easy | 0.80 | 1.00 | 1.00 | 1.00 | PASS | PASS |
+| q013 | retrieval | easy | 0.60 | 1.00 | 1.00 | 1.00 | PASS | PASS |
+| q014 | retrieval | easy | 0.20 | 1.00 | 0.33 | 1.00 | PASS | PASS |
+| q015 | retrieval | medium | 0.20 | 1.00 | 1.00 | 1.00 | PASS | PASS |
+| q016 | retrieval | medium | 0.40 | 1.00 | 1.00 | 1.00 | PASS | PASS |
+| q017 | retrieval | medium | 0.80 | 1.00 | 0.75 | 1.00 | PASS | PASS |
+| q018 | retrieval | medium | 0.80 | 1.00 | 1.00 | 1.00 | PASS | PASS |
+| q019 | retrieval | medium | 0.80 | 1.00 | 0.75 | 1.00 | PASS | PASS |
+| q020 | calculation | medium | 1.00 | 1.00 | 0.00 | 1.00 | PASS | PASS |
+| q021 | calculation | easy | 0.00 | 0.00 | 1.00 | 1.00 | PASS | PASS |
+| q022 | retrieval | hard | 0.40 | 0.33 | 0.75 | 1.00 | PASS | PASS |
+| q023 | retrieval | hard | 0.80 | 1.00 | 1.00 | 1.00 | PASS | PASS |
+| q024 | retrieval | hard | 0.80 | 1.00 | 1.00 | 1.00 | PASS | PASS |
+| q025 | retrieval | hard | 1.00 | 0.67 | 1.00 | 1.00 | PASS | PASS |
+| q026 | out_of_scope | easy | n/a | n/a | 0.67 | n/a | FAIL | PASS |
+| q027 | out_of_scope | easy | n/a | n/a | 0.67 | n/a | FAIL | PASS |

results/comparison_custom_vs_langchain.md ADDED Viewed

	@@ -0,0 +1,49 @@

+# Custom Pipeline vs. LangChain Baseline
+Both pipelines use the same retrieval stack (FAISS + BM25 + RRF + cross-encoder reranker) and the same 27-question golden dataset. The only difference is the orchestration layer: custom tool-calling loop vs. LangChain's `AgentExecutor` with `create_tool_calling_agent`.
+## OpenAI gpt-4o-mini
+| Metric | Custom | LangChain | Delta |
+|--------|--------|-----------|-------|
+| Retrieval P@5 | **0.70** | 0.64 | -0.06 |
+| Retrieval R@5 | 0.83 | **0.86** | +0.03 |
+| Keyword Hit Rate | **0.89** | 0.85 | -0.04 |
+| Citation Accuracy | 1.00 | 1.00 | tied |
+| Calculator Accuracy | 2/3 | **3/3** | +1 |
+| Latency p50 | **4,690 ms** | 10,118 ms | +5,428 ms |
+| Cost per query | $0.0004 | $0.0003 | -$0.0001 |
+## Anthropic claude-haiku-4-5
+| Metric | Custom | LangChain | Delta |
+|--------|--------|-----------|-------|
+| Retrieval P@5 | 0.74 | **0.75** | +0.01 |
+| Retrieval R@5 | **0.84** | **0.84** | tied |
+| Keyword Hit Rate | **0.92** | 0.91 | -0.01 |
+| Citation Accuracy | 1.00 | 1.00 | tied |
+| Calculator Accuracy | **3/3** | **3/3** | tied |
+| Latency p50 | **4,690 ms** | 7,165 ms | +2,475 ms |
+| Cost per query | $0.0007 | $0.0046 | +$0.0039 |
+## Cross-Framework Summary (all 4 configurations)
+| Metric | Custom OpenAI | Custom Anthropic | LC OpenAI | LC Anthropic |
+|--------|--------------|-----------------|-----------|-------------|
+| P@5 | 0.70 | 0.74 | 0.64 | **0.75** |
+| R@5 | 0.83 | **0.84** | **0.86** | **0.84** |
+| KHR | 0.89 | **0.92** | 0.85 | 0.91 |
+| Citation Acc | 1.00 | 1.00 | 1.00 | 1.00 |
+| Calc | 2/3 | **3/3** | **3/3** | **3/3** |
+| Latency p50 | **4,690 ms** | **4,690 ms** | 10,118 ms | 7,165 ms |
+| Cost/query | **$0.0004** | $0.0007 | **$0.0003** | $0.0046 |
+## Key Takeaways
+1. **Retrieval quality is comparable across all configurations.** The shared retrieval stack dominates — differences are within ~0.10 on P@5/R@5 and come down to how each LLM formulates search queries.
+2. **Latency is the biggest differentiator.** The custom pipeline runs at ~2x lower latency than LangChain at p50, due to framework overhead in prompt formatting, callback chains, and intermediate step serialization.
+3. **Zero hallucinated citations across all four configurations.** Citation accuracy is 1.00 everywhere — the retrieval-grounded approach works regardless of orchestration layer.
+4. **Anthropic slightly outperforms OpenAI on retrieval precision** in both custom (0.74 vs 0.70) and LangChain (0.75 vs 0.64), while OpenAI is cheaper per query.