Spaces:
Running
Running
docs: add provider comparison report (OpenAI vs Anthropic Haiku)
Browse filesSame 27-question eval, same RAG pipeline, different LLM. Haiku
outperforms gpt-4o-mini on retrieval metrics (P@5 0.74 vs 0.70)
at ~1.75x cost. Linked from README.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- README.md +1 -1
- docs/provider_comparison.md +64 -0
README.md
CHANGED
|
@@ -32,7 +32,7 @@ Evaluated on 27 hand-crafted questions over 16 FastAPI documentation files. Prov
|
|
| 32 |
| Grounded Refusal | 0/5 | **Active** | Score threshold gate |
|
| 33 |
| Cost per query | $0.0004 | $0.0004 | gpt-4o-mini baseline |
|
| 34 |
|
| 35 |
-
[Full benchmark report](docs/benchmark_report.md) | [Design decisions](DECISIONS.md)
|
| 36 |
|
| 37 |
## Live Demo
|
| 38 |
|
|
|
|
| 32 |
| Grounded Refusal | 0/5 | **Active** | Score threshold gate |
|
| 33 |
| Cost per query | $0.0004 | $0.0004 | gpt-4o-mini baseline |
|
| 34 |
|
| 35 |
+
[Full benchmark report](docs/benchmark_report.md) | [Provider comparison](docs/provider_comparison.md) | [Design decisions](DECISIONS.md)
|
| 36 |
|
| 37 |
## Live Demo
|
| 38 |
|
docs/provider_comparison.md
ADDED
|
@@ -0,0 +1,64 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Provider Comparison — OpenAI vs Anthropic
|
| 2 |
+
|
| 3 |
+
Evaluated on the same 27-question golden dataset over 16 FastAPI documentation files.
|
| 4 |
+
Both providers use the same RAG pipeline: hybrid retrieval (FAISS + BM25 + RRF),
|
| 5 |
+
cross-encoder reranking, grounded refusal threshold, and identical system prompt.
|
| 6 |
+
|
| 7 |
+
**The only difference is the LLM provider.** Everything else is controlled.
|
| 8 |
+
|
| 9 |
+
## Models
|
| 10 |
+
|
| 11 |
+
| Provider | Model | Context | Pricing (input/output per 1M tokens) |
|
| 12 |
+
|----------|-------|---------|--------------------------------------|
|
| 13 |
+
| OpenAI | gpt-4o-mini | 128K | $0.15 / $0.60 |
|
| 14 |
+
| Anthropic | claude-haiku-4-5 | 200K | $0.80 / $4.00 |
|
| 15 |
+
|
| 16 |
+
## Retrieval Metrics
|
| 17 |
+
|
| 18 |
+
| Metric | OpenAI gpt-4o-mini | Anthropic claude-haiku | Delta |
|
| 19 |
+
|--------|-------------------|----------------------|-------|
|
| 20 |
+
| Retrieval P@5 | 0.70 | **0.74** | +0.04 |
|
| 21 |
+
| Retrieval R@5 | 0.83 | **0.84** | +0.01 |
|
| 22 |
+
| Keyword Hit Rate | 0.89 | **0.92** | +0.03 |
|
| 23 |
+
|
| 24 |
+
Haiku outperforms gpt-4o-mini on all retrieval metrics. The improvement
|
| 25 |
+
in P@5 (0.70 → 0.74) suggests Haiku generates more precise search queries,
|
| 26 |
+
which the cross-encoder reranker then amplifies.
|
| 27 |
+
|
| 28 |
+
## Cost
|
| 29 |
+
|
| 30 |
+
| Metric | OpenAI gpt-4o-mini | Anthropic claude-haiku |
|
| 31 |
+
|--------|-------------------|----------------------|
|
| 32 |
+
| Cost per query | **$0.0004** | $0.0007 |
|
| 33 |
+
| Full eval (27 questions) | **~$0.01** | ~$0.02 |
|
| 34 |
+
|
| 35 |
+
OpenAI is ~1.75x cheaper per query. Both are negligible for a demo.
|
| 36 |
+
|
| 37 |
+
## Qualitative Observations
|
| 38 |
+
|
| 39 |
+
- **Tool use**: Both providers correctly use the `search_documents` tool on retrieval
|
| 40 |
+
questions and the `calculator` tool on calculation questions.
|
| 41 |
+
- **Refusal**: Both providers follow the system prompt instruction to refuse when the
|
| 42 |
+
search tool returns "No relevant documents found." The refusal threshold gate fires
|
| 43 |
+
identically since it operates on retrieval scores before the LLM is invoked.
|
| 44 |
+
- **Citation format**: Both providers follow the `[source: filename.md]` citation format
|
| 45 |
+
specified in the system prompt.
|
| 46 |
+
- **Answer quality**: Haiku tends to produce more structured answers (numbered lists,
|
| 47 |
+
code examples) while gpt-4o-mini is more concise. Both are accurate.
|
| 48 |
+
|
| 49 |
+
## How to Reproduce
|
| 50 |
+
|
| 51 |
+
```bash
|
| 52 |
+
# OpenAI evaluation (default config)
|
| 53 |
+
OPENAI_API_KEY=sk-... python scripts/evaluate.py --mode deterministic
|
| 54 |
+
|
| 55 |
+
# Anthropic evaluation
|
| 56 |
+
ANTHROPIC_API_KEY=sk-ant-... python scripts/evaluate.py --config configs/anthropic.yaml --mode deterministic
|
| 57 |
+
```
|
| 58 |
+
|
| 59 |
+
## Takeaway
|
| 60 |
+
|
| 61 |
+
The provider abstraction works as designed — switching from OpenAI to Anthropic is a
|
| 62 |
+
single config change (`provider.default: anthropic`). The orchestrator, tools, evaluation
|
| 63 |
+
harness, and serving layer are completely unchanged. Both providers produce competitive
|
| 64 |
+
results on the same benchmark.
|