Spaces:
Running
Running
docs: add langchain baseline comparison to README
Browse filesCo-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
README.md
CHANGED
|
@@ -6,7 +6,7 @@ Agentic RAG system with a 27-question evaluation harness, hybrid retrieval (FAIS
|
|
| 6 |
|
| 7 |
Built as a portfolio project demonstrating AI engineering depth: provider abstraction, evaluation infrastructure, production patterns (FastAPI, Docker, CI, structured logging).
|
| 8 |
|
| 9 |
-
`
|
| 10 |
|
| 11 |
## Benchmark Results
|
| 12 |
|
|
@@ -34,6 +34,20 @@ Evaluated on 27 hand-crafted questions over 16 FastAPI documentation files. Prov
|
|
| 34 |
|
| 35 |
[Full benchmark report](docs/benchmark_report.md) | [Provider comparison](docs/provider_comparison.md) | [Design decisions](DECISIONS.md)
|
| 36 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 37 |
## Live Demo
|
| 38 |
|
| 39 |
**https://nomearod-agentbench.hf.space** (Hugging Face Spaces — first request after idle may take ~30s for cold start)
|
|
@@ -135,9 +149,10 @@ Response:
|
|
| 135 |
## Evaluation
|
| 136 |
|
| 137 |
```bash
|
| 138 |
-
make evaluate-fast
|
| 139 |
-
make evaluate-full
|
| 140 |
-
make benchmark
|
|
|
|
| 141 |
```
|
| 142 |
|
| 143 |
The golden dataset contains 27 hand-crafted questions:
|
|
@@ -148,7 +163,7 @@ The golden dataset contains 27 hand-crafted questions:
|
|
| 148 |
## Testing
|
| 149 |
|
| 150 |
```bash
|
| 151 |
-
make test #
|
| 152 |
make lint # ruff + mypy
|
| 153 |
```
|
| 154 |
|
|
|
|
| 6 |
|
| 7 |
Built as a portfolio project demonstrating AI engineering depth: provider abstraction, evaluation infrastructure, production patterns (FastAPI, Docker, CI, structured logging).
|
| 8 |
|
| 9 |
+
`169 tests` | `27-question benchmark` | `2 providers` | `Docker ready` | `CI green`
|
| 10 |
|
| 11 |
## Benchmark Results
|
| 12 |
|
|
|
|
| 34 |
|
| 35 |
[Full benchmark report](docs/benchmark_report.md) | [Provider comparison](docs/provider_comparison.md) | [Design decisions](DECISIONS.md)
|
| 36 |
|
| 37 |
+
### Framework Comparison: Custom vs. LangChain
|
| 38 |
+
|
| 39 |
+
To quantify what the custom orchestration layer buys over an off-the-shelf framework, the same retrieval stack is wired through a LangChain `AgentExecutor` baseline and evaluated on the identical 27-question golden dataset across both providers.
|
| 40 |
+
|
| 41 |
+
| Metric | Custom OpenAI | Custom Anthropic | LC OpenAI | LC Anthropic |
|
| 42 |
+
|--------|--------------|-----------------|-----------|-------------|
|
| 43 |
+
| P@5 | 0.70 | 0.74 | 0.64 | **0.75** |
|
| 44 |
+
| R@5 | 0.83 | **0.84** | **0.86** | **0.84** |
|
| 45 |
+
| KHR | 0.89 | **0.92** | 0.85 | 0.91 |
|
| 46 |
+
| Citation Acc | 1.00 | 1.00 | 1.00 | 1.00 |
|
| 47 |
+
| Latency p50 | **4,690 ms** | **4,690 ms** | 10,118 ms | 7,165 ms |
|
| 48 |
+
|
| 49 |
+
Retrieval quality is comparable across all four configurations (shared stack), but the custom pipeline runs at **~2x lower latency** by eliminating framework overhead. Zero hallucinated citations in all configurations. Full analysis: [comparison report](results/comparison_custom_vs_langchain.md).
|
| 50 |
+
|
| 51 |
## Live Demo
|
| 52 |
|
| 53 |
**https://nomearod-agentbench.hf.space** (Hugging Face Spaces — first request after idle may take ~30s for cold start)
|
|
|
|
| 149 |
## Evaluation
|
| 150 |
|
| 151 |
```bash
|
| 152 |
+
make evaluate-fast # Deterministic metrics only (needs API key)
|
| 153 |
+
make evaluate-full # + LLM-judge metrics (costs more)
|
| 154 |
+
make benchmark # Generate markdown report from results
|
| 155 |
+
make evaluate-langchain # Run LangChain baseline comparison
|
| 156 |
```
|
| 157 |
|
| 158 |
The golden dataset contains 27 hand-crafted questions:
|
|
|
|
| 163 |
## Testing
|
| 164 |
|
| 165 |
```bash
|
| 166 |
+
make test # 169 deterministic tests, no API keys needed
|
| 167 |
make lint # ruff + mypy
|
| 168 |
```
|
| 169 |
|