Nomearod Claude Opus 4.6 (1M context) commited on
Commit
b1863d1
·
1 Parent(s): 81ac43f

docs: add langchain baseline comparison to README

Browse files

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Files changed (1) hide show
  1. README.md +20 -5
README.md CHANGED
@@ -6,7 +6,7 @@ Agentic RAG system with a 27-question evaluation harness, hybrid retrieval (FAIS
6
 
7
  Built as a portfolio project demonstrating AI engineering depth: provider abstraction, evaluation infrastructure, production patterns (FastAPI, Docker, CI, structured logging).
8
 
9
- `145 tests` | `27-question benchmark` | `2 providers` | `Docker ready` | `CI green`
10
 
11
  ## Benchmark Results
12
 
@@ -34,6 +34,20 @@ Evaluated on 27 hand-crafted questions over 16 FastAPI documentation files. Prov
34
 
35
  [Full benchmark report](docs/benchmark_report.md) | [Provider comparison](docs/provider_comparison.md) | [Design decisions](DECISIONS.md)
36
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37
  ## Live Demo
38
 
39
  **https://nomearod-agentbench.hf.space** (Hugging Face Spaces — first request after idle may take ~30s for cold start)
@@ -135,9 +149,10 @@ Response:
135
  ## Evaluation
136
 
137
  ```bash
138
- make evaluate-fast # Deterministic metrics only (needs API key)
139
- make evaluate-full # + LLM-judge metrics (costs more)
140
- make benchmark # Generate markdown report from results
 
141
  ```
142
 
143
  The golden dataset contains 27 hand-crafted questions:
@@ -148,7 +163,7 @@ The golden dataset contains 27 hand-crafted questions:
148
  ## Testing
149
 
150
  ```bash
151
- make test # 145 deterministic tests, no API keys needed
152
  make lint # ruff + mypy
153
  ```
154
 
 
6
 
7
  Built as a portfolio project demonstrating AI engineering depth: provider abstraction, evaluation infrastructure, production patterns (FastAPI, Docker, CI, structured logging).
8
 
9
+ `169 tests` | `27-question benchmark` | `2 providers` | `Docker ready` | `CI green`
10
 
11
  ## Benchmark Results
12
 
 
34
 
35
  [Full benchmark report](docs/benchmark_report.md) | [Provider comparison](docs/provider_comparison.md) | [Design decisions](DECISIONS.md)
36
 
37
+ ### Framework Comparison: Custom vs. LangChain
38
+
39
+ To quantify what the custom orchestration layer buys over an off-the-shelf framework, the same retrieval stack is wired through a LangChain `AgentExecutor` baseline and evaluated on the identical 27-question golden dataset across both providers.
40
+
41
+ | Metric | Custom OpenAI | Custom Anthropic | LC OpenAI | LC Anthropic |
42
+ |--------|--------------|-----------------|-----------|-------------|
43
+ | P@5 | 0.70 | 0.74 | 0.64 | **0.75** |
44
+ | R@5 | 0.83 | **0.84** | **0.86** | **0.84** |
45
+ | KHR | 0.89 | **0.92** | 0.85 | 0.91 |
46
+ | Citation Acc | 1.00 | 1.00 | 1.00 | 1.00 |
47
+ | Latency p50 | **4,690 ms** | **4,690 ms** | 10,118 ms | 7,165 ms |
48
+
49
+ Retrieval quality is comparable across all four configurations (shared stack), but the custom pipeline runs at **~2x lower latency** by eliminating framework overhead. Zero hallucinated citations in all configurations. Full analysis: [comparison report](results/comparison_custom_vs_langchain.md).
50
+
51
  ## Live Demo
52
 
53
  **https://nomearod-agentbench.hf.space** (Hugging Face Spaces — first request after idle may take ~30s for cold start)
 
149
  ## Evaluation
150
 
151
  ```bash
152
+ make evaluate-fast # Deterministic metrics only (needs API key)
153
+ make evaluate-full # + LLM-judge metrics (costs more)
154
+ make benchmark # Generate markdown report from results
155
+ make evaluate-langchain # Run LangChain baseline comparison
156
  ```
157
 
158
  The golden dataset contains 27 hand-crafted questions:
 
163
  ## Testing
164
 
165
  ```bash
166
+ make test # 169 deterministic tests, no API keys needed
167
  make lint # ruff + mypy
168
  ```
169