Nomearod Claude Opus 4.6 (1M context) commited on
Commit
f0224d3
Β·
1 Parent(s): a9d4375

docs: sharpen README narrative for clarity

Browse files

- Surface framework comparison insight as blockquote
- Reframe Mistral-7B result as architectural finding (model-size
floor for agentic workflows) rather than just poor numbers
- Rename "Skills Demonstrated" to "Engineering Scope"
- Uncollapse V1->V2 evolution table, update to include infra sprint
- Update GitHub repo description

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Files changed (1) hide show
  1. README.md +18 -24
README.md CHANGED
@@ -20,11 +20,9 @@ Evaluated on 27 hand-crafted questions over 16 FastAPI documentation files. Both
20
  | Citation Acc | 1.00 | 1.00 | 1.00 | 1.00 |
21
  | Cost/query | **$0.0004** | $0.0007 | $0.0003 | $0.0046 |
22
 
23
- #### Key Findings
24
 
25
- Retrieval quality is dominated by the shared retrieval stack, not the orchestration layer β€” P@5 and R@5 vary by less than 0.12 across all four configurations. Citation accuracy is 1.00 everywhere, confirming the retrieval-grounded approach prevents hallucination regardless of framework choice.
26
-
27
- The main cost of framework abstraction is visible in the Anthropic configuration: LangChain's AgentExecutor makes additional intermediate LLM calls, resulting in 6.6x higher per-query cost ($0.0046 vs $0.0007) with no retrieval quality improvement.
28
 
29
  Full analysis: [comparison report](results/comparison_custom_vs_langchain.md)
30
 
@@ -39,7 +37,7 @@ Full analysis: [comparison report](results/comparison_custom_vs_langchain.md)
39
  | Latency p50 | 4,690 ms | 5,120 ms | 6,709 ms |
40
  | Cost per query | **$0.0004** | $0.0007 | $0.0031 |
41
 
42
- API providers are directly comparable (same config). The self-hosted row uses `max_iterations=1` and `top_k=3` (vs 3/5 for API) to fit Mistral-7B's 8K context window β€” not an apples-to-apples comparison, but reflects realistic 7B operating constraints. See [provider comparison](docs/provider_comparison.md) for full analysis.
43
 
44
  [Full benchmark report](docs/benchmark_report.md) | [Provider comparison](docs/provider_comparison.md) | [Design decisions](DECISIONS.md)
45
 
@@ -136,7 +134,7 @@ flowchart LR
136
  end
137
  ```
138
 
139
- ## Skills Demonstrated
140
 
141
  - **Agent design & evaluation**: Built two independent orchestration approaches (custom tool-calling loop + LangChain AgentExecutor) and evaluated both on identical metrics to quantify framework tradeoffs
142
  - **Retrieval engineering**: Hybrid FAISS + BM25 with Reciprocal Rank Fusion, cross-encoder reranking, evaluated across 27 questions with P@5, R@5, citation accuracy
@@ -213,21 +211,17 @@ All tests use MockProvider + MockEmbeddingModel. No API keys. No model downloads
213
 
214
  See [DECISIONS.md](DECISIONS.md) for rationale on building from primitives, RRF over score normalization, negative evaluation cases, deterministic eval + optional LLM judge, and more.
215
 
216
- <details><summary>V1 β†’ V2 Evolution</summary>
217
-
218
- | Feature | V1 | V2 | Skill Demonstrated |
219
- |---------|----|----|-------------------|
220
- | Grounded refusal | 0/5 | Threshold gate | Trust & safety |
221
- | Retrieval P@5 | 0.70 | 0.74 | Cross-encoder reranking |
222
- | Provider support | OpenAI only | OpenAI + Anthropic | Multi-provider abstraction |
223
- | Provider resilience | None | Retry + backoff | Error handling |
224
- | Rate limiting | None | 10 RPM per IP | API hardening |
225
- | Streaming | None | SSE (`/ask/stream`) | Async Python, real-time UX |
226
- | Conversation memory | Stateless | SQLite sessions | State management |
227
- | Cloud deployment | None | HF Spaces (Docker) | Docker β†’ production |
228
- | CI/CD | None | GitHub Actions | Automated quality gates |
229
- | Tests | 97 | 205 | Comprehensive coverage |
230
-
231
- See [DECISIONS.md](DECISIONS.md) for the reasoning behind each design choice.
232
-
233
- </details>
 
20
  | Citation Acc | 1.00 | 1.00 | 1.00 | 1.00 |
21
  | Cost/query | **$0.0004** | $0.0007 | $0.0003 | $0.0046 |
22
 
23
+ > **Key insight:** Retrieval quality is dominated by the shared retrieval stack (FAISS + BM25 + RRF + cross-encoder), not the orchestration layer. P@5 and R@5 vary by less than 0.12 across all four configurations. The main cost of framework abstraction is visible in LangChain's Anthropic path: 6.6x higher per-query cost with no retrieval improvement.
24
 
25
+ Citation accuracy is 1.00 everywhere, confirming the retrieval-grounded approach prevents hallucination regardless of framework or provider choice.
 
 
26
 
27
  Full analysis: [comparison report](results/comparison_custom_vs_langchain.md)
28
 
 
37
  | Latency p50 | 4,690 ms | 5,120 ms | 6,709 ms |
38
  | Cost per query | **$0.0004** | $0.0007 | $0.0031 |
39
 
40
+ API providers are directly comparable (same config). The self-hosted row uses `max_iterations=1` and `top_k=3` (vs 3/5 for API) to fit Mistral-7B's 8K context window. Mistral-7B's context constraint forces single-iteration retrieval with fewer chunks, demonstrating that agentic tool-calling workflows have a practical model-size floor β€” a genuine architectural finding, not a system failure. See [provider comparison](docs/provider_comparison.md) for full analysis.
41
 
42
  [Full benchmark report](docs/benchmark_report.md) | [Provider comparison](docs/provider_comparison.md) | [Design decisions](DECISIONS.md)
43
 
 
134
  end
135
  ```
136
 
137
+ ## Engineering Scope
138
 
139
  - **Agent design & evaluation**: Built two independent orchestration approaches (custom tool-calling loop + LangChain AgentExecutor) and evaluated both on identical metrics to quantify framework tradeoffs
140
  - **Retrieval engineering**: Hybrid FAISS + BM25 with Reciprocal Rank Fusion, cross-encoder reranking, evaluated across 27 questions with P@5, R@5, citation accuracy
 
211
 
212
  See [DECISIONS.md](DECISIONS.md) for rationale on building from primitives, RRF over score normalization, negative evaluation cases, deterministic eval + optional LLM judge, and more.
213
 
214
+ ### V1 β†’ V2 Evolution
215
+
216
+ | Feature | V1 | V2 |
217
+ |---------|----|----|
218
+ | Grounded refusal | 0/5 | Threshold gate |
219
+ | Retrieval P@5 | 0.70 | 0.74 (cross-encoder reranking) |
220
+ | Provider support | OpenAI only | OpenAI + Anthropic + self-hosted vLLM |
221
+ | Provider resilience | None | Retry + backoff |
222
+ | Rate limiting | None | 10 RPM per IP |
223
+ | Streaming | None | SSE (`/ask/stream`) |
224
+ | Conversation memory | Stateless | SQLite sessions |
225
+ | Infrastructure | Local only | Docker, K8s (Helm), Terraform (GKE), Modal |
226
+ | CI/CD | None | GitHub Actions |
227
+ | Tests | 97 | 205 |