Spaces:
Running
Running
docs: sharpen README narrative for clarity
Browse files- Surface framework comparison insight as blockquote
- Reframe Mistral-7B result as architectural finding (model-size
floor for agentic workflows) rather than just poor numbers
- Rename "Skills Demonstrated" to "Engineering Scope"
- Uncollapse V1->V2 evolution table, update to include infra sprint
- Update GitHub repo description
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
README.md
CHANGED
|
@@ -20,11 +20,9 @@ Evaluated on 27 hand-crafted questions over 16 FastAPI documentation files. Both
|
|
| 20 |
| Citation Acc | 1.00 | 1.00 | 1.00 | 1.00 |
|
| 21 |
| Cost/query | **$0.0004** | $0.0007 | $0.0003 | $0.0046 |
|
| 22 |
|
| 23 |
-
|
| 24 |
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
The main cost of framework abstraction is visible in the Anthropic configuration: LangChain's AgentExecutor makes additional intermediate LLM calls, resulting in 6.6x higher per-query cost ($0.0046 vs $0.0007) with no retrieval quality improvement.
|
| 28 |
|
| 29 |
Full analysis: [comparison report](results/comparison_custom_vs_langchain.md)
|
| 30 |
|
|
@@ -39,7 +37,7 @@ Full analysis: [comparison report](results/comparison_custom_vs_langchain.md)
|
|
| 39 |
| Latency p50 | 4,690 ms | 5,120 ms | 6,709 ms |
|
| 40 |
| Cost per query | **$0.0004** | $0.0007 | $0.0031 |
|
| 41 |
|
| 42 |
-
API providers are directly comparable (same config). The self-hosted row uses `max_iterations=1` and `top_k=3` (vs 3/5 for API) to fit Mistral-7B's 8K context window
|
| 43 |
|
| 44 |
[Full benchmark report](docs/benchmark_report.md) | [Provider comparison](docs/provider_comparison.md) | [Design decisions](DECISIONS.md)
|
| 45 |
|
|
@@ -136,7 +134,7 @@ flowchart LR
|
|
| 136 |
end
|
| 137 |
```
|
| 138 |
|
| 139 |
-
##
|
| 140 |
|
| 141 |
- **Agent design & evaluation**: Built two independent orchestration approaches (custom tool-calling loop + LangChain AgentExecutor) and evaluated both on identical metrics to quantify framework tradeoffs
|
| 142 |
- **Retrieval engineering**: Hybrid FAISS + BM25 with Reciprocal Rank Fusion, cross-encoder reranking, evaluated across 27 questions with P@5, R@5, citation accuracy
|
|
@@ -213,21 +211,17 @@ All tests use MockProvider + MockEmbeddingModel. No API keys. No model downloads
|
|
| 213 |
|
| 214 |
See [DECISIONS.md](DECISIONS.md) for rationale on building from primitives, RRF over score normalization, negative evaluation cases, deterministic eval + optional LLM judge, and more.
|
| 215 |
|
| 216 |
-
|
| 217 |
-
|
| 218 |
-
| Feature | V1 | V2 |
|
| 219 |
-
|---------|----|----|
|
| 220 |
-
| Grounded refusal | 0/5 | Threshold gate |
|
| 221 |
-
| Retrieval P@5 | 0.70 | 0.74
|
| 222 |
-
| Provider support | OpenAI only | OpenAI + Anthropic
|
| 223 |
-
| Provider resilience | None | Retry + backoff |
|
| 224 |
-
| Rate limiting | None | 10 RPM per IP |
|
| 225 |
-
| Streaming | None | SSE (`/ask/stream`) |
|
| 226 |
-
| Conversation memory | Stateless | SQLite sessions |
|
| 227 |
-
|
|
| 228 |
-
| CI/CD | None | GitHub Actions |
|
| 229 |
-
| Tests | 97 | 205 |
|
| 230 |
-
|
| 231 |
-
See [DECISIONS.md](DECISIONS.md) for the reasoning behind each design choice.
|
| 232 |
-
|
| 233 |
-
</details>
|
|
|
|
| 20 |
| Citation Acc | 1.00 | 1.00 | 1.00 | 1.00 |
|
| 21 |
| Cost/query | **$0.0004** | $0.0007 | $0.0003 | $0.0046 |
|
| 22 |
|
| 23 |
+
> **Key insight:** Retrieval quality is dominated by the shared retrieval stack (FAISS + BM25 + RRF + cross-encoder), not the orchestration layer. P@5 and R@5 vary by less than 0.12 across all four configurations. The main cost of framework abstraction is visible in LangChain's Anthropic path: 6.6x higher per-query cost with no retrieval improvement.
|
| 24 |
|
| 25 |
+
Citation accuracy is 1.00 everywhere, confirming the retrieval-grounded approach prevents hallucination regardless of framework or provider choice.
|
|
|
|
|
|
|
| 26 |
|
| 27 |
Full analysis: [comparison report](results/comparison_custom_vs_langchain.md)
|
| 28 |
|
|
|
|
| 37 |
| Latency p50 | 4,690 ms | 5,120 ms | 6,709 ms |
|
| 38 |
| Cost per query | **$0.0004** | $0.0007 | $0.0031 |
|
| 39 |
|
| 40 |
+
API providers are directly comparable (same config). The self-hosted row uses `max_iterations=1` and `top_k=3` (vs 3/5 for API) to fit Mistral-7B's 8K context window. Mistral-7B's context constraint forces single-iteration retrieval with fewer chunks, demonstrating that agentic tool-calling workflows have a practical model-size floor β a genuine architectural finding, not a system failure. See [provider comparison](docs/provider_comparison.md) for full analysis.
|
| 41 |
|
| 42 |
[Full benchmark report](docs/benchmark_report.md) | [Provider comparison](docs/provider_comparison.md) | [Design decisions](DECISIONS.md)
|
| 43 |
|
|
|
|
| 134 |
end
|
| 135 |
```
|
| 136 |
|
| 137 |
+
## Engineering Scope
|
| 138 |
|
| 139 |
- **Agent design & evaluation**: Built two independent orchestration approaches (custom tool-calling loop + LangChain AgentExecutor) and evaluated both on identical metrics to quantify framework tradeoffs
|
| 140 |
- **Retrieval engineering**: Hybrid FAISS + BM25 with Reciprocal Rank Fusion, cross-encoder reranking, evaluated across 27 questions with P@5, R@5, citation accuracy
|
|
|
|
| 211 |
|
| 212 |
See [DECISIONS.md](DECISIONS.md) for rationale on building from primitives, RRF over score normalization, negative evaluation cases, deterministic eval + optional LLM judge, and more.
|
| 213 |
|
| 214 |
+
### V1 β V2 Evolution
|
| 215 |
+
|
| 216 |
+
| Feature | V1 | V2 |
|
| 217 |
+
|---------|----|----|
|
| 218 |
+
| Grounded refusal | 0/5 | Threshold gate |
|
| 219 |
+
| Retrieval P@5 | 0.70 | 0.74 (cross-encoder reranking) |
|
| 220 |
+
| Provider support | OpenAI only | OpenAI + Anthropic + self-hosted vLLM |
|
| 221 |
+
| Provider resilience | None | Retry + backoff |
|
| 222 |
+
| Rate limiting | None | 10 RPM per IP |
|
| 223 |
+
| Streaming | None | SSE (`/ask/stream`) |
|
| 224 |
+
| Conversation memory | Stateless | SQLite sessions |
|
| 225 |
+
| Infrastructure | Local only | Docker, K8s (Helm), Terraform (GKE), Modal |
|
| 226 |
+
| CI/CD | None | GitHub Actions |
|
| 227 |
+
| Tests | 97 | 205 |
|
|
|
|
|
|
|
|
|
|
|
|