Spaces:
Running
Running
feat: rewrite README for recruiter impact
Browse files- Opening leads with results, not constraints
- One-line purpose statement (portfolio project, AI engineering depth)
- Key numbers one-liner (97 tests, 25 commits, $0.0004/query)
- Benchmark table reordered: citation accuracy 1.00 first
- Mermaid architecture diagram (renders on GitHub)
- Cut "What This Is Not" section, compressed to one line at bottom
- Cut "Project Structure" section (redundant with folder tree)
- V2 roadmap reordered: grounded refusal 0/5 first (honest priority)
- Quick start condensed to 3 lines
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
README.md
CHANGED
|
@@ -1,8 +1,10 @@
|
|
| 1 |
# agent-bench
|
| 2 |
|
| 3 |
-
|
| 4 |
|
| 5 |
-
|
|
|
|
|
|
|
| 6 |
|
| 7 |
## Benchmark Results
|
| 8 |
|
|
@@ -10,12 +12,12 @@ Evaluated on 27 hand-crafted questions (19 retrieval, 3 calculation, 5 out-of-sc
|
|
| 10 |
|
| 11 |
| Metric | Value | Notes |
|
| 12 |
|--------|-------|-------|
|
| 13 |
-
| Retrieval P@5 | **0.70** | Hybrid RRF (FAISS + BM25) |
|
| 14 |
-
| Retrieval R@5 | **0.83** | Expected sources found in top 5 |
|
| 15 |
-
| Keyword Hit Rate | **0.89** | Expected facts present in answer |
|
| 16 |
| Citation Accuracy | **1.00** | Zero hallucinated citations |
|
| 17 |
-
|
|
|
|
|
|
|
|
| 18 |
| Calculator Accuracy | **2/3** | LLM sometimes skips tool use |
|
|
|
|
| 19 |
| Latency p50 | 4,690 ms | gpt-4o-mini, single iteration |
|
| 20 |
| Cost per query | $0.0004 | ~$0.01 for full 27-question eval |
|
| 21 |
|
|
@@ -24,16 +26,12 @@ Evaluated on 27 hand-crafted questions (19 retrieval, 3 calculation, 5 out-of-sc
|
|
| 24 |
## Quick Start
|
| 25 |
|
| 26 |
```bash
|
| 27 |
-
# Install
|
| 28 |
-
make
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
make ingest
|
| 32 |
-
|
| 33 |
-
# Start the API server
|
| 34 |
-
make serve
|
| 35 |
|
| 36 |
-
|
| 37 |
curl -X POST http://localhost:8000/ask \
|
| 38 |
-H "Content-Type: application/json" \
|
| 39 |
-d '{"question": "How do I define a path parameter in FastAPI?"}'
|
|
@@ -43,50 +41,30 @@ curl -X POST http://localhost:8000/ask \
|
|
| 43 |
|
| 44 |
```bash
|
| 45 |
OPENAI_API_KEY=sk-... docker-compose -f docker/docker-compose.yaml up --build
|
| 46 |
-
curl -X POST http://localhost:8000/ask \
|
| 47 |
-
-H "Content-Type: application/json" \
|
| 48 |
-
-d '{"question": "How do I define a path parameter in FastAPI?"}'
|
| 49 |
```
|
| 50 |
|
| 51 |
## Architecture
|
| 52 |
|
| 53 |
-
```
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
| LLM Provider (OpenAI gpt-4o-mini)
|
| 64 |
-
| |
|
| 65 |
-
| tool_calls? ──yes──> Tool Registry
|
| 66 |
-
| | |
|
| 67 |
-
| no search_documents ──> Retriever
|
| 68 |
-
| | | |
|
| 69 |
-
| v calculator Embedder + HybridStore
|
| 70 |
-
| Final answer (FAISS + BM25 + RRF)
|
| 71 |
-
|
|
| 72 |
-
v
|
| 73 |
-
AskResponse { answer, sources[], metadata }
|
| 74 |
```
|
| 75 |
|
| 76 |
## What This Demonstrates
|
| 77 |
|
| 78 |
-
- **Agentic architecture**: Iterative tool-use loop
|
| 79 |
- **RAG pipeline**: Hybrid retrieval via Reciprocal Rank Fusion (FAISS dense + BM25 sparse), two chunking strategies (recursive + fixed-size)
|
| 80 |
- **Provider abstraction**: Swap LLM backend via config. OpenAI implemented, Anthropic stubbed, MockProvider for deterministic tests
|
| 81 |
-
- **Evaluation infrastructure**: 27-question golden dataset with negative/out-of-scope cases, 8 deterministic metrics + 2 LLM-judge metrics, failure analysis
|
| 82 |
- **Production patterns**: FastAPI, Docker, structlog structured logging, Pydantic v2 validation, CI with 97 deterministic tests, request-level metrics
|
| 83 |
|
| 84 |
-
## What This Is Not
|
| 85 |
-
|
| 86 |
-
- Not a framework (it's a focused demonstration)
|
| 87 |
-
- Not cloud-deployed (Docker-local is the scope)
|
| 88 |
-
- Not GPU-dependent (runs on a CPU laptop)
|
| 89 |
-
|
| 90 |
## API Endpoints
|
| 91 |
|
| 92 |
| Endpoint | Method | Description |
|
|
@@ -126,23 +104,16 @@ Response:
|
|
| 126 |
## Evaluation
|
| 127 |
|
| 128 |
```bash
|
| 129 |
-
# Deterministic metrics only (
|
| 130 |
-
make evaluate-
|
| 131 |
-
|
| 132 |
-
# Deterministic + LLM-judge metrics (costs money)
|
| 133 |
-
make evaluate-full
|
| 134 |
-
|
| 135 |
-
# Generate benchmark report
|
| 136 |
-
make benchmark
|
| 137 |
```
|
| 138 |
|
| 139 |
-
The golden dataset
|
| 140 |
- 19 retrieval: 8 easy (single chunk), 7 medium (multi-chunk), 4 hard (multi-source)
|
| 141 |
- 3 calculation: questions requiring the calculator tool
|
| 142 |
- 5 out-of-scope: questions testing grounded refusal (answer not in corpus)
|
| 143 |
|
| 144 |
-
Metrics measured: Retrieval P@5, R@5, keyword hit rate, source citation rate, citation accuracy, grounded refusal rate, calculator accuracy, latency, cost.
|
| 145 |
-
|
| 146 |
## Testing
|
| 147 |
|
| 148 |
```bash
|
|
@@ -152,32 +123,16 @@ make lint # ruff + mypy
|
|
| 152 |
|
| 153 |
All tests use MockProvider + MockEmbeddingModel. No API keys. No model downloads. CI-safe.
|
| 154 |
|
| 155 |
-
## Project Structure
|
| 156 |
-
|
| 157 |
-
```
|
| 158 |
-
agent_bench/
|
| 159 |
-
core/ # Provider abstraction, config, types
|
| 160 |
-
agents/ # Orchestrator (tool-use loop, no persistent memory)
|
| 161 |
-
tools/ # Registry, search_documents, calculator
|
| 162 |
-
rag/ # Chunker, embedder, FAISS+BM25 store, retriever
|
| 163 |
-
evaluation/ # Harness, metrics, report generator, golden dataset
|
| 164 |
-
serving/ # FastAPI app, routes, schemas, middleware
|
| 165 |
-
```
|
| 166 |
-
|
| 167 |
## Design Decisions
|
| 168 |
|
| 169 |
-
See [DECISIONS.md](DECISIONS.md) for rationale on
|
| 170 |
-
- Building from primitives (no LangChain)
|
| 171 |
-
- Reciprocal Rank Fusion over score normalization
|
| 172 |
-
- One provider in V1 with interface for extensibility
|
| 173 |
-
- Negative evaluation cases for grounded refusal
|
| 174 |
-
- Deterministic eval + optional LLM judge
|
| 175 |
|
| 176 |
## V2 Roadmap
|
| 177 |
|
| 178 |
-
- [ ]
|
| 179 |
- [ ] Cross-encoder reranking (feature-flagged, config ready)
|
| 180 |
-
- [ ]
|
| 181 |
- [ ] Streaming responses
|
| 182 |
-
- [ ] Conversation sessions with SQLite persistence
|
| 183 |
-
|
|
|
|
|
|
| 1 |
# agent-bench
|
| 2 |
|
| 3 |
+
Agentic RAG system with a 27-question evaluation harness, hybrid retrieval (FAISS + BM25 + RRF), tool use, and zero hallucinated citations — built from API primitives.
|
| 4 |
|
| 5 |
+
Built as a portfolio project demonstrating AI engineering depth: provider abstraction, evaluation infrastructure, production patterns (FastAPI, Docker, CI, structured logging).
|
| 6 |
+
|
| 7 |
+
`97 tests` | `25 commits` | `27-question benchmark` | `$0.0004/query` | `Docker ready`
|
| 8 |
|
| 9 |
## Benchmark Results
|
| 10 |
|
|
|
|
| 12 |
|
| 13 |
| Metric | Value | Notes |
|
| 14 |
|--------|-------|-------|
|
|
|
|
|
|
|
|
|
|
| 15 |
| Citation Accuracy | **1.00** | Zero hallucinated citations |
|
| 16 |
+
| Keyword Hit Rate | **0.89** | Expected facts present in answer |
|
| 17 |
+
| Retrieval R@5 | **0.83** | Expected sources found in top 5 |
|
| 18 |
+
| Retrieval P@5 | **0.70** | Hybrid RRF (FAISS + BM25) |
|
| 19 |
| Calculator Accuracy | **2/3** | LLM sometimes skips tool use |
|
| 20 |
+
| Grounded Refusal | **0/5** | LLM never refuses — top V2 priority |
|
| 21 |
| Latency p50 | 4,690 ms | gpt-4o-mini, single iteration |
|
| 22 |
| Cost per query | $0.0004 | ~$0.01 for full 27-question eval |
|
| 23 |
|
|
|
|
| 26 |
## Quick Start
|
| 27 |
|
| 28 |
```bash
|
| 29 |
+
make install # Install dependencies
|
| 30 |
+
make ingest # Chunk + embed 16 FastAPI docs into FAISS + BM25
|
| 31 |
+
make serve # Start FastAPI server on :8000
|
| 32 |
+
```
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
|
| 34 |
+
```bash
|
| 35 |
curl -X POST http://localhost:8000/ask \
|
| 36 |
-H "Content-Type: application/json" \
|
| 37 |
-d '{"question": "How do I define a path parameter in FastAPI?"}'
|
|
|
|
| 41 |
|
| 42 |
```bash
|
| 43 |
OPENAI_API_KEY=sk-... docker-compose -f docker/docker-compose.yaml up --build
|
|
|
|
|
|
|
|
|
|
| 44 |
```
|
| 45 |
|
| 46 |
## Architecture
|
| 47 |
|
| 48 |
+
```mermaid
|
| 49 |
+
flowchart LR
|
| 50 |
+
Client -->|POST /ask| MW[Middleware<br/>request_id, timing, errors]
|
| 51 |
+
MW --> Orch[Orchestrator<br/>max 3 iterations]
|
| 52 |
+
Orch --> LLM[OpenAI gpt-4o-mini]
|
| 53 |
+
LLM -->|tool_calls| Reg[Tool Registry]
|
| 54 |
+
Reg --> Search[search_documents]
|
| 55 |
+
Reg --> Calc[calculator]
|
| 56 |
+
Search --> Store[Hybrid Store<br/>FAISS + BM25 + RRF]
|
| 57 |
+
LLM -->|no tool_calls| Resp[AskResponse<br/>answer + sources + metadata]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 58 |
```
|
| 59 |
|
| 60 |
## What This Demonstrates
|
| 61 |
|
| 62 |
+
- **Agentic architecture**: Iterative tool-use loop — max 3 iterations with toolless fallback, no LangChain or LlamaIndex
|
| 63 |
- **RAG pipeline**: Hybrid retrieval via Reciprocal Rank Fusion (FAISS dense + BM25 sparse), two chunking strategies (recursive + fixed-size)
|
| 64 |
- **Provider abstraction**: Swap LLM backend via config. OpenAI implemented, Anthropic stubbed, MockProvider for deterministic tests
|
| 65 |
+
- **Evaluation infrastructure**: 27-question golden dataset with negative/out-of-scope cases, 8 deterministic metrics + 2 LLM-judge metrics, failure analysis
|
| 66 |
- **Production patterns**: FastAPI, Docker, structlog structured logging, Pydantic v2 validation, CI with 97 deterministic tests, request-level metrics
|
| 67 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 68 |
## API Endpoints
|
| 69 |
|
| 70 |
| Endpoint | Method | Description |
|
|
|
|
| 104 |
## Evaluation
|
| 105 |
|
| 106 |
```bash
|
| 107 |
+
make evaluate-fast # Deterministic metrics only (needs API key)
|
| 108 |
+
make evaluate-full # + LLM-judge metrics (costs more)
|
| 109 |
+
make benchmark # Generate markdown report from results
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 110 |
```
|
| 111 |
|
| 112 |
+
The golden dataset contains 27 hand-crafted questions:
|
| 113 |
- 19 retrieval: 8 easy (single chunk), 7 medium (multi-chunk), 4 hard (multi-source)
|
| 114 |
- 3 calculation: questions requiring the calculator tool
|
| 115 |
- 5 out-of-scope: questions testing grounded refusal (answer not in corpus)
|
| 116 |
|
|
|
|
|
|
|
| 117 |
## Testing
|
| 118 |
|
| 119 |
```bash
|
|
|
|
| 123 |
|
| 124 |
All tests use MockProvider + MockEmbeddingModel. No API keys. No model downloads. CI-safe.
|
| 125 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 126 |
## Design Decisions
|
| 127 |
|
| 128 |
+
See [DECISIONS.md](DECISIONS.md) for rationale on building from primitives, RRF over score normalization, negative evaluation cases, deterministic eval + optional LLM judge, and more.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 129 |
|
| 130 |
## V2 Roadmap
|
| 131 |
|
| 132 |
+
- [ ] Grounded refusal improvements (0/5 is the top priority)
|
| 133 |
- [ ] Cross-encoder reranking (feature-flagged, config ready)
|
| 134 |
+
- [ ] Second provider (Anthropic Claude)
|
| 135 |
- [ ] Streaming responses
|
| 136 |
+
- [ ] Conversation sessions with SQLite persistence
|
| 137 |
+
|
| 138 |
+
*Scope: Docker-local, CPU-only, single-domain V1.*
|