Nomearod Claude Opus 4.6 (1M context) commited on
Commit
9fd0c7a
·
1 Parent(s): 9b56692

feat: rewrite README for recruiter impact

Browse files

- Opening leads with results, not constraints
- One-line purpose statement (portfolio project, AI engineering depth)
- Key numbers one-liner (97 tests, 25 commits, $0.0004/query)
- Benchmark table reordered: citation accuracy 1.00 first
- Mermaid architecture diagram (renders on GitHub)
- Cut "What This Is Not" section, compressed to one line at bottom
- Cut "Project Structure" section (redundant with folder tree)
- V2 roadmap reordered: grounded refusal 0/5 first (honest priority)
- Quick start condensed to 3 lines

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Files changed (1) hide show
  1. README.md +35 -80
README.md CHANGED
@@ -1,8 +1,10 @@
1
  # agent-bench
2
 
3
- Evaluation-first agentic RAG system built from API primitives no LangChain, no LlamaIndex.
4
 
5
- **Stack:** FastAPI, OpenAI gpt-4o-mini, FAISS + BM25 (Reciprocal Rank Fusion), Pydantic v2, Docker, 97 deterministic tests
 
 
6
 
7
  ## Benchmark Results
8
 
@@ -10,12 +12,12 @@ Evaluated on 27 hand-crafted questions (19 retrieval, 3 calculation, 5 out-of-sc
10
 
11
  | Metric | Value | Notes |
12
  |--------|-------|-------|
13
- | Retrieval P@5 | **0.70** | Hybrid RRF (FAISS + BM25) |
14
- | Retrieval R@5 | **0.83** | Expected sources found in top 5 |
15
- | Keyword Hit Rate | **0.89** | Expected facts present in answer |
16
  | Citation Accuracy | **1.00** | Zero hallucinated citations |
17
- | Grounded Refusal | **0/5** | LLM never refuses top V2 priority |
 
 
18
  | Calculator Accuracy | **2/3** | LLM sometimes skips tool use |
 
19
  | Latency p50 | 4,690 ms | gpt-4o-mini, single iteration |
20
  | Cost per query | $0.0004 | ~$0.01 for full 27-question eval |
21
 
@@ -24,16 +26,12 @@ Evaluated on 27 hand-crafted questions (19 retrieval, 3 calculation, 5 out-of-sc
24
  ## Quick Start
25
 
26
  ```bash
27
- # Install (uses the pinned interpreter from Makefile)
28
- make install
29
-
30
- # Ingest the documentation corpus
31
- make ingest
32
-
33
- # Start the API server
34
- make serve
35
 
36
- # Ask a question
37
  curl -X POST http://localhost:8000/ask \
38
  -H "Content-Type: application/json" \
39
  -d '{"question": "How do I define a path parameter in FastAPI?"}'
@@ -43,50 +41,30 @@ curl -X POST http://localhost:8000/ask \
43
 
44
  ```bash
45
  OPENAI_API_KEY=sk-... docker-compose -f docker/docker-compose.yaml up --build
46
- curl -X POST http://localhost:8000/ask \
47
- -H "Content-Type: application/json" \
48
- -d '{"question": "How do I define a path parameter in FastAPI?"}'
49
  ```
50
 
51
  ## Architecture
52
 
53
- ```
54
- Client
55
- |
56
- v
57
- POST /ask ──> Middleware (request_id, timing, error handling)
58
- |
59
- v
60
- Orchestrator ──> Loop (max 3 iterations):
61
- | |
62
- | v
63
- | LLM Provider (OpenAI gpt-4o-mini)
64
- | |
65
- | tool_calls? ──yes──> Tool Registry
66
- | | |
67
- | no search_documents ──> Retriever
68
- | | | |
69
- | v calculator Embedder + HybridStore
70
- | Final answer (FAISS + BM25 + RRF)
71
- |
72
- v
73
- AskResponse { answer, sources[], metadata }
74
  ```
75
 
76
  ## What This Demonstrates
77
 
78
- - **Agentic architecture**: Iterative tool-use loop with plan, execute, verify — max 3 iterations with toolless fallback
79
  - **RAG pipeline**: Hybrid retrieval via Reciprocal Rank Fusion (FAISS dense + BM25 sparse), two chunking strategies (recursive + fixed-size)
80
  - **Provider abstraction**: Swap LLM backend via config. OpenAI implemented, Anthropic stubbed, MockProvider for deterministic tests
81
- - **Evaluation infrastructure**: 27-question golden dataset with negative/out-of-scope cases, 8 deterministic metrics + 2 LLM-judge metrics, failure analysis, benchmark report
82
  - **Production patterns**: FastAPI, Docker, structlog structured logging, Pydantic v2 validation, CI with 97 deterministic tests, request-level metrics
83
 
84
- ## What This Is Not
85
-
86
- - Not a framework (it's a focused demonstration)
87
- - Not cloud-deployed (Docker-local is the scope)
88
- - Not GPU-dependent (runs on a CPU laptop)
89
-
90
  ## API Endpoints
91
 
92
  | Endpoint | Method | Description |
@@ -126,23 +104,16 @@ Response:
126
  ## Evaluation
127
 
128
  ```bash
129
- # Deterministic metrics only (free, CI-safe)
130
- make evaluate-fast
131
-
132
- # Deterministic + LLM-judge metrics (costs money)
133
- make evaluate-full
134
-
135
- # Generate benchmark report
136
- make benchmark
137
  ```
138
 
139
- The golden dataset (`agent_bench/evaluation/datasets/tech_docs_golden.json`) contains 27 hand-crafted questions:
140
  - 19 retrieval: 8 easy (single chunk), 7 medium (multi-chunk), 4 hard (multi-source)
141
  - 3 calculation: questions requiring the calculator tool
142
  - 5 out-of-scope: questions testing grounded refusal (answer not in corpus)
143
 
144
- Metrics measured: Retrieval P@5, R@5, keyword hit rate, source citation rate, citation accuracy, grounded refusal rate, calculator accuracy, latency, cost.
145
-
146
  ## Testing
147
 
148
  ```bash
@@ -152,32 +123,16 @@ make lint # ruff + mypy
152
 
153
  All tests use MockProvider + MockEmbeddingModel. No API keys. No model downloads. CI-safe.
154
 
155
- ## Project Structure
156
-
157
- ```
158
- agent_bench/
159
- core/ # Provider abstraction, config, types
160
- agents/ # Orchestrator (tool-use loop, no persistent memory)
161
- tools/ # Registry, search_documents, calculator
162
- rag/ # Chunker, embedder, FAISS+BM25 store, retriever
163
- evaluation/ # Harness, metrics, report generator, golden dataset
164
- serving/ # FastAPI app, routes, schemas, middleware
165
- ```
166
-
167
  ## Design Decisions
168
 
169
- See [DECISIONS.md](DECISIONS.md) for rationale on:
170
- - Building from primitives (no LangChain)
171
- - Reciprocal Rank Fusion over score normalization
172
- - One provider in V1 with interface for extensibility
173
- - Negative evaluation cases for grounded refusal
174
- - Deterministic eval + optional LLM judge
175
 
176
  ## V2 Roadmap
177
 
178
- - [ ] Second provider (Anthropic Claude)
179
  - [ ] Cross-encoder reranking (feature-flagged, config ready)
180
- - [ ] Research paper domain (PDF ingestion)
181
  - [ ] Streaming responses
182
- - [ ] Conversation sessions with SQLite persistence + conversation_id
183
- - [ ] Provider comparison benchmark
 
 
1
  # agent-bench
2
 
3
+ Agentic RAG system with a 27-question evaluation harness, hybrid retrieval (FAISS + BM25 + RRF), tool use, and zero hallucinated citations — built from API primitives.
4
 
5
+ Built as a portfolio project demonstrating AI engineering depth: provider abstraction, evaluation infrastructure, production patterns (FastAPI, Docker, CI, structured logging).
6
+
7
+ `97 tests` | `25 commits` | `27-question benchmark` | `$0.0004/query` | `Docker ready`
8
 
9
  ## Benchmark Results
10
 
 
12
 
13
  | Metric | Value | Notes |
14
  |--------|-------|-------|
 
 
 
15
  | Citation Accuracy | **1.00** | Zero hallucinated citations |
16
+ | Keyword Hit Rate | **0.89** | Expected facts present in answer |
17
+ | Retrieval R@5 | **0.83** | Expected sources found in top 5 |
18
+ | Retrieval P@5 | **0.70** | Hybrid RRF (FAISS + BM25) |
19
  | Calculator Accuracy | **2/3** | LLM sometimes skips tool use |
20
+ | Grounded Refusal | **0/5** | LLM never refuses — top V2 priority |
21
  | Latency p50 | 4,690 ms | gpt-4o-mini, single iteration |
22
  | Cost per query | $0.0004 | ~$0.01 for full 27-question eval |
23
 
 
26
  ## Quick Start
27
 
28
  ```bash
29
+ make install # Install dependencies
30
+ make ingest # Chunk + embed 16 FastAPI docs into FAISS + BM25
31
+ make serve # Start FastAPI server on :8000
32
+ ```
 
 
 
 
33
 
34
+ ```bash
35
  curl -X POST http://localhost:8000/ask \
36
  -H "Content-Type: application/json" \
37
  -d '{"question": "How do I define a path parameter in FastAPI?"}'
 
41
 
42
  ```bash
43
  OPENAI_API_KEY=sk-... docker-compose -f docker/docker-compose.yaml up --build
 
 
 
44
  ```
45
 
46
  ## Architecture
47
 
48
+ ```mermaid
49
+ flowchart LR
50
+ Client -->|POST /ask| MW[Middleware<br/>request_id, timing, errors]
51
+ MW --> Orch[Orchestrator<br/>max 3 iterations]
52
+ Orch --> LLM[OpenAI gpt-4o-mini]
53
+ LLM -->|tool_calls| Reg[Tool Registry]
54
+ Reg --> Search[search_documents]
55
+ Reg --> Calc[calculator]
56
+ Search --> Store[Hybrid Store<br/>FAISS + BM25 + RRF]
57
+ LLM -->|no tool_calls| Resp[AskResponse<br/>answer + sources + metadata]
 
 
 
 
 
 
 
 
 
 
 
58
  ```
59
 
60
  ## What This Demonstrates
61
 
62
+ - **Agentic architecture**: Iterative tool-use loop — max 3 iterations with toolless fallback, no LangChain or LlamaIndex
63
  - **RAG pipeline**: Hybrid retrieval via Reciprocal Rank Fusion (FAISS dense + BM25 sparse), two chunking strategies (recursive + fixed-size)
64
  - **Provider abstraction**: Swap LLM backend via config. OpenAI implemented, Anthropic stubbed, MockProvider for deterministic tests
65
+ - **Evaluation infrastructure**: 27-question golden dataset with negative/out-of-scope cases, 8 deterministic metrics + 2 LLM-judge metrics, failure analysis
66
  - **Production patterns**: FastAPI, Docker, structlog structured logging, Pydantic v2 validation, CI with 97 deterministic tests, request-level metrics
67
 
 
 
 
 
 
 
68
  ## API Endpoints
69
 
70
  | Endpoint | Method | Description |
 
104
  ## Evaluation
105
 
106
  ```bash
107
+ make evaluate-fast # Deterministic metrics only (needs API key)
108
+ make evaluate-full # + LLM-judge metrics (costs more)
109
+ make benchmark # Generate markdown report from results
 
 
 
 
 
110
  ```
111
 
112
+ The golden dataset contains 27 hand-crafted questions:
113
  - 19 retrieval: 8 easy (single chunk), 7 medium (multi-chunk), 4 hard (multi-source)
114
  - 3 calculation: questions requiring the calculator tool
115
  - 5 out-of-scope: questions testing grounded refusal (answer not in corpus)
116
 
 
 
117
  ## Testing
118
 
119
  ```bash
 
123
 
124
  All tests use MockProvider + MockEmbeddingModel. No API keys. No model downloads. CI-safe.
125
 
 
 
 
 
 
 
 
 
 
 
 
 
126
  ## Design Decisions
127
 
128
+ See [DECISIONS.md](DECISIONS.md) for rationale on building from primitives, RRF over score normalization, negative evaluation cases, deterministic eval + optional LLM judge, and more.
 
 
 
 
 
129
 
130
  ## V2 Roadmap
131
 
132
+ - [ ] Grounded refusal improvements (0/5 is the top priority)
133
  - [ ] Cross-encoder reranking (feature-flagged, config ready)
134
+ - [ ] Second provider (Anthropic Claude)
135
  - [ ] Streaming responses
136
+ - [ ] Conversation sessions with SQLite persistence
137
+
138
+ *Scope: Docker-local, CPU-only, single-domain V1.*