Nomearod Claude Opus 4.6 (1M context) commited on
Commit
dc97d8c
·
1 Parent(s): 3e490c9

fix: remove stale V1 docs, update DECISIONS.md for V2

Browse files

Remove DESIGN.md from tracking (internal V1 design doc with outdated
non-goals that contradict shipped V2 features). Update DECISIONS.md
to reflect current state: Anthropic implemented, streaming added,
sessions added.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Files changed (3) hide show
  1. .gitignore +1 -0
  2. DECISIONS.md +16 -23
  3. docs/DESIGN.md +0 -416
.gitignore CHANGED
@@ -16,3 +16,4 @@ build/
16
  venv/
17
  .worktrees/
18
  *.db
 
 
16
  venv/
17
  .worktrees/
18
  *.db
19
+ docs/DESIGN.md
DECISIONS.md CHANGED
@@ -9,11 +9,10 @@ I know exactly where it plugs in — because I built every layer.
9
 
10
  ## Why one provider in V1?
11
 
12
- The interface supports multiple providers. Implementing one real (OpenAI)
13
- plus one mock proves the abstraction works without doubling edge-case
14
- work. The Anthropic stub raises `NotImplementedError("planned for V2")`
15
- adding it is a matter of mapping tool formats. The orchestrator and
16
- tools don't change.
17
 
18
  ## Why one domain (technical docs)?
19
 
@@ -54,24 +53,18 @@ eliminates consistency bugs.
54
  ## Why async internals, sync user behavior?
55
 
56
  FastAPI and the OpenAI SDK are async-native. Using async for I/O
57
- avoids blocking the event loop. But the API is request-response —
58
- no streaming, no background jobs. Async is an implementation detail,
59
- not a user-facing feature.
60
-
61
- ## Why no conversation_id in V1?
62
-
63
- No persistent memory means a session ID is a contract you can't
64
- honor. The orchestrator's message list is local to each request
65
- and dies with the response. An in-process memory buffer would be
66
- a half-measure it resets on restart and can't distinguish users.
67
- V2 adds SQLite-backed sessions with a real conversation_id.
68
-
69
- ## Why no memory.py in V1?
70
-
71
- Discussed during design review and cut. Without conversation_id,
72
- cross-request memory is state you can't persist or key. The
73
- orchestrator builds a `list[Message]` as a local variable during
74
- its 3-iteration loop — that's working context, not memory.
75
 
76
  ## Why negative evaluation cases?
77
 
 
9
 
10
  ## Why one provider in V1?
11
 
12
+ The interface supports multiple providers. V1 shipped OpenAI + Mock to
13
+ prove the abstraction. V2 added Anthropic (claude-haiku-4-5), confirming
14
+ that switching providers is a one-line config change. The orchestrator
15
+ and tools are completely unchanged between providers.
 
16
 
17
  ## Why one domain (technical docs)?
18
 
 
53
  ## Why async internals, sync user behavior?
54
 
55
  FastAPI and the OpenAI SDK are async-native. Using async for I/O
56
+ avoids blocking the event loop. V2 added SSE streaming (`/ask/stream`)
57
+ for the final synthesis step tool calls remain non-streamed since
58
+ they complete in ~100ms.
59
+
60
+ ## Why SQLite-backed conversation sessions
61
+
62
+ V1 was stateless by design no conversation_id, no cross-request
63
+ memory. V2 adds optional SQLite-backed sessions: pass `session_id`
64
+ on `/ask` to persist and load conversation history. When omitted,
65
+ behavior is identical to V1 (stateless). See the dedicated
66
+ DECISIONS.md entry under "Why SQLite for conversation persistence"
67
+ for the full rationale.
 
 
 
 
 
 
68
 
69
  ## Why negative evaluation cases?
70
 
docs/DESIGN.md DELETED
@@ -1,416 +0,0 @@
1
- # agent-bench — Design Document
2
-
3
- > Evaluation-first agentic RAG system with one provider, one domain, one API, and one benchmark report — built from API primitives on a CPU-only laptop.
4
-
5
- Based on V3 spec with 7 refinements from design review (2026-03-24).
6
-
7
- ---
8
-
9
- ## Scope Lock
10
-
11
- | Decision | Choice |
12
- |----------|--------|
13
- | LLM backend | OpenAI (`gpt-4o-mini`) + `MockProvider` for tests + `AnthropicProvider` stub |
14
- | Embedding model | `all-MiniLM-L6-v2` (sentence-transformers, CPU) |
15
- | Vector store | FAISS (CPU) + BM25, fused via Reciprocal Rank Fusion |
16
- | API framework | FastAPI |
17
- | Validation | Pydantic v2 |
18
- | Testing | pytest + httpx (async test client) |
19
- | CI | GitHub Actions — full deterministic test suite |
20
- | Containerization | Docker + docker-compose |
21
- | Logging | structlog (JSON structured logging) |
22
- | Domain | Technical documentation Q&A (markdown) |
23
- | Corpus | ~15-20 curated markdown files (e.g., FastAPI tutorial pages) |
24
- | Async strategy | Async provider internals, sync user-facing behavior |
25
- | Citation format | Structured `sources` list in JSON + `[source: filename.md]` inline |
26
-
27
- ### Non-goals (V1)
28
-
29
- No LangChain/LlamaIndex, no fine-tuning, no frontend UI, no cloud deploy, no third-party observability, no GPU, no streaming, no persistent memory/conversation DB, no `/upload` endpoint, no second domain, no second provider implementation, no conversation sessions.
30
-
31
- ---
32
-
33
- ## Repository Structure
34
-
35
- ```
36
- agent-bench/
37
- ├── pyproject.toml
38
- ├── Makefile
39
- ├── README.md
40
- ├── DECISIONS.md
41
- ├── .github/workflows/ci.yaml
42
- ├── configs/
43
- │ ├── default.yaml
44
- │ └── tasks/tech_docs.yaml
45
- ├── data/tech_docs/ # ~15-20 curated markdown files
46
- ├── agent_bench/
47
- │ ├── __init__.py
48
- │ ├── core/
49
- │ │ ├── __init__.py
50
- │ │ ├── provider.py # LLM provider abstraction
51
- │ │ ├── config.py # Pydantic settings
52
- │ │ └── types.py # Shared type definitions
53
- │ ├── agents/
54
- │ │ ├── __init__.py
55
- │ │ └── orchestrator.py # Tool-use loop (no memory.py in V1)
56
- │ ├── tools/
57
- │ │ ├── __init__.py
58
- │ │ ├── registry.py
59
- │ │ ├── base.py
60
- │ │ ├── search.py
61
- │ │ └── calculator.py
62
- │ ├── rag/
63
- │ │ ├── __init__.py
64
- │ │ ├── chunker.py
65
- │ │ ├── embedder.py
66
- │ │ ├── store.py
67
- │ │ └── retriever.py
68
- │ ├── evaluation/
69
- │ │ ├── __init__.py
70
- │ │ ├── harness.py
71
- │ │ ├── metrics.py
72
- │ │ ├── datasets/tech_docs_golden.json
73
- │ │ └── report.py
74
- │ └── serving/
75
- │ ├── __init__.py
76
- │ ├── app.py
77
- │ ├── routes.py
78
- │ ├── schemas.py
79
- │ └── middleware.py
80
- ├── scripts/
81
- │ ├── ingest.py
82
- │ ├── evaluate.py
83
- │ └── benchmark.py
84
- ├── tests/
85
- │ ├── __init__.py
86
- │ ├── conftest.py
87
- │ ├── test_provider.py
88
- │ ├── test_tools.py
89
- │ ├── test_rag.py
90
- │ ├── test_agent.py
91
- │ └── test_serving.py
92
- └── docker/
93
- ├── Dockerfile
94
- └── docker-compose.yaml
95
- ```
96
-
97
- ---
98
-
99
- ## Data Flow
100
-
101
- ```
102
- Client → FastAPI (/ask) → Middleware (request_id, timing)
103
- → Orchestrator.run(question, top_k, strategy)
104
- → messages = [system_prompt, user_question]
105
- → Loop (max 3 iterations):
106
- → OpenAI.complete(messages, tools=[search_documents, calculator])
107
- → If tool_calls: execute via ToolRegistry → append tool results to messages
108
- → If no tool_calls: break (final answer)
109
- → If max iterations hit: one final complete() without tools → force text answer
110
- → Return AgentResponse(answer, sources, metadata)
111
- → Serialize to AskResponse
112
- → Client
113
- ```
114
-
115
- Three endpoints: `POST /ask`, `GET /health`, `GET /metrics`. No CRUD, no sessions, no auth.
116
-
117
- Three singletons at startup: ToolRegistry, HybridStore (loaded from disk), OpenAI client.
118
-
119
- ---
120
-
121
- ## Provider Abstraction
122
-
123
- ```python
124
- class LLMProvider(ABC):
125
- @abstractmethod
126
- async def complete(self, messages, tools=None, temperature=0.0, max_tokens=1024) -> CompletionResponse: ...
127
- @abstractmethod
128
- def format_tools(self, tools: list[ToolDefinition]) -> list[dict]: ...
129
- ```
130
-
131
- Three implementations:
132
- 1. **OpenAIProvider** — full implementation, `gpt-4o-mini` default
133
- 2. **MockProvider** — deterministic responses for tests (returns tool_calls on first call, final answer when tool results present)
134
- 3. **AnthropicProvider** — raises `NotImplementedError("planned for V2")`
135
-
136
- ### OpenAI-specific details
137
-
138
- - Message mapping: internal Role enum → OpenAI role strings
139
- - Tool calls: `choice.message.tool_calls` → `list[ToolCall]`
140
- - Arguments parsing: `json.loads(tc.function.arguments)` with try/except for malformed JSON → empty dict fallback
141
- - Cost: `(input_tokens * input_cost_per_mtok + output_tokens * output_cost_per_mtok) / 1_000_000`, pricing from config YAML
142
- - Latency: `time.perf_counter()` around the API call
143
- - Errors: `openai.APITimeoutError` → domain exception. No retries in V1.
144
- - `tool_choice: "auto"` — let the model decide
145
-
146
- ### MockProvider keying
147
-
148
- Checks whether messages contain `Role.TOOL` entries:
149
- - No tool results present → return canned response with `tool_calls`
150
- - Tool results present → return canned final answer with no `tool_calls`
151
- - Returns realistic `TokenUsage` for cost-tracking tests
152
-
153
- ---
154
-
155
- ## RAG Pipeline
156
-
157
- ### Chunk model (flattened)
158
-
159
- ```python
160
- class Chunk(BaseModel):
161
- id: str # hash of content + source
162
- content: str
163
- source: str # bare filename, e.g. "fastapi_path_params.md"
164
- chunk_index: int
165
- metadata: dict
166
- ```
167
-
168
- `chunk.source` must match golden dataset `expected_sources` exactly (bare filename, no path prefix).
169
-
170
- ### Chunker
171
-
172
- Two strategies, configured via `chunk_size` (512) and `chunk_overlap` (64):
173
- - **Recursive:** splits on `\n\n` → `\n` → `. ` → space
174
- - **Fixed-size:** character-count splits with overlap
175
-
176
- ### Embedder
177
-
178
- - `SentenceTransformer('all-MiniLM-L6-v2')`, loaded once at init
179
- - Output: `np.ndarray` shape `(384,)` per chunk
180
- - Disk cache: `hash(content)` → `.cache/embeddings/{hash}.npy`
181
-
182
- ### Store (FAISS + BM25 + RRF)
183
-
184
- - FAISS `IndexFlatIP` on L2-normalized vectors (= cosine similarity)
185
- - BM25 via `rank_bm25.BM25Okapi`, tokenized with `re.findall(r'\w+', text.lower())`
186
- - `add(chunks)` writes to both indices
187
- - `search(query, top_k, strategy)` where strategy = "semantic" | "keyword" | "hybrid"
188
-
189
- **RRF fusion:**
190
- ```
191
- dense_results = faiss.search(query_embedding, k=candidates_per_system) # 10
192
- sparse_results = bm25.get_top_n(tokenized_query, k=candidates_per_system) # 10
193
- For each unique chunk: rrf_score = Σ 1/(60 + rank_in_system)
194
- Sort by rrf_score descending, return top_k (5)
195
- ```
196
-
197
- - `save()`/`load()`: FAISS via `faiss.write_index`/`read_index`, BM25 via pickle, chunks via JSON
198
- - No delete in V1. Rebuild on re-ingest.
199
-
200
- ### Retriever
201
-
202
- Thin glue: query string → embedder → store.search() → `list[SearchResult]`.
203
-
204
- ---
205
-
206
- ## Tool System
207
-
208
- ### Interface
209
-
210
- ```python
211
- class Tool(ABC):
212
- name: str
213
- description: str
214
- parameters: dict # JSON Schema
215
- @abstractmethod
216
- async def execute(self, **kwargs) -> ToolOutput: ...
217
- ```
218
-
219
- ### SearchTool
220
-
221
- - Input: `query: str`, optional `top_k: int = 5`
222
- - Calls `retriever.search(query, top_k)`
223
- - Formats results as numbered passages with filename attribution:
224
- ```
225
- [1] (fastapi_path_params.md): Path parameters are defined using curly braces...
226
- [2] (fastapi_query_params.md): Query parameters are automatically parsed...
227
- ```
228
- - Returns `ToolOutput(success=True, result=formatted, metadata={"sources": [filenames]})`
229
-
230
- ### CalculatorTool
231
-
232
- - Input: `expression: str`
233
- - Uses `simpleeval.simple_eval()` (blocks import, exec, eval, attribute access by default)
234
- - Wrapped in try/except:
235
- ```python
236
- try:
237
- result = simple_eval(expression)
238
- return ToolOutput(success=True, result=str(result))
239
- except Exception:
240
- return ToolOutput(success=False, result=f"Could not evaluate: {expression}")
241
- ```
242
-
243
- ### Registry
244
-
245
- - Dict-based. `register(tool)`, `execute(name, **kwargs)`, `get_definitions()`
246
- - Unknown tool name → `ToolOutput(success=False, result="Unknown tool: {name}")`
247
-
248
- ---
249
-
250
- ## Orchestrator
251
-
252
- ```python
253
- async def run(self, question, system_prompt, top_k, strategy) -> AgentResponse:
254
- messages = [system, user]
255
- tools = registry.get_definitions()
256
- all_sources, tools_used = [], []
257
- total_usage = TokenUsage(0, 0, 0.0)
258
-
259
- for iteration in range(max_iterations):
260
- response = await provider.complete(messages, tools=tools)
261
- # Manual accumulation (no operator overloading on Pydantic model)
262
- total_usage.input_tokens += response.usage.input_tokens
263
- total_usage.output_tokens += response.usage.output_tokens
264
- total_usage.estimated_cost_usd += response.usage.estimated_cost_usd
265
-
266
- if not response.tool_calls:
267
- return AgentResponse(answer=response.content, sources=dedup(all_sources), ...)
268
-
269
- messages.append(assistant_msg_with_tool_calls)
270
- for tc in response.tool_calls:
271
- result = await registry.execute(tc.name, **tc.arguments)
272
- messages.append(Message(role=TOOL, content=result.result, tool_call_id=tc.id))
273
- tools_used.append(tc.name)
274
- if "sources" in result.metadata:
275
- all_sources.extend(result.metadata["sources"])
276
-
277
- # Max iterations hit — force a text answer without tools
278
- response = await provider.complete(messages, tools=None)
279
- return AgentResponse(answer=response.content, sources=dedup(all_sources), ...)
280
- ```
281
-
282
- No `memory.py` in V1. The `messages` list is local to this function. Every `/ask` request is stateless.
283
-
284
- ---
285
-
286
- ## Serving Layer
287
-
288
- ### Schemas
289
-
290
- ```python
291
- class AskRequest(BaseModel):
292
- question: str
293
- top_k: int = 5
294
- retrieval_strategy: Literal["semantic", "keyword", "hybrid"] = "hybrid"
295
-
296
- class AskResponse(BaseModel):
297
- answer: str
298
- sources: list[SourceReference]
299
- metadata: ResponseMetadata # provider, model, iterations, tools_used, latency_ms, token_usage, request_id
300
- ```
301
-
302
- No `conversation_id`. No persistent sessions in V1.
303
-
304
- ### App factory
305
-
306
- Initializes singletons (embedder, store, retriever, registry, provider, orchestrator), attaches to `app.state`.
307
-
308
- ### Middleware
309
-
310
- - `X-Request-ID` (uuid4) on every response
311
- - structlog: method, path, status, latency_ms, request_id
312
- - Provider timeout → 504
313
- - Unexpected exceptions → 500 with request_id
314
-
315
- ### MetricsCollector
316
-
317
- In-process: `deque(maxlen=1000)` of latencies, request count, error count, total cost. Percentiles computed on demand. Resets on restart.
318
-
319
- ---
320
-
321
- ## Evaluation Harness
322
-
323
- ### Golden dataset (25 questions)
324
-
325
- - 20 positive: 8 easy (single chunk), 8 medium (2-3 chunks), 4 hard (multi-source)
326
- - 5 negative: out-of-scope, expects grounded refusal
327
- - 3+ calculator questions among the 20 positive
328
-
329
- Written in two passes: 10 on Day 4 (after seeing retrieval), 15 on Day 7 (after seeing actual system behavior).
330
-
331
- ### Deterministic metrics (free, CI-safe)
332
-
333
- - `retrieval_precision_at_k(retrieved, expected, k=5)`
334
- - `retrieval_recall_at_k(retrieved, expected, k=5)`
335
- - `keyword_hit_rate(answer, keywords)`
336
- - `source_presence_rate(response)` — has at least one source?
337
- - `grounded_refusal_rate(answer, category, expected_sources)` — out-of-scope → refuses + no sources
338
- - `citation_accuracy(answer, sources)` — regex `\[source: (.+?)\]`, check against structured sources list
339
- - `calculator_used_when_expected(response, requires_calculator)`
340
- - `tool_call_count(response)`
341
-
342
- ### LLM-judge metrics (costs money, manual)
343
-
344
- - `answer_faithfulness(answer, chunks, judge)` → 0.0-1.0
345
- - `answer_correctness(answer, reference, judge)` → 0.0-1.0
346
-
347
- Judge prompt ends with: `Respond with ONLY a JSON object: {"score": 0.8, "reasoning": "brief explanation"}`. Parse with `json.loads()`. If parsing fails, return `None`. Log reasoning for failure analysis.
348
-
349
- ### Benchmark report (`docs/benchmark_report.md`)
350
-
351
- Tables: aggregate, by category, by difficulty, chunking comparison.
352
- Failure analysis: 3 worst queries with root cause (manual, informed by judge reasoning).
353
- Config snapshot: full YAML dumped for reproducibility.
354
-
355
- ---
356
-
357
- ## Testing (31 tests, all deterministic)
358
-
359
- ### Fixtures (`conftest.py`)
360
-
361
- - `mock_provider`: MockProvider with realistic TokenUsage
362
- - `mock_embedder`: replaces SentenceTransformer with `np.random.RandomState(seed).randn(n, 384)` normalized to unit length. Deterministic, no model download.
363
- - `sample_chunks`: 5-10 Chunk objects with known content/sources
364
- - `test_store`: HybridStore populated with sample_chunks via mock_embedder
365
- - `test_registry`: SearchTool (backed by test_retriever) + CalculatorTool
366
-
367
- ### Test files
368
-
369
- | File | Tests | Coverage |
370
- |------|-------|----------|
371
- | `test_provider.py` | 6 | MockProvider responses, format_tools schema, cost calc, stub raises |
372
- | `test_tools.py` | 6 | Registry CRUD, search results, calculator valid/invalid, JSON Schema |
373
- | `test_rag.py` | 9 | Chunker strategies, embedder shape/cache, store search/RRF/empty/roundtrip |
374
- | `test_agent.py` | 4 | AgentResponse fields, max_iterations, source accumulation, deterministic output |
375
- | `test_serving.py` | 6 | /ask valid/invalid, /health, /metrics, request_id header, timeout 504 |
376
-
377
- All tests use MockProvider + mock_embedder. No API keys. No model downloads. CI runs full suite.
378
-
379
- ---
380
-
381
- ## Changes from V3 Spec
382
-
383
- | # | Change | Why |
384
- |---|--------|-----|
385
- | 1 | Drop `agents/memory.py` | No `conversation_id` → cross-request memory is a contract you can't honor |
386
- | 2 | Fix build-backend to `setuptools.build_meta` | Legacy backend breaks editable installs |
387
- | 3 | Wrap `simpleeval` errors in try/except | Prevents agent loop crash on malformed expressions |
388
- | 4 | Split golden dataset: 10 on Day 4, 15 on Day 7 | Better questions informed by real retrieval behavior |
389
- | 5 | `json.loads` safety net on tool_call arguments | Handles rare OpenAI malformed JSON |
390
- | 6 | Toolless final call on max-iterations fallback | Clean answer instead of raw tool result string |
391
- | 7 | JSON-structured LLM judge output with reasoning | Reliable parsing + free root-cause hints |
392
-
393
- ### Implementation details locked in
394
-
395
- - SearchTool formats: `[1] (filename.md): content...`
396
- - BM25 tokenizer: `re.findall(r'\w+', text.lower())`
397
- - `Chunk.source` = bare filename matching golden dataset `expected_sources`
398
- - Manual token accumulation (no Pydantic operator overloading)
399
- - `mock_embedder` fixture with seeded deterministic vectors
400
-
401
- ---
402
-
403
- ## Build Sequence
404
-
405
- | Day | Focus | Gate |
406
- |-----|-------|------|
407
- | 1 | Repo + provider + config | `make install && make test` green |
408
- | 2 | Tools + registry | Registration, dispatch, schema generation pass |
409
- | 3 | RAG core (chunker, embedder, store) | chunk → embed → store → retrieve works |
410
- | 4 | RAG e2e + ingest + 10 golden questions | **GATE: known query → right chunk. P@5 ≥ 0.5** |
411
- | 5 | Orchestrator wired to tools + RAG | Agent answers questions e2e using search + LLM |
412
- | 6 | Serving layer | `curl POST /ask` returns valid AskResponse |
413
- | 7 | Eval harness + 15 more golden questions + benchmark | **GATE: `docs/benchmark_report.md` with real numbers + failure analysis** |
414
- | 8 | README + DECISIONS.md | README can sell the project |
415
- | 9 | Docker | `docker-compose up → curl /ask` works |
416
- | 10 | Buffer | Everything green |