Spaces:

Nomearod
/

agentbench

Running

Nomearod Claude Opus 4.6 (1M context) commited on Mar 24

Commit

9fd0c7a

1 Parent(s): 9b56692

feat: rewrite README for recruiter impact

- Opening leads with results, not constraints
- One-line purpose statement (portfolio project, AI engineering depth)
- Key numbers one-liner (97 tests, 25 commits, $0.0004/query)
- Benchmark table reordered: citation accuracy 1.00 first
- Mermaid architecture diagram (renders on GitHub)
- Cut "What This Is Not" section, compressed to one line at bottom
- Cut "Project Structure" section (redundant with folder tree)
- V2 roadmap reordered: grounded refusal 0/5 first (honest priority)
- Quick start condensed to 3 lines

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Files changed (1) hide show

README.md +35 -80

README.md CHANGED Viewed

@@ -1,8 +1,10 @@
 # agent-bench
-Evaluation-first agentic RAG system built from API primitives — no LangChain, no LlamaIndex.
-**Stack:** FastAPI, OpenAI gpt-4o-mini, FAISS + BM25 (Reciprocal Rank Fusion), Pydantic v2, Docker, 97 deterministic tests
 ## Benchmark Results
@@ -10,12 +12,12 @@ Evaluated on 27 hand-crafted questions (19 retrieval, 3 calculation, 5 out-of-sc
 | Metric | Value | Notes |
 |--------|-------|-------|
-| Retrieval P@5 | **0.70** | Hybrid RRF (FAISS + BM25) |
-| Retrieval R@5 | **0.83** | Expected sources found in top 5 |
-| Keyword Hit Rate | **0.89** | Expected facts present in answer |
 | Citation Accuracy | **1.00** | Zero hallucinated citations |
-| Grounded Refusal | **0/5** | LLM never refuses — top V2 priority |
 | Calculator Accuracy | **2/3** | LLM sometimes skips tool use |
 | Latency p50 | 4,690 ms | gpt-4o-mini, single iteration |
 | Cost per query | $0.0004 | ~$0.01 for full 27-question eval |
@@ -24,16 +26,12 @@ Evaluated on 27 hand-crafted questions (19 retrieval, 3 calculation, 5 out-of-sc
 ## Quick Start
 ```bash
-# Install (uses the pinned interpreter from Makefile)
-make install
-# Ingest the documentation corpus
-make ingest
-# Start the API server
-make serve
-# Ask a question
 curl -X POST http://localhost:8000/ask \
   -H "Content-Type: application/json" \
   -d '{"question": "How do I define a path parameter in FastAPI?"}'
@@ -43,50 +41,30 @@ curl -X POST http://localhost:8000/ask \
 ```bash
 OPENAI_API_KEY=sk-... docker-compose -f docker/docker-compose.yaml up --build
-curl -X POST http://localhost:8000/ask \
-  -H "Content-Type: application/json" \
-  -d '{"question": "How do I define a path parameter in FastAPI?"}'
 ```
 ## Architecture
-```
-Client
-  |
-  v
-POST /ask  ──>  Middleware (request_id, timing, error handling)
-  |
-  v
-Orchestrator  ──>  Loop (max 3 iterations):
-  |                   |
-  |                   v
-  |              LLM Provider (OpenAI gpt-4o-mini)
-  |                   |
-  |              tool_calls?  ──yes──>  Tool Registry
-  |                   |                     |
-  |                   no               search_documents ──> Retriever
-  |                   |                     |                    |
-  |                   v                calculator         Embedder + HybridStore
-  |              Final answer                            (FAISS + BM25 + RRF)
-  |
-  v
-AskResponse { answer, sources[], metadata }
 ```
 ## What This Demonstrates
-- **Agentic architecture**: Iterative tool-use loop with plan, execute, verify — max 3 iterations with toolless fallback
 - **RAG pipeline**: Hybrid retrieval via Reciprocal Rank Fusion (FAISS dense + BM25 sparse), two chunking strategies (recursive + fixed-size)
 - **Provider abstraction**: Swap LLM backend via config. OpenAI implemented, Anthropic stubbed, MockProvider for deterministic tests
-- **Evaluation infrastructure**: 27-question golden dataset with negative/out-of-scope cases, 8 deterministic metrics + 2 LLM-judge metrics, failure analysis, benchmark report
 - **Production patterns**: FastAPI, Docker, structlog structured logging, Pydantic v2 validation, CI with 97 deterministic tests, request-level metrics
-## What This Is Not
-- Not a framework (it's a focused demonstration)
-- Not cloud-deployed (Docker-local is the scope)
-- Not GPU-dependent (runs on a CPU laptop)
 ## API Endpoints
 | Endpoint | Method | Description |
@@ -126,23 +104,16 @@ Response:
 ## Evaluation
 ```bash
-# Deterministic metrics only (free, CI-safe)
-make evaluate-fast
-# Deterministic + LLM-judge metrics (costs money)
-make evaluate-full
-# Generate benchmark report
-make benchmark
 ```
-The golden dataset (`agent_bench/evaluation/datasets/tech_docs_golden.json`) contains 27 hand-crafted questions:
 - 19 retrieval: 8 easy (single chunk), 7 medium (multi-chunk), 4 hard (multi-source)
 - 3 calculation: questions requiring the calculator tool
 - 5 out-of-scope: questions testing grounded refusal (answer not in corpus)
-Metrics measured: Retrieval P@5, R@5, keyword hit rate, source citation rate, citation accuracy, grounded refusal rate, calculator accuracy, latency, cost.
 ## Testing
 ```bash
@@ -152,32 +123,16 @@ make lint    # ruff + mypy
 All tests use MockProvider + MockEmbeddingModel. No API keys. No model downloads. CI-safe.
-## Project Structure
-```
-agent_bench/
-  core/       # Provider abstraction, config, types
-  agents/     # Orchestrator (tool-use loop, no persistent memory)
-  tools/      # Registry, search_documents, calculator
-  rag/        # Chunker, embedder, FAISS+BM25 store, retriever
-  evaluation/ # Harness, metrics, report generator, golden dataset
-  serving/    # FastAPI app, routes, schemas, middleware
-```
 ## Design Decisions
-See [DECISIONS.md](DECISIONS.md) for rationale on:
-- Building from primitives (no LangChain)
-- Reciprocal Rank Fusion over score normalization
-- One provider in V1 with interface for extensibility
-- Negative evaluation cases for grounded refusal
-- Deterministic eval + optional LLM judge
 ## V2 Roadmap
-- [ ] Second provider (Anthropic Claude)
 - [ ] Cross-encoder reranking (feature-flagged, config ready)
-- [ ] Research paper domain (PDF ingestion)
 - [ ] Streaming responses
-- [ ] Conversation sessions with SQLite persistence + conversation_id
-- [ ] Provider comparison benchmark

 # agent-bench
+Agentic RAG system with a 27-question evaluation harness, hybrid retrieval (FAISS + BM25 + RRF), tool use, and zero hallucinated citations — built from API primitives.
+Built as a portfolio project demonstrating AI engineering depth: provider abstraction, evaluation infrastructure, production patterns (FastAPI, Docker, CI, structured logging).
+`97 tests` | `25 commits` | `27-question benchmark` | `$0.0004/query` | `Docker ready`
 ## Benchmark Results
 | Metric | Value | Notes |
 |--------|-------|-------|
 | Citation Accuracy | **1.00** | Zero hallucinated citations |
+| Keyword Hit Rate | **0.89** | Expected facts present in answer |
+| Retrieval R@5 | **0.83** | Expected sources found in top 5 |
+| Retrieval P@5 | **0.70** | Hybrid RRF (FAISS + BM25) |
 | Calculator Accuracy | **2/3** | LLM sometimes skips tool use |
+| Grounded Refusal | **0/5** | LLM never refuses — top V2 priority |
 | Latency p50 | 4,690 ms | gpt-4o-mini, single iteration |
 | Cost per query | $0.0004 | ~$0.01 for full 27-question eval |
 ## Quick Start
 ```bash
+make install    # Install dependencies
+make ingest     # Chunk + embed 16 FastAPI docs into FAISS + BM25
+make serve      # Start FastAPI server on :8000
+```
+```bash
 curl -X POST http://localhost:8000/ask \
   -H "Content-Type: application/json" \
   -d '{"question": "How do I define a path parameter in FastAPI?"}'
 ```bash
 OPENAI_API_KEY=sk-... docker-compose -f docker/docker-compose.yaml up --build
 ```
 ## Architecture
+```mermaid
+flowchart LR
+    Client -->|POST /ask| MW[Middleware<br/>request_id, timing, errors]
+    MW --> Orch[Orchestrator<br/>max 3 iterations]
+    Orch --> LLM[OpenAI gpt-4o-mini]
+    LLM -->|tool_calls| Reg[Tool Registry]
+    Reg --> Search[search_documents]
+    Reg --> Calc[calculator]
+    Search --> Store[Hybrid Store<br/>FAISS + BM25 + RRF]
+    LLM -->|no tool_calls| Resp[AskResponse<br/>answer + sources + metadata]
 ```
 ## What This Demonstrates
+- **Agentic architecture**: Iterative tool-use loop — max 3 iterations with toolless fallback, no LangChain or LlamaIndex
 - **RAG pipeline**: Hybrid retrieval via Reciprocal Rank Fusion (FAISS dense + BM25 sparse), two chunking strategies (recursive + fixed-size)
 - **Provider abstraction**: Swap LLM backend via config. OpenAI implemented, Anthropic stubbed, MockProvider for deterministic tests
+- **Evaluation infrastructure**: 27-question golden dataset with negative/out-of-scope cases, 8 deterministic metrics + 2 LLM-judge metrics, failure analysis
 - **Production patterns**: FastAPI, Docker, structlog structured logging, Pydantic v2 validation, CI with 97 deterministic tests, request-level metrics
 ## API Endpoints
 | Endpoint | Method | Description |
 ## Evaluation
 ```bash
+make evaluate-fast   # Deterministic metrics only (needs API key)
+make evaluate-full   # + LLM-judge metrics (costs more)
+make benchmark       # Generate markdown report from results
 ```
+The golden dataset contains 27 hand-crafted questions:
 - 19 retrieval: 8 easy (single chunk), 7 medium (multi-chunk), 4 hard (multi-source)
 - 3 calculation: questions requiring the calculator tool
 - 5 out-of-scope: questions testing grounded refusal (answer not in corpus)
 ## Testing
 ```bash
 All tests use MockProvider + MockEmbeddingModel. No API keys. No model downloads. CI-safe.
 ## Design Decisions
+See [DECISIONS.md](DECISIONS.md) for rationale on building from primitives, RRF over score normalization, negative evaluation cases, deterministic eval + optional LLM judge, and more.
 ## V2 Roadmap
+- [ ] Grounded refusal improvements (0/5 is the top priority)
 - [ ] Cross-encoder reranking (feature-flagged, config ready)
+- [ ] Second provider (Anthropic Claude)
 - [ ] Streaming responses
+- [ ] Conversation sessions with SQLite persistence
+*Scope: Docker-local, CPU-only, single-domain V1.*