Nomearod Claude Opus 4.6 (1M context) commited on
Commit
55218a1
·
1 Parent(s): f7dd169

feat: Render deployment config, startup warmup, README update

Browse files

- render.yaml: free tier, Frankfurt, Docker, health check path
- Startup warmup eager-loads embedder + reranker to reduce cold start
- README: live demo section with curl examples, V1→V2 table, updated
test count and production patterns
- Replaced fly.toml with render.yaml (free tier, no card required)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Files changed (3) hide show
  1. README.md +41 -11
  2. agent_bench/serving/app.py +13 -0
  3. render.yaml +16 -0
README.md CHANGED
@@ -6,7 +6,7 @@ Agentic RAG system with a 27-question evaluation harness, hybrid retrieval (FAIS
6
 
7
  Built as a portfolio project demonstrating AI engineering depth: provider abstraction, evaluation infrastructure, production patterns (FastAPI, Docker, CI, structured logging).
8
 
9
- `97 tests` | `25 commits` | `27-question benchmark` | `$0.0004/query` | `Docker ready`
10
 
11
  ## Benchmark Results
12
 
@@ -25,7 +25,26 @@ Evaluated on 27 hand-crafted questions using **gpt-4o-mini** ($0.0004/query) ove
25
 
26
  [Full benchmark report with failure analysis](docs/benchmark_report.md) | [Design decisions](DECISIONS.md)
27
 
28
- ## Quick Start
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29
 
30
  ```bash
31
  make install # Install dependencies
@@ -65,7 +84,7 @@ flowchart LR
65
  - **RAG pipeline**: Hybrid retrieval via Reciprocal Rank Fusion (FAISS dense + BM25 sparse), two chunking strategies (recursive + fixed-size)
66
  - **Provider abstraction**: Swap LLM backend via config. OpenAI implemented, Anthropic stubbed, MockProvider for deterministic tests
67
  - **Evaluation infrastructure**: 27-question golden dataset with negative/out-of-scope cases, 8 deterministic metrics + 2 LLM-judge metrics, failure analysis
68
- - **Production patterns**: FastAPI, Docker, structlog structured logging, Pydantic v2 validation, CI with 97 deterministic tests, request-level metrics
69
 
70
  ## API Endpoints
71
 
@@ -119,7 +138,7 @@ The golden dataset contains 27 hand-crafted questions:
119
  ## Testing
120
 
121
  ```bash
122
- make test # 97 deterministic tests, no API keys needed
123
  make lint # ruff + mypy
124
  ```
125
 
@@ -129,12 +148,23 @@ All tests use MockProvider + MockEmbeddingModel. No API keys. No model downloads
129
 
130
  See [DECISIONS.md](DECISIONS.md) for rationale on building from primitives, RRF over score normalization, negative evaluation cases, deterministic eval + optional LLM judge, and more.
131
 
132
- ## V2 Roadmap
 
 
 
 
 
 
 
 
 
 
 
 
 
133
 
134
- - [ ] Grounded refusal improvements (0/5 is the top priority)
135
- - [ ] Cross-encoder reranking (feature-flagged, config ready)
136
- - [ ] Second provider (Anthropic Claude)
137
- - [ ] Streaming responses
138
- - [ ] Conversation sessions with SQLite persistence
139
 
140
- *Scope: Docker-local, CPU-only, single-domain V1.*
 
6
 
7
  Built as a portfolio project demonstrating AI engineering depth: provider abstraction, evaluation infrastructure, production patterns (FastAPI, Docker, CI, structured logging).
8
 
9
+ `120 tests` | `27-question benchmark` | `$0.0004/query` | `Docker ready` | `CI green`
10
 
11
  ## Benchmark Results
12
 
 
25
 
26
  [Full benchmark report with failure analysis](docs/benchmark_report.md) | [Design decisions](DECISIONS.md)
27
 
28
+ ## Live Demo
29
+
30
+ **https://agent-bench.onrender.com** (Frankfurt, free tier — first request after idle may take ~30-60s for cold start)
31
+
32
+ ```bash
33
+ # In-scope question (expect answer with sources)
34
+ curl -X POST https://agent-bench.onrender.com/ask \
35
+ -H "Content-Type: application/json" \
36
+ -d '{"question": "How do I define a path parameter in FastAPI?"}'
37
+
38
+ # Out-of-scope question (expect grounded refusal)
39
+ curl -X POST https://agent-bench.onrender.com/ask \
40
+ -H "Content-Type: application/json" \
41
+ -d '{"question": "How do I cook pasta?"}'
42
+
43
+ # Health check
44
+ curl https://agent-bench.onrender.com/health
45
+ ```
46
+
47
+ ## Quick Start (Local)
48
 
49
  ```bash
50
  make install # Install dependencies
 
84
  - **RAG pipeline**: Hybrid retrieval via Reciprocal Rank Fusion (FAISS dense + BM25 sparse), two chunking strategies (recursive + fixed-size)
85
  - **Provider abstraction**: Swap LLM backend via config. OpenAI implemented, Anthropic stubbed, MockProvider for deterministic tests
86
  - **Evaluation infrastructure**: 27-question golden dataset with negative/out-of-scope cases, 8 deterministic metrics + 2 LLM-judge metrics, failure analysis
87
+ - **Production patterns**: FastAPI, Docker, CI/CD (GitHub Actions), Fly.io deployment, rate limiting, provider retry with backoff, structlog structured logging, Pydantic v2 validation, 120 deterministic tests
88
 
89
  ## API Endpoints
90
 
 
138
  ## Testing
139
 
140
  ```bash
141
+ make test # 120 deterministic tests, no API keys needed
142
  make lint # ruff + mypy
143
  ```
144
 
 
148
 
149
  See [DECISIONS.md](DECISIONS.md) for rationale on building from primitives, RRF over score normalization, negative evaluation cases, deterministic eval + optional LLM judge, and more.
150
 
151
+ ## V1 → V2 Improvements
152
+
153
+ | Feature | V1 | V2 | Skill Demonstrated |
154
+ |---------|----|----|-------------------|
155
+ | Grounded refusal | 0/5 | Threshold gate | Trust & safety |
156
+ | Retrieval precision | RRF only | RRF + cross-encoder | Reranking |
157
+ | Provider resilience | None | Retry + backoff | Error handling |
158
+ | Rate limiting | None | 10 RPM per IP | API hardening |
159
+ | Cloud deployment | None | Render (Frankfurt) | Docker → production |
160
+ | CI/CD | None | GitHub Actions | Automated quality gates |
161
+
162
+ See [DECISIONS.md](DECISIONS.md) for the reasoning behind each design choice.
163
+
164
+ ## Roadmap
165
 
166
+ - [ ] Streaming responses (SSE for final synthesis)
167
+ - [ ] SQLite conversation sessions
168
+ - [ ] Anthropic provider (multi-provider comparison)
 
 
169
 
170
+ *CPU-only, single-domain. Framework scales to larger corpora and additional providers.*
agent_bench/serving/app.py CHANGED
@@ -104,4 +104,17 @@ def create_app(config: AppConfig | None = None) -> FastAPI:
104
  app.add_middleware(RateLimitMiddleware, requests_per_minute=config.serving.rate_limit_rpm)
105
  app.include_router(router)
106
 
 
 
 
 
 
 
 
 
 
 
 
 
 
107
  return app
 
104
  app.add_middleware(RateLimitMiddleware, requests_per_minute=config.serving.rate_limit_rpm)
105
  app.include_router(router)
106
 
107
+ # Startup warmup: eager-load models to reduce cold start latency
108
+ @app.on_event("startup")
109
+ async def warmup() -> None:
110
+ import structlog
111
+
112
+ log = structlog.get_logger()
113
+ log.info("warmup_start")
114
+ # Trigger lazy model loads
115
+ _ = embedder.embed("warmup")
116
+ if reranker is not None:
117
+ _ = reranker.model # noqa: F841
118
+ log.info("warmup_complete")
119
+
120
  return app
render.yaml ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ services:
2
+ - type: web
3
+ name: agent-bench
4
+ runtime: docker
5
+ dockerfilePath: docker/Dockerfile
6
+ region: frankfurt
7
+ plan: free
8
+ autoDeploy: true
9
+ envVars:
10
+ - key: OPENAI_API_KEY
11
+ sync: false
12
+ - key: AGENT_BENCH_ENV
13
+ value: production
14
+ - key: PYTHONUNBUFFERED
15
+ value: "1"
16
+ healthCheckPath: /health