Spaces:
Sleeping
Sleeping
feat: Render deployment config, startup warmup, README update
Browse files- render.yaml: free tier, Frankfurt, Docker, health check path
- Startup warmup eager-loads embedder + reranker to reduce cold start
- README: live demo section with curl examples, V1→V2 table, updated
test count and production patterns
- Replaced fly.toml with render.yaml (free tier, no card required)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- README.md +41 -11
- agent_bench/serving/app.py +13 -0
- render.yaml +16 -0
README.md
CHANGED
|
@@ -6,7 +6,7 @@ Agentic RAG system with a 27-question evaluation harness, hybrid retrieval (FAIS
|
|
| 6 |
|
| 7 |
Built as a portfolio project demonstrating AI engineering depth: provider abstraction, evaluation infrastructure, production patterns (FastAPI, Docker, CI, structured logging).
|
| 8 |
|
| 9 |
-
`
|
| 10 |
|
| 11 |
## Benchmark Results
|
| 12 |
|
|
@@ -25,7 +25,26 @@ Evaluated on 27 hand-crafted questions using **gpt-4o-mini** ($0.0004/query) ove
|
|
| 25 |
|
| 26 |
[Full benchmark report with failure analysis](docs/benchmark_report.md) | [Design decisions](DECISIONS.md)
|
| 27 |
|
| 28 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 29 |
|
| 30 |
```bash
|
| 31 |
make install # Install dependencies
|
|
@@ -65,7 +84,7 @@ flowchart LR
|
|
| 65 |
- **RAG pipeline**: Hybrid retrieval via Reciprocal Rank Fusion (FAISS dense + BM25 sparse), two chunking strategies (recursive + fixed-size)
|
| 66 |
- **Provider abstraction**: Swap LLM backend via config. OpenAI implemented, Anthropic stubbed, MockProvider for deterministic tests
|
| 67 |
- **Evaluation infrastructure**: 27-question golden dataset with negative/out-of-scope cases, 8 deterministic metrics + 2 LLM-judge metrics, failure analysis
|
| 68 |
-
- **Production patterns**: FastAPI, Docker, structlog structured logging, Pydantic v2 validation,
|
| 69 |
|
| 70 |
## API Endpoints
|
| 71 |
|
|
@@ -119,7 +138,7 @@ The golden dataset contains 27 hand-crafted questions:
|
|
| 119 |
## Testing
|
| 120 |
|
| 121 |
```bash
|
| 122 |
-
make test #
|
| 123 |
make lint # ruff + mypy
|
| 124 |
```
|
| 125 |
|
|
@@ -129,12 +148,23 @@ All tests use MockProvider + MockEmbeddingModel. No API keys. No model downloads
|
|
| 129 |
|
| 130 |
See [DECISIONS.md](DECISIONS.md) for rationale on building from primitives, RRF over score normalization, negative evaluation cases, deterministic eval + optional LLM judge, and more.
|
| 131 |
|
| 132 |
-
## V2
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 133 |
|
| 134 |
-
- [ ]
|
| 135 |
-
- [ ]
|
| 136 |
-
- [ ]
|
| 137 |
-
- [ ] Streaming responses
|
| 138 |
-
- [ ] Conversation sessions with SQLite persistence
|
| 139 |
|
| 140 |
-
*
|
|
|
|
| 6 |
|
| 7 |
Built as a portfolio project demonstrating AI engineering depth: provider abstraction, evaluation infrastructure, production patterns (FastAPI, Docker, CI, structured logging).
|
| 8 |
|
| 9 |
+
`120 tests` | `27-question benchmark` | `$0.0004/query` | `Docker ready` | `CI green`
|
| 10 |
|
| 11 |
## Benchmark Results
|
| 12 |
|
|
|
|
| 25 |
|
| 26 |
[Full benchmark report with failure analysis](docs/benchmark_report.md) | [Design decisions](DECISIONS.md)
|
| 27 |
|
| 28 |
+
## Live Demo
|
| 29 |
+
|
| 30 |
+
**https://agent-bench.onrender.com** (Frankfurt, free tier — first request after idle may take ~30-60s for cold start)
|
| 31 |
+
|
| 32 |
+
```bash
|
| 33 |
+
# In-scope question (expect answer with sources)
|
| 34 |
+
curl -X POST https://agent-bench.onrender.com/ask \
|
| 35 |
+
-H "Content-Type: application/json" \
|
| 36 |
+
-d '{"question": "How do I define a path parameter in FastAPI?"}'
|
| 37 |
+
|
| 38 |
+
# Out-of-scope question (expect grounded refusal)
|
| 39 |
+
curl -X POST https://agent-bench.onrender.com/ask \
|
| 40 |
+
-H "Content-Type: application/json" \
|
| 41 |
+
-d '{"question": "How do I cook pasta?"}'
|
| 42 |
+
|
| 43 |
+
# Health check
|
| 44 |
+
curl https://agent-bench.onrender.com/health
|
| 45 |
+
```
|
| 46 |
+
|
| 47 |
+
## Quick Start (Local)
|
| 48 |
|
| 49 |
```bash
|
| 50 |
make install # Install dependencies
|
|
|
|
| 84 |
- **RAG pipeline**: Hybrid retrieval via Reciprocal Rank Fusion (FAISS dense + BM25 sparse), two chunking strategies (recursive + fixed-size)
|
| 85 |
- **Provider abstraction**: Swap LLM backend via config. OpenAI implemented, Anthropic stubbed, MockProvider for deterministic tests
|
| 86 |
- **Evaluation infrastructure**: 27-question golden dataset with negative/out-of-scope cases, 8 deterministic metrics + 2 LLM-judge metrics, failure analysis
|
| 87 |
+
- **Production patterns**: FastAPI, Docker, CI/CD (GitHub Actions), Fly.io deployment, rate limiting, provider retry with backoff, structlog structured logging, Pydantic v2 validation, 120 deterministic tests
|
| 88 |
|
| 89 |
## API Endpoints
|
| 90 |
|
|
|
|
| 138 |
## Testing
|
| 139 |
|
| 140 |
```bash
|
| 141 |
+
make test # 120 deterministic tests, no API keys needed
|
| 142 |
make lint # ruff + mypy
|
| 143 |
```
|
| 144 |
|
|
|
|
| 148 |
|
| 149 |
See [DECISIONS.md](DECISIONS.md) for rationale on building from primitives, RRF over score normalization, negative evaluation cases, deterministic eval + optional LLM judge, and more.
|
| 150 |
|
| 151 |
+
## V1 → V2 Improvements
|
| 152 |
+
|
| 153 |
+
| Feature | V1 | V2 | Skill Demonstrated |
|
| 154 |
+
|---------|----|----|-------------------|
|
| 155 |
+
| Grounded refusal | 0/5 | Threshold gate | Trust & safety |
|
| 156 |
+
| Retrieval precision | RRF only | RRF + cross-encoder | Reranking |
|
| 157 |
+
| Provider resilience | None | Retry + backoff | Error handling |
|
| 158 |
+
| Rate limiting | None | 10 RPM per IP | API hardening |
|
| 159 |
+
| Cloud deployment | None | Render (Frankfurt) | Docker → production |
|
| 160 |
+
| CI/CD | None | GitHub Actions | Automated quality gates |
|
| 161 |
+
|
| 162 |
+
See [DECISIONS.md](DECISIONS.md) for the reasoning behind each design choice.
|
| 163 |
+
|
| 164 |
+
## Roadmap
|
| 165 |
|
| 166 |
+
- [ ] Streaming responses (SSE for final synthesis)
|
| 167 |
+
- [ ] SQLite conversation sessions
|
| 168 |
+
- [ ] Anthropic provider (multi-provider comparison)
|
|
|
|
|
|
|
| 169 |
|
| 170 |
+
*CPU-only, single-domain. Framework scales to larger corpora and additional providers.*
|
agent_bench/serving/app.py
CHANGED
|
@@ -104,4 +104,17 @@ def create_app(config: AppConfig | None = None) -> FastAPI:
|
|
| 104 |
app.add_middleware(RateLimitMiddleware, requests_per_minute=config.serving.rate_limit_rpm)
|
| 105 |
app.include_router(router)
|
| 106 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 107 |
return app
|
|
|
|
| 104 |
app.add_middleware(RateLimitMiddleware, requests_per_minute=config.serving.rate_limit_rpm)
|
| 105 |
app.include_router(router)
|
| 106 |
|
| 107 |
+
# Startup warmup: eager-load models to reduce cold start latency
|
| 108 |
+
@app.on_event("startup")
|
| 109 |
+
async def warmup() -> None:
|
| 110 |
+
import structlog
|
| 111 |
+
|
| 112 |
+
log = structlog.get_logger()
|
| 113 |
+
log.info("warmup_start")
|
| 114 |
+
# Trigger lazy model loads
|
| 115 |
+
_ = embedder.embed("warmup")
|
| 116 |
+
if reranker is not None:
|
| 117 |
+
_ = reranker.model # noqa: F841
|
| 118 |
+
log.info("warmup_complete")
|
| 119 |
+
|
| 120 |
return app
|
render.yaml
ADDED
|
@@ -0,0 +1,16 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
services:
|
| 2 |
+
- type: web
|
| 3 |
+
name: agent-bench
|
| 4 |
+
runtime: docker
|
| 5 |
+
dockerfilePath: docker/Dockerfile
|
| 6 |
+
region: frankfurt
|
| 7 |
+
plan: free
|
| 8 |
+
autoDeploy: true
|
| 9 |
+
envVars:
|
| 10 |
+
- key: OPENAI_API_KEY
|
| 11 |
+
sync: false
|
| 12 |
+
- key: AGENT_BENCH_ENV
|
| 13 |
+
value: production
|
| 14 |
+
- key: PYTHONUNBUFFERED
|
| 15 |
+
value: "1"
|
| 16 |
+
healthCheckPath: /health
|