rag-agent-workbench-api / docs /LOAD_TEST.md
BrejBala's picture
feat: deploy Tiers 2 & 3 β€” CRAG, faithfulness, streaming, Prometheus, eval-driven retrieval
6686f13
|
Raw
History Blame Contribute Delete
5.31 kB
# Load Test Report β€” /chat endpoint
## Purpose
This report documents a benchmark run of the `/chat` pipeline under controlled
in-process conditions. It establishes a baseline for **framework overhead**
(FastAPI routing, LangGraph traversal, Pydantic serialization) with no real
external I/O. This is the T3-C Part 2 deliverable.
---
## Run conditions
| Parameter | Value |
|---|---|
| Date | 2026-06-26 |
| Tool | `scripts/bench_mocked.py` |
| Transport | `httpx.ASGITransport(app=app)` β€” in-process, no TCP |
| Server | No real server process; ASGI interface called directly |
| Requests | 50 |
| Concurrency | 10 |
| Python | 3.11.15 |
| Platform | Windows 11, Intel/AMD x86-64 (GIL-bound) |
### What was mocked
| Boundary | Mock behaviour |
|---|---|
| Pinecone vector search | Instant return of 1 chunk (cosine 0.92) |
| Groq LLM (`generate_answer`) | `MagicMock.invoke()` returning a fake AIMessage |
| Groq LLM (`streaming`) | No-op async generator |
| Tavily web search | Disabled (`is_tavily_configured=False`) |
| FastAPI startup (Pinecone init) | `init_pinecone` no-op |
| Response cache | Disabled (`cache_enabled=False`) |
| slowapi rate limiter | `limiter.enabled=False` (prevents 30/min limit from firing across 50 requests from one IP) |
### What ran for real
The full in-process request path: ASGI receive/send, FastAPI middleware
(CORS, metrics collection, auth header check), `require_api_key` dependency,
the `run_in_threadpool` dispatch into the LangGraph pipeline, all 7 graph
nodes (`normalize_input` β†’ `contextualize_query` β†’ `retrieve_context` β†’
`corrective_retrieve` β†’ `decide_next` β†’ `generate_answer` β†’
`format_response`), prompt building, `filter_chunks_by_score`, citation
verification, `ChatResponse` Pydantic serialization, and JSON response
encoding.
---
## Results
```
=== /chat in-process bench (mocked externals) ===
Requests: 50
Concurrency: 10
Errors: 0 (0.0%)
Wall time: 1321 ms
Throughput: 37.9 req/s
Avg latency: 252.02 ms
p50 latency: 272.73 ms
p95 latency: 448.47 ms
```
---
## Interpretation
### What these numbers measure
The p50 of **273 ms** is the cost of routing, middleware, auth, LangGraph
node traversal, schema validation, and JSON serialization β€” with zero I/O
latency. It is a floor, not a ceiling: in production, Pinecone and Groq API
latency dominate (typically 100–800 ms combined), and the p50 would be
600–1500 ms end-to-end.
### Why p50 is ~270 ms with mocked externals
The primary bottleneck is Python's GIL combined with `run_in_threadpool`:
- The router dispatches `graph.invoke()` via `asyncio.run_in_threadpool`,
which schedules the call on the default `ThreadPoolExecutor`.
- With 10 concurrent requests, 10 threads compete for the GIL to execute
LangGraph's pure-Python node traversal.
- Each node call holds the GIL during its Python bytecode execution.
- Effective concurrency is constrained β€” threads execute interleaved, not
truly parallel, under CPU-bound load.
The graph's self-reported `generate_ms β‰ˆ 0.02 ms` (logged per request)
reflects only the mock's `.invoke()` call time, not the thread scheduling
overhead or GIL contention visible from the outside.
### Relationship to Prometheus latency (T2.6)
The T2.6 Prometheus histogram (`rag_request_duration_seconds`) records
**total time from request receipt to response dispatch**, matching what this
bench measures. The p95 of 448 ms under 10-concurrency simulated load sets
an expectation: with real Groq and Pinecone I/O, the Prometheus p95 bucket
should track at 600–1500 ms in nominal operation (1–2 concurrent users).
A sharp rise in the Prometheus p95 above 2000 ms with mocked externals (if
reproduced) would point to GIL starvation at higher concurrency β€” a signal
to consider either reducing LangGraph node count or offloading to a
subprocess pool.
### Throughput ceiling
**37.9 req/s** with 10 concurrent threads and zero I/O represents an
upper bound on single-machine throughput with the current GIL-bound design.
Real throughput (with Groq + Pinecone) at 10 concurrency would be limited
by external I/O (Groq: ~200–800 ms) and would likely plateau at 5–15 req/s.
### What this run does NOT measure
| Gap | Reason |
|---|---|
| Real Pinecone latency | Mocked β€” would add 50–200 ms per request |
| Real Groq latency | Mocked β€” would add 200–800 ms per request |
| LangSmith tracing overhead | Disabled (no real `LANGSMITH_API_KEY`) |
| Cold start (graph compilation) | First request compiles the graph; amortized here |
| GZip compression middleware | Not added to this app |
---
## How to reproduce
```bash
cd backend
# Needs a Python env with dependencies installed
PYTHONPATH=backend python scripts/bench_mocked.py
```
The script self-configures dummy credentials and disables all real external
calls. No Pinecone or Groq account is required.
---
## Next steps
If real-traffic profiling shows p95 > 2000 ms under β‰₯ 5 concurrent users:
1. Profile with `py-spy` to identify which LangGraph node holds the GIL
longest.
2. Consider converting CPU-bound graph nodes to `async def` with direct
`await` on I/O (removing the `run_in_threadpool` wrapper).
3. Evaluate LangGraph's async `astream` / `ainvoke` path for the `/chat`
endpoint.