| # Load Test Report β /chat endpoint |
|
|
| ## Purpose |
|
|
| This report documents a benchmark run of the `/chat` pipeline under controlled |
| in-process conditions. It establishes a baseline for **framework overhead** |
| (FastAPI routing, LangGraph traversal, Pydantic serialization) with no real |
| external I/O. This is the T3-C Part 2 deliverable. |
|
|
| --- |
|
|
| ## Run conditions |
|
|
| | Parameter | Value | |
| |---|---| |
| | Date | 2026-06-26 | |
| | Tool | `scripts/bench_mocked.py` | |
| | Transport | `httpx.ASGITransport(app=app)` β in-process, no TCP | |
| | Server | No real server process; ASGI interface called directly | |
| | Requests | 50 | |
| | Concurrency | 10 | |
| | Python | 3.11.15 | |
| | Platform | Windows 11, Intel/AMD x86-64 (GIL-bound) | |
|
|
| ### What was mocked |
|
|
| | Boundary | Mock behaviour | |
| |---|---| |
| | Pinecone vector search | Instant return of 1 chunk (cosine 0.92) | |
| | Groq LLM (`generate_answer`) | `MagicMock.invoke()` returning a fake AIMessage | |
| | Groq LLM (`streaming`) | No-op async generator | |
| | Tavily web search | Disabled (`is_tavily_configured=False`) | |
| | FastAPI startup (Pinecone init) | `init_pinecone` no-op | |
| | Response cache | Disabled (`cache_enabled=False`) | |
| | slowapi rate limiter | `limiter.enabled=False` (prevents 30/min limit from firing across 50 requests from one IP) | |
|
|
| ### What ran for real |
|
|
| The full in-process request path: ASGI receive/send, FastAPI middleware |
| (CORS, metrics collection, auth header check), `require_api_key` dependency, |
| the `run_in_threadpool` dispatch into the LangGraph pipeline, all 7 graph |
| nodes (`normalize_input` β `contextualize_query` β `retrieve_context` β |
| `corrective_retrieve` β `decide_next` β `generate_answer` β |
| `format_response`), prompt building, `filter_chunks_by_score`, citation |
| verification, `ChatResponse` Pydantic serialization, and JSON response |
| encoding. |
|
|
| --- |
|
|
| ## Results |
|
|
| ``` |
| === /chat in-process bench (mocked externals) === |
| Requests: 50 |
| Concurrency: 10 |
| Errors: 0 (0.0%) |
| Wall time: 1321 ms |
| Throughput: 37.9 req/s |
| Avg latency: 252.02 ms |
| p50 latency: 272.73 ms |
| p95 latency: 448.47 ms |
| ``` |
|
|
| --- |
|
|
| ## Interpretation |
|
|
| ### What these numbers measure |
|
|
| The p50 of **273 ms** is the cost of routing, middleware, auth, LangGraph |
| node traversal, schema validation, and JSON serialization β with zero I/O |
| latency. It is a floor, not a ceiling: in production, Pinecone and Groq API |
| latency dominate (typically 100β800 ms combined), and the p50 would be |
| 600β1500 ms end-to-end. |
|
|
| ### Why p50 is ~270 ms with mocked externals |
|
|
| The primary bottleneck is Python's GIL combined with `run_in_threadpool`: |
|
|
| - The router dispatches `graph.invoke()` via `asyncio.run_in_threadpool`, |
| which schedules the call on the default `ThreadPoolExecutor`. |
| - With 10 concurrent requests, 10 threads compete for the GIL to execute |
| LangGraph's pure-Python node traversal. |
| - Each node call holds the GIL during its Python bytecode execution. |
| - Effective concurrency is constrained β threads execute interleaved, not |
| truly parallel, under CPU-bound load. |
|
|
| The graph's self-reported `generate_ms β 0.02 ms` (logged per request) |
| reflects only the mock's `.invoke()` call time, not the thread scheduling |
| overhead or GIL contention visible from the outside. |
|
|
| ### Relationship to Prometheus latency (T2.6) |
|
|
| The T2.6 Prometheus histogram (`rag_request_duration_seconds`) records |
| **total time from request receipt to response dispatch**, matching what this |
| bench measures. The p95 of 448 ms under 10-concurrency simulated load sets |
| an expectation: with real Groq and Pinecone I/O, the Prometheus p95 bucket |
| should track at 600β1500 ms in nominal operation (1β2 concurrent users). |
|
|
| A sharp rise in the Prometheus p95 above 2000 ms with mocked externals (if |
| reproduced) would point to GIL starvation at higher concurrency β a signal |
| to consider either reducing LangGraph node count or offloading to a |
| subprocess pool. |
|
|
| ### Throughput ceiling |
|
|
| **37.9 req/s** with 10 concurrent threads and zero I/O represents an |
| upper bound on single-machine throughput with the current GIL-bound design. |
| Real throughput (with Groq + Pinecone) at 10 concurrency would be limited |
| by external I/O (Groq: ~200β800 ms) and would likely plateau at 5β15 req/s. |
|
|
| ### What this run does NOT measure |
|
|
| | Gap | Reason | |
| |---|---| |
| | Real Pinecone latency | Mocked β would add 50β200 ms per request | |
| | Real Groq latency | Mocked β would add 200β800 ms per request | |
| | LangSmith tracing overhead | Disabled (no real `LANGSMITH_API_KEY`) | |
| | Cold start (graph compilation) | First request compiles the graph; amortized here | |
| | GZip compression middleware | Not added to this app | |
|
|
| --- |
|
|
| ## How to reproduce |
|
|
| ```bash |
| cd backend |
| # Needs a Python env with dependencies installed |
| PYTHONPATH=backend python scripts/bench_mocked.py |
| ``` |
|
|
| The script self-configures dummy credentials and disables all real external |
| calls. No Pinecone or Groq account is required. |
|
|
| --- |
|
|
| ## Next steps |
|
|
| If real-traffic profiling shows p95 > 2000 ms under β₯ 5 concurrent users: |
|
|
| 1. Profile with `py-spy` to identify which LangGraph node holds the GIL |
| longest. |
| 2. Consider converting CPU-bound graph nodes to `async def` with direct |
| `await` on I/O (removing the `run_in_threadpool` wrapper). |
| 3. Evaluate LangGraph's async `astream` / `ainvoke` path for the `/chat` |
| endpoint. |
|
|