# Load Test Report — /chat endpoint ## Purpose This report documents a benchmark run of the `/chat` pipeline under controlled in-process conditions. It establishes a baseline for **framework overhead** (FastAPI routing, LangGraph traversal, Pydantic serialization) with no real external I/O. This is the T3-C Part 2 deliverable. --- ## Run conditions | Parameter | Value | |---|---| | Date | 2026-06-26 | | Tool | `scripts/bench_mocked.py` | | Transport | `httpx.ASGITransport(app=app)` — in-process, no TCP | | Server | No real server process; ASGI interface called directly | | Requests | 50 | | Concurrency | 10 | | Python | 3.11.15 | | Platform | Windows 11, Intel/AMD x86-64 (GIL-bound) | ### What was mocked | Boundary | Mock behaviour | |---|---| | Pinecone vector search | Instant return of 1 chunk (cosine 0.92) | | Groq LLM (`generate_answer`) | `MagicMock.invoke()` returning a fake AIMessage | | Groq LLM (`streaming`) | No-op async generator | | Tavily web search | Disabled (`is_tavily_configured=False`) | | FastAPI startup (Pinecone init) | `init_pinecone` no-op | | Response cache | Disabled (`cache_enabled=False`) | | slowapi rate limiter | `limiter.enabled=False` (prevents 30/min limit from firing across 50 requests from one IP) | ### What ran for real The full in-process request path: ASGI receive/send, FastAPI middleware (CORS, metrics collection, auth header check), `require_api_key` dependency, the `run_in_threadpool` dispatch into the LangGraph pipeline, all 7 graph nodes (`normalize_input` → `contextualize_query` → `retrieve_context` → `corrective_retrieve` → `decide_next` → `generate_answer` → `format_response`), prompt building, `filter_chunks_by_score`, citation verification, `ChatResponse` Pydantic serialization, and JSON response encoding. --- ## Results ``` === /chat in-process bench (mocked externals) === Requests: 50 Concurrency: 10 Errors: 0 (0.0%) Wall time: 1321 ms Throughput: 37.9 req/s Avg latency: 252.02 ms p50 latency: 272.73 ms p95 latency: 448.47 ms ``` --- ## Interpretation ### What these numbers measure The p50 of **273 ms** is the cost of routing, middleware, auth, LangGraph node traversal, schema validation, and JSON serialization — with zero I/O latency. It is a floor, not a ceiling: in production, Pinecone and Groq API latency dominate (typically 100–800 ms combined), and the p50 would be 600–1500 ms end-to-end. ### Why p50 is ~270 ms with mocked externals The primary bottleneck is Python's GIL combined with `run_in_threadpool`: - The router dispatches `graph.invoke()` via `asyncio.run_in_threadpool`, which schedules the call on the default `ThreadPoolExecutor`. - With 10 concurrent requests, 10 threads compete for the GIL to execute LangGraph's pure-Python node traversal. - Each node call holds the GIL during its Python bytecode execution. - Effective concurrency is constrained — threads execute interleaved, not truly parallel, under CPU-bound load. The graph's self-reported `generate_ms ≈ 0.02 ms` (logged per request) reflects only the mock's `.invoke()` call time, not the thread scheduling overhead or GIL contention visible from the outside. ### Relationship to Prometheus latency (T2.6) The T2.6 Prometheus histogram (`rag_request_duration_seconds`) records **total time from request receipt to response dispatch**, matching what this bench measures. The p95 of 448 ms under 10-concurrency simulated load sets an expectation: with real Groq and Pinecone I/O, the Prometheus p95 bucket should track at 600–1500 ms in nominal operation (1–2 concurrent users). A sharp rise in the Prometheus p95 above 2000 ms with mocked externals (if reproduced) would point to GIL starvation at higher concurrency — a signal to consider either reducing LangGraph node count or offloading to a subprocess pool. ### Throughput ceiling **37.9 req/s** with 10 concurrent threads and zero I/O represents an upper bound on single-machine throughput with the current GIL-bound design. Real throughput (with Groq + Pinecone) at 10 concurrency would be limited by external I/O (Groq: ~200–800 ms) and would likely plateau at 5–15 req/s. ### What this run does NOT measure | Gap | Reason | |---|---| | Real Pinecone latency | Mocked — would add 50–200 ms per request | | Real Groq latency | Mocked — would add 200–800 ms per request | | LangSmith tracing overhead | Disabled (no real `LANGSMITH_API_KEY`) | | Cold start (graph compilation) | First request compiles the graph; amortized here | | GZip compression middleware | Not added to this app | --- ## How to reproduce ```bash cd backend # Needs a Python env with dependencies installed PYTHONPATH=backend python scripts/bench_mocked.py ``` The script self-configures dummy credentials and disables all real external calls. No Pinecone or Groq account is required. --- ## Next steps If real-traffic profiling shows p95 > 2000 ms under ≥ 5 concurrent users: 1. Profile with `py-spy` to identify which LangGraph node holds the GIL longest. 2. Consider converting CPU-bound graph nodes to `async def` with direct `await` on I/O (removing the `run_in_threadpool` wrapper). 3. Evaluate LangGraph's async `astream` / `ainvoke` path for the `/chat` endpoint.