Spaces:
Sleeping
Load Test Report β /chat endpoint
Purpose
This report documents a benchmark run of the /chat pipeline under controlled
in-process conditions. It establishes a baseline for framework overhead
(FastAPI routing, LangGraph traversal, Pydantic serialization) with no real
external I/O. This is the T3-C Part 2 deliverable.
Run conditions
| Parameter | Value |
|---|---|
| Date | 2026-06-26 |
| Tool | scripts/bench_mocked.py |
| Transport | httpx.ASGITransport(app=app) β in-process, no TCP |
| Server | No real server process; ASGI interface called directly |
| Requests | 50 |
| Concurrency | 10 |
| Python | 3.11.15 |
| Platform | Windows 11, Intel/AMD x86-64 (GIL-bound) |
What was mocked
| Boundary | Mock behaviour |
|---|---|
| Pinecone vector search | Instant return of 1 chunk (cosine 0.92) |
Groq LLM (generate_answer) |
MagicMock.invoke() returning a fake AIMessage |
Groq LLM (streaming) |
No-op async generator |
| Tavily web search | Disabled (is_tavily_configured=False) |
| FastAPI startup (Pinecone init) | init_pinecone no-op |
| Response cache | Disabled (cache_enabled=False) |
| slowapi rate limiter | limiter.enabled=False (prevents 30/min limit from firing across 50 requests from one IP) |
What ran for real
The full in-process request path: ASGI receive/send, FastAPI middleware
(CORS, metrics collection, auth header check), require_api_key dependency,
the run_in_threadpool dispatch into the LangGraph pipeline, all 7 graph
nodes (normalize_input β contextualize_query β retrieve_context β
corrective_retrieve β decide_next β generate_answer β
format_response), prompt building, filter_chunks_by_score, citation
verification, ChatResponse Pydantic serialization, and JSON response
encoding.
Results
=== /chat in-process bench (mocked externals) ===
Requests: 50
Concurrency: 10
Errors: 0 (0.0%)
Wall time: 1321 ms
Throughput: 37.9 req/s
Avg latency: 252.02 ms
p50 latency: 272.73 ms
p95 latency: 448.47 ms
Interpretation
What these numbers measure
The p50 of 273 ms is the cost of routing, middleware, auth, LangGraph node traversal, schema validation, and JSON serialization β with zero I/O latency. It is a floor, not a ceiling: in production, Pinecone and Groq API latency dominate (typically 100β800 ms combined), and the p50 would be 600β1500 ms end-to-end.
Why p50 is ~270 ms with mocked externals
The primary bottleneck is Python's GIL combined with run_in_threadpool:
- The router dispatches
graph.invoke()viaasyncio.run_in_threadpool, which schedules the call on the defaultThreadPoolExecutor. - With 10 concurrent requests, 10 threads compete for the GIL to execute LangGraph's pure-Python node traversal.
- Each node call holds the GIL during its Python bytecode execution.
- Effective concurrency is constrained β threads execute interleaved, not truly parallel, under CPU-bound load.
The graph's self-reported generate_ms β 0.02 ms (logged per request)
reflects only the mock's .invoke() call time, not the thread scheduling
overhead or GIL contention visible from the outside.
Relationship to Prometheus latency (T2.6)
The T2.6 Prometheus histogram (rag_request_duration_seconds) records
total time from request receipt to response dispatch, matching what this
bench measures. The p95 of 448 ms under 10-concurrency simulated load sets
an expectation: with real Groq and Pinecone I/O, the Prometheus p95 bucket
should track at 600β1500 ms in nominal operation (1β2 concurrent users).
A sharp rise in the Prometheus p95 above 2000 ms with mocked externals (if reproduced) would point to GIL starvation at higher concurrency β a signal to consider either reducing LangGraph node count or offloading to a subprocess pool.
Throughput ceiling
37.9 req/s with 10 concurrent threads and zero I/O represents an upper bound on single-machine throughput with the current GIL-bound design. Real throughput (with Groq + Pinecone) at 10 concurrency would be limited by external I/O (Groq: ~200β800 ms) and would likely plateau at 5β15 req/s.
What this run does NOT measure
| Gap | Reason |
|---|---|
| Real Pinecone latency | Mocked β would add 50β200 ms per request |
| Real Groq latency | Mocked β would add 200β800 ms per request |
| LangSmith tracing overhead | Disabled (no real LANGSMITH_API_KEY) |
| Cold start (graph compilation) | First request compiles the graph; amortized here |
| GZip compression middleware | Not added to this app |
How to reproduce
cd backend
# Needs a Python env with dependencies installed
PYTHONPATH=backend python scripts/bench_mocked.py
The script self-configures dummy credentials and disables all real external calls. No Pinecone or Groq account is required.
Next steps
If real-traffic profiling shows p95 > 2000 ms under β₯ 5 concurrent users:
- Profile with
py-spyto identify which LangGraph node holds the GIL longest. - Consider converting CPU-bound graph nodes to
async defwith directawaiton I/O (removing therun_in_threadpoolwrapper). - Evaluate LangGraph's async
astream/ainvokepath for the/chatendpoint.