Spaces:

BrejBala
/

rag-agent-workbench-api

Running

App Files Files Community

rag-agent-workbench-api / docs /LOAD_TEST.md

BrejBala

feat: deploy Tiers 2 & 3 — CRAG, faithfulness, streaming, Prometheus, eval-driven retrieval

6686f13 7 days ago

preview code

Raw

History Blame Contribute Delete

5.31 kB

	# Load Test Report — /chat endpoint

	## Purpose

	This report documents a benchmark run of the `/chat` pipeline under controlled
	in-process conditions. It establishes a baseline for framework overhead
	(FastAPI routing, LangGraph traversal, Pydantic serialization) with no real
	external I/O. This is the T3-C Part 2 deliverable.

	---

	## Run conditions

	\| Parameter \| Value \|
	\|---\|---\|
	\| Date \| 2026-06-26 \|
	\| Tool \| `scripts/bench_mocked.py` \|
	\| Transport \| `httpx.ASGITransport(app=app)` — in-process, no TCP \|
	\| Server \| No real server process; ASGI interface called directly \|
	\| Requests \| 50 \|
	\| Concurrency \| 10 \|
	\| Python \| 3.11.15 \|
	\| Platform \| Windows 11, Intel/AMD x86-64 (GIL-bound) \|

	### What was mocked

	\| Boundary \| Mock behaviour \|
	\|---\|---\|
	\| Pinecone vector search \| Instant return of 1 chunk (cosine 0.92) \|
	\| Groq LLM (`generate_answer`) \| `MagicMock.invoke()` returning a fake AIMessage \|
	\| Groq LLM (`streaming`) \| No-op async generator \|
	\| Tavily web search \| Disabled (`is_tavily_configured=False`) \|
	\| FastAPI startup (Pinecone init) \| `init_pinecone` no-op \|
	\| Response cache \| Disabled (`cache_enabled=False`) \|
	\| slowapi rate limiter \| `limiter.enabled=False` (prevents 30/min limit from firing across 50 requests from one IP) \|

	### What ran for real

	The full in-process request path: ASGI receive/send, FastAPI middleware
	(CORS, metrics collection, auth header check), `require_api_key` dependency,
	the `run_in_threadpool` dispatch into the LangGraph pipeline, all 7 graph
	nodes (`normalize_input` → `contextualize_query` → `retrieve_context` →
	`corrective_retrieve` → `decide_next` → `generate_answer` →
	`format_response`), prompt building, `filter_chunks_by_score`, citation
	verification, `ChatResponse` Pydantic serialization, and JSON response
	encoding.

	---

	## Results

	```
	=== /chat in-process bench (mocked externals) ===
	Requests: 50
	Concurrency: 10
	Errors: 0 (0.0%)
	Wall time: 1321 ms
	Throughput: 37.9 req/s
	Avg latency: 252.02 ms
	p50 latency: 272.73 ms
	p95 latency: 448.47 ms
	```

	---

	## Interpretation

	### What these numbers measure

	The p50 of 273 ms is the cost of routing, middleware, auth, LangGraph
	node traversal, schema validation, and JSON serialization — with zero I/O
	latency. It is a floor, not a ceiling: in production, Pinecone and Groq API
	latency dominate (typically 100–800 ms combined), and the p50 would be
	600–1500 ms end-to-end.

	### Why p50 is ~270 ms with mocked externals

	The primary bottleneck is Python's GIL combined with `run_in_threadpool`:

	- The router dispatches `graph.invoke()` via `asyncio.run_in_threadpool`,
	which schedules the call on the default `ThreadPoolExecutor`.
	- With 10 concurrent requests, 10 threads compete for the GIL to execute
	LangGraph's pure-Python node traversal.
	- Each node call holds the GIL during its Python bytecode execution.
	- Effective concurrency is constrained — threads execute interleaved, not
	truly parallel, under CPU-bound load.

	The graph's self-reported `generate_ms ≈ 0.02 ms` (logged per request)
	reflects only the mock's `.invoke()` call time, not the thread scheduling
	overhead or GIL contention visible from the outside.

	### Relationship to Prometheus latency (T2.6)

	The T2.6 Prometheus histogram (`rag_request_duration_seconds`) records
	total time from request receipt to response dispatch, matching what this
	bench measures. The p95 of 448 ms under 10-concurrency simulated load sets
	an expectation: with real Groq and Pinecone I/O, the Prometheus p95 bucket
	should track at 600–1500 ms in nominal operation (1–2 concurrent users).

	A sharp rise in the Prometheus p95 above 2000 ms with mocked externals (if
	reproduced) would point to GIL starvation at higher concurrency — a signal
	to consider either reducing LangGraph node count or offloading to a
	subprocess pool.

	### Throughput ceiling

	37.9 req/s with 10 concurrent threads and zero I/O represents an
	upper bound on single-machine throughput with the current GIL-bound design.
	Real throughput (with Groq + Pinecone) at 10 concurrency would be limited
	by external I/O (Groq: ~200–800 ms) and would likely plateau at 5–15 req/s.

	### What this run does NOT measure

	\| Gap \| Reason \|
	\|---\|---\|
	\| Real Pinecone latency \| Mocked — would add 50–200 ms per request \|
	\| Real Groq latency \| Mocked — would add 200–800 ms per request \|
	\| LangSmith tracing overhead \| Disabled (no real `LANGSMITH_API_KEY`) \|
	\| Cold start (graph compilation) \| First request compiles the graph; amortized here \|
	\| GZip compression middleware \| Not added to this app \|

	---

	## How to reproduce

	```bash
	cd backend
	# Needs a Python env with dependencies installed
	PYTHONPATH=backend python scripts/bench_mocked.py
	```

	The script self-configures dummy credentials and disables all real external
	calls. No Pinecone or Groq account is required.

	---

	## Next steps

	If real-traffic profiling shows p95 > 2000 ms under ≥ 5 concurrent users:

	1. Profile with `py-spy` to identify which LangGraph node holds the GIL
	longest.
	2. Consider converting CPU-bound graph nodes to `async def` with direct
	`await` on I/O (removing the `run_in_threadpool` wrapper).
	3. Evaluate LangGraph's async `astream` / `ainvoke` path for the `/chat`
	endpoint.