Spaces:

BrejBala
/

rag-agent-workbench-api

Sleeping

App Files Files Community

rag-agent-workbench-api / docs /LOAD_TEST.md

BrejBala

feat: deploy Tiers 2 & 3 — CRAG, faithfulness, streaming, Prometheus, eval-driven retrieval

6686f13 6 days ago

preview code

Raw

History Blame Contribute Delete

5.31 kB

Load Test Report — /chat endpoint

Purpose

This report documents a benchmark run of the /chat pipeline under controlled in-process conditions. It establishes a baseline for framework overhead (FastAPI routing, LangGraph traversal, Pydantic serialization) with no real external I/O. This is the T3-C Part 2 deliverable.

Run conditions

Parameter	Value
Date	2026-06-26
Tool	`scripts/bench_mocked.py`
Transport	`httpx.ASGITransport(app=app)` — in-process, no TCP
Server	No real server process; ASGI interface called directly
Requests	50
Concurrency	10
Python	3.11.15
Platform	Windows 11, Intel/AMD x86-64 (GIL-bound)

What was mocked

Boundary	Mock behaviour
Pinecone vector search	Instant return of 1 chunk (cosine 0.92)
Groq LLM (`generate_answer`)	`MagicMock.invoke()` returning a fake AIMessage
Groq LLM (`streaming`)	No-op async generator
Tavily web search	Disabled (`is_tavily_configured=False`)
FastAPI startup (Pinecone init)	`init_pinecone` no-op
Response cache	Disabled (`cache_enabled=False`)
slowapi rate limiter	`limiter.enabled=False` (prevents 30/min limit from firing across 50 requests from one IP)

What ran for real

The full in-process request path: ASGI receive/send, FastAPI middleware (CORS, metrics collection, auth header check), require_api_key dependency, the run_in_threadpool dispatch into the LangGraph pipeline, all 7 graph nodes (normalize_input → contextualize_query → retrieve_context → corrective_retrieve → decide_next → generate_answer → format_response), prompt building, filter_chunks_by_score, citation verification, ChatResponse Pydantic serialization, and JSON response encoding.

Results

=== /chat in-process bench (mocked externals) ===
Requests:        50
Concurrency:     10
Errors:          0 (0.0%)
Wall time:       1321 ms
Throughput:      37.9 req/s
Avg latency:     252.02 ms
p50 latency:     272.73 ms
p95 latency:     448.47 ms

Interpretation

What these numbers measure

The p50 of 273 ms is the cost of routing, middleware, auth, LangGraph node traversal, schema validation, and JSON serialization — with zero I/O latency. It is a floor, not a ceiling: in production, Pinecone and Groq API latency dominate (typically 100–800 ms combined), and the p50 would be 600–1500 ms end-to-end.

Why p50 is ~270 ms with mocked externals

The primary bottleneck is Python's GIL combined with run_in_threadpool:

The router dispatches graph.invoke() via asyncio.run_in_threadpool, which schedules the call on the default ThreadPoolExecutor.
With 10 concurrent requests, 10 threads compete for the GIL to execute LangGraph's pure-Python node traversal.
Each node call holds the GIL during its Python bytecode execution.
Effective concurrency is constrained — threads execute interleaved, not truly parallel, under CPU-bound load.

The graph's self-reported generate_ms ≈ 0.02 ms (logged per request) reflects only the mock's .invoke() call time, not the thread scheduling overhead or GIL contention visible from the outside.

Relationship to Prometheus latency (T2.6)

The T2.6 Prometheus histogram (rag_request_duration_seconds) records total time from request receipt to response dispatch, matching what this bench measures. The p95 of 448 ms under 10-concurrency simulated load sets an expectation: with real Groq and Pinecone I/O, the Prometheus p95 bucket should track at 600–1500 ms in nominal operation (1–2 concurrent users).

A sharp rise in the Prometheus p95 above 2000 ms with mocked externals (if reproduced) would point to GIL starvation at higher concurrency — a signal to consider either reducing LangGraph node count or offloading to a subprocess pool.

Throughput ceiling

37.9 req/s with 10 concurrent threads and zero I/O represents an upper bound on single-machine throughput with the current GIL-bound design. Real throughput (with Groq + Pinecone) at 10 concurrency would be limited by external I/O (Groq: ~200–800 ms) and would likely plateau at 5–15 req/s.

What this run does NOT measure

Gap	Reason
Real Pinecone latency	Mocked — would add 50–200 ms per request
Real Groq latency	Mocked — would add 200–800 ms per request
LangSmith tracing overhead	Disabled (no real `LANGSMITH_API_KEY`)
Cold start (graph compilation)	First request compiles the graph; amortized here
GZip compression middleware	Not added to this app

How to reproduce

cd backend
# Needs a Python env with dependencies installed
PYTHONPATH=backend python scripts/bench_mocked.py

The script self-configures dummy credentials and disables all real external calls. No Pinecone or Groq account is required.

Next steps

If real-traffic profiling shows p95 > 2000 ms under ≥ 5 concurrent users:

Profile with py-spy to identify which LangGraph node holds the GIL longest.
Consider converting CPU-bound graph nodes to async def with direct await on I/O (removing the run_in_threadpool wrapper).
Evaluate LangGraph's async astream / ainvoke path for the /chat endpoint.