rag-agent-workbench-api / docs /LOAD_TEST.md
BrejBala's picture
feat: deploy Tiers 2 & 3 β€” CRAG, faithfulness, streaming, Prometheus, eval-driven retrieval
6686f13
|
Raw
History Blame Contribute Delete
5.31 kB

Load Test Report β€” /chat endpoint

Purpose

This report documents a benchmark run of the /chat pipeline under controlled in-process conditions. It establishes a baseline for framework overhead (FastAPI routing, LangGraph traversal, Pydantic serialization) with no real external I/O. This is the T3-C Part 2 deliverable.


Run conditions

Parameter Value
Date 2026-06-26
Tool scripts/bench_mocked.py
Transport httpx.ASGITransport(app=app) β€” in-process, no TCP
Server No real server process; ASGI interface called directly
Requests 50
Concurrency 10
Python 3.11.15
Platform Windows 11, Intel/AMD x86-64 (GIL-bound)

What was mocked

Boundary Mock behaviour
Pinecone vector search Instant return of 1 chunk (cosine 0.92)
Groq LLM (generate_answer) MagicMock.invoke() returning a fake AIMessage
Groq LLM (streaming) No-op async generator
Tavily web search Disabled (is_tavily_configured=False)
FastAPI startup (Pinecone init) init_pinecone no-op
Response cache Disabled (cache_enabled=False)
slowapi rate limiter limiter.enabled=False (prevents 30/min limit from firing across 50 requests from one IP)

What ran for real

The full in-process request path: ASGI receive/send, FastAPI middleware (CORS, metrics collection, auth header check), require_api_key dependency, the run_in_threadpool dispatch into the LangGraph pipeline, all 7 graph nodes (normalize_input β†’ contextualize_query β†’ retrieve_context β†’ corrective_retrieve β†’ decide_next β†’ generate_answer β†’ format_response), prompt building, filter_chunks_by_score, citation verification, ChatResponse Pydantic serialization, and JSON response encoding.


Results

=== /chat in-process bench (mocked externals) ===
Requests:        50
Concurrency:     10
Errors:          0 (0.0%)
Wall time:       1321 ms
Throughput:      37.9 req/s
Avg latency:     252.02 ms
p50 latency:     272.73 ms
p95 latency:     448.47 ms

Interpretation

What these numbers measure

The p50 of 273 ms is the cost of routing, middleware, auth, LangGraph node traversal, schema validation, and JSON serialization β€” with zero I/O latency. It is a floor, not a ceiling: in production, Pinecone and Groq API latency dominate (typically 100–800 ms combined), and the p50 would be 600–1500 ms end-to-end.

Why p50 is ~270 ms with mocked externals

The primary bottleneck is Python's GIL combined with run_in_threadpool:

  • The router dispatches graph.invoke() via asyncio.run_in_threadpool, which schedules the call on the default ThreadPoolExecutor.
  • With 10 concurrent requests, 10 threads compete for the GIL to execute LangGraph's pure-Python node traversal.
  • Each node call holds the GIL during its Python bytecode execution.
  • Effective concurrency is constrained β€” threads execute interleaved, not truly parallel, under CPU-bound load.

The graph's self-reported generate_ms β‰ˆ 0.02 ms (logged per request) reflects only the mock's .invoke() call time, not the thread scheduling overhead or GIL contention visible from the outside.

Relationship to Prometheus latency (T2.6)

The T2.6 Prometheus histogram (rag_request_duration_seconds) records total time from request receipt to response dispatch, matching what this bench measures. The p95 of 448 ms under 10-concurrency simulated load sets an expectation: with real Groq and Pinecone I/O, the Prometheus p95 bucket should track at 600–1500 ms in nominal operation (1–2 concurrent users).

A sharp rise in the Prometheus p95 above 2000 ms with mocked externals (if reproduced) would point to GIL starvation at higher concurrency β€” a signal to consider either reducing LangGraph node count or offloading to a subprocess pool.

Throughput ceiling

37.9 req/s with 10 concurrent threads and zero I/O represents an upper bound on single-machine throughput with the current GIL-bound design. Real throughput (with Groq + Pinecone) at 10 concurrency would be limited by external I/O (Groq: ~200–800 ms) and would likely plateau at 5–15 req/s.

What this run does NOT measure

Gap Reason
Real Pinecone latency Mocked β€” would add 50–200 ms per request
Real Groq latency Mocked β€” would add 200–800 ms per request
LangSmith tracing overhead Disabled (no real LANGSMITH_API_KEY)
Cold start (graph compilation) First request compiles the graph; amortized here
GZip compression middleware Not added to this app

How to reproduce

cd backend
# Needs a Python env with dependencies installed
PYTHONPATH=backend python scripts/bench_mocked.py

The script self-configures dummy credentials and disables all real external calls. No Pinecone or Groq account is required.


Next steps

If real-traffic profiling shows p95 > 2000 ms under β‰₯ 5 concurrent users:

  1. Profile with py-spy to identify which LangGraph node holds the GIL longest.
  2. Consider converting CPU-bound graph nodes to async def with direct await on I/O (removing the run_in_threadpool wrapper).
  3. Evaluate LangGraph's async astream / ainvoke path for the /chat endpoint.