Spaces:
Sleeping
RAG Agent Workbench β Design Document
Audience: Engineers and recruiters reviewing this repo.
Purpose: Explain the decisions behind the system β not just what it does, but why each choice was made and what the real tradeoffs are.
Exhaustive detail lives indocs/CONTEXT.md; this document curates the decisions that matter most.
What this is
A production-style RAG (Retrieval-Augmented Generation) backend built as a deliberate engineering exercise in decision-driven design. It ingests documents from Wikipedia, arXiv, and OpenAlex into a Pinecone vector index, then answers questions over that corpus via a 7-node LangGraph pipeline backed by Groq (LLaMA) and optional Tavily web search.
The headline capability: agentic RAG with corrective retrieval, cosine-gated abstention, two-layer faithfulness checking, honest token-level streaming, and per-request cost accounting β all wired to a Streamlit chat UI and a Prometheus metrics endpoint.
Every major feature was preceded by a retrieval evaluation harness. The rule: no parameter change without a measurement that justifies it.
Stack: FastAPI Β· LangGraph/LangChain Β· Pinecone (llama-text-embed-v2, 1024-dim, cosine) Β·
Groq (LLaMA 3.1 8B) Β· Tavily (optional) Β· Streamlit Β· Prometheus Β· Docker
Architecture
See the pipeline diagram in the README for the full node flow.
A request to POST /chat passes through:
- FastAPI middleware β CORS, API key auth (
X-API-Key), slowapi rate limit (30 req/min), Prometheus HTTP instrumentation, in-memory TTL cache check. run_in_threadpoolβ dispatches the LangGraph graph into a thread.- LangGraph pipeline (7 nodes, synchronous) β see diagram.
- Response serialization β Pydantic
ChatResponsewith grounding metadata, timings, token usage, and source citations.
POST /chat/stream runs phases 1 and 3 (pre-generation nodes + post-generation grounding)
in a thread pool, with phase 2 (token generation) streamed async via llm.astream for real
first-token latency improvement.
Key Design Decisions
1. Eval-first, anti-circular-validation
The evaluation harness (eval/) was built before any parameter was tuned. Golden-set
relevant_doc_ids are determined by reading document content β never by running the retriever
and labelling its own output. Doing so would make recall@k tautological (the retriever would
appear to have perfect recall because labels were derived from its output).
Tradeoff: building the harness first added upfront cost with no immediate feature output. The payoff is that every subsequent decision (reranking, top_k, cosine floor) is backed by a number, not intuition.
2. Two-threshold retrieval gate
Two independently configurable cosine thresholds serve different purposes:
| Setting | Default | Purpose |
|---|---|---|
RAG_MIN_SCORE |
0.25 | Routing: if top_score < 0.25, route to Tavily web fallback |
RAG_MIN_CHUNK_SCORE |
0.20 | Safety floor: drop individual Pinecone chunks below this cosine score before they enter the LLM context |
The floor at 0.20 is a data-derived safety bound: the minimum cosine score of any golden-relevant chunk across 30 evaluation queries was 0.2368. Setting the floor at 0.20 places it below this bound so no known-relevant chunk is dropped. It is not a tuned optimum β sharp floor calibration requires chunk-level graded relevance labels.
Tradeoff: two thresholds with different semantics create configuration surface. Keeping them distinct (even at different defaults) avoids the silent failure mode of a single threshold accidentally serving both routing and filtering purposes.
3. Reranking: evaluated and disabled
A Pinecone hosted reranker (bge-reranker-v2-m3) was implemented, A/B tested against the
baseline, and disabled by default after measurement showed it was flat-or-negative at every
metric:
| Metric | Baseline | Rerank | Ξ |
|---|---|---|---|
| nDCG@3 | 0.875 | 0.818 | β0.057 |
| nDCG@5 | 0.900 | 0.869 | β0.031 |
| Precision@1 | 0.966 | 0.966 | 0.000 |
| Mean latency | 360 ms | 795 ms | +435 ms |
Root cause: the corpus (34 chunks / 23 docs) is too small and well-separated for the
dense retriever to miscalibrate top-of-list order. The reranker cannot demonstrate headroom
it never had. RAG_RERANK_ENABLED=False is the empirically-validated default β enable only
after the corpus grows to where dense retrieval misfires on precision.
4. top_k = 5: precision-first
The quality-vs-k curve (n=30 queries) shows:
| k | Recall@k | P@k |
|---|---|---|
| 5 | 0.914 | 0.360 |
| 8 | 0.969 | 0.242 |
| 10 | 0.981 | 0.197 |
The recall-margin knee is k=8 (both recall and nDCG within 0.02 of the k=10 ceiling).
Despite this, RAG_DEFAULT_TOP_K is kept at 5 β a precision-first choice: k=5 delivers
higher-signal context (P@5=0.36 vs P@8=0.24) at the accepted cost of 6.7 recall points.
Tradeoff: recall@k cannot settle this β it measures whether relevant docs appear in the ranked list, not whether a larger-but-noisier context improves LLM answer quality. The tiebreaker is a head-to-head answer-quality evaluation, which does not yet exist. Until it does, context signal quality is preferred over recall coverage.
5. Bounded CRAG corrective loop
corrective_retrieve (between retrieve_context and decide_next) grades retrieval quality
by the cosine score already in state. If weak, it rewrites the query with Groq and re-queries
Pinecone β up to RAG_CRAG_MAX_ITERS=2 times (a hard, unconditional loop bound).
The bound is non-negotiable: without it, a query on a topic not in the knowledge base would spin indefinitely on weak retrieval, exhausting rate limits and blocking the response.
Disabled by default (RAG_CRAG_ENABLED=False): the corpus is saturated at recall@10=0.97,
so the corrective loop fires rarely on in-corpus queries. Enable it only after observing
out-of-corpus queries where initial retrieval fails and the rewrite demonstrably helps.
Circular-validation avoidance: the grader uses the cosine score already in state β it does not re-embed with the retrieval model. Re-embedding would assess the retriever's output with the retriever's own semantic space.
6. Two-layer faithfulness check
| Layer | When | Model calls | What it checks |
|---|---|---|---|
verify_citations |
Always | Zero | [n] citation markers that reference out-of-range chunk indices |
judge_faithfulness |
When RAG_FAITHFULNESS_ENABLED=True + not abstaining |
1 (reuses Groq client) | Whether answer claims are supported by the retrieved context |
The judge uses the existing Groq LLM β not the retrieval embedder. Re-embedding the answer with the same model used for retrieval would encode the embedder's biases into the faithfulness signal (circular validation).
Flag default OFF: every /chat request would otherwise pay for a second LLM call. On
Groq's free tier the cost is latency, not money, but it is still undesirable for interactive
use. When the flag is OFF, grounded and faithfulness_score in ChatResponse are null β
the UI renders this as "not evaluated", never fabricates a value.
7. Honest streaming
/chat/stream uses llm.astream for the generation phase only (the nodes where it matters for
TTFT). Pre-generation nodes (retrieval, CRAG, web search) are run synchronously in a thread
pool β making them async-native would add complexity with no meaningful latency improvement.
Non-streamable paths are honest:
- Cache hit β one token event with the full cached answer,
done.cached=true - Abstention β one token event with the deterministic abstention text
- Neither path calls the LLM or simulates token-by-token output
The previous implementation yielded whitespace-split words from a completed string. That misrepresented itself as streaming.
8. Cost and token observability
Token counts come from the actual API response (response.usage_metadata), not a local
tokenizer estimate. All four LLM call types (generation, faithfulness judge, CRAG rewrite,
history contextualization) are tracked by call_type in ChatResponse.usage.by_call_type and
emitted as a Prometheus counter (llm_tokens_total{call_type=...}).
Dollar cost is an estimate from an as-of-date pricing table (2026-06-25) and is labeled
as such. Embedding token counts are not reported β the Pinecone SDK does not expose them.
9. Reproducible corpus + pinned dimension
A corpus manifest (eval/corpus_manifest.py generate) snapshots vector IDs from the live
Pinecone index to eval/corpus_manifest.json. A validator (corpus_manifest.py validate)
compares the committed manifest against the live index and reports drift without auto-reconciling.
Both operations are read-only.
The embedding model (llama-text-embed-v2) and dimension (1024) are now explicit in Settings
(PINECONE_EMBED_MODEL, PINECONE_EMBED_DIMENSION) and logged at startup β removing the
implicit dependency on Pinecone's default dimension.
Limitations & Tradeoffs
These are the real constraints. A design doc that only lists strengths reads as incomplete.
1. Saturated eval corpus. The evaluation golden set covers 34 chunks / 23 documents. At this scale, baseline dense retrieval is already at recall@10=0.97 β the metrics are ceiling-bound. Any apparent improvement (whether from reranking, CRAG, or parameter changes) may be noise rather than signal. No feature can be conclusively validated until the corpus is at least 10Γ larger.
2. Prompt injection mitigation, not elimination. The RAG system prompt instructs the LLM to use only the supplied context and cite inline. This reduces prompt injection risk but does not eliminate it: a sufficiently adversarial document can still attempt to override instructions via embedded directives in chunk text.
3. Same-model faithfulness judge. The faithfulness judge calls the same Groq LLM that generated the answer. A model grading its own output has a self-preference bias β it may rate its own claims as grounded even when they are not. A second independent model (e.g. a different provider) would give a less biased verdict but at higher cost and latency.
4. Cost is an estimate.
estimated_cost_usd is computed from a static pricing table pinned to 2026-06-25. It does
not account for free-tier credits, batch pricing, or promotional rates. Treat it as an order-
of-magnitude indicator, not a billing source of truth.
5. Reranking and hybrid search deferred β not for lack of trying. Reranking was implemented and A/B tested; it is disabled because the measurement showed no improvement on this corpus size, not because the implementation is absent. Hybrid search (sparse + dense) is documented and designed but not implemented β the recall gap it would address (proper-noun queries) does not exist at current corpus size, where baseline recall@10=0.97.
6. Chunk size below recommended range.
The RecursiveCharacterTextSplitter is configured to ~225 tokens per chunk (900 chars Γ· ~4
chars/token). Pinecone's guidance for llama-text-embed-v2 suggests 400β500 tokens for best
retrieval quality. The current chunks are too short to exploit the model's full context window.
Changing chunk_size requires re-ingestion and re-evaluation against the golden set.
7. CRAG threshold and faithfulness threshold are placeholders.
RAG_CRAG_GOOD_SCORE=0.45 (the cosine threshold that triggers query rewriting) and
RAG_FAITHFULNESS_THRESHOLD=0.5 (the faithfulness score below which grounded=False) are
reasonable midpoints β not values calibrated against labeled data. Both require a held-out
answer-quality evaluation to tune.
Testing & Observability
343 tests (321 unit + 22 integration) run in CI with zero network calls, zero credentials.
| Layer | What it tests |
|---|---|
| Unit (321) | Pure functions: metrics, chunking, normalization, dedup, prompt builders, retrieval gating, faithfulness, CRAG, streaming, Prometheus, cost accounting |
| Integration (22) | Real FastAPI app via TestClient β HTTP routing, auth dependency, LangGraph pipeline, SSE protocol, abstention path, faithfulness wiring; externals mocked at boundaries |
CI runs from the fully-pinned backend/requirements.txt lock (compiled with uv pip compile,
constrained to tested versions) β every CI run is a clean-environment reproducibility check.
Observability:
/metrics(JSON, auth-gated) β request counts, error counts, 20-sample timing ring buffer/metrics/prometheus(Prometheus text, public) βhttp_requests_total(Counter),http_request_duration_seconds(Histogram),rag_phase_duration_seconds(Histogram),llm_tokens_total(Counter bycall_type)- LangSmith β optional trace collection via
LANGCHAIN_TRACING_V2=true
How to Run
# Backend
cd backend
pip install -r requirements.txt
cp .env.example .env # fill in PINECONE_*, GROQ_API_KEY, optional API_KEY
uvicorn app.main:app --port 8000
# Frontend
pip install -r requirements.txt # root (Streamlit)
streamlit run frontend/app.py
# Run tests (zero credentials needed)
pytest tests/ -v
# Evaluate retrieval (requires live Pinecone β reads only)
make eval
# Load benchmark (in-process, mocked externals)
PYTHONPATH=backend python scripts/bench_mocked.py
Full configuration reference: backend/.env.example
Operational runbook (key rotation, rate-limit toggle, deployment): docs/CONTEXT.md