# RAG Agent Workbench — Design Document > **Audience:** Engineers and recruiters reviewing this repo. > **Purpose:** Explain the *decisions* behind the system — not just what it does, but why each > choice was made and what the real tradeoffs are. > Exhaustive detail lives in [`docs/CONTEXT.md`](CONTEXT.md); this document curates the decisions > that matter most. --- ## What this is A production-style RAG (Retrieval-Augmented Generation) backend built as a deliberate engineering exercise in decision-driven design. It ingests documents from Wikipedia, arXiv, and OpenAlex into a Pinecone vector index, then answers questions over that corpus via a **7-node LangGraph pipeline** backed by Groq (LLaMA) and optional Tavily web search. The headline capability: agentic RAG with corrective retrieval, cosine-gated abstention, two-layer faithfulness checking, honest token-level streaming, and per-request cost accounting — all wired to a Streamlit chat UI and a Prometheus metrics endpoint. Every major feature was preceded by a retrieval evaluation harness. The rule: no parameter change without a measurement that justifies it. **Stack:** FastAPI · LangGraph/LangChain · Pinecone (`llama-text-embed-v2`, 1024-dim, cosine) · Groq (LLaMA 3.1 8B) · Tavily (optional) · Streamlit · Prometheus · Docker --- ## Architecture See the [pipeline diagram in the README](../README.md#architecture) for the full node flow. A request to `POST /chat` passes through: 1. **FastAPI middleware** — CORS, API key auth (`X-API-Key`), slowapi rate limit (30 req/min), Prometheus HTTP instrumentation, in-memory TTL cache check. 2. **`run_in_threadpool`** — dispatches the LangGraph graph into a thread. 3. **LangGraph pipeline** (7 nodes, synchronous) — see diagram. 4. **Response serialization** — Pydantic `ChatResponse` with grounding metadata, timings, token usage, and source citations. `POST /chat/stream` runs phases 1 and 3 (pre-generation nodes + post-generation grounding) in a thread pool, with phase 2 (token generation) streamed async via `llm.astream` for real first-token latency improvement. --- ## Key Design Decisions ### 1. Eval-first, anti-circular-validation The evaluation harness (`eval/`) was built before any parameter was tuned. Golden-set `relevant_doc_ids` are determined by reading document content — never by running the retriever and labelling its own output. Doing so would make recall@k tautological (the retriever would appear to have perfect recall because labels were derived from its output). **Tradeoff:** building the harness first added upfront cost with no immediate feature output. The payoff is that every subsequent decision (reranking, top_k, cosine floor) is backed by a number, not intuition. --- ### 2. Two-threshold retrieval gate Two independently configurable cosine thresholds serve different purposes: | Setting | Default | Purpose | |---|---|---| | `RAG_MIN_SCORE` | 0.25 | **Routing:** if `top_score < 0.25`, route to Tavily web fallback | | `RAG_MIN_CHUNK_SCORE` | **0.20** | **Safety floor:** drop individual Pinecone chunks below this cosine score before they enter the LLM context | The floor at 0.20 is a **data-derived safety bound**: the minimum cosine score of any golden-relevant chunk across 30 evaluation queries was 0.2368. Setting the floor at 0.20 places it below this bound so no known-relevant chunk is dropped. It is not a tuned optimum — sharp floor calibration requires chunk-level graded relevance labels. **Tradeoff:** two thresholds with different semantics create configuration surface. Keeping them distinct (even at different defaults) avoids the silent failure mode of a single threshold accidentally serving both routing and filtering purposes. --- ### 3. Reranking: evaluated and disabled A Pinecone hosted reranker (`bge-reranker-v2-m3`) was implemented, A/B tested against the baseline, and **disabled by default** after measurement showed it was flat-or-negative at every metric: | Metric | Baseline | Rerank | Δ | |---|---|---|---| | nDCG@3 | 0.875 | 0.818 | −0.057 | | nDCG@5 | 0.900 | 0.869 | −0.031 | | Precision@1 | 0.966 | 0.966 | 0.000 | | Mean latency | 360 ms | 795 ms | +435 ms | **Root cause:** the corpus (34 chunks / 23 docs) is too small and well-separated for the dense retriever to miscalibrate top-of-list order. The reranker cannot demonstrate headroom it never had. `RAG_RERANK_ENABLED=False` is the empirically-validated default — enable only after the corpus grows to where dense retrieval misfires on precision. --- ### 4. top_k = 5: precision-first The quality-vs-k curve (n=30 queries) shows: | k | Recall@k | P@k | |---|---|---| | 5 | 0.914 | 0.360 | | 8 | 0.969 | 0.242 | | 10 | 0.981 | 0.197 | The **recall-margin knee** is k=8 (both recall and nDCG within 0.02 of the k=10 ceiling). Despite this, `RAG_DEFAULT_TOP_K` is kept at **5** — a precision-first choice: k=5 delivers higher-signal context (P@5=0.36 vs P@8=0.24) at the accepted cost of 6.7 recall points. **Tradeoff:** recall@k cannot settle this — it measures whether relevant docs appear in the ranked list, not whether a larger-but-noisier context improves LLM answer quality. The tiebreaker is a head-to-head answer-quality evaluation, which does not yet exist. Until it does, context signal quality is preferred over recall coverage. --- ### 5. Bounded CRAG corrective loop `corrective_retrieve` (between `retrieve_context` and `decide_next`) grades retrieval quality by the cosine score already in state. If weak, it rewrites the query with Groq and re-queries Pinecone — up to `RAG_CRAG_MAX_ITERS=2` times (a hard, unconditional loop bound). The bound is **non-negotiable**: without it, a query on a topic not in the knowledge base would spin indefinitely on weak retrieval, exhausting rate limits and blocking the response. **Disabled by default** (`RAG_CRAG_ENABLED=False`): the corpus is saturated at recall@10=0.97, so the corrective loop fires rarely on in-corpus queries. Enable it only after observing out-of-corpus queries where initial retrieval fails and the rewrite demonstrably helps. **Circular-validation avoidance:** the grader uses the cosine score already in state — it does not re-embed with the retrieval model. Re-embedding would assess the retriever's output with the retriever's own semantic space. --- ### 6. Two-layer faithfulness check | Layer | When | Model calls | What it checks | |---|---|---|---| | `verify_citations` | Always | Zero | `[n]` citation markers that reference out-of-range chunk indices | | `judge_faithfulness` | When `RAG_FAITHFULNESS_ENABLED=True` + not abstaining | 1 (reuses Groq client) | Whether answer claims are supported by the retrieved context | The judge uses the **existing Groq LLM** — not the retrieval embedder. Re-embedding the answer with the same model used for retrieval would encode the embedder's biases into the faithfulness signal (circular validation). **Flag default OFF:** every `/chat` request would otherwise pay for a second LLM call. On Groq's free tier the cost is latency, not money, but it is still undesirable for interactive use. When the flag is OFF, `grounded` and `faithfulness_score` in `ChatResponse` are `null` — the UI renders this as "not evaluated", never fabricates a value. --- ### 7. Honest streaming `/chat/stream` uses `llm.astream` for the generation phase only (the nodes where it matters for TTFT). Pre-generation nodes (retrieval, CRAG, web search) are run synchronously in a thread pool — making them async-native would add complexity with no meaningful latency improvement. **Non-streamable paths are honest:** - Cache hit → one token event with the full cached answer, `done.cached=true` - Abstention → one token event with the deterministic abstention text - Neither path calls the LLM or simulates token-by-token output The previous implementation yielded whitespace-split words from a completed string. That misrepresented itself as streaming. --- ### 8. Cost and token observability Token counts come from the **actual API response** (`response.usage_metadata`), not a local tokenizer estimate. All four LLM call types (generation, faithfulness judge, CRAG rewrite, history contextualization) are tracked by `call_type` in `ChatResponse.usage.by_call_type` and emitted as a Prometheus counter (`llm_tokens_total{call_type=...}`). Dollar cost is an **estimate** from an as-of-date pricing table (`2026-06-25`) and is labeled as such. Embedding token counts are not reported — the Pinecone SDK does not expose them. --- ### 9. Reproducible corpus + pinned dimension A corpus manifest (`eval/corpus_manifest.py generate`) snapshots vector IDs from the live Pinecone index to `eval/corpus_manifest.json`. A validator (`corpus_manifest.py validate`) compares the committed manifest against the live index and reports drift without auto-reconciling. Both operations are read-only. The embedding model (`llama-text-embed-v2`) and dimension (1024) are now explicit in `Settings` (`PINECONE_EMBED_MODEL`, `PINECONE_EMBED_DIMENSION`) and logged at startup — removing the implicit dependency on Pinecone's default dimension. --- ## Limitations & Tradeoffs These are the real constraints. A design doc that only lists strengths reads as incomplete. **1. Saturated eval corpus.** The evaluation golden set covers 34 chunks / 23 documents. At this scale, baseline dense retrieval is already at recall@10=0.97 — the metrics are ceiling-bound. Any apparent improvement (whether from reranking, CRAG, or parameter changes) may be noise rather than signal. No feature can be conclusively validated until the corpus is at least 10× larger. **2. Prompt injection mitigation, not elimination.** The RAG system prompt instructs the LLM to use only the supplied context and cite inline. This reduces prompt injection risk but does not eliminate it: a sufficiently adversarial document can still attempt to override instructions via embedded directives in chunk text. **3. Same-model faithfulness judge.** The faithfulness judge calls the same Groq LLM that generated the answer. A model grading its own output has a self-preference bias — it may rate its own claims as grounded even when they are not. A second independent model (e.g. a different provider) would give a less biased verdict but at higher cost and latency. **4. Cost is an estimate.** `estimated_cost_usd` is computed from a static pricing table pinned to 2026-06-25. It does not account for free-tier credits, batch pricing, or promotional rates. Treat it as an order- of-magnitude indicator, not a billing source of truth. **5. Reranking and hybrid search deferred — not for lack of trying.** Reranking was implemented and A/B tested; it is disabled because the measurement showed no improvement on this corpus size, not because the implementation is absent. Hybrid search (sparse + dense) is documented and designed but not implemented — the recall gap it would address (proper-noun queries) does not exist at current corpus size, where baseline recall@10=0.97. **6. Chunk size below recommended range.** The `RecursiveCharacterTextSplitter` is configured to ~225 tokens per chunk (900 chars ÷ ~4 chars/token). Pinecone's guidance for `llama-text-embed-v2` suggests 400–500 tokens for best retrieval quality. The current chunks are too short to exploit the model's full context window. Changing `chunk_size` requires re-ingestion and re-evaluation against the golden set. **7. CRAG threshold and faithfulness threshold are placeholders.** `RAG_CRAG_GOOD_SCORE=0.45` (the cosine threshold that triggers query rewriting) and `RAG_FAITHFULNESS_THRESHOLD=0.5` (the faithfulness score below which `grounded=False`) are reasonable midpoints — not values calibrated against labeled data. Both require a held-out answer-quality evaluation to tune. --- ## Testing & Observability **343 tests** (321 unit + 22 integration) run in CI with zero network calls, zero credentials. | Layer | What it tests | |---|---| | Unit (321) | Pure functions: metrics, chunking, normalization, dedup, prompt builders, retrieval gating, faithfulness, CRAG, streaming, Prometheus, cost accounting | | Integration (22) | Real FastAPI app via `TestClient` — HTTP routing, auth dependency, LangGraph pipeline, SSE protocol, abstention path, faithfulness wiring; externals mocked at boundaries | CI runs from the fully-pinned `backend/requirements.txt` lock (compiled with `uv pip compile`, constrained to tested versions) — every CI run is a clean-environment reproducibility check. Observability: - **`/metrics`** (JSON, auth-gated) — request counts, error counts, 20-sample timing ring buffer - **`/metrics/prometheus`** (Prometheus text, public) — `http_requests_total` (Counter), `http_request_duration_seconds` (Histogram), `rag_phase_duration_seconds` (Histogram), `llm_tokens_total` (Counter by `call_type`) - **LangSmith** — optional trace collection via `LANGCHAIN_TRACING_V2=true` --- ## How to Run ```bash # Backend cd backend pip install -r requirements.txt cp .env.example .env # fill in PINECONE_*, GROQ_API_KEY, optional API_KEY uvicorn app.main:app --port 8000 # Frontend pip install -r requirements.txt # root (Streamlit) streamlit run frontend/app.py # Run tests (zero credentials needed) pytest tests/ -v # Evaluate retrieval (requires live Pinecone — reads only) make eval # Load benchmark (in-process, mocked externals) PYTHONPATH=backend python scripts/bench_mocked.py ``` Full configuration reference: [`backend/.env.example`](../backend/.env.example) Operational runbook (key rotation, rate-limit toggle, deployment): [`docs/CONTEXT.md`](CONTEXT.md)