Spaces:
Sleeping
Sleeping
| # RAG Agent Workbench β Design Document | |
| > **Audience:** Engineers and recruiters reviewing this repo. | |
| > **Purpose:** Explain the *decisions* behind the system β not just what it does, but why each | |
| > choice was made and what the real tradeoffs are. | |
| > Exhaustive detail lives in [`docs/CONTEXT.md`](CONTEXT.md); this document curates the decisions | |
| > that matter most. | |
| --- | |
| ## What this is | |
| A production-style RAG (Retrieval-Augmented Generation) backend built as a deliberate engineering | |
| exercise in decision-driven design. It ingests documents from Wikipedia, arXiv, and OpenAlex | |
| into a Pinecone vector index, then answers questions over that corpus via a **7-node LangGraph | |
| pipeline** backed by Groq (LLaMA) and optional Tavily web search. | |
| The headline capability: agentic RAG with corrective retrieval, cosine-gated abstention, | |
| two-layer faithfulness checking, honest token-level streaming, and per-request cost accounting β | |
| all wired to a Streamlit chat UI and a Prometheus metrics endpoint. | |
| Every major feature was preceded by a retrieval evaluation harness. The rule: no parameter | |
| change without a measurement that justifies it. | |
| **Stack:** FastAPI Β· LangGraph/LangChain Β· Pinecone (`llama-text-embed-v2`, 1024-dim, cosine) Β· | |
| Groq (LLaMA 3.1 8B) Β· Tavily (optional) Β· Streamlit Β· Prometheus Β· Docker | |
| --- | |
| ## Architecture | |
| See the [pipeline diagram in the README](../README.md#architecture) for the full node flow. | |
| A request to `POST /chat` passes through: | |
| 1. **FastAPI middleware** β CORS, API key auth (`X-API-Key`), slowapi rate limit (30 req/min), | |
| Prometheus HTTP instrumentation, in-memory TTL cache check. | |
| 2. **`run_in_threadpool`** β dispatches the LangGraph graph into a thread. | |
| 3. **LangGraph pipeline** (7 nodes, synchronous) β see diagram. | |
| 4. **Response serialization** β Pydantic `ChatResponse` with grounding metadata, timings, | |
| token usage, and source citations. | |
| `POST /chat/stream` runs phases 1 and 3 (pre-generation nodes + post-generation grounding) | |
| in a thread pool, with phase 2 (token generation) streamed async via `llm.astream` for real | |
| first-token latency improvement. | |
| --- | |
| ## Key Design Decisions | |
| ### 1. Eval-first, anti-circular-validation | |
| The evaluation harness (`eval/`) was built before any parameter was tuned. Golden-set | |
| `relevant_doc_ids` are determined by reading document content β never by running the retriever | |
| and labelling its own output. Doing so would make recall@k tautological (the retriever would | |
| appear to have perfect recall because labels were derived from its output). | |
| **Tradeoff:** building the harness first added upfront cost with no immediate feature output. | |
| The payoff is that every subsequent decision (reranking, top_k, cosine floor) is backed by a | |
| number, not intuition. | |
| --- | |
| ### 2. Two-threshold retrieval gate | |
| Two independently configurable cosine thresholds serve different purposes: | |
| | Setting | Default | Purpose | | |
| |---|---|---| | |
| | `RAG_MIN_SCORE` | 0.25 | **Routing:** if `top_score < 0.25`, route to Tavily web fallback | | |
| | `RAG_MIN_CHUNK_SCORE` | **0.20** | **Safety floor:** drop individual Pinecone chunks below this cosine score before they enter the LLM context | | |
| The floor at 0.20 is a **data-derived safety bound**: the minimum cosine score of any | |
| golden-relevant chunk across 30 evaluation queries was 0.2368. Setting the floor at 0.20 | |
| places it below this bound so no known-relevant chunk is dropped. It is not a tuned optimum β | |
| sharp floor calibration requires chunk-level graded relevance labels. | |
| **Tradeoff:** two thresholds with different semantics create configuration surface. Keeping | |
| them distinct (even at different defaults) avoids the silent failure mode of a single threshold | |
| accidentally serving both routing and filtering purposes. | |
| --- | |
| ### 3. Reranking: evaluated and disabled | |
| A Pinecone hosted reranker (`bge-reranker-v2-m3`) was implemented, A/B tested against the | |
| baseline, and **disabled by default** after measurement showed it was flat-or-negative at every | |
| metric: | |
| | Metric | Baseline | Rerank | Ξ | | |
| |---|---|---|---| | |
| | nDCG@3 | 0.875 | 0.818 | β0.057 | | |
| | nDCG@5 | 0.900 | 0.869 | β0.031 | | |
| | Precision@1 | 0.966 | 0.966 | 0.000 | | |
| | Mean latency | 360 ms | 795 ms | +435 ms | | |
| **Root cause:** the corpus (34 chunks / 23 docs) is too small and well-separated for the | |
| dense retriever to miscalibrate top-of-list order. The reranker cannot demonstrate headroom | |
| it never had. `RAG_RERANK_ENABLED=False` is the empirically-validated default β enable only | |
| after the corpus grows to where dense retrieval misfires on precision. | |
| --- | |
| ### 4. top_k = 5: precision-first | |
| The quality-vs-k curve (n=30 queries) shows: | |
| | k | Recall@k | P@k | | |
| |---|---|---| | |
| | 5 | 0.914 | 0.360 | | |
| | 8 | 0.969 | 0.242 | | |
| | 10 | 0.981 | 0.197 | | |
| The **recall-margin knee** is k=8 (both recall and nDCG within 0.02 of the k=10 ceiling). | |
| Despite this, `RAG_DEFAULT_TOP_K` is kept at **5** β a precision-first choice: k=5 delivers | |
| higher-signal context (P@5=0.36 vs P@8=0.24) at the accepted cost of 6.7 recall points. | |
| **Tradeoff:** recall@k cannot settle this β it measures whether relevant docs appear in the | |
| ranked list, not whether a larger-but-noisier context improves LLM answer quality. The | |
| tiebreaker is a head-to-head answer-quality evaluation, which does not yet exist. Until it | |
| does, context signal quality is preferred over recall coverage. | |
| --- | |
| ### 5. Bounded CRAG corrective loop | |
| `corrective_retrieve` (between `retrieve_context` and `decide_next`) grades retrieval quality | |
| by the cosine score already in state. If weak, it rewrites the query with Groq and re-queries | |
| Pinecone β up to `RAG_CRAG_MAX_ITERS=2` times (a hard, unconditional loop bound). | |
| The bound is **non-negotiable**: without it, a query on a topic not in the knowledge base would | |
| spin indefinitely on weak retrieval, exhausting rate limits and blocking the response. | |
| **Disabled by default** (`RAG_CRAG_ENABLED=False`): the corpus is saturated at recall@10=0.97, | |
| so the corrective loop fires rarely on in-corpus queries. Enable it only after observing | |
| out-of-corpus queries where initial retrieval fails and the rewrite demonstrably helps. | |
| **Circular-validation avoidance:** the grader uses the cosine score already in state β it does | |
| not re-embed with the retrieval model. Re-embedding would assess the retriever's output with | |
| the retriever's own semantic space. | |
| --- | |
| ### 6. Two-layer faithfulness check | |
| | Layer | When | Model calls | What it checks | | |
| |---|---|---|---| | |
| | `verify_citations` | Always | Zero | `[n]` citation markers that reference out-of-range chunk indices | | |
| | `judge_faithfulness` | When `RAG_FAITHFULNESS_ENABLED=True` + not abstaining | 1 (reuses Groq client) | Whether answer claims are supported by the retrieved context | | |
| The judge uses the **existing Groq LLM** β not the retrieval embedder. Re-embedding the answer | |
| with the same model used for retrieval would encode the embedder's biases into the faithfulness | |
| signal (circular validation). | |
| **Flag default OFF:** every `/chat` request would otherwise pay for a second LLM call. On | |
| Groq's free tier the cost is latency, not money, but it is still undesirable for interactive | |
| use. When the flag is OFF, `grounded` and `faithfulness_score` in `ChatResponse` are `null` β | |
| the UI renders this as "not evaluated", never fabricates a value. | |
| --- | |
| ### 7. Honest streaming | |
| `/chat/stream` uses `llm.astream` for the generation phase only (the nodes where it matters for | |
| TTFT). Pre-generation nodes (retrieval, CRAG, web search) are run synchronously in a thread | |
| pool β making them async-native would add complexity with no meaningful latency improvement. | |
| **Non-streamable paths are honest:** | |
| - Cache hit β one token event with the full cached answer, `done.cached=true` | |
| - Abstention β one token event with the deterministic abstention text | |
| - Neither path calls the LLM or simulates token-by-token output | |
| The previous implementation yielded whitespace-split words from a completed string. That | |
| misrepresented itself as streaming. | |
| --- | |
| ### 8. Cost and token observability | |
| Token counts come from the **actual API response** (`response.usage_metadata`), not a local | |
| tokenizer estimate. All four LLM call types (generation, faithfulness judge, CRAG rewrite, | |
| history contextualization) are tracked by `call_type` in `ChatResponse.usage.by_call_type` and | |
| emitted as a Prometheus counter (`llm_tokens_total{call_type=...}`). | |
| Dollar cost is an **estimate** from an as-of-date pricing table (`2026-06-25`) and is labeled | |
| as such. Embedding token counts are not reported β the Pinecone SDK does not expose them. | |
| --- | |
| ### 9. Reproducible corpus + pinned dimension | |
| A corpus manifest (`eval/corpus_manifest.py generate`) snapshots vector IDs from the live | |
| Pinecone index to `eval/corpus_manifest.json`. A validator (`corpus_manifest.py validate`) | |
| compares the committed manifest against the live index and reports drift without auto-reconciling. | |
| Both operations are read-only. | |
| The embedding model (`llama-text-embed-v2`) and dimension (1024) are now explicit in `Settings` | |
| (`PINECONE_EMBED_MODEL`, `PINECONE_EMBED_DIMENSION`) and logged at startup β removing the | |
| implicit dependency on Pinecone's default dimension. | |
| --- | |
| ## Limitations & Tradeoffs | |
| These are the real constraints. A design doc that only lists strengths reads as incomplete. | |
| **1. Saturated eval corpus.** | |
| The evaluation golden set covers 34 chunks / 23 documents. At this scale, baseline dense | |
| retrieval is already at recall@10=0.97 β the metrics are ceiling-bound. Any apparent | |
| improvement (whether from reranking, CRAG, or parameter changes) may be noise rather than | |
| signal. No feature can be conclusively validated until the corpus is at least 10Γ larger. | |
| **2. Prompt injection mitigation, not elimination.** | |
| The RAG system prompt instructs the LLM to use only the supplied context and cite inline. | |
| This reduces prompt injection risk but does not eliminate it: a sufficiently adversarial document | |
| can still attempt to override instructions via embedded directives in chunk text. | |
| **3. Same-model faithfulness judge.** | |
| The faithfulness judge calls the same Groq LLM that generated the answer. A model grading its | |
| own output has a self-preference bias β it may rate its own claims as grounded even when they | |
| are not. A second independent model (e.g. a different provider) would give a less biased | |
| verdict but at higher cost and latency. | |
| **4. Cost is an estimate.** | |
| `estimated_cost_usd` is computed from a static pricing table pinned to 2026-06-25. It does | |
| not account for free-tier credits, batch pricing, or promotional rates. Treat it as an order- | |
| of-magnitude indicator, not a billing source of truth. | |
| **5. Reranking and hybrid search deferred β not for lack of trying.** | |
| Reranking was implemented and A/B tested; it is disabled because the measurement showed no | |
| improvement on this corpus size, not because the implementation is absent. Hybrid search | |
| (sparse + dense) is documented and designed but not implemented β the recall gap it would address | |
| (proper-noun queries) does not exist at current corpus size, where baseline recall@10=0.97. | |
| **6. Chunk size below recommended range.** | |
| The `RecursiveCharacterTextSplitter` is configured to ~225 tokens per chunk (900 chars Γ· ~4 | |
| chars/token). Pinecone's guidance for `llama-text-embed-v2` suggests 400β500 tokens for best | |
| retrieval quality. The current chunks are too short to exploit the model's full context window. | |
| Changing `chunk_size` requires re-ingestion and re-evaluation against the golden set. | |
| **7. CRAG threshold and faithfulness threshold are placeholders.** | |
| `RAG_CRAG_GOOD_SCORE=0.45` (the cosine threshold that triggers query rewriting) and | |
| `RAG_FAITHFULNESS_THRESHOLD=0.5` (the faithfulness score below which `grounded=False`) are | |
| reasonable midpoints β not values calibrated against labeled data. Both require a held-out | |
| answer-quality evaluation to tune. | |
| --- | |
| ## Testing & Observability | |
| **343 tests** (321 unit + 22 integration) run in CI with zero network calls, zero credentials. | |
| | Layer | What it tests | | |
| |---|---| | |
| | Unit (321) | Pure functions: metrics, chunking, normalization, dedup, prompt builders, retrieval gating, faithfulness, CRAG, streaming, Prometheus, cost accounting | | |
| | Integration (22) | Real FastAPI app via `TestClient` β HTTP routing, auth dependency, LangGraph pipeline, SSE protocol, abstention path, faithfulness wiring; externals mocked at boundaries | | |
| CI runs from the fully-pinned `backend/requirements.txt` lock (compiled with `uv pip compile`, | |
| constrained to tested versions) β every CI run is a clean-environment reproducibility check. | |
| Observability: | |
| - **`/metrics`** (JSON, auth-gated) β request counts, error counts, 20-sample timing ring buffer | |
| - **`/metrics/prometheus`** (Prometheus text, public) β `http_requests_total` (Counter), | |
| `http_request_duration_seconds` (Histogram), `rag_phase_duration_seconds` (Histogram), | |
| `llm_tokens_total` (Counter by `call_type`) | |
| - **LangSmith** β optional trace collection via `LANGCHAIN_TRACING_V2=true` | |
| --- | |
| ## How to Run | |
| ```bash | |
| # Backend | |
| cd backend | |
| pip install -r requirements.txt | |
| cp .env.example .env # fill in PINECONE_*, GROQ_API_KEY, optional API_KEY | |
| uvicorn app.main:app --port 8000 | |
| # Frontend | |
| pip install -r requirements.txt # root (Streamlit) | |
| streamlit run frontend/app.py | |
| # Run tests (zero credentials needed) | |
| pytest tests/ -v | |
| # Evaluate retrieval (requires live Pinecone β reads only) | |
| make eval | |
| # Load benchmark (in-process, mocked externals) | |
| PYTHONPATH=backend python scripts/bench_mocked.py | |
| ``` | |
| Full configuration reference: [`backend/.env.example`](../backend/.env.example) | |
| Operational runbook (key rotation, rate-limit toggle, deployment): [`docs/CONTEXT.md`](CONTEXT.md) | |