Spaces:
Sleeping
Sleeping
File size: 13,779 Bytes
6686f13 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 | # RAG Agent Workbench β Design Document
> **Audience:** Engineers and recruiters reviewing this repo.
> **Purpose:** Explain the *decisions* behind the system β not just what it does, but why each
> choice was made and what the real tradeoffs are.
> Exhaustive detail lives in [`docs/CONTEXT.md`](CONTEXT.md); this document curates the decisions
> that matter most.
---
## What this is
A production-style RAG (Retrieval-Augmented Generation) backend built as a deliberate engineering
exercise in decision-driven design. It ingests documents from Wikipedia, arXiv, and OpenAlex
into a Pinecone vector index, then answers questions over that corpus via a **7-node LangGraph
pipeline** backed by Groq (LLaMA) and optional Tavily web search.
The headline capability: agentic RAG with corrective retrieval, cosine-gated abstention,
two-layer faithfulness checking, honest token-level streaming, and per-request cost accounting β
all wired to a Streamlit chat UI and a Prometheus metrics endpoint.
Every major feature was preceded by a retrieval evaluation harness. The rule: no parameter
change without a measurement that justifies it.
**Stack:** FastAPI Β· LangGraph/LangChain Β· Pinecone (`llama-text-embed-v2`, 1024-dim, cosine) Β·
Groq (LLaMA 3.1 8B) Β· Tavily (optional) Β· Streamlit Β· Prometheus Β· Docker
---
## Architecture
See the [pipeline diagram in the README](../README.md#architecture) for the full node flow.
A request to `POST /chat` passes through:
1. **FastAPI middleware** β CORS, API key auth (`X-API-Key`), slowapi rate limit (30 req/min),
Prometheus HTTP instrumentation, in-memory TTL cache check.
2. **`run_in_threadpool`** β dispatches the LangGraph graph into a thread.
3. **LangGraph pipeline** (7 nodes, synchronous) β see diagram.
4. **Response serialization** β Pydantic `ChatResponse` with grounding metadata, timings,
token usage, and source citations.
`POST /chat/stream` runs phases 1 and 3 (pre-generation nodes + post-generation grounding)
in a thread pool, with phase 2 (token generation) streamed async via `llm.astream` for real
first-token latency improvement.
---
## Key Design Decisions
### 1. Eval-first, anti-circular-validation
The evaluation harness (`eval/`) was built before any parameter was tuned. Golden-set
`relevant_doc_ids` are determined by reading document content β never by running the retriever
and labelling its own output. Doing so would make recall@k tautological (the retriever would
appear to have perfect recall because labels were derived from its output).
**Tradeoff:** building the harness first added upfront cost with no immediate feature output.
The payoff is that every subsequent decision (reranking, top_k, cosine floor) is backed by a
number, not intuition.
---
### 2. Two-threshold retrieval gate
Two independently configurable cosine thresholds serve different purposes:
| Setting | Default | Purpose |
|---|---|---|
| `RAG_MIN_SCORE` | 0.25 | **Routing:** if `top_score < 0.25`, route to Tavily web fallback |
| `RAG_MIN_CHUNK_SCORE` | **0.20** | **Safety floor:** drop individual Pinecone chunks below this cosine score before they enter the LLM context |
The floor at 0.20 is a **data-derived safety bound**: the minimum cosine score of any
golden-relevant chunk across 30 evaluation queries was 0.2368. Setting the floor at 0.20
places it below this bound so no known-relevant chunk is dropped. It is not a tuned optimum β
sharp floor calibration requires chunk-level graded relevance labels.
**Tradeoff:** two thresholds with different semantics create configuration surface. Keeping
them distinct (even at different defaults) avoids the silent failure mode of a single threshold
accidentally serving both routing and filtering purposes.
---
### 3. Reranking: evaluated and disabled
A Pinecone hosted reranker (`bge-reranker-v2-m3`) was implemented, A/B tested against the
baseline, and **disabled by default** after measurement showed it was flat-or-negative at every
metric:
| Metric | Baseline | Rerank | Ξ |
|---|---|---|---|
| nDCG@3 | 0.875 | 0.818 | β0.057 |
| nDCG@5 | 0.900 | 0.869 | β0.031 |
| Precision@1 | 0.966 | 0.966 | 0.000 |
| Mean latency | 360 ms | 795 ms | +435 ms |
**Root cause:** the corpus (34 chunks / 23 docs) is too small and well-separated for the
dense retriever to miscalibrate top-of-list order. The reranker cannot demonstrate headroom
it never had. `RAG_RERANK_ENABLED=False` is the empirically-validated default β enable only
after the corpus grows to where dense retrieval misfires on precision.
---
### 4. top_k = 5: precision-first
The quality-vs-k curve (n=30 queries) shows:
| k | Recall@k | P@k |
|---|---|---|
| 5 | 0.914 | 0.360 |
| 8 | 0.969 | 0.242 |
| 10 | 0.981 | 0.197 |
The **recall-margin knee** is k=8 (both recall and nDCG within 0.02 of the k=10 ceiling).
Despite this, `RAG_DEFAULT_TOP_K` is kept at **5** β a precision-first choice: k=5 delivers
higher-signal context (P@5=0.36 vs P@8=0.24) at the accepted cost of 6.7 recall points.
**Tradeoff:** recall@k cannot settle this β it measures whether relevant docs appear in the
ranked list, not whether a larger-but-noisier context improves LLM answer quality. The
tiebreaker is a head-to-head answer-quality evaluation, which does not yet exist. Until it
does, context signal quality is preferred over recall coverage.
---
### 5. Bounded CRAG corrective loop
`corrective_retrieve` (between `retrieve_context` and `decide_next`) grades retrieval quality
by the cosine score already in state. If weak, it rewrites the query with Groq and re-queries
Pinecone β up to `RAG_CRAG_MAX_ITERS=2` times (a hard, unconditional loop bound).
The bound is **non-negotiable**: without it, a query on a topic not in the knowledge base would
spin indefinitely on weak retrieval, exhausting rate limits and blocking the response.
**Disabled by default** (`RAG_CRAG_ENABLED=False`): the corpus is saturated at recall@10=0.97,
so the corrective loop fires rarely on in-corpus queries. Enable it only after observing
out-of-corpus queries where initial retrieval fails and the rewrite demonstrably helps.
**Circular-validation avoidance:** the grader uses the cosine score already in state β it does
not re-embed with the retrieval model. Re-embedding would assess the retriever's output with
the retriever's own semantic space.
---
### 6. Two-layer faithfulness check
| Layer | When | Model calls | What it checks |
|---|---|---|---|
| `verify_citations` | Always | Zero | `[n]` citation markers that reference out-of-range chunk indices |
| `judge_faithfulness` | When `RAG_FAITHFULNESS_ENABLED=True` + not abstaining | 1 (reuses Groq client) | Whether answer claims are supported by the retrieved context |
The judge uses the **existing Groq LLM** β not the retrieval embedder. Re-embedding the answer
with the same model used for retrieval would encode the embedder's biases into the faithfulness
signal (circular validation).
**Flag default OFF:** every `/chat` request would otherwise pay for a second LLM call. On
Groq's free tier the cost is latency, not money, but it is still undesirable for interactive
use. When the flag is OFF, `grounded` and `faithfulness_score` in `ChatResponse` are `null` β
the UI renders this as "not evaluated", never fabricates a value.
---
### 7. Honest streaming
`/chat/stream` uses `llm.astream` for the generation phase only (the nodes where it matters for
TTFT). Pre-generation nodes (retrieval, CRAG, web search) are run synchronously in a thread
pool β making them async-native would add complexity with no meaningful latency improvement.
**Non-streamable paths are honest:**
- Cache hit β one token event with the full cached answer, `done.cached=true`
- Abstention β one token event with the deterministic abstention text
- Neither path calls the LLM or simulates token-by-token output
The previous implementation yielded whitespace-split words from a completed string. That
misrepresented itself as streaming.
---
### 8. Cost and token observability
Token counts come from the **actual API response** (`response.usage_metadata`), not a local
tokenizer estimate. All four LLM call types (generation, faithfulness judge, CRAG rewrite,
history contextualization) are tracked by `call_type` in `ChatResponse.usage.by_call_type` and
emitted as a Prometheus counter (`llm_tokens_total{call_type=...}`).
Dollar cost is an **estimate** from an as-of-date pricing table (`2026-06-25`) and is labeled
as such. Embedding token counts are not reported β the Pinecone SDK does not expose them.
---
### 9. Reproducible corpus + pinned dimension
A corpus manifest (`eval/corpus_manifest.py generate`) snapshots vector IDs from the live
Pinecone index to `eval/corpus_manifest.json`. A validator (`corpus_manifest.py validate`)
compares the committed manifest against the live index and reports drift without auto-reconciling.
Both operations are read-only.
The embedding model (`llama-text-embed-v2`) and dimension (1024) are now explicit in `Settings`
(`PINECONE_EMBED_MODEL`, `PINECONE_EMBED_DIMENSION`) and logged at startup β removing the
implicit dependency on Pinecone's default dimension.
---
## Limitations & Tradeoffs
These are the real constraints. A design doc that only lists strengths reads as incomplete.
**1. Saturated eval corpus.**
The evaluation golden set covers 34 chunks / 23 documents. At this scale, baseline dense
retrieval is already at recall@10=0.97 β the metrics are ceiling-bound. Any apparent
improvement (whether from reranking, CRAG, or parameter changes) may be noise rather than
signal. No feature can be conclusively validated until the corpus is at least 10Γ larger.
**2. Prompt injection mitigation, not elimination.**
The RAG system prompt instructs the LLM to use only the supplied context and cite inline.
This reduces prompt injection risk but does not eliminate it: a sufficiently adversarial document
can still attempt to override instructions via embedded directives in chunk text.
**3. Same-model faithfulness judge.**
The faithfulness judge calls the same Groq LLM that generated the answer. A model grading its
own output has a self-preference bias β it may rate its own claims as grounded even when they
are not. A second independent model (e.g. a different provider) would give a less biased
verdict but at higher cost and latency.
**4. Cost is an estimate.**
`estimated_cost_usd` is computed from a static pricing table pinned to 2026-06-25. It does
not account for free-tier credits, batch pricing, or promotional rates. Treat it as an order-
of-magnitude indicator, not a billing source of truth.
**5. Reranking and hybrid search deferred β not for lack of trying.**
Reranking was implemented and A/B tested; it is disabled because the measurement showed no
improvement on this corpus size, not because the implementation is absent. Hybrid search
(sparse + dense) is documented and designed but not implemented β the recall gap it would address
(proper-noun queries) does not exist at current corpus size, where baseline recall@10=0.97.
**6. Chunk size below recommended range.**
The `RecursiveCharacterTextSplitter` is configured to ~225 tokens per chunk (900 chars Γ· ~4
chars/token). Pinecone's guidance for `llama-text-embed-v2` suggests 400β500 tokens for best
retrieval quality. The current chunks are too short to exploit the model's full context window.
Changing `chunk_size` requires re-ingestion and re-evaluation against the golden set.
**7. CRAG threshold and faithfulness threshold are placeholders.**
`RAG_CRAG_GOOD_SCORE=0.45` (the cosine threshold that triggers query rewriting) and
`RAG_FAITHFULNESS_THRESHOLD=0.5` (the faithfulness score below which `grounded=False`) are
reasonable midpoints β not values calibrated against labeled data. Both require a held-out
answer-quality evaluation to tune.
---
## Testing & Observability
**343 tests** (321 unit + 22 integration) run in CI with zero network calls, zero credentials.
| Layer | What it tests |
|---|---|
| Unit (321) | Pure functions: metrics, chunking, normalization, dedup, prompt builders, retrieval gating, faithfulness, CRAG, streaming, Prometheus, cost accounting |
| Integration (22) | Real FastAPI app via `TestClient` β HTTP routing, auth dependency, LangGraph pipeline, SSE protocol, abstention path, faithfulness wiring; externals mocked at boundaries |
CI runs from the fully-pinned `backend/requirements.txt` lock (compiled with `uv pip compile`,
constrained to tested versions) β every CI run is a clean-environment reproducibility check.
Observability:
- **`/metrics`** (JSON, auth-gated) β request counts, error counts, 20-sample timing ring buffer
- **`/metrics/prometheus`** (Prometheus text, public) β `http_requests_total` (Counter),
`http_request_duration_seconds` (Histogram), `rag_phase_duration_seconds` (Histogram),
`llm_tokens_total` (Counter by `call_type`)
- **LangSmith** β optional trace collection via `LANGCHAIN_TRACING_V2=true`
---
## How to Run
```bash
# Backend
cd backend
pip install -r requirements.txt
cp .env.example .env # fill in PINECONE_*, GROQ_API_KEY, optional API_KEY
uvicorn app.main:app --port 8000
# Frontend
pip install -r requirements.txt # root (Streamlit)
streamlit run frontend/app.py
# Run tests (zero credentials needed)
pytest tests/ -v
# Evaluate retrieval (requires live Pinecone β reads only)
make eval
# Load benchmark (in-process, mocked externals)
PYTHONPATH=backend python scripts/bench_mocked.py
```
Full configuration reference: [`backend/.env.example`](../backend/.env.example)
Operational runbook (key rotation, rate-limit toggle, deployment): [`docs/CONTEXT.md`](CONTEXT.md)
|