File size: 13,779 Bytes
6686f13
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
# RAG Agent Workbench β€” Design Document

> **Audience:** Engineers and recruiters reviewing this repo.  
> **Purpose:** Explain the *decisions* behind the system β€” not just what it does, but why each
> choice was made and what the real tradeoffs are.  
> Exhaustive detail lives in [`docs/CONTEXT.md`](CONTEXT.md); this document curates the decisions
> that matter most.

---

## What this is

A production-style RAG (Retrieval-Augmented Generation) backend built as a deliberate engineering
exercise in decision-driven design.  It ingests documents from Wikipedia, arXiv, and OpenAlex
into a Pinecone vector index, then answers questions over that corpus via a **7-node LangGraph
pipeline** backed by Groq (LLaMA) and optional Tavily web search.

The headline capability: agentic RAG with corrective retrieval, cosine-gated abstention,
two-layer faithfulness checking, honest token-level streaming, and per-request cost accounting β€”
all wired to a Streamlit chat UI and a Prometheus metrics endpoint.

Every major feature was preceded by a retrieval evaluation harness.  The rule: no parameter
change without a measurement that justifies it.

**Stack:** FastAPI Β· LangGraph/LangChain Β· Pinecone (`llama-text-embed-v2`, 1024-dim, cosine) Β·
Groq (LLaMA 3.1 8B) Β· Tavily (optional) Β· Streamlit Β· Prometheus Β· Docker

---

## Architecture

See the [pipeline diagram in the README](../README.md#architecture) for the full node flow.

A request to `POST /chat` passes through:

1. **FastAPI middleware** β€” CORS, API key auth (`X-API-Key`), slowapi rate limit (30 req/min),
   Prometheus HTTP instrumentation, in-memory TTL cache check.
2. **`run_in_threadpool`** β€” dispatches the LangGraph graph into a thread.
3. **LangGraph pipeline** (7 nodes, synchronous) β€” see diagram.
4. **Response serialization** β€” Pydantic `ChatResponse` with grounding metadata, timings,
   token usage, and source citations.

`POST /chat/stream` runs phases 1 and 3 (pre-generation nodes + post-generation grounding)
in a thread pool, with phase 2 (token generation) streamed async via `llm.astream` for real
first-token latency improvement.

---

## Key Design Decisions

### 1. Eval-first, anti-circular-validation

The evaluation harness (`eval/`) was built before any parameter was tuned.  Golden-set
`relevant_doc_ids` are determined by reading document content β€” never by running the retriever
and labelling its own output.  Doing so would make recall@k tautological (the retriever would
appear to have perfect recall because labels were derived from its output).

**Tradeoff:** building the harness first added upfront cost with no immediate feature output.
The payoff is that every subsequent decision (reranking, top_k, cosine floor) is backed by a
number, not intuition.

---

### 2. Two-threshold retrieval gate

Two independently configurable cosine thresholds serve different purposes:

| Setting | Default | Purpose |
|---|---|---|
| `RAG_MIN_SCORE` | 0.25 | **Routing:** if `top_score < 0.25`, route to Tavily web fallback |
| `RAG_MIN_CHUNK_SCORE` | **0.20** | **Safety floor:** drop individual Pinecone chunks below this cosine score before they enter the LLM context |

The floor at 0.20 is a **data-derived safety bound**: the minimum cosine score of any
golden-relevant chunk across 30 evaluation queries was 0.2368.  Setting the floor at 0.20
places it below this bound so no known-relevant chunk is dropped.  It is not a tuned optimum β€”
sharp floor calibration requires chunk-level graded relevance labels.

**Tradeoff:** two thresholds with different semantics create configuration surface.  Keeping
them distinct (even at different defaults) avoids the silent failure mode of a single threshold
accidentally serving both routing and filtering purposes.

---

### 3. Reranking: evaluated and disabled

A Pinecone hosted reranker (`bge-reranker-v2-m3`) was implemented, A/B tested against the
baseline, and **disabled by default** after measurement showed it was flat-or-negative at every
metric:

| Metric | Baseline | Rerank | Ξ” |
|---|---|---|---|
| nDCG@3 | 0.875 | 0.818 | βˆ’0.057 |
| nDCG@5 | 0.900 | 0.869 | βˆ’0.031 |
| Precision@1 | 0.966 | 0.966 | 0.000 |
| Mean latency | 360 ms | 795 ms | +435 ms |

**Root cause:** the corpus (34 chunks / 23 docs) is too small and well-separated for the
dense retriever to miscalibrate top-of-list order.  The reranker cannot demonstrate headroom
it never had.  `RAG_RERANK_ENABLED=False` is the empirically-validated default β€” enable only
after the corpus grows to where dense retrieval misfires on precision.

---

### 4. top_k = 5: precision-first

The quality-vs-k curve (n=30 queries) shows:

| k | Recall@k | P@k |
|---|---|---|
| 5 | 0.914 | 0.360 |
| 8 | 0.969 | 0.242 |
| 10 | 0.981 | 0.197 |

The **recall-margin knee** is k=8 (both recall and nDCG within 0.02 of the k=10 ceiling).
Despite this, `RAG_DEFAULT_TOP_K` is kept at **5** β€” a precision-first choice: k=5 delivers
higher-signal context (P@5=0.36 vs P@8=0.24) at the accepted cost of 6.7 recall points.

**Tradeoff:** recall@k cannot settle this β€” it measures whether relevant docs appear in the
ranked list, not whether a larger-but-noisier context improves LLM answer quality.  The
tiebreaker is a head-to-head answer-quality evaluation, which does not yet exist.  Until it
does, context signal quality is preferred over recall coverage.

---

### 5. Bounded CRAG corrective loop

`corrective_retrieve` (between `retrieve_context` and `decide_next`) grades retrieval quality
by the cosine score already in state.  If weak, it rewrites the query with Groq and re-queries
Pinecone β€” up to `RAG_CRAG_MAX_ITERS=2` times (a hard, unconditional loop bound).

The bound is **non-negotiable**: without it, a query on a topic not in the knowledge base would
spin indefinitely on weak retrieval, exhausting rate limits and blocking the response.

**Disabled by default** (`RAG_CRAG_ENABLED=False`): the corpus is saturated at recall@10=0.97,
so the corrective loop fires rarely on in-corpus queries.  Enable it only after observing
out-of-corpus queries where initial retrieval fails and the rewrite demonstrably helps.

**Circular-validation avoidance:** the grader uses the cosine score already in state β€” it does
not re-embed with the retrieval model.  Re-embedding would assess the retriever's output with
the retriever's own semantic space.

---

### 6. Two-layer faithfulness check

| Layer | When | Model calls | What it checks |
|---|---|---|---|
| `verify_citations` | Always | Zero | `[n]` citation markers that reference out-of-range chunk indices |
| `judge_faithfulness` | When `RAG_FAITHFULNESS_ENABLED=True` + not abstaining | 1 (reuses Groq client) | Whether answer claims are supported by the retrieved context |

The judge uses the **existing Groq LLM** β€” not the retrieval embedder.  Re-embedding the answer
with the same model used for retrieval would encode the embedder's biases into the faithfulness
signal (circular validation).

**Flag default OFF:** every `/chat` request would otherwise pay for a second LLM call.  On
Groq's free tier the cost is latency, not money, but it is still undesirable for interactive
use.  When the flag is OFF, `grounded` and `faithfulness_score` in `ChatResponse` are `null` β€”
the UI renders this as "not evaluated", never fabricates a value.

---

### 7. Honest streaming

`/chat/stream` uses `llm.astream` for the generation phase only (the nodes where it matters for
TTFT).  Pre-generation nodes (retrieval, CRAG, web search) are run synchronously in a thread
pool β€” making them async-native would add complexity with no meaningful latency improvement.

**Non-streamable paths are honest:**
- Cache hit β†’ one token event with the full cached answer, `done.cached=true`
- Abstention β†’ one token event with the deterministic abstention text
- Neither path calls the LLM or simulates token-by-token output

The previous implementation yielded whitespace-split words from a completed string.  That
misrepresented itself as streaming.

---

### 8. Cost and token observability

Token counts come from the **actual API response** (`response.usage_metadata`), not a local
tokenizer estimate.  All four LLM call types (generation, faithfulness judge, CRAG rewrite,
history contextualization) are tracked by `call_type` in `ChatResponse.usage.by_call_type` and
emitted as a Prometheus counter (`llm_tokens_total{call_type=...}`).

Dollar cost is an **estimate** from an as-of-date pricing table (`2026-06-25`) and is labeled
as such.  Embedding token counts are not reported β€” the Pinecone SDK does not expose them.

---

### 9. Reproducible corpus + pinned dimension

A corpus manifest (`eval/corpus_manifest.py generate`) snapshots vector IDs from the live
Pinecone index to `eval/corpus_manifest.json`.  A validator (`corpus_manifest.py validate`)
compares the committed manifest against the live index and reports drift without auto-reconciling.
Both operations are read-only.

The embedding model (`llama-text-embed-v2`) and dimension (1024) are now explicit in `Settings`
(`PINECONE_EMBED_MODEL`, `PINECONE_EMBED_DIMENSION`) and logged at startup β€” removing the
implicit dependency on Pinecone's default dimension.

---

## Limitations & Tradeoffs

These are the real constraints.  A design doc that only lists strengths reads as incomplete.

**1. Saturated eval corpus.**
The evaluation golden set covers 34 chunks / 23 documents.  At this scale, baseline dense
retrieval is already at recall@10=0.97 β€” the metrics are ceiling-bound.  Any apparent
improvement (whether from reranking, CRAG, or parameter changes) may be noise rather than
signal.  No feature can be conclusively validated until the corpus is at least 10Γ— larger.

**2. Prompt injection mitigation, not elimination.**
The RAG system prompt instructs the LLM to use only the supplied context and cite inline.
This reduces prompt injection risk but does not eliminate it: a sufficiently adversarial document
can still attempt to override instructions via embedded directives in chunk text.

**3. Same-model faithfulness judge.**
The faithfulness judge calls the same Groq LLM that generated the answer.  A model grading its
own output has a self-preference bias β€” it may rate its own claims as grounded even when they
are not.  A second independent model (e.g. a different provider) would give a less biased
verdict but at higher cost and latency.

**4. Cost is an estimate.**
`estimated_cost_usd` is computed from a static pricing table pinned to 2026-06-25.  It does
not account for free-tier credits, batch pricing, or promotional rates.  Treat it as an order-
of-magnitude indicator, not a billing source of truth.

**5. Reranking and hybrid search deferred β€” not for lack of trying.**
Reranking was implemented and A/B tested; it is disabled because the measurement showed no
improvement on this corpus size, not because the implementation is absent.  Hybrid search
(sparse + dense) is documented and designed but not implemented β€” the recall gap it would address
(proper-noun queries) does not exist at current corpus size, where baseline recall@10=0.97.

**6. Chunk size below recommended range.**
The `RecursiveCharacterTextSplitter` is configured to ~225 tokens per chunk (900 chars Γ· ~4
chars/token).  Pinecone's guidance for `llama-text-embed-v2` suggests 400–500 tokens for best
retrieval quality.  The current chunks are too short to exploit the model's full context window.
Changing `chunk_size` requires re-ingestion and re-evaluation against the golden set.

**7. CRAG threshold and faithfulness threshold are placeholders.**
`RAG_CRAG_GOOD_SCORE=0.45` (the cosine threshold that triggers query rewriting) and
`RAG_FAITHFULNESS_THRESHOLD=0.5` (the faithfulness score below which `grounded=False`) are
reasonable midpoints β€” not values calibrated against labeled data.  Both require a held-out
answer-quality evaluation to tune.

---

## Testing & Observability

**343 tests** (321 unit + 22 integration) run in CI with zero network calls, zero credentials.

| Layer | What it tests |
|---|---|
| Unit (321) | Pure functions: metrics, chunking, normalization, dedup, prompt builders, retrieval gating, faithfulness, CRAG, streaming, Prometheus, cost accounting |
| Integration (22) | Real FastAPI app via `TestClient` β€” HTTP routing, auth dependency, LangGraph pipeline, SSE protocol, abstention path, faithfulness wiring; externals mocked at boundaries |

CI runs from the fully-pinned `backend/requirements.txt` lock (compiled with `uv pip compile`,
constrained to tested versions) β€” every CI run is a clean-environment reproducibility check.

Observability:
- **`/metrics`** (JSON, auth-gated) β€” request counts, error counts, 20-sample timing ring buffer
- **`/metrics/prometheus`** (Prometheus text, public) β€” `http_requests_total` (Counter),
  `http_request_duration_seconds` (Histogram), `rag_phase_duration_seconds` (Histogram),
  `llm_tokens_total` (Counter by `call_type`)
- **LangSmith** β€” optional trace collection via `LANGCHAIN_TRACING_V2=true`

---

## How to Run

```bash
# Backend
cd backend
pip install -r requirements.txt
cp .env.example .env           # fill in PINECONE_*, GROQ_API_KEY, optional API_KEY
uvicorn app.main:app --port 8000

# Frontend
pip install -r requirements.txt   # root (Streamlit)
streamlit run frontend/app.py

# Run tests (zero credentials needed)
pytest tests/ -v

# Evaluate retrieval (requires live Pinecone β€” reads only)
make eval

# Load benchmark (in-process, mocked externals)
PYTHONPATH=backend python scripts/bench_mocked.py
```

Full configuration reference: [`backend/.env.example`](../backend/.env.example)  
Operational runbook (key rotation, rate-limit toggle, deployment): [`docs/CONTEXT.md`](CONTEXT.md)