Spaces:

BrejBala
/

rag-agent-workbench-api

Sleeping

App Files Files Community

rag-agent-workbench-api / docs /DESIGN.md

BrejBala

feat: deploy Tiers 2 & 3 — CRAG, faithfulness, streaming, Prometheus, eval-driven retrieval

6686f13 5 days ago

preview code

Raw

History Blame Contribute Delete

13.8 kB

RAG Agent Workbench — Design Document

Audience: Engineers and recruiters reviewing this repo.
Purpose: Explain the decisions behind the system — not just what it does, but why each choice was made and what the real tradeoffs are.
Exhaustive detail lives in docs/CONTEXT.md; this document curates the decisions that matter most.

What this is

A production-style RAG (Retrieval-Augmented Generation) backend built as a deliberate engineering exercise in decision-driven design. It ingests documents from Wikipedia, arXiv, and OpenAlex into a Pinecone vector index, then answers questions over that corpus via a 7-node LangGraph pipeline backed by Groq (LLaMA) and optional Tavily web search.

The headline capability: agentic RAG with corrective retrieval, cosine-gated abstention, two-layer faithfulness checking, honest token-level streaming, and per-request cost accounting — all wired to a Streamlit chat UI and a Prometheus metrics endpoint.

Every major feature was preceded by a retrieval evaluation harness. The rule: no parameter change without a measurement that justifies it.

Stack: FastAPI · LangGraph/LangChain · Pinecone (llama-text-embed-v2, 1024-dim, cosine) · Groq (LLaMA 3.1 8B) · Tavily (optional) · Streamlit · Prometheus · Docker

Architecture

See the pipeline diagram in the README for the full node flow.

A request to POST /chat passes through:

FastAPI middleware — CORS, API key auth (X-API-Key), slowapi rate limit (30 req/min), Prometheus HTTP instrumentation, in-memory TTL cache check.
run_in_threadpool — dispatches the LangGraph graph into a thread.
LangGraph pipeline (7 nodes, synchronous) — see diagram.
Response serialization — Pydantic ChatResponse with grounding metadata, timings, token usage, and source citations.

POST /chat/stream runs phases 1 and 3 (pre-generation nodes + post-generation grounding) in a thread pool, with phase 2 (token generation) streamed async via llm.astream for real first-token latency improvement.

Key Design Decisions

1. Eval-first, anti-circular-validation

The evaluation harness (eval/) was built before any parameter was tuned. Golden-set relevant_doc_ids are determined by reading document content — never by running the retriever and labelling its own output. Doing so would make recall@k tautological (the retriever would appear to have perfect recall because labels were derived from its output).

Tradeoff: building the harness first added upfront cost with no immediate feature output. The payoff is that every subsequent decision (reranking, top_k, cosine floor) is backed by a number, not intuition.

2. Two-threshold retrieval gate

Two independently configurable cosine thresholds serve different purposes:

Setting	Default	Purpose
`RAG_MIN_SCORE`	0.25	Routing: if `top_score < 0.25`, route to Tavily web fallback
`RAG_MIN_CHUNK_SCORE`	0.20	Safety floor: drop individual Pinecone chunks below this cosine score before they enter the LLM context

The floor at 0.20 is a data-derived safety bound: the minimum cosine score of any golden-relevant chunk across 30 evaluation queries was 0.2368. Setting the floor at 0.20 places it below this bound so no known-relevant chunk is dropped. It is not a tuned optimum — sharp floor calibration requires chunk-level graded relevance labels.

Tradeoff: two thresholds with different semantics create configuration surface. Keeping them distinct (even at different defaults) avoids the silent failure mode of a single threshold accidentally serving both routing and filtering purposes.

3. Reranking: evaluated and disabled

A Pinecone hosted reranker (bge-reranker-v2-m3) was implemented, A/B tested against the baseline, and disabled by default after measurement showed it was flat-or-negative at every metric:

Metric	Baseline	Rerank	Δ
nDCG@3	0.875	0.818	−0.057
nDCG@5	0.900	0.869	−0.031
Precision@1	0.966	0.966	0.000
Mean latency	360 ms	795 ms	+435 ms

Root cause: the corpus (34 chunks / 23 docs) is too small and well-separated for the dense retriever to miscalibrate top-of-list order. The reranker cannot demonstrate headroom it never had. RAG_RERANK_ENABLED=False is the empirically-validated default — enable only after the corpus grows to where dense retrieval misfires on precision.

4. top_k = 5: precision-first

The quality-vs-k curve (n=30 queries) shows:

k	Recall@k	P@k
5	0.914	0.360
8	0.969	0.242
10	0.981	0.197

The recall-margin knee is k=8 (both recall and nDCG within 0.02 of the k=10 ceiling). Despite this, RAG_DEFAULT_TOP_K is kept at 5 — a precision-first choice: k=5 delivers higher-signal context (P@5=0.36 vs P@8=0.24) at the accepted cost of 6.7 recall points.

Tradeoff: recall@k cannot settle this — it measures whether relevant docs appear in the ranked list, not whether a larger-but-noisier context improves LLM answer quality. The tiebreaker is a head-to-head answer-quality evaluation, which does not yet exist. Until it does, context signal quality is preferred over recall coverage.

5. Bounded CRAG corrective loop

corrective_retrieve (between retrieve_context and decide_next) grades retrieval quality by the cosine score already in state. If weak, it rewrites the query with Groq and re-queries Pinecone — up to RAG_CRAG_MAX_ITERS=2 times (a hard, unconditional loop bound).

The bound is non-negotiable: without it, a query on a topic not in the knowledge base would spin indefinitely on weak retrieval, exhausting rate limits and blocking the response.

Disabled by default (RAG_CRAG_ENABLED=False): the corpus is saturated at recall@10=0.97, so the corrective loop fires rarely on in-corpus queries. Enable it only after observing out-of-corpus queries where initial retrieval fails and the rewrite demonstrably helps.

Circular-validation avoidance: the grader uses the cosine score already in state — it does not re-embed with the retrieval model. Re-embedding would assess the retriever's output with the retriever's own semantic space.

6. Two-layer faithfulness check

Layer	When	Model calls	What it checks
`verify_citations`	Always	Zero	`[n]` citation markers that reference out-of-range chunk indices
`judge_faithfulness`	When `RAG_FAITHFULNESS_ENABLED=True` + not abstaining	1 (reuses Groq client)	Whether answer claims are supported by the retrieved context

The judge uses the existing Groq LLM — not the retrieval embedder. Re-embedding the answer with the same model used for retrieval would encode the embedder's biases into the faithfulness signal (circular validation).

Flag default OFF: every /chat request would otherwise pay for a second LLM call. On Groq's free tier the cost is latency, not money, but it is still undesirable for interactive use. When the flag is OFF, grounded and faithfulness_score in ChatResponse are null — the UI renders this as "not evaluated", never fabricates a value.

7. Honest streaming

/chat/stream uses llm.astream for the generation phase only (the nodes where it matters for TTFT). Pre-generation nodes (retrieval, CRAG, web search) are run synchronously in a thread pool — making them async-native would add complexity with no meaningful latency improvement.

Non-streamable paths are honest:

Cache hit → one token event with the full cached answer, done.cached=true
Abstention → one token event with the deterministic abstention text
Neither path calls the LLM or simulates token-by-token output

The previous implementation yielded whitespace-split words from a completed string. That misrepresented itself as streaming.

8. Cost and token observability

Token counts come from the actual API response (response.usage_metadata), not a local tokenizer estimate. All four LLM call types (generation, faithfulness judge, CRAG rewrite, history contextualization) are tracked by call_type in ChatResponse.usage.by_call_type and emitted as a Prometheus counter (llm_tokens_total{call_type=...}).

Dollar cost is an estimate from an as-of-date pricing table (2026-06-25) and is labeled as such. Embedding token counts are not reported — the Pinecone SDK does not expose them.

9. Reproducible corpus + pinned dimension

A corpus manifest (eval/corpus_manifest.py generate) snapshots vector IDs from the live Pinecone index to eval/corpus_manifest.json. A validator (corpus_manifest.py validate) compares the committed manifest against the live index and reports drift without auto-reconciling. Both operations are read-only.

The embedding model (llama-text-embed-v2) and dimension (1024) are now explicit in Settings (PINECONE_EMBED_MODEL, PINECONE_EMBED_DIMENSION) and logged at startup — removing the implicit dependency on Pinecone's default dimension.

Limitations & Tradeoffs

These are the real constraints. A design doc that only lists strengths reads as incomplete.

1. Saturated eval corpus. The evaluation golden set covers 34 chunks / 23 documents. At this scale, baseline dense retrieval is already at recall@10=0.97 — the metrics are ceiling-bound. Any apparent improvement (whether from reranking, CRAG, or parameter changes) may be noise rather than signal. No feature can be conclusively validated until the corpus is at least 10× larger.

2. Prompt injection mitigation, not elimination. The RAG system prompt instructs the LLM to use only the supplied context and cite inline. This reduces prompt injection risk but does not eliminate it: a sufficiently adversarial document can still attempt to override instructions via embedded directives in chunk text.

3. Same-model faithfulness judge. The faithfulness judge calls the same Groq LLM that generated the answer. A model grading its own output has a self-preference bias — it may rate its own claims as grounded even when they are not. A second independent model (e.g. a different provider) would give a less biased verdict but at higher cost and latency.

4. Cost is an estimate. estimated_cost_usd is computed from a static pricing table pinned to 2026-06-25. It does not account for free-tier credits, batch pricing, or promotional rates. Treat it as an order- of-magnitude indicator, not a billing source of truth.

5. Reranking and hybrid search deferred — not for lack of trying. Reranking was implemented and A/B tested; it is disabled because the measurement showed no improvement on this corpus size, not because the implementation is absent. Hybrid search (sparse + dense) is documented and designed but not implemented — the recall gap it would address (proper-noun queries) does not exist at current corpus size, where baseline recall@10=0.97.

6. Chunk size below recommended range. The RecursiveCharacterTextSplitter is configured to ~225 tokens per chunk (900 chars ÷ ~4 chars/token). Pinecone's guidance for llama-text-embed-v2 suggests 400–500 tokens for best retrieval quality. The current chunks are too short to exploit the model's full context window. Changing chunk_size requires re-ingestion and re-evaluation against the golden set.

7. CRAG threshold and faithfulness threshold are placeholders. RAG_CRAG_GOOD_SCORE=0.45 (the cosine threshold that triggers query rewriting) and RAG_FAITHFULNESS_THRESHOLD=0.5 (the faithfulness score below which grounded=False) are reasonable midpoints — not values calibrated against labeled data. Both require a held-out answer-quality evaluation to tune.

Testing & Observability

343 tests (321 unit + 22 integration) run in CI with zero network calls, zero credentials.

Layer	What it tests
Unit (321)	Pure functions: metrics, chunking, normalization, dedup, prompt builders, retrieval gating, faithfulness, CRAG, streaming, Prometheus, cost accounting
Integration (22)	Real FastAPI app via `TestClient` — HTTP routing, auth dependency, LangGraph pipeline, SSE protocol, abstention path, faithfulness wiring; externals mocked at boundaries

CI runs from the fully-pinned backend/requirements.txt lock (compiled with uv pip compile, constrained to tested versions) — every CI run is a clean-environment reproducibility check.

Observability:

/metrics (JSON, auth-gated) — request counts, error counts, 20-sample timing ring buffer
/metrics/prometheus (Prometheus text, public) — http_requests_total (Counter), http_request_duration_seconds (Histogram), rag_phase_duration_seconds (Histogram), llm_tokens_total (Counter by call_type)
LangSmith — optional trace collection via LANGCHAIN_TRACING_V2=true

How to Run

# Backend
cd backend
pip install -r requirements.txt
cp .env.example .env           # fill in PINECONE_*, GROQ_API_KEY, optional API_KEY
uvicorn app.main:app --port 8000

# Frontend
pip install -r requirements.txt   # root (Streamlit)
streamlit run frontend/app.py

# Run tests (zero credentials needed)
pytest tests/ -v

# Evaluate retrieval (requires live Pinecone — reads only)
make eval

# Load benchmark (in-process, mocked externals)
PYTHONPATH=backend python scripts/bench_mocked.py

Full configuration reference: backend/.env.example
Operational runbook (key rotation, rate-limit toggle, deployment): docs/CONTEXT.md