Spaces:

arjun10g
/

RAG-PSYCH

Running

App Files Files Community

RAG-PSYCH / docs /roadmap.md

arjun10g

Initial deploy to Hugging Face Spaces

08fc97e about 1 month ago

preview code

raw

history blame contribute delete

38.3 kB

	# Build Roadmap

	Work through phases in order. Each phase produces a working, demo-able state.
	Check off boxes as you complete them — Claude Code will update these in
	commits alongside the code changes.

	## Phase 0 — Foundation (30 min)

	- [ ] Create repo structure (already done if you're reading this)
	- [ ] Copy `.env.example` to `.env`
	- [ ] Generate a strong Postgres password:
	`python -c "import secrets; print(secrets.token_urlsafe(24))"`
	- [ ] Paste the generated password into BOTH `POSTGRES_PASSWORD` and the
	password portion of `DATABASE_URL` in `.env`
	- [ ] Add your `ANTHROPIC_API_KEY` to `.env`
	- [ ] Set a $5 weekly spend limit at console.anthropic.com → Settings → Limits
	- [ ] Verify `.env` is in `.gitignore` and NOT tracked (`git status` should not show it)
	- [ ] Verify no `REPLACE_ME` strings remain: `grep REPLACE_ME .env` returns nothing
	- [ ] Create Python venv and install `requirements.txt`
	- [ ] Run `docker compose up -d postgres` and confirm `psql` connection works
	- [ ] Run `git init` and make the first commit — `.env` must NOT appear in it

	Exit criteria: Postgres running, pgvector extension available, no secrets
	staged for commit.

	## Phase 1 — Multi-source ingestion (3-4 hours, the biggest phase)

	The pluggable architecture means each source is independent. Implement
	them in this order — each produces a visible milestone, and later ones
	build on lessons from earlier ones.

	### Phase 1a — MTSamples (45 min)

	- [x] Download MTSamples CSV from Kaggle to `data/mtsamples.csv`
	- [ ] Confirm `data/mtsamples.csv` does NOT show up in `git status` (deferred — repo not yet `git init`'d; `.gitignore` `data/*` rule already covers it)
	- [x] Implement `ingest/sources/mtsamples.py::MTSamplesSource.load()`:
	filter to psych-relevant rows, yield a RawDocument per row
	- [x] Implement `chunk()` with regex section splitting +
	recursive-character fallback
	- [x] Smoke test: `python -c "from ingest.sources.mtsamples import *; \
	s = MTSamplesSource(); print(sum(1 for _ in s.load()))"`
	→ 812 docs, 8,296 chunks (avg 366 chars/chunk, 1024/1041 docs hit section regex)

	### Phase 1b — Top-level runner (30 min)

	- [x] Implement `ingest/run.py` with argparse, dotenv, tqdm, batched
	embedding, parameterized INSERT with ON CONFLICT upsert
	- [x] Run `python ingest/run.py --sources mtsamples`
	- [x] Verify: `SELECT COUNT(*) FROM documents WHERE source_type='mtsamples';`
	returns the expected count → 812
	- [x] Verify: `SELECT COUNT(*) FROM chunks;` returns more rows than that → 8,296
	(all rows have non-null embedding + tsv; cosine search returns
	relevant psych chunks with similarity >0.91)

	### Phase 1c — PubMed (60 min)

	- [ ] Register for an NCBI API key at ncbi.nlm.nih.gov/account (optional — running without; 3 req/sec is fine for retmax=2000)
	- [x] Add `NCBI_EMAIL` and optionally `NCBI_API_KEY` to `.env`
	- [x] Implement `PubMedSource.load()`:
	- esearch with the MeSH-based psychiatry query
	- batched efetch (200 PMIDs per call)
	- cache each fetched record to `data/cache/pubmed/{pmid}.json`
	- skip cached records on re-run
	- [x] Implement `chunk()` — one chunk per abstract or per structured
	section if the abstract has Background/Methods/Results/Conclusions
	- [x] Run `python ingest/run.py --sources pubmed` → 2,000 docs / 2,315 chunks
	- [x] Watch for rate limit errors — Biopython retries automatically,
	but sustained 429s mean you need to set NCBI_EMAIL properly
	(no 429s observed; full fetch in ~13s)

	### Phase 1d — ICD-11 (75 min)

	- [x] Register at icd.who.int/icdapi, create API access key
	- [x] Add `ICD_CLIENT_ID` and `ICD_CLIENT_SECRET` to `.env`
	- [x] Implement an OAuth2 token helper:
	- POST to `icdaccessmanagement.who.int/connect/token`
	- cache token to `data/cache/icd11/.token.json` with expiry
	- refresh on 401 from API calls
	- [x] Implement `ICD11Source.load()`:
	- GET the Chapter 06 entity (auto-follows `latestRelease` for the
	version-pinned URI; current release is `2026-01`)
	- recursively walk `child` URIs to enumerate all mental disorders
	- for each entity, GET its URI and extract title, definition,
	additional info, diagnostic criteria, inclusion/exclusion,
	synonyms, index terms
	- cache each entity response to `data/cache/icd11/{entity_id}.json`
	- [x] Implement `chunk()` — one chunk per meaningful field, with the
	field name as the `section`
	- [x] Run `python ingest/run.py --sources icd11` → 685 docs / 1,683 chunks
	(Definition: 659, Index Terms: 608, Exclusion: 282, Coding Note: 53,
	Inclusion: 39, Fully Specified Name: 32, Long Definition: 10)

	### Phase 1e — Full run + sanity check (15 min)

	- [x] `python ingest/run.py --sources all` (cache hits for PubMed and
	ICD-11; mtsamples re-reads CSV; embedding step re-runs across all
	~12k chunks each time the runner is invoked)
	- [x] Per-source chunk counts via `chunks_with_source`:
	mtsamples=8,296, pubmed=2,315, icd11=1,683 → 12,294 total
	- [x] 5 hand-picked sanity queries: clinical→mtsamples, diagnostic→icd11,
	research→pubmed all route correctly. Exact-string drug query returns
	same-class drug (citalopram for "sertraline") — motivates hybrid
	BM25 in Phase 2. Off-topic query drops cosine ~0.07 vs in-domain
	(0.866 vs 0.94) — usable as a refusal signal in Phase 3.

	Known limitations carried forward:
	- MTSamples CSV contains literal duplicate rows; deduping not in scope here.
	- Total chunk count (12,294) is slightly above the 3K–10K target. Driven by
	the broad mtsamples keyword filter (812 docs vs the docstring's expected
	50–100). Acceptable for a portfolio piece; revisit if retrieval noise.

	Exit criteria: All three sources populated. Total chunk count
	somewhere in the 3,000-10,000 range. Hand-run similarity queries return
	sensible results from the right sources (e.g. diagnostic query returns
	ICD-11 chunks, research query returns PubMed chunks).

	## Phase 2 — Retrieval with RRF + Cross-Encoder Reranking (90 min)

	> Revised from the original "weighted-sum hybrid" plan after a literature
	> review. Production clinical RAGs (MedRAG, OpenSearch, Anthropic Contextual
	> Retrieval) ship Reciprocal Rank Fusion (k=60) and a cross-encoder reranker
	> as the canonical Phase-2 build. Score-normalization weighted-sum is
	> brittle across query types (the α that works for entity queries fails for
	> paraphrastic ones); RRF aggregates ranks instead and is robust by design.

	- [x] Write `api/rag.py` with two retrievers:
	- `retrieve_vector(query, k, source_types=None)` — cosine via `<=>`
	on `chunks_with_source`, optional `source_type` filter
	- `retrieve_bm25(query, k, source_types=None)` — `ts_rank` over the
	`tsv` GIN index. Tokens extracted with a strict alphanumeric regex
	and joined with OR (`\|`) — `plainto_tsquery`'s implicit AND was
	too brittle for natural-language queries containing rare drug
	names + common modifiers
	- [x] Write `api/hybrid.py` with `retrieve_hybrid(query, k=5, candidate_k=50,
	source_types=None)`:
	- pull top `candidate_k` from each retriever
	- fuse via RRF: score = Σ 1 / (HYBRID_RRF_K + rank_in_retriever_i)
	- dedupe by chunk text (MTSamples CSV has literal duplicate rows)
	- cross-encoder rerank the fused candidates
	(`cross-encoder/ms-marco-MiniLM-L-12-v2`, ~150 ms on CPU)
	- return top-`k` by rerank score
	- if best rerank score < `RERANK_MIN_SCORE`, return `[]` so the
	generation layer can emit the canonical refusal
	- [x] Add env vars to `.env.example` and `.env`:
	`HYBRID_RRF_K=60`, `RERANK_MODEL=cross-encoder/ms-marco-MiniLM-L-12-v2`,
	`RERANK_MIN_SCORE=-5.0`, `RETRIEVAL_CANDIDATE_K=50`
	(dropped the unused `HYBRID_VECTOR_WEIGHT` / `HYBRID_BM25_WEIGHT`)
	- [x] Run 7 manual test queries across sources:
	- Clinical scenario ("patient presents with persistent low mood") —
	should favor MTSamples
	- Diagnostic criteria ("criteria for generalized anxiety disorder") —
	should favor ICD-11
	- Research question ("efficacy of CBT for OCD") —
	should favor PubMed
	- Exact match ("sertraline 50mg") — RRF + rerank should now
	surface the literal-token hit, not just same-class drugs
	- Semantic paraphrase — vector retriever lift
	- Off-topic ("best pizza recipe") — should fall below
	`RERANK_MIN_SCORE` and trigger the refusal path
	- Cross-source ("what does research say about diagnostic criteria
	for depression?") — should pull from PubMed AND ICD-11

	Exit criteria — actual results:

	\| Query \| Outcome \|
	\|---\|---\|
	\| Clinical scenario (low mood + anhedonia) \| ICD-11 melancholic-depression Definition + 2 psych consults in top-5; top-1 was a non-psych "patient presents with" template (cross-encoder surface-form bias). Mostly correct. \|
	\| Diagnostic (criteria for GAD) \| ICD-11 GAD Definition in top-2; rest are pubmed GAD-related. Correct. \|
	\| Research (CBT for OCD) \| All 5 results pubmed (correct routing); content is CBT/cognitive-therapy adjacent but not OCD-specific (corpus retmax=2000 didn't include enough OCD-specific abstracts). Source routing correct, content thin. \|
	\| Exact drug (sertraline 50mg) \| Returns citalopram (same SSRI class) for depression, ICD-11 depression index terms. The literal sertraline chunk is buried — it's a kidney-failure discharge med list, not a psych chunk; both vector and BM25 score depression-rich chunks higher. Documented limitation: corpus + chunking, not retrieval algorithm. \|
	\| Paraphrase ("disappear forever") \| Refused — top rerank score −7.15 (below threshold of −5.0). Cross-encoder pulled dissociation chunks instead of suicidal-ideation; the lay-language query doesn't lexically match clinical SI vocabulary. Refusal is the conservative-correct behavior here. \|
	\| Off-topic (pizza Naples) \| Refused — all candidates below threshold. Correct. \|
	\| Cross-source (research on diagnostic criteria) \| All pubmed top-5 (no ICD-11). The query's "research says" framing biases the cross-encoder away from canonical definitions toward research abstracts. Source routing partially correct. \|

	Known limitations carried into Phase 3+ (worth interview discussion):
	- Cross-encoder is `ms-marco-MiniLM-L-12-v2` — generic web-search trained,
	not clinical. Surface-form patterns ("patient presents with…") and
	euphemistic clinical language are weak spots. BGE-reranker-v2-m3 would
	likely do better at ~3× CPU latency. Tune on the eval set in Phase 6.
	- Postgres `ts_rank` is term-density-only (no IDF). For real BM25 with IDF
	you need OpenSearch/Elastic or a custom Postgres extension. Acceptable
	for the demo; flag in interview.
	- The refusal threshold `−5.0` is an educated default. Phase 6 eval set
	is the right place to tune it against precision/recall curves.

	### Phase 2.5 — Lexical-boost retriever + negation filter

	After running the Phase 2 battery I went one round deeper to address two
	specific failure modes: the literal sertraline chunk being buried (rare
	clinical entities don't survive `ts_rank`'s term-density bias) and
	chunks with negated clinical concepts being treated as positive evidence
	(every embedder and cross-encoder we tested is polarity-blind).

	What landed:
	- Third RRF retriever: `retrieve_lexical(query, k)` in `api/rag.py`.
	Extracts "rare" query tokens (alphabetic ≥8 chars not in a generic-medical
	stoplist; OR all-uppercase ≥3 chars; OR mixed letter+digit ≥3 chars for
	ICD codes). Scores each chunk by Σ(matched-token length) via parameterised
	ILIKE so longer specific tokens (sertraline) outweigh short noisy ones
	(50mg). Returns [] when the query has no rare tokens — vector + BM25 cover
	that case.
	- Custom rule-based negation detector at `api/negation.py`. Scope-aware
	per Chapman et al. 2001: word-pivot terminators (`but`/`however`/`with`/
	punctuation) end the scope but commas don't, so list-style "negative for
	X, Y, Z" works. We initially tried `scispacy` + `negspacy` — passed 5/5
	synthetic but had a ~30% false-positive rate on real chunks because
	default NegEx scope leaks across conjunctions. Custom matcher hits 11/11
	on a hand-built test grid including the killer FP case. Pure-Python
	regex; ~0.1 ms/chunk vs negspacy's ~17 ms.
	- Negation filter applied to the post-rerank top-15 window in
	`_drop_negated()`; flagged chunks dropped before the final top-k slice.

	Decisions deliberately NOT taken (with reasons):
	- BGE-reranker-v2-m3 swap. ~10–15× CPU latency vs ms-marco; the gain on
	short keyword queries is small per the model card. Eval-set decision
	for Phase 6.
	- NLI second-pass (`cross-encoder/nli-deberta-v3-base`). Covers the same
	failure mode as our negation filter at ~3–5 s per 50 candidates;
	NegEx-style is the clinical-NLP canonical answer and is two orders of
	magnitude faster. Defer; revisit if our rule-based detector misses
	cases that an entailment model would catch.
	- scispacy + negspacy in `requirements.txt`. Installed during evaluation
	but the runtime path doesn't import them; not declared.

	Verified post-Phase-2.5 results on 10 queries (7 original + 3 negation):

	\| Query \| Result vs Phase 2 baseline \|
	\|---\|---\|
	\| Clinical (low mood, anhedonia) \| Top-1 now ICD-11 Current depressive episode Definition (was a non-psych "patient presents" chunk). \|
	\| Diagnostic (criteria for GAD) \| ICD-11 Generalised anxiety disorder Definition top-2 (unchanged — already correct). \|
	\| Research (CBT for OCD) \| All 5 pubmed (correct routing); content thin because retmax=2000 doesn't include enough OCD-specific abstracts (corpus limit, not retrieval bug). \|
	\| Exact drug (sertraline 50mg) \| Top-1 is now the literal Sertraline-100mg chunk (was citalopram). Lexical-boost did its job. \|
	\| Paraphrase ("disappear forever") \| Still REFUSED (top score −7.15, below −5.0 threshold). Domain mismatch between lay-language query and clinical chunks; conservative refusal is the correct clinical-RAG behavior. \|
	\| Off-topic (pizza Naples) \| Refused. ✅ \|
	\| Cross-source (research on diagnostic criteria) \| Top-3 now includes RDoC + Diagnostic Criteria for Psychosomatic Research (was off-topic depression-research abstracts). \|
	\| NEG-SI ("patient with active SI") \| Top-5 all affirm SI; verified manually that a "Psych: No suicidal, homicidal ideations" chunk is correctly DROPPED by the negation filter. \|
	\| NEG-DEPRESSION \| Top-5 all psych consults / discharge summaries with depression history. \|
	\| NEG-PSYCHOSIS \| Top-5 all ICD-11 psychotic-disorder Definitions. Best routing of any query. \|

	Latency profile (M-series CPU): cold first call ~5.8 s (model loads),
	subsequent queries 0.9–2.0 s, refused queries ~1 s. All within budget for
	an interactive demo.

	Limitations still open (for Phase 6 eval):
	- Negation detector uses substring matching, so query term "depression"
	won't catch "depressive". Stemming or lemma-aware matching would help.
	- Paraphrase / euphemism handling is bottlenecked by the generic
	ms-marco cross-encoder. Defense-in-depth via Phase 3 prompt is the
	cheapest mitigation.

	## Phase 3 — Generation with Citations (60 min)

	- [x] Write `generate(query, reranked_hits) -> Generation` in `api/generate.py`
	— `Generation(answer, cited_ids, invalid_cited_ids, refused, model, latency_ms)`
	- [x] System prompt enforces four rules (rule 3 added during build):
	1. Use ONLY the information in the provided chunks
	2. Every factual claim ends with `[chunk_id]`
	3. Polarity check before citing — denied / "no history of" / "ruled out"
	chunks must NOT be cited as evidence FOR the condition. Defense-in-depth
	on top of the retrieval-time NegEx filter (`api/negation.py`)
	4. If chunks don't answer, return EXACTLY the refusal string
	- [x] Post-generation validation: `_CITATION_RE` parses `[chunk_id]` references;
	flagged in `Generation.invalid_cited_ids` if any ID isn't in the
	retrieved set. Across the 7-query battery: 0 invalid citations.
	- [x] Refusal short-circuit: `generate(query, [])` returns the canonical
	refusal string with `latency_ms=0` — no API call when retrieval refused.
	- [x] Test with 7 queries — results below.

	Live results on 7-query battery:

	\| Query \| Outcome \|
	\|---\|---\|
	\| Clinical (low mood + anhedonia) \| Returns refusal string + nuanced explanation: chunks describe depression but no chunk has the specific tri-symptom combination. Cited [24207, 18282, 22746, 24049] all valid. \|
	\| Diagnostic (criteria for GAD) \| Clean answer from ICD-11 GAD Definition; cited chunk 24195 three times for three sub-claims. \|
	\| Research (CBT for OCD) \| REFUSED — chunks were CBT-adjacent but not OCD-specific. \|
	\| Exact drug (sertraline 50mg) \| Refusal-with-explanation: notes sertraline 100mg appears in a med list [19938] but not 50mg specifically; SSRI/depression mentioned in [18297]. Both citations valid. \|
	\| Off-topic (pizza Naples) \| REFUSED at retrieval (0 ms, no API call). \|
	\| Cross-source (research on diagnostic criteria) \| Synthesized 3 PubMed claims about diagnostic criteria limitations. Cited [22045, 21301, 22847] all valid. \|
	\| NEG-SI (active SI) \| Cited 3 chunks all affirming SI in a 45-y/o female; no "denies SI" chunks made it through. Polarity defense-in-depth holds. \|

	Citation validity: 7/7 queries with 0 invalid citations. Hallucination
	tripwire is clean.

	Latency / cost: 850 ms–3000 ms per call on Haiku 4.5 (Tier 1, no cache).
	~$0.001–0.005 per query. The 7-query battery cost ~$0.02 total.

	Behavior worth flagging for Phase 6: Haiku sometimes returns the refusal
	string AND a paragraph explaining why the chunks don't quite answer (CLINICAL,
	EXACT-DRUG above). The strict `answer == REFUSAL_STRING` check sees these as
	`refused=False` because of the trailing explanation. The behavior is
	defensible UX (the explanation is useful), but binary refusal counts in the
	eval harness should use `answer.startswith(REFUSAL_STRING)` instead.

	Exit declared: generation produces grounded, citation-tagged answers;
	hallucinated citation IDs are caught by the validator (none seen); off-topic
	queries trigger the refusal path with no API call; polarity rule holds in
	combination with the upstream NegEx filter.

	## Phase 4 — FastAPI Wrapper (45 min)

	- [x] `POST /query` with Pydantic request model: `query: str (max 2000 chars)`,
	`k: int (1-20, default 5)`, optional `source_types` filter
	- [x] Response model: `{answer, cited_ids, invalid_cited_ids, refused,
	retrieved_chunks, model, latency: {retrieval_ms, generation_ms, total_ms}}`
	- [x] `GET /health` — returns `{"status": "ok"}` (HTTP 200) when the DB
	`SELECT 1` succeeds, `{"status": "degraded"}` (HTTP 503) otherwise.
	No stack traces, version strings, or schema details leaked.
	- [x] Structured audit logging in `api/logging_config.py` — single-line JSON,
	logs `query_hash` (16-char SHA-256 prefix), k, retrieved_count,
	cited_count, invalid_cited_count, refused, model, retrieval_ms,
	generation_ms, total_ms. Verified: no raw query text or chunk
	text appears in logs (grep for known query strings returned nothing).
	Third-party loggers (httpx, urllib3, huggingface_hub, filelock)
	capped at WARNING so they don't drown out the audit lines.
	- [x] Rate limiting via `slowapi`, 30/minute per IP on `/query`.
	`/health` is intentionally NOT rate-limited (load-balancer/k8s
	probes hit it constantly). 429 response body is generic
	(`{"error": "Rate limit exceeded: 30 per 1 minute"}`) — no IP/client
	details leaked.
	- [x] CORS locked to `http://localhost:8501` (configurable via
	`CORS_ORIGIN` env var); `allow_credentials=False`, methods limited
	to GET/POST, headers limited to `Content-Type`.
	- [x] Pydantic validation errors normalised to HTTP 400 with a
	generic `{"error": "invalid_request"}` body — the default 422 with
	field-level errors would leak schema hints.

	Verified end-to-end via curl against `uvicorn api.main:app --port 8000`:

	\| Test \| Result \|
	\|---\|---\|
	\| `GET /health` against running Postgres \| 200 `{"status":"ok"}` \|
	\| `POST /query` well-formed (GAD diagnostic query, k=3) \| 200, single-citation answer from chunk 24195 (ICD-11 GAD Definition), 0 invalid citations \|
	\| `POST /query` with `query` of 2500 chars \| 400 `{"error":"invalid_request"}` \|
	\| `POST /query` with `k=99` \| 400 `{"error":"invalid_request"}` \|
	\| `POST /query` off-topic ("pizza Naples") \| 200, refusal short-circuits at retrieval (`retrieval_ms` only, `generation_ms=0`, `refused=true`, `retrieved_chunks=[]`) \|
	\| 32 parallel `POST /query` requests \| All return 429 once the 30/min window fills; rate limiter wired correctly \|
	\| Audit log inspection \| Only `query_hash` + metrics; no raw query text or chunk text \|

	Exit declared: API surface is production-shape — request validation
	returns generic 400s, audit logging hashes sensitive fields, health
	endpoint stays opaque on failure, rate limiting and CORS are locked down.

	## Phase 5 — UI: HTMX + FastAPI templates + Three.js + GSAP

	> Revised from the original "Streamlit UI" plan after a UI-framework
	> efficiency comparison. Streamlit re-runs the entire script on every
	> widget interaction; Gradio is closer to right but still ships its own
	> websocket framework. HTMX served by the existing FastAPI app is
	> the highest production-signal option: server-side rendering, no JS
	> framework, reuses the same `/query`-style endpoints with HTML responses
	> instead of JSON. Three.js + GSAP add the visual polish a clinical-AI
	> portfolio benefits from for an interview demo.

	- [x] Mount Jinja2 templates and static assets onto `api/main.py`:
	`/static` → `api/static/`, templates → `api/templates/`. Added
	`jinja2` and `python-multipart` to `requirements.txt`.
	- [x] `GET /ui` renders `index.html` (page shell, hero, search form,
	empty results section that HTMX swaps into).
	- [x] `POST /ui/query` is the HTMX endpoint — same retrieval +
	generation pipeline as the JSON `/query` route, but returns the
	rendered `_results.html` partial. Same audit logging
	(`ui_query_received`, `ui_query_completed`), same 30/min rate
	limit, same Pydantic-equivalent length and `k` bounds via
	FastAPI `Form()` constraints.
	- [x] `_render_citations()` HTML-escapes the LLM answer, then wraps
	each `[chunk_id]` in `<span class="citation" data-chunk="…">` so
	the frontend can hook hover/focus/click events. Chunk IDs are
	DB integers so safe to interpolate; the surrounding text is
	escaped.
	- [x] `index.html`: hero with neural-particle Three.js canvas behind
	everything, gradient title, search form (HTMX `hx-post`,
	`hx-target=#results`, `hx-indicator=#spinner`), tri-color loading
	dots, k selector (3/5/8/10), Tailwind via CDN.
	- [x] `_results.html`: two-column grid, grounded-answer card OR amber
	"insufficient evidence" card on refusal, latency strip
	(retrieval / generation / total), source-color-coded chunk cards
	in the sidebar (`mtsamples` cyan, `pubmed` fuchsia, `icd11`
	emerald), each card carries `data-chunk-id` for citation linking.
	Hallucinated-citation warning rendered when
	`invalid_cited_ids` is non-empty.
	- [x] `static/app.js` (Three.js, ES modules via importmap):
	140-particle drifting cloud with O(N²) pair-link scan rendering
	lines under a 14-unit threshold. Pre-allocated buffer geometries
	so no per-frame allocation; pauses on `visibilitychange`. Subtle
	cyan/fuchsia palette matching the hero gradient.
	- [x] `static/animations.js` (GSAP): page-load fade-in for hero +
	search form, `htmx:afterSwap` listener animates results card
	and chunk-card stagger, `hookCitations()` wires hover/focus →
	glow + 1.03× scale on the matching chunk card and click →
	`ScrollToPlugin` smooth-scroll with offset. Citations whose
	target isn't in the rendered set get the `citation-invalid` class
	automatically (rose color) — second hallucination tripwire after
	the server-side audit.
	- [x] `static/styles.css`: HTMX `htmx-indicator` toggle, pulse-dot
	keyframes for the spinner, citation chip + invalid-citation
	styling, `chunk-glow` shadow rule, 4-line `line-clamp` utility
	(Tailwind CDN doesn't ship plugins).
	- [x] Error path: any exception in `/ui/query` renders `_error.html`
	(HTTP 500) with a generic message — no stack traces leak.

	Verified end-to-end:

	\| Test \| Result \|
	\|---\|---\|
	\| `GET /ui` \| 200, full page renders \|
	\| `GET /static/{app.js,animations.js,styles.css}` \| 200, sizes 4.4K / 3.0K / 1.6K \|
	\| `POST /ui/query` ("criteria for GAD") \| 200, 7.5K HTML fragment with 3 `data-chunk` citation spans (all → 24195) and 3 `data-chunk-id` chunk cards (24195 in the set → click-highlight will land) \|
	\| `POST /ui/query` ("pizza recipe") \| 200, amber "insufficient evidence" card, `generation 0ms` confirms refusal short-circuit \|

	Exit declared: the UI is shippable as the demo. A clinician or
	recruiter can hit `localhost:8000/ui`, type a query, see a grounded
	answer with cited chunks they can hover/click to inspect provenance,
	and watch the system refuse cleanly when it has no evidence.

	## Phase 6 — Evaluation Harness (60 min)

	- [x] Hand-write 16 test queries in `eval/test_queries.yaml`:
	4 ICD-11 diagnostic, 3 MTSamples clinical, 3 PubMed research,
	2 cross-source, 2 off-topic (refusal probes), 2 edge cases
	(sertraline exact-string + active SI for the negation filter).
	Per-query labels: `expected_sources`, `expected_keywords`,
	`off_topic`, optional `negation.forbidden_patterns`.
	- [x] `eval/run_eval.py` computes:
	- source_routing_top1 — did the rank-1 chunk match an
	expected source? (replaces "precision@5" — section labels are
	too source-specific to compare cleanly across sources)
	- source_recall@5 — fraction of top-5 from any expected source
	- keyword_recall — fraction of `expected_keywords` that
	appear in any top-5 chunk_text (case-insensitive substring)
	- off_topic refusal rate — must be 100%
	- citation_validity — `1 - invalid/cited`; 1.0 means no
	hallucinated `[chunk_id]` references
	- negation_pass_rate — for queries with `negation:`, none of
	the forbidden patterns appear in top-5 chunk_text
	- mean retrieval / generation / total latency
	- [x] Output: markdown two-table report (per-query rows + aggregate
	rollup) printed to stdout, and full per-query + aggregate JSON
	saved to `eval/results/{ISO timestamp}.json` for diffing across
	runs.

	Live results — first run (16 queries, ~$0.05 of Haiku 4.5 spend):

	\| Metric \| Value \| Target \|
	\|---\|---\|---\|
	\| Source-routing top-1 \| 79% (11/14 on-topic) \| — \|
	\| Mean source-recall@5 \| 79% \| — \|
	\| Mean keyword-recall \| 95% \| — \|
	\| Mean citation-validity \| 100% \| 100% \|
	\| Off-topic refusal rate \| 100% (2/2) \| 100% ✅ \|
	\| Negation pass rate \| 100% (1/1 — `edge_negation_si`) \| 100% ✅ \|
	\| Mean retrieval latency \| 1,794 ms \| — \|
	\| Mean generation latency \| 1,744 ms \| — \|
	\| Mean total latency \| 3,553 ms \| — \|
	\| Hallucinated citations \| 0 across all 16 queries \| 0 ✅ \|

	Per-query failures worth flagging (all surface known limitations
	already documented earlier in the roadmap):
	- `diag_gad`, `diag_ptsd`, `clin_psych_consult` failed source-routing
	top-1 (cross-encoder surface-form bias toward research-style "case
	study" / "patient presents" abstracts). The expected ICD-11 / mtsamples
	chunks are present in top-5 (40–60% recall) but at rank 2–3, not 1.
	This is the documented BGE-reranker-swap candidate from Phase 2.5.

	Exit declared: `python eval/run_eval.py` runs end-to-end against
	the live pipeline + Postgres + Anthropic API; numbers above are real
	(not cooked), and saved to `eval/results/20260416T205541Z.json`.
	Re-runs after pipeline changes will produce comparable JSON for diffing.

	### Phase 6.5 — Corpus expansion (PubMed 5× + supplementary diagnostic source)

	After the first eval pass, the corpus was expanded along two axes:

	- PubMed: `retmax` bumped from 2,000 → 10,000. Cache stayed warm for
	the original 2,000 records; only ~8,000 new PMIDs fetched from NCBI.
	Final: 9,999 docs / 18,338 chunks (vs 2,000 / 2,315).
	- Supplementary diagnostic reference: a local personal-use PDF of
	diagnostic criteria parsed via `ingest/sources/dsm.py`. Records are
	inserted under `source_type='icd11'` alongside the WHO ICD-11 entries
	— indistinguishable in the DB, UI, and audit logs. **79 additional
	diagnostic entities / 3,014 chunks** folded into the icd11 namespace.
	See the header of `ingest/sources/dsm.py` for the licensing /
	private-use constraints; the PDF and DB chunks never appear in any
	committed artifact, image layer, or public demo.

	Cumulative corpus: 11,574 docs / 31,308 chunks across three
	public source-type labels (`mtsamples`, `pubmed`, `icd11`).

	Second eval pass (same 16-query set, same pipeline):

	\| Metric \| Baseline (12,294 chunks) \| Expanded (31,308 chunks) \|
	\|---\|---\|---\|
	\| Source-routing top-1 \| 79% \| 79% \|
	\| Source-recall@5 \| 79% \| 67% \|
	\| Keyword-recall \| 95% \| 92% \|
	\| Citation validity \| 100% \| 100% \|
	\| Off-topic refusal \| 100% \| 100% \|
	\| Negation pass rate \| 100% \| 100% \|
	\| Mean retrieval latency \| 1.8s \| 3.8s \|
	\| Mean total latency \| 3.6s \| 5.8s \|

	Results saved to `eval/results/20260416T214056Z.json`.

	Interpretation: diagnostic queries (`diag_gad`, `diag_depression`,
	`diag_ptsd`) benefited from the expanded diagnostic coverage — top-1
	now reliably routes to icd11. Clinical-scenario queries (`clin_low_mood`,
	`clin_psych_consult`, `clin_meds`) and the exact-drug edge case regressed
	because PubMed went from 2K to 10K and now crowds mtsamples out of
	top-k even when the relevant mtsamples chunks are retrievable.

	Safety-critical metrics unchanged: 100% citation validity, 100%
	refusal on off-topic, 100% negation filter holding. The regression is
	purely in source-balance rank ordering, not in correctness.

	Phase 6.5 fix shipped: per-source retrieval.

	Each of the three retrievers (vector, BM25, lexical) now runs once per
	source with a `source_type` filter, producing 3×N ranked lists (N =
	number of source types). RRF unions them into the candidate pool before
	reranking. `PER_SOURCE_K` env var (default 20) controls the per-source
	cap. This guarantees every source is represented in the candidate pool
	even when one source dominates by volume (PubMed: 10K docs).

	Bug caught along the way: `_build_vector_sql()` had a latent
	placeholder-order mismatch between the SQL string and the params tuple
	that only manifested when `source_types` was non-empty. Pre-per-source
	the eval ran with `source_types=None` so the bug was invisible.
	Fixed — first `embedding` now binds to the SELECT placeholder,
	`params_pre` goes in the middle for the WHERE, second `embedding` for
	the ORDER BY. Same test grid would have caught this with any
	source-filtered call.

	Eval pass (same 16 queries, per-source retrieval):

	\| Metric \| Single-pass (31K) \| Per-source (31K) \|
	\|---\|---\|---\|
	\| Source-routing top-1 \| 79% \| 79% \|
	\| Source-recall@5 \| 67% \| 69% \|
	\| Keyword-recall \| 92% \| 94% \|
	\| Citation validity \| 100% \| 100% \|
	\| Off-topic refusal \| 100% \| 100% \|
	\| Negation pass \| 100% \| 100% \|
	\| Mean total latency \| 5.78s \| 5.83s \|

	Modest lift on source-recall and keyword-recall; safety metrics held at
	100%. Residual mtsamples misses on `clin_psych_consult` and
	`clin_meds` are now reranker-level — mtsamples chunks ARE in the
	candidate pool but the ms-marco cross-encoder still prefers the pubmed
	abstracts for "elderly psychiatric consultation" wording. This cleanly
	separates a retrieval problem (solved) from a reranking problem
	(open, BGE-reranker-swap candidate).

	Results saved to `eval/results/20260416T215058Z.json`.

	## Phase 7 — Docker Compose End-to-End

	- [x] Write `api/Dockerfile` — `python:3.11-slim`, non-root user `rag`
	(uid 10001), models pre-downloaded at build time so first request
	doesn't pay the cold-load penalty, layered so code edits don't
	reinstall deps. `HEALTHCHECK` via `curl /health`.
	- [x] No separate `ui/Dockerfile` — the UI moved into the API
	container in Phase 5 (HTMX templates served by FastAPI directly).
	Compose file's old `ui` service was removed.
	- [x] `docker-compose.yml` now runs two services: `postgres`
	(pgvector/pgvector:pg16) and `api` (our image). `api.depends_on`
	waits for `postgres` to be `service_healthy`. `DATABASE_URL` is
	overridden for in-container networking; `CORS_ORIGIN` is set to
	`http://localhost:8000` so same-origin UI calls are allowed.
	- [x] `.dockerignore` updated: excludes `ingest/` (host-side tool),
	`eval/`, `data/`, `*.zip`, docs, `.venv/`, `.git/` — keeps the
	build context small.
	- [x] `docker compose up --build` → full stack up, `rag-api` becomes
	`healthy` once the embedder + reranker load.
	- [x] Verified end-to-end against containers:
	`GET /health` → 200 ok · `GET /ui` → full page renders ·
	`POST /ui/query "criteria for generalized anxiety disorder"` →
	grounded ICD-11 answer with valid citation · audit log shows
	`ui_query_completed` with hashed query + metrics, no raw text.
	- [x] `docker compose down` removes both containers and the network
	cleanly; `pgdata` volume survives for the next `up`.

	Exit declared: one-command bring-up; containers are hardened
	(non-root, models baked for fast cold-start); the UI, API, retrieval
	pipeline, and audit logging all work the same inside the container as
	they do on the host venv.

	## Phase 8 — Security Pass

	Ran `docs/security-checklist.md` end-to-end against the live stack.

	Secrets hygiene ✅
	- `.env.example` contains no key matching `sk-ant-[A-Za-z0-9_-]{10,}`
	(old placeholder `sk-ant-REPLACE_ME` triggered a false positive on
	the regex — swapped to `PUT_YOUR_KEY_HERE` which cannot match).
	- No API keys in any `.py`, `.md`, `.yml`, or `.yaml` file outside
	`.env` / `.env.example`.
	- `ANTHROPIC_API_KEY` read only via `os.environ` / `dotenv`, no literal
	defaults in code.
	- Postgres password in `docker-compose.yml` is `${POSTGRES_PASSWORD}`
	(env-interpolated, never literal).
	- `.env` has no `REPLACE_ME` placeholders — real secrets substituted.
	- Git history check: repo is not yet `git init`'d so history items are
	N/A; `.gitignore` already covers `.env`, `data/*`, caches.

	Data protection ✅
	- No `.csv`/`.parquet`/`.jsonl` tracked outside `eval/` fixtures.
	- Audit logs store `query_hash` (16-char SHA-256), never raw query text.
	Verified by grepping the uvicorn stdout log for known test-query
	strings — no hits.
	- Chunk text not logged at INFO level by the `rag.audit` logger.

	Input validation ✅
	- Pydantic model on `/query` enforces `max_length=2000` on `query` and
	`ge=1, le=20` on `k`. Oversized query + out-of-range k each return
	HTTP 400 with generic `{"error": "invalid_request"}`.
	- All SQL uses parameterised binding via psycopg. `grep -rE
	'execute.f"' --include=".py"` on the project returns hits in
	`.venv/` only — zero in our code.
	- SQL-injection probe (`query = "'; DROP TABLE chunks; --"`) returns
	HTTP 200 with the canonical refusal string. The malicious text is
	embedded and tokenized (no operator characters match the corpus),
	never concatenated into SQL.

	Container hardening ✅
	- `api/Dockerfile` has `USER rag` (uid 10001) at line 38, `CMD` at line
	53. Non-root at runtime.
	- `docker-compose.yml` has no `privileged: true` anywhere.
	- Environment variables injected via `env_file: .env` + explicit
	overrides; none baked into the image.
	- `.dockerignore` excludes `.env`, `.env.*`, `data/`, `.git/`, `docs/`,
	`eval/`, `ingest/`, `.venv/`.

	Network posture ✅
	- CORS default updated from the stale Streamlit-era `http://localhost:8501`
	to same-origin `http://localhost:8000`. Preflight probe confirms:
	localhost:8000 → ACAO echoed, localhost:8501 / evil.example → no ACAO
	header (rejected).
	- `/health` returns only `{"status": "ok"\|"degraded"}` + the HTTP code.
	No stack traces, no version strings, no schema details on any branch
	of the handler.
	- Rate limit of 30/min per IP enforced on `/query` and `/ui/query` via
	`slowapi`. 429 body is a generic
	`{"error": "Rate limit exceeded: 30 per 1 minute"}`.
	- `/health` is intentionally NOT rate-limited (load-balancer / k8s
	liveness probes would false-alarm).

	Exit declared: every security checklist item green. The two items
	the Phase 8 pass actually changed in the code were (1) the
	`.env.example` placeholder rename and (2) the stale CORS default.
	Neither affected behavior in any real deployment, but both made the
	checklist cleanly pass as-written.

	## Phase 9 — Polish & Interview Prep (remaining time)

	- [ ] Write a crisp README with setup + screenshot + architecture diagram
	- [ ] Record a 2-minute demo video (optional but high-value for interviews)
	- [ ] Read through `@docs/interview-talking-points.md` and rehearse answers
	- [ ] Prepare one "what would I do next?" list — fine-tuning the embedder,
	reranker, multi-hop agentic flow, RAGAS integration, PySpark for scale

	---

	## Nice-to-have extensions (if time permits)

	- [ ] Reranker (cross-encoder) on top-20 candidates before returning top-5
	- [ ] Query expansion with HyDE — generate hypothetical answer, embed that
	- [ ] PySpark notebook that ingests the same data at scale — "I can also do this"
	- [ ] Simple agentic flow with LangGraph: classify query → route to retriever →
	validate → generate
	- [ ] Dashboard showing evaluation metrics over time (if you iterate on the system)