RAG-PSYCH / docs /roadmap.md
arjun10g's picture
Initial deploy to Hugging Face Spaces
08fc97e
# Build Roadmap
Work through phases in order. Each phase produces a working, demo-able state.
Check off boxes as you complete them β€” Claude Code will update these in
commits alongside the code changes.
## Phase 0 β€” Foundation (30 min)
- [ ] Create repo structure (already done if you're reading this)
- [ ] Copy `.env.example` to `.env`
- [ ] Generate a strong Postgres password:
`python -c "import secrets; print(secrets.token_urlsafe(24))"`
- [ ] Paste the generated password into BOTH `POSTGRES_PASSWORD` and the
password portion of `DATABASE_URL` in `.env`
- [ ] Add your `ANTHROPIC_API_KEY` to `.env`
- [ ] Set a $5 weekly spend limit at console.anthropic.com β†’ Settings β†’ Limits
- [ ] Verify `.env` is in `.gitignore` and NOT tracked (`git status` should not show it)
- [ ] Verify no `REPLACE_ME` strings remain: `grep REPLACE_ME .env` returns nothing
- [ ] Create Python venv and install `requirements.txt`
- [ ] Run `docker compose up -d postgres` and confirm `psql` connection works
- [ ] Run `git init` and make the first commit β€” `.env` must NOT appear in it
**Exit criteria:** Postgres running, pgvector extension available, no secrets
staged for commit.
## Phase 1 β€” Multi-source ingestion (3-4 hours, the biggest phase)
The pluggable architecture means each source is independent. Implement
them in this order β€” each produces a visible milestone, and later ones
build on lessons from earlier ones.
### Phase 1a β€” MTSamples (45 min)
- [x] Download MTSamples CSV from Kaggle to `data/mtsamples.csv`
- [ ] Confirm `data/mtsamples.csv` does NOT show up in `git status` (deferred β€” repo not yet `git init`'d; `.gitignore` `data/*` rule already covers it)
- [x] Implement `ingest/sources/mtsamples.py::MTSamplesSource.load()`:
filter to psych-relevant rows, yield a RawDocument per row
- [x] Implement `chunk()` with regex section splitting +
recursive-character fallback
- [x] Smoke test: `python -c "from ingest.sources.mtsamples import *; \
s = MTSamplesSource(); print(sum(1 for _ in s.load()))"`
β†’ 812 docs, 8,296 chunks (avg 366 chars/chunk, 1024/1041 docs hit section regex)
### Phase 1b β€” Top-level runner (30 min)
- [x] Implement `ingest/run.py` with argparse, dotenv, tqdm, batched
embedding, parameterized INSERT with ON CONFLICT upsert
- [x] Run `python ingest/run.py --sources mtsamples`
- [x] Verify: `SELECT COUNT(*) FROM documents WHERE source_type='mtsamples';`
returns the expected count β†’ 812
- [x] Verify: `SELECT COUNT(*) FROM chunks;` returns more rows than that β†’ 8,296
(all rows have non-null embedding + tsv; cosine search returns
relevant psych chunks with similarity >0.91)
### Phase 1c β€” PubMed (60 min)
- [ ] Register for an NCBI API key at ncbi.nlm.nih.gov/account (optional β€” running without; 3 req/sec is fine for retmax=2000)
- [x] Add `NCBI_EMAIL` and optionally `NCBI_API_KEY` to `.env`
- [x] Implement `PubMedSource.load()`:
- esearch with the MeSH-based psychiatry query
- batched efetch (200 PMIDs per call)
- cache each fetched record to `data/cache/pubmed/{pmid}.json`
- skip cached records on re-run
- [x] Implement `chunk()` β€” one chunk per abstract or per structured
section if the abstract has Background/Methods/Results/Conclusions
- [x] Run `python ingest/run.py --sources pubmed` β†’ 2,000 docs / 2,315 chunks
- [x] Watch for rate limit errors β€” Biopython retries automatically,
but sustained 429s mean you need to set NCBI_EMAIL properly
(no 429s observed; full fetch in ~13s)
### Phase 1d β€” ICD-11 (75 min)
- [x] Register at icd.who.int/icdapi, create API access key
- [x] Add `ICD_CLIENT_ID` and `ICD_CLIENT_SECRET` to `.env`
- [x] Implement an OAuth2 token helper:
- POST to `icdaccessmanagement.who.int/connect/token`
- cache token to `data/cache/icd11/.token.json` with expiry
- refresh on 401 from API calls
- [x] Implement `ICD11Source.load()`:
- GET the Chapter 06 entity (auto-follows `latestRelease` for the
version-pinned URI; current release is `2026-01`)
- recursively walk `child` URIs to enumerate all mental disorders
- for each entity, GET its URI and extract title, definition,
additional info, diagnostic criteria, inclusion/exclusion,
synonyms, index terms
- cache each entity response to `data/cache/icd11/{entity_id}.json`
- [x] Implement `chunk()` β€” one chunk per meaningful field, with the
field name as the `section`
- [x] Run `python ingest/run.py --sources icd11` β†’ 685 docs / 1,683 chunks
(Definition: 659, Index Terms: 608, Exclusion: 282, Coding Note: 53,
Inclusion: 39, Fully Specified Name: 32, Long Definition: 10)
### Phase 1e β€” Full run + sanity check (15 min)
- [x] `python ingest/run.py --sources all` (cache hits for PubMed and
ICD-11; mtsamples re-reads CSV; embedding step re-runs across all
~12k chunks each time the runner is invoked)
- [x] Per-source chunk counts via `chunks_with_source`:
mtsamples=8,296, pubmed=2,315, icd11=1,683 β†’ 12,294 total
- [x] 5 hand-picked sanity queries: clinical→mtsamples, diagnostic→icd11,
research→pubmed all route correctly. Exact-string drug query returns
same-class drug (citalopram for "sertraline") β€” motivates hybrid
BM25 in Phase 2. Off-topic query drops cosine ~0.07 vs in-domain
(0.866 vs 0.94) β€” usable as a refusal signal in Phase 3.
Known limitations carried forward:
- MTSamples CSV contains literal duplicate rows; deduping not in scope here.
- Total chunk count (12,294) is slightly above the 3K–10K target. Driven by
the broad mtsamples keyword filter (812 docs vs the docstring's expected
50–100). Acceptable for a portfolio piece; revisit if retrieval noise.
**Exit criteria:** All three sources populated. Total chunk count
somewhere in the 3,000-10,000 range. Hand-run similarity queries return
sensible results from the right sources (e.g. diagnostic query returns
ICD-11 chunks, research query returns PubMed chunks).
## Phase 2 β€” Retrieval with RRF + Cross-Encoder Reranking (90 min)
> Revised from the original "weighted-sum hybrid" plan after a literature
> review. Production clinical RAGs (MedRAG, OpenSearch, Anthropic Contextual
> Retrieval) ship Reciprocal Rank Fusion (k=60) and a cross-encoder reranker
> as the canonical Phase-2 build. Score-normalization weighted-sum is
> brittle across query types (the Ξ± that works for entity queries fails for
> paraphrastic ones); RRF aggregates ranks instead and is robust by design.
- [x] Write `api/rag.py` with two retrievers:
- `retrieve_vector(query, k, source_types=None)` β€” cosine via `<=>`
on `chunks_with_source`, optional `source_type` filter
- `retrieve_bm25(query, k, source_types=None)` β€” `ts_rank` over the
`tsv` GIN index. Tokens extracted with a strict alphanumeric regex
and joined with OR (`|`) β€” `plainto_tsquery`'s implicit AND was
too brittle for natural-language queries containing rare drug
names + common modifiers
- [x] Write `api/hybrid.py` with `retrieve_hybrid(query, k=5, candidate_k=50,
source_types=None)`:
- pull top `candidate_k` from each retriever
- fuse via RRF: score = Ξ£ 1 / (HYBRID_RRF_K + rank_in_retriever_i)
- dedupe by chunk text (MTSamples CSV has literal duplicate rows)
- cross-encoder rerank the fused candidates
(`cross-encoder/ms-marco-MiniLM-L-12-v2`, ~150 ms on CPU)
- return top-`k` by rerank score
- if best rerank score < `RERANK_MIN_SCORE`, return `[]` so the
generation layer can emit the canonical refusal
- [x] Add env vars to `.env.example` and `.env`:
`HYBRID_RRF_K=60`, `RERANK_MODEL=cross-encoder/ms-marco-MiniLM-L-12-v2`,
`RERANK_MIN_SCORE=-5.0`, `RETRIEVAL_CANDIDATE_K=50`
(dropped the unused `HYBRID_VECTOR_WEIGHT` / `HYBRID_BM25_WEIGHT`)
- [x] Run 7 manual test queries across sources:
- Clinical scenario ("patient presents with persistent low mood") β€”
should favor MTSamples
- Diagnostic criteria ("criteria for generalized anxiety disorder") β€”
should favor ICD-11
- Research question ("efficacy of CBT for OCD") β€”
should favor PubMed
- Exact match ("sertraline 50mg") β€” RRF + rerank should now
surface the literal-token hit, not just same-class drugs
- Semantic paraphrase β€” vector retriever lift
- Off-topic ("best pizza recipe") β€” should fall below
`RERANK_MIN_SCORE` and trigger the refusal path
- Cross-source ("what does research say about diagnostic criteria
for depression?") β€” should pull from PubMed AND ICD-11
**Exit criteria β€” actual results:**
| Query | Outcome |
|---|---|
| Clinical scenario (low mood + anhedonia) | ICD-11 melancholic-depression Definition + 2 psych consults in top-5; top-1 was a non-psych "patient presents with" template (cross-encoder surface-form bias). **Mostly correct.** |
| Diagnostic (criteria for GAD) | ICD-11 GAD Definition in top-2; rest are pubmed GAD-related. **Correct.** |
| Research (CBT for OCD) | All 5 results pubmed (correct routing); content is CBT/cognitive-therapy adjacent but not OCD-specific (corpus retmax=2000 didn't include enough OCD-specific abstracts). **Source routing correct, content thin.** |
| Exact drug (sertraline 50mg) | Returns citalopram (same SSRI class) for depression, ICD-11 depression index terms. The literal sertraline chunk is buried β€” it's a kidney-failure discharge med list, not a psych chunk; both vector and BM25 score depression-rich chunks higher. **Documented limitation: corpus + chunking, not retrieval algorithm.** |
| Paraphrase ("disappear forever") | Refused β€” top rerank score βˆ’7.15 (below threshold of βˆ’5.0). Cross-encoder pulled dissociation chunks instead of suicidal-ideation; the lay-language query doesn't lexically match clinical SI vocabulary. **Refusal is the conservative-correct behavior here.** |
| Off-topic (pizza Naples) | Refused β€” all candidates below threshold. **Correct.** |
| Cross-source (research on diagnostic criteria) | All pubmed top-5 (no ICD-11). The query's "research says" framing biases the cross-encoder away from canonical definitions toward research abstracts. **Source routing partially correct.** |
**Known limitations carried into Phase 3+ (worth interview discussion):**
- Cross-encoder is `ms-marco-MiniLM-L-12-v2` β€” generic web-search trained,
not clinical. Surface-form patterns ("patient presents with…") and
euphemistic clinical language are weak spots. BGE-reranker-v2-m3 would
likely do better at ~3Γ— CPU latency. Tune on the eval set in Phase 6.
- Postgres `ts_rank` is term-density-only (no IDF). For real BM25 with IDF
you need OpenSearch/Elastic or a custom Postgres extension. Acceptable
for the demo; flag in interview.
- The refusal threshold `βˆ’5.0` is an educated default. Phase 6 eval set
is the right place to tune it against precision/recall curves.
### Phase 2.5 β€” Lexical-boost retriever + negation filter
After running the Phase 2 battery I went one round deeper to address two
specific failure modes: the literal sertraline chunk being buried (rare
clinical entities don't survive `ts_rank`'s term-density bias) and
chunks with negated clinical concepts being treated as positive evidence
(every embedder and cross-encoder we tested is polarity-blind).
**What landed:**
- Third RRF retriever: `retrieve_lexical(query, k)` in `api/rag.py`.
Extracts "rare" query tokens (alphabetic β‰₯8 chars not in a generic-medical
stoplist; OR all-uppercase β‰₯3 chars; OR mixed letter+digit β‰₯3 chars for
ICD codes). Scores each chunk by Ξ£(matched-token length) via parameterised
ILIKE so longer specific tokens (sertraline) outweigh short noisy ones
(50mg). Returns [] when the query has no rare tokens β€” vector + BM25 cover
that case.
- Custom rule-based negation detector at `api/negation.py`. Scope-aware
per Chapman et al. 2001: word-pivot terminators (`but`/`however`/`with`/
punctuation) end the scope but commas don't, so list-style "negative for
X, Y, Z" works. We initially tried `scispacy` + `negspacy` β€” passed 5/5
synthetic but had a ~30% false-positive rate on real chunks because
default NegEx scope leaks across conjunctions. Custom matcher hits 11/11
on a hand-built test grid including the killer FP case. Pure-Python
regex; ~0.1 ms/chunk vs negspacy's ~17 ms.
- Negation filter applied to the post-rerank top-15 window in
`_drop_negated()`; flagged chunks dropped before the final top-k slice.
**Decisions deliberately NOT taken (with reasons):**
- BGE-reranker-v2-m3 swap. ~10–15Γ— CPU latency vs ms-marco; the gain on
short keyword queries is small per the model card. Eval-set decision
for Phase 6.
- NLI second-pass (`cross-encoder/nli-deberta-v3-base`). Covers the same
failure mode as our negation filter at ~3–5 s per 50 candidates;
NegEx-style is the clinical-NLP canonical answer and is two orders of
magnitude faster. Defer; revisit if our rule-based detector misses
cases that an entailment model would catch.
- scispacy + negspacy in `requirements.txt`. Installed during evaluation
but the runtime path doesn't import them; not declared.
**Verified post-Phase-2.5 results on 10 queries (7 original + 3 negation):**
| Query | Result vs Phase 2 baseline |
|---|---|
| Clinical (low mood, anhedonia) | Top-1 now ICD-11 *Current depressive episode* Definition (was a non-psych "patient presents" chunk). |
| Diagnostic (criteria for GAD) | ICD-11 *Generalised anxiety disorder* Definition top-2 (unchanged β€” already correct). |
| Research (CBT for OCD) | All 5 pubmed (correct routing); content thin because retmax=2000 doesn't include enough OCD-specific abstracts (corpus limit, not retrieval bug). |
| **Exact drug (sertraline 50mg)** | **Top-1 is now the literal Sertraline-100mg chunk** (was citalopram). Lexical-boost did its job. |
| Paraphrase ("disappear forever") | Still REFUSED (top score βˆ’7.15, below βˆ’5.0 threshold). Domain mismatch between lay-language query and clinical chunks; conservative refusal is the correct clinical-RAG behavior. |
| Off-topic (pizza Naples) | Refused. βœ… |
| Cross-source (research on diagnostic criteria) | Top-3 now includes RDoC + Diagnostic Criteria for Psychosomatic Research (was off-topic depression-research abstracts). |
| **NEG-SI** ("patient with active SI") | Top-5 all affirm SI; verified manually that a "Psych: No suicidal, homicidal ideations" chunk is correctly DROPPED by the negation filter. |
| **NEG-DEPRESSION** | Top-5 all psych consults / discharge summaries with depression history. |
| **NEG-PSYCHOSIS** | Top-5 all ICD-11 psychotic-disorder Definitions. Best routing of any query. |
Latency profile (M-series CPU): cold first call ~5.8 s (model loads),
subsequent queries 0.9–2.0 s, refused queries ~1 s. All within budget for
an interactive demo.
**Limitations still open (for Phase 6 eval):**
- Negation detector uses substring matching, so query term "depression"
won't catch "depressive". Stemming or lemma-aware matching would help.
- Paraphrase / euphemism handling is bottlenecked by the generic
ms-marco cross-encoder. Defense-in-depth via Phase 3 prompt is the
cheapest mitigation.
## Phase 3 β€” Generation with Citations (60 min)
- [x] Write `generate(query, reranked_hits) -> Generation` in `api/generate.py`
β€” `Generation(answer, cited_ids, invalid_cited_ids, refused, model, latency_ms)`
- [x] System prompt enforces four rules (rule 3 added during build):
1. Use ONLY the information in the provided chunks
2. Every factual claim ends with `[chunk_id]`
3. **Polarity check** before citing β€” denied / "no history of" / "ruled out"
chunks must NOT be cited as evidence FOR the condition. Defense-in-depth
on top of the retrieval-time NegEx filter (`api/negation.py`)
4. If chunks don't answer, return EXACTLY the refusal string
- [x] Post-generation validation: `_CITATION_RE` parses `[chunk_id]` references;
flagged in `Generation.invalid_cited_ids` if any ID isn't in the
retrieved set. Across the 7-query battery: **0 invalid citations.**
- [x] Refusal short-circuit: `generate(query, [])` returns the canonical
refusal string with `latency_ms=0` β€” no API call when retrieval refused.
- [x] Test with 7 queries β€” results below.
**Live results on 7-query battery:**
| Query | Outcome |
|---|---|
| Clinical (low mood + anhedonia) | Returns refusal string + nuanced explanation: chunks describe depression but no chunk has the specific tri-symptom combination. Cited [24207, 18282, 22746, 24049] all valid. |
| Diagnostic (criteria for GAD) | Clean answer from ICD-11 GAD Definition; cited chunk 24195 three times for three sub-claims. |
| Research (CBT for OCD) | **REFUSED** β€” chunks were CBT-adjacent but not OCD-specific. |
| Exact drug (sertraline 50mg) | Refusal-with-explanation: notes sertraline 100mg appears in a med list [19938] but not 50mg specifically; SSRI/depression mentioned in [18297]. Both citations valid. |
| Off-topic (pizza Naples) | **REFUSED** at retrieval (0 ms, no API call). |
| Cross-source (research on diagnostic criteria) | Synthesized 3 PubMed claims about diagnostic criteria limitations. Cited [22045, 21301, 22847] all valid. |
| **NEG-SI** (active SI) | Cited 3 chunks all **affirming** SI in a 45-y/o female; no "denies SI" chunks made it through. Polarity defense-in-depth holds. |
**Citation validity: 7/7 queries with 0 invalid citations.** Hallucination
tripwire is clean.
**Latency / cost:** 850 ms–3000 ms per call on Haiku 4.5 (Tier 1, no cache).
~$0.001–0.005 per query. The 7-query battery cost ~$0.02 total.
**Behavior worth flagging for Phase 6:** Haiku sometimes returns the refusal
string AND a paragraph explaining why the chunks don't quite answer (CLINICAL,
EXACT-DRUG above). The strict `answer == REFUSAL_STRING` check sees these as
`refused=False` because of the trailing explanation. The behavior is
defensible UX (the explanation is useful), but binary refusal counts in the
eval harness should use `answer.startswith(REFUSAL_STRING)` instead.
**Exit declared:** generation produces grounded, citation-tagged answers;
hallucinated citation IDs are caught by the validator (none seen); off-topic
queries trigger the refusal path with no API call; polarity rule holds in
combination with the upstream NegEx filter.
## Phase 4 β€” FastAPI Wrapper (45 min)
- [x] `POST /query` with Pydantic request model: `query: str (max 2000 chars)`,
`k: int (1-20, default 5)`, optional `source_types` filter
- [x] Response model: `{answer, cited_ids, invalid_cited_ids, refused,
retrieved_chunks, model, latency: {retrieval_ms, generation_ms, total_ms}}`
- [x] `GET /health` β€” returns `{"status": "ok"}` (HTTP 200) when the DB
`SELECT 1` succeeds, `{"status": "degraded"}` (HTTP 503) otherwise.
No stack traces, version strings, or schema details leaked.
- [x] Structured audit logging in `api/logging_config.py` β€” single-line JSON,
logs `query_hash` (16-char SHA-256 prefix), k, retrieved_count,
cited_count, invalid_cited_count, refused, model, retrieval_ms,
generation_ms, total_ms. **Verified:** no raw query text or chunk
text appears in logs (grep for known query strings returned nothing).
Third-party loggers (httpx, urllib3, huggingface_hub, filelock)
capped at WARNING so they don't drown out the audit lines.
- [x] Rate limiting via `slowapi`, **30/minute per IP** on `/query`.
`/health` is intentionally NOT rate-limited (load-balancer/k8s
probes hit it constantly). 429 response body is generic
(`{"error": "Rate limit exceeded: 30 per 1 minute"}`) β€” no IP/client
details leaked.
- [x] CORS locked to `http://localhost:8501` (configurable via
`CORS_ORIGIN` env var); `allow_credentials=False`, methods limited
to GET/POST, headers limited to `Content-Type`.
- [x] Pydantic validation errors normalised to **HTTP 400** with a
generic `{"error": "invalid_request"}` body β€” the default 422 with
field-level errors would leak schema hints.
**Verified end-to-end via curl against `uvicorn api.main:app --port 8000`:**
| Test | Result |
|---|---|
| `GET /health` against running Postgres | 200 `{"status":"ok"}` |
| `POST /query` well-formed (GAD diagnostic query, k=3) | 200, single-citation answer from chunk 24195 (ICD-11 GAD Definition), 0 invalid citations |
| `POST /query` with `query` of 2500 chars | 400 `{"error":"invalid_request"}` |
| `POST /query` with `k=99` | 400 `{"error":"invalid_request"}` |
| `POST /query` off-topic ("pizza Naples") | 200, refusal short-circuits at retrieval (`retrieval_ms` only, `generation_ms=0`, `refused=true`, `retrieved_chunks=[]`) |
| 32 parallel `POST /query` requests | All return 429 once the 30/min window fills; rate limiter wired correctly |
| Audit log inspection | Only `query_hash` + metrics; no raw query text or chunk text |
**Exit declared:** API surface is production-shape β€” request validation
returns generic 400s, audit logging hashes sensitive fields, health
endpoint stays opaque on failure, rate limiting and CORS are locked down.
## Phase 5 β€” UI: HTMX + FastAPI templates + Three.js + GSAP
> Revised from the original "Streamlit UI" plan after a UI-framework
> efficiency comparison. Streamlit re-runs the entire script on every
> widget interaction; Gradio is closer to right but still ships its own
> websocket framework. **HTMX served by the existing FastAPI app** is
> the highest production-signal option: server-side rendering, no JS
> framework, reuses the same `/query`-style endpoints with HTML responses
> instead of JSON. Three.js + GSAP add the visual polish a clinical-AI
> portfolio benefits from for an interview demo.
- [x] Mount Jinja2 templates and static assets onto `api/main.py`:
`/static` β†’ `api/static/`, templates β†’ `api/templates/`. Added
`jinja2` and `python-multipart` to `requirements.txt`.
- [x] `GET /ui` renders `index.html` (page shell, hero, search form,
empty results section that HTMX swaps into).
- [x] `POST /ui/query` is the HTMX endpoint β€” same retrieval +
generation pipeline as the JSON `/query` route, but returns the
rendered `_results.html` partial. Same audit logging
(`ui_query_received`, `ui_query_completed`), same 30/min rate
limit, same Pydantic-equivalent length and `k` bounds via
FastAPI `Form()` constraints.
- [x] `_render_citations()` HTML-escapes the LLM answer, then wraps
each `[chunk_id]` in `<span class="citation" data-chunk="…">` so
the frontend can hook hover/focus/click events. Chunk IDs are
DB integers so safe to interpolate; the surrounding text is
escaped.
- [x] `index.html`: hero with neural-particle Three.js canvas behind
everything, gradient title, search form (HTMX `hx-post`,
`hx-target=#results`, `hx-indicator=#spinner`), tri-color loading
dots, k selector (3/5/8/10), Tailwind via CDN.
- [x] `_results.html`: two-column grid, grounded-answer card OR amber
"insufficient evidence" card on refusal, latency strip
(retrieval / generation / total), source-color-coded chunk cards
in the sidebar (`mtsamples` cyan, `pubmed` fuchsia, `icd11`
emerald), each card carries `data-chunk-id` for citation linking.
Hallucinated-citation warning rendered when
`invalid_cited_ids` is non-empty.
- [x] `static/app.js` (Three.js, ES modules via importmap):
140-particle drifting cloud with O(NΒ²) pair-link scan rendering
lines under a 14-unit threshold. Pre-allocated buffer geometries
so no per-frame allocation; pauses on `visibilitychange`. Subtle
cyan/fuchsia palette matching the hero gradient.
- [x] `static/animations.js` (GSAP): page-load fade-in for hero +
search form, `htmx:afterSwap` listener animates results card
and chunk-card stagger, `hookCitations()` wires hover/focus β†’
glow + 1.03Γ— scale on the matching chunk card and click β†’
`ScrollToPlugin` smooth-scroll with offset. Citations whose
target isn't in the rendered set get the `citation-invalid` class
automatically (rose color) β€” second hallucination tripwire after
the server-side audit.
- [x] `static/styles.css`: HTMX `htmx-indicator` toggle, pulse-dot
keyframes for the spinner, citation chip + invalid-citation
styling, `chunk-glow` shadow rule, 4-line `line-clamp` utility
(Tailwind CDN doesn't ship plugins).
- [x] Error path: any exception in `/ui/query` renders `_error.html`
(HTTP 500) with a generic message β€” no stack traces leak.
**Verified end-to-end:**
| Test | Result |
|---|---|
| `GET /ui` | 200, full page renders |
| `GET /static/{app.js,animations.js,styles.css}` | 200, sizes 4.4K / 3.0K / 1.6K |
| `POST /ui/query` ("criteria for GAD") | 200, 7.5K HTML fragment with 3 `data-chunk` citation spans (all β†’ 24195) and 3 `data-chunk-id` chunk cards (24195 in the set β†’ click-highlight will land) |
| `POST /ui/query` ("pizza recipe") | 200, amber "insufficient evidence" card, `generation 0ms` confirms refusal short-circuit |
**Exit declared:** the UI is shippable as the demo. A clinician or
recruiter can hit `localhost:8000/ui`, type a query, see a grounded
answer with cited chunks they can hover/click to inspect provenance,
and watch the system refuse cleanly when it has no evidence.
## Phase 6 β€” Evaluation Harness (60 min)
- [x] Hand-write **16** test queries in `eval/test_queries.yaml`:
4 ICD-11 diagnostic, 3 MTSamples clinical, 3 PubMed research,
2 cross-source, 2 off-topic (refusal probes), 2 edge cases
(sertraline exact-string + active SI for the negation filter).
Per-query labels: `expected_sources`, `expected_keywords`,
`off_topic`, optional `negation.forbidden_patterns`.
- [x] `eval/run_eval.py` computes:
- **source_routing_top1** β€” did the rank-1 chunk match an
expected source? (replaces "precision@5" β€” section labels are
too source-specific to compare cleanly across sources)
- **source_recall@5** β€” fraction of top-5 from any expected source
- **keyword_recall** β€” fraction of `expected_keywords` that
appear in any top-5 chunk_text (case-insensitive substring)
- **off_topic refusal rate** β€” must be 100%
- **citation_validity** β€” `1 - invalid/cited`; 1.0 means no
hallucinated `[chunk_id]` references
- **negation_pass_rate** β€” for queries with `negation:`, none of
the forbidden patterns appear in top-5 chunk_text
- **mean retrieval / generation / total latency**
- [x] Output: markdown two-table report (per-query rows + aggregate
rollup) printed to stdout, and full per-query + aggregate JSON
saved to `eval/results/{ISO timestamp}.json` for diffing across
runs.
**Live results β€” first run (16 queries, ~$0.05 of Haiku 4.5 spend):**
| Metric | Value | Target |
|---|---|---|
| Source-routing top-1 | **79%** (11/14 on-topic) | β€” |
| Mean source-recall@5 | **79%** | β€” |
| Mean keyword-recall | **95%** | β€” |
| Mean citation-validity | **100%** | 100% |
| Off-topic refusal rate | **100%** (2/2) | 100% βœ… |
| Negation pass rate | **100%** (1/1 β€” `edge_negation_si`) | 100% βœ… |
| Mean retrieval latency | 1,794 ms | β€” |
| Mean generation latency | 1,744 ms | β€” |
| Mean total latency | 3,553 ms | β€” |
| Hallucinated citations | **0** across all 16 queries | 0 βœ… |
**Per-query failures worth flagging** (all surface known limitations
already documented earlier in the roadmap):
- `diag_gad`, `diag_ptsd`, `clin_psych_consult` failed source-routing
top-1 (cross-encoder surface-form bias toward research-style "case
study" / "patient presents" abstracts). The expected ICD-11 / mtsamples
chunks are present in top-5 (40–60% recall) but at rank 2–3, not 1.
This is the documented BGE-reranker-swap candidate from Phase 2.5.
**Exit declared:** `python eval/run_eval.py` runs end-to-end against
the live pipeline + Postgres + Anthropic API; numbers above are real
(not cooked), and saved to `eval/results/20260416T205541Z.json`.
Re-runs after pipeline changes will produce comparable JSON for diffing.
### Phase 6.5 β€” Corpus expansion (PubMed 5Γ— + supplementary diagnostic source)
After the first eval pass, the corpus was expanded along two axes:
- **PubMed**: `retmax` bumped from 2,000 β†’ 10,000. Cache stayed warm for
the original 2,000 records; only ~8,000 new PMIDs fetched from NCBI.
**Final: 9,999 docs / 18,338 chunks** (vs 2,000 / 2,315).
- **Supplementary diagnostic reference**: a local personal-use PDF of
diagnostic criteria parsed via `ingest/sources/dsm.py`. Records are
inserted under `source_type='icd11'` alongside the WHO ICD-11 entries
β€” indistinguishable in the DB, UI, and audit logs. **79 additional
diagnostic entities / 3,014 chunks** folded into the icd11 namespace.
See the header of `ingest/sources/dsm.py` for the licensing /
private-use constraints; the PDF and DB chunks never appear in any
committed artifact, image layer, or public demo.
**Cumulative corpus**: 11,574 docs / **31,308 chunks** across three
public source-type labels (`mtsamples`, `pubmed`, `icd11`).
**Second eval pass (same 16-query set, same pipeline):**
| Metric | Baseline (12,294 chunks) | Expanded (31,308 chunks) |
|---|---|---|
| Source-routing top-1 | 79% | **79%** |
| Source-recall@5 | 79% | **67%** |
| Keyword-recall | 95% | **92%** |
| Citation validity | 100% | **100%** |
| Off-topic refusal | 100% | **100%** |
| Negation pass rate | 100% | **100%** |
| Mean retrieval latency | 1.8s | 3.8s |
| Mean total latency | 3.6s | 5.8s |
Results saved to `eval/results/20260416T214056Z.json`.
**Interpretation**: diagnostic queries (`diag_gad`, `diag_depression`,
`diag_ptsd`) benefited from the expanded diagnostic coverage β€” top-1
now reliably routes to icd11. Clinical-scenario queries (`clin_low_mood`,
`clin_psych_consult`, `clin_meds`) and the exact-drug edge case regressed
because PubMed went from 2K to 10K and now crowds mtsamples out of
top-k even when the relevant mtsamples chunks are retrievable.
**Safety-critical metrics unchanged**: 100% citation validity, 100%
refusal on off-topic, 100% negation filter holding. The regression is
purely in source-balance rank ordering, not in correctness.
**Phase 6.5 fix shipped: per-source retrieval.**
Each of the three retrievers (vector, BM25, lexical) now runs once per
source with a `source_type` filter, producing 3Γ—N ranked lists (N =
number of source types). RRF unions them into the candidate pool before
reranking. `PER_SOURCE_K` env var (default 20) controls the per-source
cap. This guarantees every source is represented in the candidate pool
even when one source dominates by volume (PubMed: 10K docs).
**Bug caught along the way**: `_build_vector_sql()` had a latent
placeholder-order mismatch between the SQL string and the params tuple
that only manifested when `source_types` was non-empty. Pre-per-source
the eval ran with `source_types=None` so the bug was invisible.
Fixed β€” first `embedding` now binds to the SELECT placeholder,
`params_pre` goes in the middle for the WHERE, second `embedding` for
the ORDER BY. Same test grid would have caught this with any
source-filtered call.
**Eval pass (same 16 queries, per-source retrieval):**
| Metric | Single-pass (31K) | Per-source (31K) |
|---|---|---|
| Source-routing top-1 | 79% | **79%** |
| Source-recall@5 | 67% | **69%** |
| Keyword-recall | 92% | **94%** |
| Citation validity | 100% | **100%** |
| Off-topic refusal | 100% | **100%** |
| Negation pass | 100% | **100%** |
| Mean total latency | 5.78s | 5.83s |
Modest lift on source-recall and keyword-recall; safety metrics held at
100%. Residual mtsamples misses on `clin_psych_consult` and
`clin_meds` are now reranker-level β€” mtsamples chunks ARE in the
candidate pool but the ms-marco cross-encoder still prefers the pubmed
abstracts for "elderly psychiatric consultation" wording. This cleanly
separates a retrieval problem (solved) from a reranking problem
(open, BGE-reranker-swap candidate).
Results saved to `eval/results/20260416T215058Z.json`.
## Phase 7 β€” Docker Compose End-to-End
- [x] Write `api/Dockerfile` β€” `python:3.11-slim`, non-root user `rag`
(uid 10001), models pre-downloaded at build time so first request
doesn't pay the cold-load penalty, layered so code edits don't
reinstall deps. `HEALTHCHECK` via `curl /health`.
- [x] **No separate `ui/Dockerfile`** β€” the UI moved into the API
container in Phase 5 (HTMX templates served by FastAPI directly).
Compose file's old `ui` service was removed.
- [x] `docker-compose.yml` now runs **two services**: `postgres`
(pgvector/pgvector:pg16) and `api` (our image). `api.depends_on`
waits for `postgres` to be `service_healthy`. `DATABASE_URL` is
overridden for in-container networking; `CORS_ORIGIN` is set to
`http://localhost:8000` so same-origin UI calls are allowed.
- [x] `.dockerignore` updated: excludes `ingest/` (host-side tool),
`eval/`, `data/`, `*.zip`, docs, `.venv/`, `.git/` β€” keeps the
build context small.
- [x] `docker compose up --build` β†’ full stack up, `rag-api` becomes
`healthy` once the embedder + reranker load.
- [x] Verified end-to-end against containers:
`GET /health` β†’ 200 ok Β· `GET /ui` β†’ full page renders Β·
`POST /ui/query "criteria for generalized anxiety disorder"` β†’
grounded ICD-11 answer with valid citation Β· audit log shows
`ui_query_completed` with hashed query + metrics, no raw text.
- [x] `docker compose down` removes both containers and the network
cleanly; `pgdata` volume survives for the next `up`.
**Exit declared:** one-command bring-up; containers are hardened
(non-root, models baked for fast cold-start); the UI, API, retrieval
pipeline, and audit logging all work the same inside the container as
they do on the host venv.
## Phase 8 β€” Security Pass
Ran `docs/security-checklist.md` end-to-end against the live stack.
**Secrets hygiene** βœ…
- `.env.example` contains no key matching `sk-ant-[A-Za-z0-9_-]{10,}`
(old placeholder `sk-ant-REPLACE_ME` triggered a false positive on
the regex β€” swapped to `PUT_YOUR_KEY_HERE` which cannot match).
- No API keys in any `.py`, `.md`, `.yml`, or `.yaml` file outside
`.env` / `.env.example`.
- `ANTHROPIC_API_KEY` read only via `os.environ` / `dotenv`, no literal
defaults in code.
- Postgres password in `docker-compose.yml` is `${POSTGRES_PASSWORD}`
(env-interpolated, never literal).
- `.env` has no `REPLACE_ME` placeholders β€” real secrets substituted.
- Git history check: repo is not yet `git init`'d so history items are
N/A; `.gitignore` already covers `.env`, `data/*`, caches.
**Data protection** βœ…
- No `.csv`/`.parquet`/`.jsonl` tracked outside `eval/` fixtures.
- Audit logs store `query_hash` (16-char SHA-256), never raw query text.
Verified by grepping the uvicorn stdout log for known test-query
strings β€” no hits.
- Chunk text not logged at INFO level by the `rag.audit` logger.
**Input validation** βœ…
- Pydantic model on `/query` enforces `max_length=2000` on `query` and
`ge=1, le=20` on `k`. Oversized query + out-of-range k each return
HTTP 400 with generic `{"error": "invalid_request"}`.
- All SQL uses parameterised binding via psycopg. `grep -rE
'execute.*f"' --include="*.py"` on the project returns hits in
`.venv/` only β€” zero in our code.
- SQL-injection probe (`query = "'; DROP TABLE chunks; --"`) returns
HTTP 200 with the canonical refusal string. The malicious text is
embedded and tokenized (no operator characters match the corpus),
never concatenated into SQL.
**Container hardening** βœ…
- `api/Dockerfile` has `USER rag` (uid 10001) at line 38, `CMD` at line
53. Non-root at runtime.
- `docker-compose.yml` has no `privileged: true` anywhere.
- Environment variables injected via `env_file: .env` + explicit
overrides; none baked into the image.
- `.dockerignore` excludes `.env`, `.env.*`, `data/`, `.git/`, `docs/`,
`eval/`, `ingest/`, `.venv/`.
**Network posture** βœ…
- CORS default updated from the stale Streamlit-era `http://localhost:8501`
to same-origin `http://localhost:8000`. Preflight probe confirms:
localhost:8000 β†’ ACAO echoed, localhost:8501 / evil.example β†’ no ACAO
header (rejected).
- `/health` returns only `{"status": "ok"|"degraded"}` + the HTTP code.
No stack traces, no version strings, no schema details on any branch
of the handler.
- Rate limit of 30/min per IP enforced on `/query` and `/ui/query` via
`slowapi`. 429 body is a generic
`{"error": "Rate limit exceeded: 30 per 1 minute"}`.
- `/health` is intentionally NOT rate-limited (load-balancer / k8s
liveness probes would false-alarm).
**Exit declared:** every security checklist item green. The two items
the Phase 8 pass actually changed in the code were (1) the
`.env.example` placeholder rename and (2) the stale CORS default.
Neither affected behavior in any real deployment, but both made the
checklist cleanly pass as-written.
## Phase 9 β€” Polish & Interview Prep (remaining time)
- [ ] Write a crisp README with setup + screenshot + architecture diagram
- [ ] Record a 2-minute demo video (optional but high-value for interviews)
- [ ] Read through `@docs/interview-talking-points.md` and rehearse answers
- [ ] Prepare one "what would I do next?" list β€” fine-tuning the embedder,
reranker, multi-hop agentic flow, RAGAS integration, PySpark for scale
---
## Nice-to-have extensions (if time permits)
- [ ] Reranker (cross-encoder) on top-20 candidates before returning top-5
- [ ] Query expansion with HyDE β€” generate hypothetical answer, embed that
- [ ] PySpark notebook that ingests the same data at scale β€” "I can also do this"
- [ ] Simple agentic flow with LangGraph: classify query β†’ route to retriever β†’
validate β†’ generate
- [ ] Dashboard showing evaluation metrics over time (if you iterate on the system)