varientlens / docs /VariantLens_Build_Plan.md
Codex
Initial VariantLens clinical readiness scaffold
3e219fa

VariantLens: Lab-Grade Variant Interpretation Tool

Full Implementation Plan β€” Claude Code Build

Jordan Lerner-Ellis Lab Β· University of Toronto Β· April 2026


1. Design Philosophy

Core principle: Human-in-the-loop augmentation. The tool accelerates evidence gathering, applies ACMG criteria, and uses Claude to synthesize unstructured literature β€” but a trained curator makes every final classification decision. This matches the design of all three tools from the November 2025 CGLC session (AI CURA, EvAgg, AutoPM3) and is the safest path to clinical adoption.

Non-negotiables:

  • All patient data stays on-premise (no genomic data sent to cloud APIs without explicit opt-in)
  • Full evidence audit trail β€” every criterion is traceable to a source
  • Compatible with ACMG SVC v4.0 when finalized
  • Export to ClinVar, PDF, and HL7 FHIR

2. Selected Tools & Frameworks to Integrate

2.1 Existing tools to build ON TOP OF (don't reinvent)

Tool Role in VariantLens Why
autoPVS1 PVS1 criterion automation Best-in-class null variant assessment; open-source Python; integrates with pyhgvs
InterVar ACMG rule engine scaffold Implements ~18 criteria; open-source; use as base then extend to all 28
Mutalyzer HGVS normalization Industry standard; Python API available; solves the nomenclature inconsistency problem
PyHGVS Secondary normalization Lightweight Python library; good fallback
SpliceAI Splice effect prediction Pre-scored lookup tables available (avoid running the model per variant)
REVEL Missense pathogenicity Pre-computed for all missense positions in gnomAD; load as SQLite
AlphaMissense Missense pathogenicity 2023 DeepMind model; scores for ~71% of human missense variants; download as flat file
CADD Combined annotation Pre-scored tracks; REST API available
ChromaDB Vector store for RAG Local, embedded, no server needed; Python-native; HIPAA-friendly
sentence-transformers Embeddings for RAG all-MiniLM-L6-v2 for speed; BioLinkBERT for biomedical accuracy

2.2 Data sources to connect

Source Data Access method
gnomAD v4.1 Population allele frequencies REST API + local SQLite for BA1/BS1/BS2/PM2
ClinVar Existing classifications Entrez E-utilities + local VCF download (weekly sync)
OMIM Gene-disease + inheritance API (free for academic use)
ClinGen VCEPs Expert panel rules ClinGen Allele Registry API
HGMD (lite) Published variants Public variant lists (full version if lab has license)
PubMed Literature E-utilities for abstract retrieval; full-text via PMC API
UniProt Protein domain / functional domains REST API for PM1

2.3 What NOT to rebuild

  • Do not implement your own in silico predictors β€” use pre-scored tables
  • Do not build your own variant normalizer β€” Mutalyzer handles this
  • Do not build your own vector database β€” ChromaDB is production-ready

3. System Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  FRONTEND (React)                β”‚
β”‚   Variant input Β· HPO terms Β· Curator dashboard  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     β”‚ REST API
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚               BACKEND (FastAPI / Python)         β”‚
β”‚                                                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚  1. Normalization Layer                  β”‚   β”‚
β”‚  β”‚     Mutalyzer β†’ canonical HGVS           β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                    β”‚                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚  2. Evidence Gathering (parallel async)  β”‚   β”‚
β”‚  β”‚                                          β”‚   β”‚
β”‚  β”‚  Databases:         RAG Pipeline:        β”‚   β”‚
β”‚  β”‚  β€’ gnomAD           β€’ PubMed fetch       β”‚   β”‚
β”‚  β”‚  β€’ ClinVar          β€’ ChromaDB query     β”‚   β”‚
β”‚  β”‚  β€’ OMIM             β€’ Relevant chunks    β”‚   β”‚
β”‚  β”‚  β€’ REVEL/SpliceAI   β€’ Context assembly   β”‚   β”‚
β”‚  β”‚  β€’ AlphaMissense                         β”‚   β”‚
β”‚  β”‚  β€’ autoPVS1                              β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                    β”‚                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚  3. ACMG Rule Engine                     β”‚   β”‚
β”‚  β”‚     InterVar base + custom extensions    β”‚   β”‚
β”‚  β”‚     28 criteria β†’ weighted scores        β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                    β”‚                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚  4. Claude Reasoning Layer               β”‚   β”‚
β”‚  β”‚     RAG context + ACMG pre-scores        β”‚   β”‚
β”‚  β”‚     β†’ literature evidence synthesis      β”‚   β”‚
β”‚  β”‚     β†’ VUS reasoning + uncertainty flags  β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                    β”‚                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚  5. Classification Combiner              β”‚   β”‚
β”‚  β”‚     Table 5 (Richards 2015) logic        β”‚   β”‚
β”‚  β”‚     β†’ provisional 5-tier + confidence    β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                    β”‚                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚  6. Output: audit trail + report draft   β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              CURATOR REVIEW UI                   β”‚
β”‚   Evidence table Β· Criterion override Β· Sign-off β”‚
β”‚   ClinVar export Β· PDF report Β· LIMS integration β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

4. RAG Pipeline Design (Hallucination Reduction)

This is the most critical architectural decision. The RAG system is what separates a reliable clinical tool from a hallucination-prone chatbot.

4.1 Why RAG works here

Instead of asking Claude to "recall" information about a variant from training data (which is stale and unverifiable), RAG:

  1. Retrieves the actual PubMed abstracts/PMC full-texts relevant to the variant
  2. Chunks and embeds them into a vector store
  3. At query time, retrieves only the most semantically relevant chunks
  4. Passes those chunks as explicit context to Claude
  5. Claude reasons ONLY over what's in the context window β€” it cannot hallucinate what isn't there

4.2 Index Construction

# Pseudocode for index build pipeline

# Step 1: Query PubMed for variant + gene
pubmed_results = fetch_pubmed(
    query=f'"{gene_symbol}" AND "{variant_hgvs}" OR "{protein_change}"',
    max_results=200
)

# Step 2: Fetch full text where available (PMC)
papers = [fetch_fulltext(pmid) or fetch_abstract(pmid) 
          for pmid in pubmed_results]

# Step 3: Chunk with overlap (preserve context around variant mentions)
chunks = sliding_window_chunk(
    papers, 
    chunk_size=512,      # tokens
    overlap=128,         # tokens
    anchor_keywords=[variant_hgvs, protein_change, gene_symbol]
)

# Step 4: Embed (BioLinkBERT for biomedical domain accuracy)
embeddings = model.encode(chunks)

# Step 5: Store with metadata
chroma_collection.add(
    documents=chunks,
    embeddings=embeddings,
    metadatas=[{
        "pmid": p.pmid, 
        "year": p.year, 
        "variant": variant_hgvs,
        "gene": gene_symbol,
        "criteria_hint": detect_criteria_signals(chunk)  # PM3, PP1, PS3 etc.
    } for p, chunk in zip(papers, chunks)]
)

4.3 Retrieval Strategy (Criterion-Aware)

Different ACMG criteria need different retrieval strategies:

Criterion Retrieval focus Query augmentation
PM3 in trans compound het "in trans" OR "compound heterozygous" OR "biallelic"
PP1 co-segregation "segregation" OR "affected family members" OR "co-segregates"
PS3/BS3 functional studies "functional" OR "in vitro" OR "in vivo" OR "assay"
PS4 case-control prevalence "cases" OR "prevalence" OR "odds ratio"
PP4 phenotype specificity "phenotype" OR "clinical features" OR "presentation"

4.4 Context Assembly for Claude

# The context passed to Claude is structured, not raw text
context = {
    "variant": "NM_000548.5(TSC2):c.4639A>T (p.Lys1547Ter)",
    "gene": "TSC2",
    "disease": "Tuberous sclerosis complex",
    "acmg_preliminary": {
        "PVS1": {"triggered": True, "source": "autoPVS1", "note": "NMD predicted"},
        "PM2": {"triggered": True, "source": "gnomAD v4.1", "af": 0.000002},
        # ... other auto-scored criteria
    },
    "retrieved_literature": [
        {
            "pmid": "12345678",
            "chunk": "...five affected family members carried the p.Lys1547Ter variant...",
            "criteria_relevance": "PP1"
        },
        # top-k chunks
    ]
}

4.5 Claude Prompt Design (Hallucination-Suppressed)

SYSTEM_PROMPT = """
You are a clinical genetics variant curator assistant. Your role is to 
extract structured evidence from the provided literature context ONLY.

CRITICAL RULES:
1. Do NOT use any knowledge from your training data about this variant
2. Only cite evidence that appears verbatim in the provided context chunks
3. If the context does not contain sufficient evidence for a criterion, say "insufficient evidence in provided literature"
4. For each criterion you assess, cite the specific PMID and quote the relevant sentence
5. Output structured JSON only β€” no free text
6. Flag any ambiguous phasing, uncertain phenotype matches, or potential ascertainment bias
"""

USER_PROMPT = f"""
Variant: {variant.hgvs}
Gene/Disease: {variant.gene} / {disease}

PRE-SCORED CRITERIA (from databases β€” do not re-evaluate these):
{json.dumps(acmg_preliminary, indent=2)}

LITERATURE CONTEXT (evaluate PM3, PP1, PS3, PS4, PP4 from these only):
{format_chunks(retrieved_chunks)}

For each literature-dependent criterion, output:
{{
  "criterion": "PM3",
  "triggered": true/false,
  "strength": "supporting/moderate/strong",
  "evidence": "exact quote from context",
  "pmid": "12345678",
  "confidence": "high/medium/low",
  "caveat": "any ascertainment concerns"
}}
"""

5. ACMG Criteria Coverage Map

Automated (database-driven β€” no LLM needed)

Criterion Automation approach Tool
PVS1 Loss-of-function prediction + transcript check autoPVS1
BA1 gnomAD AF > 5% gnomAD API
BS1 gnomAD AF > expected for disorder gnomAD + disease incidence table
BS2 Healthy homozygote/heterozygote in gnomAD gnomAD
PM2 Absent from gnomAD / very low AF gnomAD API
PM4 In-frame indel length + conservation Custom rule
PM5 Same aa position as known pathogenic missense ClinVar lookup
PS1 Same aa change as established pathogenic ClinVar lookup
PP3 / BP4 REVEL, SpliceAI, AlphaMissense, CADD Pre-scored tables
BP1 Missense in truncation-only gene ClinGen curated gene list
BP3 In-frame indel in repeat region RepeatMasker annotation
BP7 Synonymous + no splice prediction + non-conserved SpliceAI + PhyloP
PP2 Missense in low-benign-missense gene ClinGen gene-level stats

LLM-assisted (RAG + Claude)

Criterion Claude task
PM3 Extract in trans observations from literature (AutoPM3 approach)
PP1 / BS4 Count segregating/non-segregating family members
PS3 / BS3 Identify and assess functional assay data
PS4 Extract case counts and odds ratios
PP4 Assess phenotype specificity match
PS2 / PM6 Identify confirmed/assumed de novo reports
PP5 / BP6 Check recent authoritative database submissions

Requires curator input (cannot automate)

Criterion Why manual
PM1 Requires domain expert judgment about "critical" functional domains
BP5 Requires knowledge of the specific patient's alternative diagnosis
PM3 (phasing) Parental testing results needed from clinician

6. Tech Stack

Backend:    Python 3.12 Β· FastAPI Β· SQLAlchemy Β· Celery (async jobs)
Frontend:   React 18 Β· TypeScript Β· Tailwind CSS
Databases:  PostgreSQL (variants, audit trail) Β· SQLite (REVEL, gnomAD offline)
Vector DB:  ChromaDB (embedded, on-premise)
Embeddings: sentence-transformers (BioLinkBERT or all-MiniLM-L6-v2)
LLM:        Claude API (on-premise option: Ollama + open-source LLM as fallback)
Auth:       OAuth2 / LDAP (for hospital integration)
Containers: Docker + docker-compose (single-command deployment)
Tests:      pytest Β· hypothesis (property-based testing of ACMG logic)

7. Claude Code Implementation Plan

Use Claude Code (claude CLI) to build this in phases. Run from the project root.

Prerequisites

# Install Claude Code
npm install -g @anthropic-ai/claude-code

# Verify
claude --version

Phase 0 β€” Project Setup (Day 1)

mkdir variantlens && cd variantlens

claude "Create a Python FastAPI project called VariantLens for clinical genomic 
variant interpretation. Set up:
- /backend: FastAPI app with routers for variants, evidence, classification
- /frontend: React + TypeScript + Tailwind project
- /data: SQLite databases for REVEL and gnomAD offline lookups
- docker-compose.yml with services: api, frontend, postgres, chroma
- pyproject.toml with dependencies: fastapi, sqlalchemy, chromadb, 
  sentence-transformers, anthropic, biopython, requests, httpx, celery
- .env.example with ANTHROPIC_API_KEY, NCBI_API_KEY, OMIM_API_KEY
Include a README with setup instructions."

Phase 1 β€” Variant Normalization (Day 2–3)

claude "In /backend/app/services/normalization.py, implement a VariantNormalizer 
class that:
1. Accepts variants in HGVS, VCF, or protein notation
2. Uses the Mutalyzer REST API (https://mutalyzer.nl/api/v2/) for normalization
3. Falls back to PyHGVS for offline normalization
4. Returns: canonical HGVS (genomic + coding + protein), transcript, gene symbol
5. Handles batch normalization with rate limiting
6. Includes comprehensive unit tests with 20+ test variants including edge cases 
   (stop-loss, indels, splice variants, mitochondrial)
Use pydantic models for all inputs/outputs."

Phase 2 β€” Database Integrations (Day 4–7)

# gnomAD integration
claude "In /backend/app/services/gnomad.py, implement a GnomADClient that:
1. Queries gnomAD v4.1 GraphQL API for variant allele frequencies
2. Returns AF by population (AFR, EUR, ASJ, EAS, SAS, AMR, FIN)
3. Implements local SQLite caching to avoid redundant API calls
4. Computes BA1 (>5% AF), BS1 (>expected), BS2 (healthy homozygotes), PM2 (<0.0001)
5. Handles missing data and low coverage warnings
Include the gnomAD GraphQL query template as a constant."

# ClinVar integration  
claude "In /backend/app/services/clinvar.py, implement a ClinVarClient that:
1. Queries ClinVar via NCBI Entrez E-utilities for a given variant
2. Parses existing classifications and review status (star rating)
3. Extracts PS1 evidence (same aa change, different nucleotide)
4. Extracts PM5 evidence (same position, different pathogenic missense)  
5. Extracts PP5/BP6 evidence (recent reputable submissions)
6. Downloads weekly ClinVar VCF for local lookup (faster batch queries)
Use BioPython's Entrez module."

# In silico predictors
claude "In /backend/app/services/insilico.py, implement InSilicoPredictor that:
1. Loads REVEL scores from a local SQLite database (build script included)
2. Loads AlphaMissense scores from the downloaded TSV (2.5GB flat file)
3. Calls the SpliceAI lookup API (https://spliceailookup-api.broadinstitute.org)
4. Calls the CADD REST API (https://cadd.gs.washington.edu)
5. Aggregates concordant/discordant predictions for PP3/BP4
6. Follows the ACMG rule: concordant predictions = 1 piece of evidence (not additive)
Returns a structured InSilicoResult with per-tool scores and overall PP3/BP4 call."

# autoPVS1 integration
claude "Integrate the autoPVS1 Python package into /backend/app/services/pvs1.py.
Create a PVS1Assessor wrapper that:
1. Takes a normalized HGVS variant
2. Runs autoPVS1 to classify null variant strength (PVS1/PS1-equivalent/PM1-equivalent)
3. Returns structured output with reasoning for the rule applied
4. Handles the 5 caveats from the ACMG guidelines (LOF mechanism, 3' end, splice variants, 
   multiple transcripts, alternatively spliced exons)
Include comprehensive tests for CFTR, MYH7, and BRCA1 known variants."

Phase 3 β€” RAG Pipeline (Day 8–11)

claude "Build the RAG literature pipeline in /backend/app/services/rag/:

1. literature_fetcher.py:
   - Query PubMed E-utilities with variant-aware search queries
   - Fetch full text from PMC where available, abstract otherwise
   - Build criterion-specific queries for PM3, PP1, PS3, PS4
   - Cache results to avoid re-fetching the same papers

2. chunker.py:
   - Sliding window chunker (512 tokens, 128 overlap)
   - Anchor chunks near variant mention sentences
   - Detect which ACMG criteria each chunk is relevant to (keyword heuristic)

3. embedder.py:
   - Use sentence-transformers BioLinkBERT for biomedical-domain embeddings
   - Batch embedding with progress tracking
   - Store to ChromaDB with full metadata (pmid, year, variant, gene, criteria_hint)

4. retriever.py:
   - Criterion-aware query construction (different for PM3 vs PP1 vs PS3)
   - Retrieve top-k chunks (k=8 per criterion)
   - Deduplicate across criteria
   - Return structured context for Claude

Each module must have typed interfaces (pydantic) and unit tests."

Phase 4 β€” ACMG Rule Engine (Day 12–15)

claude "Build the ACMG rule engine in /backend/app/services/acmg/:

1. criteria.py: Pydantic models for each of the 28 criteria with:
   - triggered (bool)
   - strength (very_strong/strong/moderate/supporting/standalone)
   - source (database name or PMID)
   - evidence_text (quote or numeric value)
   - confidence (high/medium/low)
   - caveat (optional warning text)

2. rules.py: Implement all auto-scorable criteria:
   - PVS1 (from autoPVS1 result)
   - PS1, PM5 (from ClinVar)
   - BA1, BS1, BS2, PM2 (from gnomAD)
   - PP3/BP4 (from InSilico concordant predictions)
   - PM4/BP3 (in-frame indel in repeat region)
   - BP1 (gene-level truncation-only flag from ClinGen)
   - BP7 (synonymous + no splice impact + non-conserved)
   - PP2 (low benign missense gene)

3. combiner.py: Implement Table 5 from Richards 2015 exactly:
   - All combination rules for Pathogenic/Likely Pathogenic/Benign/Likely Benign
   - Returns provisional classification + list of triggered criteria
   - Flags conflicting evidence (pathogenic + benign criteria both present)
   - Exports to structured JSON for audit trail

4. validator.py: Unit tests using 50 known ClinVar variants 
   (10 P, 10 LP, 10 VUS, 10 LB, 10 B) β€” verify combiner matches ClinVar
   classification at β‰₯85% concordance."

Phase 5 β€” Claude Reasoning Layer (Day 16–18)

claude "Build the Claude reasoning layer in /backend/app/services/llm/:

1. prompts.py: Structured prompt templates for each literature-dependent criterion:
   - PM3 prompt (in trans extraction, inspired by AutoPM3)
   - PP1 prompt (segregation counting with anti-hallucination guards)
   - PS3 prompt (functional assay quality assessment)
   - PS4 prompt (case count and OR extraction)
   - PP4 prompt (phenotype specificity matching)
   
   Each prompt must:
   - Explicitly instruct Claude to only use provided context (no training recall)
   - Request JSON output with evidence quotes + PMIDs
   - Include uncertainty and caveat detection
   - Include examples of what hallucination looks like and how to avoid it

2. reasoner.py: LLM reasoning orchestrator that:
   - Takes pre-scored criteria + RAG context
   - Calls Claude for each literature-dependent criterion
   - Parses and validates JSON responses
   - Falls back gracefully if Claude response is malformed
   - Logs all LLM calls with input/output for audit

3. synthesizer.py: Final synthesis pass that:
   - Merges database-scored + LLM-scored criteria
   - Produces human-readable evidence summary (for curator)
   - Highlights conflicting or ambiguous evidence
   - Generates uncertainty flags for VUS cases

Use the Anthropic Python SDK. Model: claude-sonnet-4-6. max_tokens: 2000."

Phase 6 β€” Frontend (Day 19–22)

claude "Build the React frontend in /frontend/src/:

1. VariantInput component:
   - Text field for HGVS entry with live validation against Mutalyzer
   - VCF file upload (single variant or batch)
   - HPO term autocomplete (using HPO API)
   - Gene/disease context selector

2. EvidenceDashboard component (the main curator view):
   - Criteria table showing all 28 criteria with status (triggered/not triggered/pending)
   - Color coding: green (benign criteria), red (pathogenic), gray (not triggered)
   - Each row expandable to show evidence source, quote, and confidence
   - Override button per criterion with required free-text justification
   - Literature panel showing RAG-retrieved papers with relevant quotes highlighted

3. ClassificationPanel component:
   - Shows provisional 5-tier classification
   - Shows confidence and any conflicting evidence flags
   - Curator sign-off button with authentication
   - Classification history / previous submissions

4. ReportGenerator component:
   - Preview of clinical report in standard format
   - Export to PDF, ClinVar submission XML, HL7 FHIR R4
   
Use React Query for API calls, Zustand for state, Tailwind for styling."

Phase 7 β€” Testing & Validation (Day 23–25)

claude "Create a comprehensive validation suite in /tests/:

1. test_known_variants.py:
   - Use 100 variants from ClinVar with 4-star expert panel reviews
   - Assert classification concordance β‰₯ 85%
   - Assert all triggered criteria are traceable to a source
   - Assert no criterion is triggered without evidence

2. test_hallucination_guard.py:
   - Feed the LLM prompts with deliberately wrong literature (controls)
   - Assert Claude does not trigger PM3/PP1 when context contains no relevant evidence
   - Assert Claude cites only PMIDs present in the provided context

3. test_acmg_combiner.py:
   - Property-based tests using hypothesis
   - Test all combination rules from Table 5 of Richards 2015
   - Test edge cases: conflicting evidence, single criterion only

4. performance_benchmark.py:
   - Time per variant (target: < 30 seconds including RAG)
   - Batch throughput (target: 100 variants/hour)
   - Memory usage per worker"

8. Directory Structure

variantlens/
β”œβ”€β”€ backend/
β”‚   β”œβ”€β”€ app/
β”‚   β”‚   β”œβ”€β”€ api/                  # FastAPI routers
β”‚   β”‚   β”‚   β”œβ”€β”€ variants.py       # POST /variants/classify
β”‚   β”‚   β”‚   β”œβ”€β”€ evidence.py       # GET /variants/{id}/evidence
β”‚   β”‚   β”‚   └── reports.py        # GET /variants/{id}/report
β”‚   β”‚   β”œβ”€β”€ services/
β”‚   β”‚   β”‚   β”œβ”€β”€ normalization.py  # Mutalyzer wrapper
β”‚   β”‚   β”‚   β”œβ”€β”€ gnomad.py         # gnomAD client
β”‚   β”‚   β”‚   β”œβ”€β”€ clinvar.py        # ClinVar client
β”‚   β”‚   β”‚   β”œβ”€β”€ insilico.py       # REVEL, SpliceAI, CADD, AlphaMissense
β”‚   β”‚   β”‚   β”œβ”€β”€ pvs1.py           # autoPVS1 wrapper
β”‚   β”‚   β”‚   β”œβ”€β”€ rag/
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ fetcher.py    # PubMed fetch
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ chunker.py    # Text chunking
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ embedder.py   # sentence-transformers
β”‚   β”‚   β”‚   β”‚   └── retriever.py  # ChromaDB query
β”‚   β”‚   β”‚   β”œβ”€β”€ acmg/
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ criteria.py   # Pydantic models
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ rules.py      # 28 criteria automation
β”‚   β”‚   β”‚   β”‚   └── combiner.py   # Table 5 logic
β”‚   β”‚   β”‚   └── llm/
β”‚   β”‚   β”‚       β”œβ”€β”€ prompts.py    # Criterion-specific prompts
β”‚   β”‚   β”‚       β”œβ”€β”€ reasoner.py   # Claude API calls
β”‚   β”‚   β”‚       └── synthesizer.py
β”‚   β”‚   └── models/               # SQLAlchemy DB models
β”‚   └── tests/
β”œβ”€β”€ frontend/
β”‚   └── src/
β”‚       β”œβ”€β”€ components/
β”‚       β”‚   β”œβ”€β”€ VariantInput.tsx
β”‚       β”‚   β”œβ”€β”€ EvidenceDashboard.tsx
β”‚       β”‚   β”œβ”€β”€ ClassificationPanel.tsx
β”‚       β”‚   └── ReportGenerator.tsx
β”‚       └── hooks/
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ revel_scores.db           # SQLite: pre-scored missense positions
β”‚   β”œβ”€β”€ alphamissense.tsv.gz      # Downloaded AlphaMissense flat file
β”‚   └── gnomad_cache.db           # Local AF cache
β”œβ”€β”€ docker-compose.yml
β”œβ”€β”€ .env.example
└── README.md

9. Privacy & Deployment

On-premise deployment (recommended for clinical data)

# docker-compose.yml excerpt
services:
  api:
    build: ./backend
    environment:
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
      - USE_LOCAL_LLM=false  # set true + configure Ollama for air-gap
    volumes:
      - ./data:/app/data     # all patient data stays local
  
  chroma:
    image: chromadb/chroma    # runs locally, no cloud
    volumes:
      - chroma_data:/chroma

Air-gapped option

If patient data cannot touch external APIs even for Claude:

  • Replace Claude with a locally-hosted open-source LLM via Ollama
  • Recommended model: mistral-nemo or qwen2.5 (strong instruction following)
  • Performance will be lower than Claude but maintains privacy
  • Toggle via USE_LOCAL_LLM=true in .env

Keys you need

Key Source Free?
ANTHROPIC_API_KEY console.anthropic.com Pay per token
NCBI_API_KEY ncbi.nlm.nih.gov/account Free
OMIM_API_KEY omim.org/api Free for academic
GNOMAD No key needed (REST API) Free

10. Development Timeline

Phase Duration Milestone
0 β€” Setup Day 1 Project scaffolded, Docker running
1 β€” Normalization Day 2–3 Mutalyzer integration + tests passing
2 β€” Databases Day 4–7 gnomAD, ClinVar, REVEL, SpliceAI, autoPVS1 integrated
3 β€” RAG Day 8–11 Literature retrieval + ChromaDB indexing working
4 β€” ACMG engine Day 12–15 All auto-scorable criteria + combiner; β‰₯85% concordance
5 β€” LLM layer Day 16–18 Claude synthesizing PM3/PP1/PS3 from RAG context
6 β€” Frontend Day 19–22 Full curator dashboard; report export
7 β€” Validation Day 23–25 100-variant benchmark suite passing

Total: ~5 weeks of focused development using Claude Code throughout


11. Key Design Decisions Summary

Decision Choice Rationale
LLM for literature only Claude handles PM3, PP1, PS3, PS4, PP4 β€” not DB criteria Reduces hallucination surface area; DB facts never go through LLM
RAG over in-context recall ChromaDB + BioLinkBERT embeddings Grounds Claude in actual retrieved text; eliminates training-data staleness
Prompt includes only context System prompt explicitly forbids using training recall Mirrors AI CURA's anti-hallucination strategy that achieved 96% concordance
autoPVS1 for PVS1 Don't reinvent PVS1 logic autoPVS1 has been validated extensively; reuse it
InterVar as ACMG scaffold Build on existing 18-criteria implementation Extend rather than rewrite; saves ~2 weeks
Human-in-the-loop always Curator must review + sign off every classification Matches ACMG guidance; required for clinical lab accreditation
On-premise ChromaDB No patient data leaves the network HIPAA/PHIPA compliance
JSON-only LLM output All Claude responses are structured JSON Enables reliable parsing + audit trail