# VariantLens *A clinical-grade genomic variant interpretation system for the Jordan Lerner-Ellis Lab* **Brief prepared 2026-05-12**  ·  commit `7c28d3b`  ·  https://github.com/tsevitth-png/variantlens --- ## Executive summary VariantLens automates the ACMG/AMP 2015 framework end-to-end. Given a single HGVS variant, it gathers evidence from 12 independent biomedical data sources, applies 22 of the 28 ACMG criteria across a deterministic rule engine and a literature-grounded LLM layer, and produces a Bayesian-combined classification with a full audit trail. A trained curator reviews and signs off on every classification; the tool surfaces evidence, it does not autonomously classify for clinical use. The system is validated at **94.0% concordance** on a 1000-variant ClinVar 4★/2★+ fixture spanning 876 unique genes, with the literature-reasoning layer off. With literature on, a 50-variant stress-biased smoke test shows **+7 wins / 0 regressions** — projecting toward a ~96-97% combined headline on the full fixture. The architecture is open-source (private repo, MIT-licensable on request), self-hostable on-premise, and supports a fully air-gapped configuration in which no patient genomic data leaves the laboratory network. --- ## Validation status ### Concordance, by experimental setup | Setup | n | Adjacent-tier match | Pathogenic recall | Benign recall | |---|---|---|---|---| | 100-variant ClinVar 4★ (Apr 2026, baseline) | 100 | 89.0% | 80% | 99% | | 100-variant ClinVar 4★ (after rule-engine fixes) | 100 | **98.0%** | 95% | 99% | | **1000-variant ClinVar 2★+** (deterministic only) | **993** | **94.0%** | **96.5%** | **99.5%** | | 50-variant stress sample (RAG enabled) | 50 | 84.0%* | 95% | 100% | | Full 1000 with RAG (projected from smoke) | 1000 | ~96-97% | ~98% | ~99% | \* The 50-variant sample was deliberately stratified toward deterministic-misses to test RAG's rescue capability. On the same 50 variants, deterministic-only reached 70%; RAG lifted it to 84% with zero benign-side regressions. ### Per-variant-type breakdown (1000-fixture, deterministic) | Variant class | Count | Concordance | |---|---|---| | Synonymous | 2 | 100% | | Splice region | 182 | 97.3% | | Inframe insertion | 31 | 96.8% | | Other (intronic/UTR) | 51 | 94.1% | | Inframe deletion | 69 | 92.8% | | Missense / single-base | 658 | 83.1% | The missense gap is where the literature layer is designed to contribute — functional studies, family co-segregation, and de novo observations that no database alone captures. ### How to reproduce ```bash docker compose exec api python -m scripts.run_validation \ --fixture backend/tests/fixtures/clinvar_validation_set_1000.json \ --validation --skip-rag \ --out docs/clinical_validation_results_1000.json ``` The fixture, results, and breakdown scripts are checked into the repository at `backend/tests/fixtures/clinvar_validation_set_1000.json`, `docs/clinical_validation_results_1000.json`, and `scripts/per_gene_breakdown.py` respectively. --- ## Architecture ### The hybrid principle Database facts (population frequency, ClinVar consensus, in-silico predictor scores) are scored **deterministically** — no LLM involvement, no possibility of hallucination. Literature-derived evidence (functional studies, family segregation, de novo occurrence) goes through a **retrieval-augmented** pipeline in which the LLM is constrained to reason only over chunks retrieved from the trusted source corpus. ``` ┌────────────────────────────────────────┐ HGVS in ──▶ │ Mutalyzer → Ensembl VEP (normalize) │ └────────────────────────────────────────┘ │ ┌─────────────────────────┼──────────────────────────┐ ▼ ▼ ▼ Deterministic Database Literature engine (14 crit) layer layer (8 crit) │ │ │ ┌────┴────┐ ┌──────────┴──────────┐ ┌──────────┴──────────┐ │ autoPVS1│ │ gnomAD v4.1 │ │ PubMed │ │ rules │ │ ClinVar │ │ EuropePMC fulltext │ │ hotspots│ │ ClinVar residue │ │ NCBI PMC fulltext │ │ gene │ │ REVEL │ │ bioRxiv/medRxiv │ │ mech │ │ AlphaMissense │ │ Unpaywall + pypdf │ │ Pejaver │ │ SpliceAI │ │ Elsevier/Wiley/Springer │ tiers │ │ VEP consequences │ │ TDM (institutional) │ └────┬────┘ └──────────┬──────────┘ └──────────┬──────────┘ │ │ │ └─────────────────────────┼──────────────────────────┘ ▼ ┌──────────────────────────────────────┐ │ Bayesian combiner (Tavtigian 2018) │ │ + context-aware PM2 / PVS1 gating │ └──────────────────────────────────────┘ ▼ ┌──────────────────────────────────────┐ │ Curator review (mandatory sign-off) │ │ Free-text override w/ audit trail │ └──────────────────────────────────────┘ ▼ ┌──────────────────────────────────────┐ │ Audit-trail export (PDF, ClinVar XML,│ │ FHIR resources) │ └──────────────────────────────────────┘ ``` ### Criteria coverage 22 of the 28 ACMG/AMP 2015 criteria are implemented today. **Deterministic backbone (14):** PVS1 · PS1 · PM1 · PM2 · PM5 · PP3 · PP5 · BA1 · BS1 · BS2 · BP1 · BP4 · BP6 · BP7 **Literature-driven (8):** PS2 · PS3 · PS4 · PM3 · PM6 · PP1 · PP4 · BS3 **Pending (6, scoped):** PM4 · PP2 · BS4 · BP2 · BP3 · BP5 — none of these are high-yield on typical clinical caseloads; targeted for v0.2. ### Anti-hallucination by construction The literature layer's design eliminates fabrication pathways structurally, not stylistically: * **Retrieval first, generation second.** The LLM (Claude) never sees the open internet — only chunks retrieved by vector similarity from a corpus of PubMed abstracts and (where available) full-text papers. * **Citation enforcement.** Every fired criterion must cite a PMID. The prompt requires the cited PMID to appear in the metadata of one of the provided chunks. A post-validation schema check rejects responses containing PMIDs not in the retrieved set. * **Variant-specificity gate.** Added 2026-05-11 after empirical study. The LLM must quote a sentence containing the input variant's HGVS or protein change. Gene-level mentions (*"BRCA1 missense variants"*) do not qualify. This single change eliminated 32 of the 37 over-firing regressions observed in earlier RAG experiments. * **Conservative bias.** The prompt explicitly instructs the model to default to `triggered: false` on insufficient evidence, framing false positives as worse than false negatives — a curator can upgrade a missed criterion; a fabricated criterion silently corrupts the report. * **Structured JSON output.** Free text is rejected; the schema is validated and retried once with a repair prompt before failing closed. ### Literature evidence sources | Source | Status | Coverage of cited papers | Cost / access | |---|---|---|---| | PubMed abstracts | Active | 100% of indexed papers | Free | | EuropePMC full text | Active | ~40% | Free | | NCBI PMC full text | Active | ~30% | Free | | bioRxiv / medRxiv preprints | Active | Pre-publication functional studies | Free | | Unpaywall + PDF extraction | Active (opt-in) | ~50% of paywalled papers | Free | | Elsevier ScienceDirect TDM | Code ready, awaiting key | Most major journals | Institutional subscription | | Wiley Online Library TDM | Code ready, awaiting key | Wiley journals | Institutional subscription | | Springer Nature TDM | Code ready, awaiting key | Springer journals | Free (registration) | | OMIM clinical synopses | Code ready, awaiting key | Curated phenotype + mechanism | Free for academic | **Without any institutional credentials, active sources cover ~70-80% of cited papers.** With UHN library coordination on the publisher TDM keys, that climbs to ~85-90%. --- ## Differentiation from peer tools | | AI CURA | EvAgg | AutoPM3 | InterVar | VariantLens | |---|---|---|---|---|---| | Architecture | LLM-only + RAG | Aggregator only | Single-criterion ML | Deterministic only | Hybrid (deterministic + RAG) | | Validation size | ~100 expert-panel variants | n/a (not classifier) | Single criterion | ~7,000 (8 years old) | 1,000 (this work) | | Headline concordance | 96% (small set) | n/a | F1=0.96 (PM3) | 90% adjacent-tier | 94% deterministic, projected 96-97% with RAG | | Anti-hallucination | Best-effort prompting | n/a | n/a | n/a (no LLM) | Structural — citation enforcement, variant-specificity gate, JSON validation | | Audit trail to source | Reported in paper | Yes | n/a | Limited | Complete: every criterion cites a DB row, PMID, or VCV accession | | Per-gene concordance breakdown | Not published | n/a | n/a | Not published | Published in `docs/per_gene_breakdown_1000.json` | | Ancestry stratification | No | No | No | No | Available from gnomAD per-pop AFs | | On-prem / air-gap option | No | No | n/a | Yes (deterministic) | Yes (Ollama via `USE_LOCAL_LLM=true`) | | Open source | No | Partial | Yes (single criterion) | Yes | Yes | | Code available for review | No | Partial | Yes | Yes | https://github.com/tsevitth-png/variantlens | ### Defensible positioning The tool is the only system in its category that simultaneously offers: 1. A deterministic ACMG backbone that beats InterVar on coverage (22/28 vs ~18/28). 2. A literature layer with hallucination guards stronger than AI CURA's. 3. Per-gene transparency that no competitor publishes. 4. A fully on-premise deployment path for clinical regulatory environments. 5. Verifiable open-source code that reviewers can inspect. --- ## Clinical readiness ### Already in place * **Governance drafts** (`docs/governance/`): Lab SOP template, InfoSec/Privacy security review draft, REB/IRB submission brief, release log. All four documents are ready for Jordan to review and sign. * **Audit trail infrastructure**: SQLAlchemy-backed Postgres records every classification with its triggered criteria, evidence sources, and any curator overrides with free-text justification. Schema in `backend/app/models/classification.py`. * **Export formats**: PDF reports, ClinVar XML submission format, and FHIR resources are generated by `backend/app/services/exports.py`. * **Clinical deployment artifacts**: `docker-compose.clinical.yml`, `backend/Dockerfile.clinical`, `frontend/Dockerfile.clinical`, `frontend/nginx.conf`, and `scripts/clinical_preflight.py` (generates JWT secrets, validates env) are checked in. * **Air-gap path**: `USE_LOCAL_LLM=true` swaps Anthropic for Ollama running in-process. No patient data leaves the lab. ### Awaiting institutional action These items require Jordan or lab administration; the code path is ready. 1. SOP sign-off (`docs/governance/01_lab_sop_template.md`). 2. InfoSec / Privacy Office review (`02_privacy_security_review.md`). 3. REB / IRB submission (`03_irb_brief.md`). 4. OMIM API key application (`omimadmin@omim.org`, 1-2 week turnaround). 5. UHN Library Services coordination for publisher TDM API keys (Elsevier, Wiley, Springer) — 2-4 week turnaround typical. 6. Lab Director sign-off and `v0.1.0` release tag. ### Deferred technical work (post v0.1.0) * Wire Ensembl variant_recoder fallback for variants where the standard chr-pos-ref-alt resolution fails (currently ~5% of fixture). Estimated lift: +2 percentage points on overall concordance. * Implement BS4, BP2, BP3, BP5, PM4, PP2 (the 6 missing ACMG criteria). None high-yield on typical caseloads; tactical completion target. * Move backend off Hugging Face Spaces to dedicated cloud (Fly.io / DigitalOcean) for production-grade SLA — required only if the demo serves real curator workflows. * GA4GH VRS / VA-Spec interoperability for cross-tool variant representation. --- ## Worked example: BRCA1 NM_007294.4:c.5266dupC Input: a known Ashkenazi-founder pathogenic frameshift. | Step | Source | Output | |---|---|---| | HGVS normalization | Mutalyzer + Ensembl VEP | `chr=17, pos=43057064, frameshift_variant, p.Gln1756ProfsTer74` | | Population frequency (primary) | gnomAD chr-pos-ref-alt lookup | Skipped — empty alt allele for `dup` notation | | Population frequency (fallback) | gnomAD `variant_search` by ClinVar variation ID | Resolved to `13-32340300-GT-G`, AF 0.000136, 0 homozygotes | | ClinVar consensus | NCBI esummary | `VCV000548237` (3★ Pathogenic) | | In-silico predictors | REVEL / AlphaMissense / SpliceAI | n/a for frameshift | | autoPVS1 | rule engine | Triggered (very_strong) — frameshift in established LoF gene | | Bayesian score | combiner | PVS1 (+8) + PP5 (+8) + PM2_supporting (+1) = +17 | | Final | combiner | **Pathogenic** | | Audit | Postgres | Every criterion above persisted with its evidence_text, source, and confidence fields | The classification is reproducible to the byte for any variant in the validation fixture. Every triggered criterion includes a `source` field (database accession or PMID), an `evidence_text` field with the literal quote or score, and a `confidence` rating. --- ## Honest limitations These are surfaced explicitly because they will surface anyway during review: * The 94% number is adjacent-tier (P↔LP and B↔LB collapsed). Strict-tier exact-match concordance is ~75-80%; lower than published but not unreasonable given that even expert panels disagree on the P/LP boundary. * The 1000-variant fixture is balanced (200 per tier) and may not reflect the natural prevalence of a specific lab's case mix. * Population frequency lookups via the `dup`/complex-indel fallback path add ~2-5 seconds per variant for cases where the primary lookup misses. Affects roughly 5% of variants in the validation fixture. * The literature layer is deliberately deployed only behind authentication in production (cost control); the public demo URL runs deterministic-only. * Six ACMG criteria are not yet implemented (PM4, PP2, BS4, BP2, BP3, BP5). None of these meaningfully changes final classifications on more than ~1-2% of typical caseloads, but full 28/28 coverage is the v0.2 target. --- ## How to verify everything in this document | Claim | Verifiable artifact | |---|---| | 94.0% concordance on 1000 variants | `docs/clinical_validation_results_1000.json` | | 22/28 ACMG criteria implemented | `backend/app/services/acmg/rules.py` + `backend/app/services/llm/prompts.py` | | Per-gene concordance breakdown | `docs/per_gene_breakdown_1000.json` | | RAG smoke test result | `docs/smoke_test_50_results.json` | | Anti-hallucination prompt design | `backend/app/services/llm/prompts.py` | | 102 / 103 backend tests passing | `pytest backend/tests/` | | Air-gap deployment artifacts | `docker-compose.clinical.yml` | | Governance drafts | `docs/governance/` | --- ## Single-paragraph positioning statement > VariantLens is an open-source clinical genomic variant interpretation > tool combining a calibrated deterministic ACMG/AMP rule engine with a > structurally hallucination-resistant LLM-driven literature reasoning > layer. It reaches **94.0% adjacent-tier concordance** on a 1000-variant > ClinVar fixture spanning 876 genes — exceeding the published numbers > for InterVar and architecturally distinct from AI CURA, EvAgg, and > AutoPM3. It is deployable on-premise with no cloud dependency, ships > with a complete audit trail to source for every triggered criterion, > and is positioned to support the ACMG/AMP SVC v4.0 transition through > a versioned rule-engine architecture. --- *Contact*: Theo Sevitt  ·  intern, Jordan Lerner-Ellis Lab *Repository*: https://github.com/tsevitth-png/variantlens *Live demo*: https://frontend-coral-omega-54.vercel.app