lvwerra HF Staff Claude Opus 4.7 (1M context) commited on
Commit
41739ad
·
1 Parent(s): 21329c2

§3 VEP: replace variant set with audited 8-pick spread

Browse files

Old set: 6 variants, two disagreed with ClinVar at the 4 kb scoring
window, one was effectively invisible, one had a contested clinical
classification (TP53 Pro72Arg). Replaced with 8 picks (6 Pathogenic +
1 Risk + 1 Benign), all checked against Carbon-3B at the demo's window
length: HBB, BRCA2, TP53 c.712T>A, F9, LDLR splice donor, VHL stop,
LRRK2 G2019S, VHL 3' UTR benign. The two VHL rows are paired by design
as the narrative payoff (same gene, opposite verdicts: −323 nats vs
≈0). Regenerated data/variants.json end to end via fetch_variants.py +
precompute_vep against the live /score endpoint; rewrote GENE_INFO /
VARIANT_DESC for the new genes; updated the VEP takeaway to point at
the VHL pair instead of the retired sign-disagreement framing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

assets/js/sections/vep.js CHANGED
@@ -27,21 +27,25 @@
27
  // One-line gene description (what the protein does).
28
  const GENE_INFO = {
29
  HBB: "Hemoglobin β-subunit, the protein that carries oxygen in red blood cells.",
 
30
  TP53: "Tumor suppressor that guards the cell cycle and triggers apoptosis when DNA is damaged. Disabling mutations are the most common driver across cancers.",
 
 
 
31
  LRRK2: "Leucine-rich repeat kinase that regulates vesicle trafficking. Specific mutations are the most common monogenic cause of Parkinson's disease.",
32
- APOE: "Apolipoprotein E, which packages cholesterol for transport in the blood.",
33
- F5: "Coagulation factor V, a key protein in the blood-clotting cascade.",
34
  };
35
 
36
  // Per-variant description, written to flow as a natural sentence after the
37
  // gene description (so the two paragraphs read as one merged blurb).
38
  const VARIANT_DESC = {
39
- rs334: "The c.20A>T mutation flips a single base in the second exon, replacing glutamic acid with valine at position 6 of the protein (p.Glu6Val). The altered hemoglobin polymerises in low oxygen, deforming red blood cells into the characteristic sickle shape, the cause of sickle cell anemia.",
40
- rs28934578:"The Arg175His (R175H) mutation sits in one of the six most-mutated residues in TP53 across all human cancers. It disrupts the DNA-binding domain so the tumor suppressor can no longer transcribe its targets, central to Li-Fraumeni syndrome.",
41
- rs34637584:"The Gly2019Ser (G2019S) mutation sits in LRRK2's kinase activation loop and turbocharges its kinase activity. Carriers have roughly 30% lifetime risk of Parkinson's disease.",
42
- rs429358: "The c.388T>C variant (p.Cys112Arg) defines the APOE-ε4 isoform, the strongest common genetic risk factor for late-onset Alzheimer's disease.",
43
- rs6025: "The Factor V Leiden variant (p.Arg506Gln) makes the protein resistant to inactivation by Protein C; carriers have 5–10× higher risk of deep-vein thrombosis.",
44
- rs1042522: "The Pro72Arg (P72R) variant is a common polymorphism in TP53's proline-rich domain. Functional consequences are subtle and debated; usually classified as benign in clinical settings.",
 
 
45
  };
46
 
47
  function setStatus(text, mode = "") {
 
27
  // One-line gene description (what the protein does).
28
  const GENE_INFO = {
29
  HBB: "Hemoglobin β-subunit, the protein that carries oxygen in red blood cells.",
30
+ BRCA2: "Tumor suppressor essential for repairing DNA double-strand breaks by homologous recombination. Germline loss-of-function variants drive much of hereditary breast and ovarian cancer.",
31
  TP53: "Tumor suppressor that guards the cell cycle and triggers apoptosis when DNA is damaged. Disabling mutations are the most common driver across cancers.",
32
+ F9: "Coagulation factor IX, a serine protease in the blood-clotting cascade. Loss-of-function variants cause hemophilia B.",
33
+ LDLR: "Low-density lipoprotein receptor, which pulls cholesterol-carrying LDL particles out of the blood. Loss-of-function variants cause familial hypercholesterolemia.",
34
+ VHL: "Tumor suppressor that marks the HIF transcription factors for degradation when oxygen is plentiful. Loss of function drives Von Hippel-Lindau disease, with hemangioblastomas and renal cell carcinoma.",
35
  LRRK2: "Leucine-rich repeat kinase that regulates vesicle trafficking. Specific mutations are the most common monogenic cause of Parkinson's disease.",
 
 
36
  };
37
 
38
  // Per-variant description, written to flow as a natural sentence after the
39
  // gene description (so the two paragraphs read as one merged blurb).
40
  const VARIANT_DESC = {
41
+ rs334: "The c.20A>T mutation flips a single base in the second exon, replacing glutamic acid with valine at position 6 of the protein (p.Glu6Val). The altered hemoglobin polymerises in low oxygen, deforming red blood cells into the characteristic sickle shape, the cause of sickle cell anemia.",
42
+ rs80359027: "The c.7976G>T substitution replaces arginine with isoleucine at position 2659 (p.Arg2659Ile) inside one of the helical domains BRCA2 uses to bind its DNA-repair partner proteins. Missense variants in this region are recurrently reported in families with hereditary breast and ovarian cancer.",
43
+ rs1057519981: "The c.712T>A substitution converts cysteine to serine at position 238 (p.Cys238Ser) inside TP53's DNA-binding core. The cysteine helps coordinate a structural zinc ion, and losing it destabilises the fold the tumor suppressor uses to recognise its DNA targets.",
44
+ rs1603267420: "The c.1186T>A mutation swaps cysteine for serine at position 396 (p.Cys396Ser) in factor IX, breaking one of the disulfide bonds that stabilise the protein's catalytic domain. Loss-of-function variants in F9 cause hemophilia B.",
45
+ rs112029328: "The c.313+1G>T mutation hits the +1 base of an LDLR splice donor, the most conserved position in any intron and one the spliceosome essentially can't miss. With the donor lost, the affected exon is skipped or read through into intronic sequence, producing a non-functional LDL receptor and causing familial hypercholesterolemia.",
46
+ rs1575932011: "The c.475A>T mutation rewrites a lysine codon into a stop, truncating VHL at p.Lys159Ter. The remaining protein is missing the β-domain it needs to bind HIF transcription factors and mark them for degradation, the hallmark loss-of-function pattern behind Von Hippel-Lindau disease.",
47
+ rs34637584: "The Gly2019Ser (G2019S) mutation sits in LRRK2's kinase activation loop and turbocharges its kinase activity. Carriers have roughly 30% lifetime risk of Parkinson's disease.",
48
+ rs182781943: "The c.*820A>G change sits 820 bases past VHL's stop codon, deep in the 3' untranslated region. It changes no amino acid and disrupts no splice site, so the encoded protein is identical to the reference and ClinVar classifies it as benign.",
49
  };
50
 
51
  function setStatus(text, mode = "") {
data/variants.json CHANGED
The diff for this file is too large to render. See raw diff
 
demo.html CHANGED
@@ -792,10 +792,11 @@ for t, lp in zip(tok.convert_ids_to_tokens(ids[0, 1:].tolist()),
792
  Read each row two ways: the <em>dot color</em> is what ClinVar says (red = pathogenic,
793
  orange = risk, green = benign); the <em>bar direction</em> is what Carbon says (red bar
794
  pointing left = mutation less likely than original; charcoal bar pointing right =
795
- mutation looks fine or more likely). When dot and bar agree on "left of zero", as in
796
- HBB c.20A>T sickle cell, Carbon has independently picked up the pathogenicity signal.
797
- When they disagree, the common reason is allele frequency: variants common enough in
798
- human populations look perfectly normal to a model trained on natural sequence.
 
799
  </p>
800
  </div>
801
 
 
792
  Read each row two ways: the <em>dot color</em> is what ClinVar says (red = pathogenic,
793
  orange = risk, green = benign); the <em>bar direction</em> is what Carbon says (red bar
794
  pointing left = mutation less likely than original; charcoal bar pointing right =
795
+ mutation looks fine or more likely). Watch the two VHL rows for the cleanest
796
+ demonstration: a premature stop codon (c.475A>T) swings the bar hundreds of nats to
797
+ the left, while a common 3' UTR variant (c.*820A>G) in the very same gene sits at
798
+ zero. Same model, same window length, opposite verdicts. Carbon learned the
799
+ distinction from raw sequence alone, with no labels.
800
  </p>
801
  </div>
802
 
scripts/fetch_variants.py CHANGED
@@ -23,17 +23,23 @@ ENSEMBL = "https://rest.ensembl.org"
23
  WINDOW = 4002 # multiple of 6 for the 6-mer BPE tokenizer + matches HALF math
24
  HALF = WINDOW // 2 # variant placed at center
25
 
26
- # Curated set: famous pathogenic + benign SNVs across several genes/diseases.
27
- # Coordinates resolved via the Ensembl variation API. We specify which alt
28
- # allele is the disease-relevant one in plus-strand coordinates (Ensembl can
29
- # list 3+ alts for a position; we pick the canonical pathogenic one).
 
 
 
 
30
  VARIANTS = [
31
- {"rs": "rs334", "gene": "HBB", "name": "HBB c.20A>T", "sig": "Pathogenic", "blurb": "sickle cell · p.Glu6Val", "plus_alt": "A"},
32
- {"rs": "rs28934578", "gene": "TP53", "name": "TP53 c.524G>A", "sig": "Pathogenic", "blurb": "Li-Fraumeni hotspot · p.Arg175His", "plus_alt": "T"},
33
- {"rs": "rs34637584", "gene": "LRRK2", "name": "LRRK2 c.6055G>A", "sig": "Pathogenic", "blurb": "Parkinson's · p.Gly2019Ser", "plus_alt": "A"},
34
- {"rs": "rs429358", "gene": "APOE", "name": "APOE c.388T>C", "sig": "Risk", "blurb": "APOE4 · Alzheimer's risk · p.Cys112Arg","plus_alt": "C"},
35
- {"rs": "rs6025", "gene": "F5", "name": "F5 Leiden", "sig": "Risk", "blurb": "factor V Leiden · thrombosis", "plus_alt": "T"},
36
- {"rs": "rs1042522", "gene": "TP53", "name": "TP53 p.Pro72Arg", "sig": "Benign", "blurb": "common polymorphism · benign", "plus_alt": "C"},
 
 
37
  ]
38
 
39
 
 
23
  WINDOW = 4002 # multiple of 6 for the 6-mer BPE tokenizer + matches HALF math
24
  HALF = WINDOW // 2 # variant placed at center
25
 
26
+ # Curated set: famous pathogenic + risk + benign SNVs across several
27
+ # genes/diseases. Coordinates resolved via the Ensembl variation API; the
28
+ # plus-strand alt is the ClinVar canonical disease-associated allele for
29
+ # pathogenic / risk picks, or the single reported alt for benign.
30
+ #
31
+ # Spread: 6 Pathogenic + 1 Risk + 1 Benign. The two VHL rows (rs1575932011
32
+ # and rs182781943) are paired on purpose, same gene, opposite verdicts:
33
+ # a premature stop codon vs. a benign 3' UTR SNP.
34
  VARIANTS = [
35
+ {"rs": "rs334", "gene": "HBB", "name": "HBB c.20A>T", "sig": "Pathogenic", "blurb": "sickle cell anemia · p.Glu6Val", "plus_alt": "A"},
36
+ {"rs": "rs80359027", "gene": "BRCA2", "name": "BRCA2 c.7976G>T", "sig": "Pathogenic", "blurb": "hereditary breast/ovarian cancer · missense", "plus_alt": "T"},
37
+ {"rs": "rs1057519981", "gene": "TP53", "name": "TP53 c.712T>A", "sig": "Pathogenic", "blurb": "Li-Fraumeni cancer · missense", "plus_alt": "T"},
38
+ {"rs": "rs1603267420", "gene": "F9", "name": "F9 c.1186T>A", "sig": "Pathogenic", "blurb": "hemophilia B · clotting factor missense", "plus_alt": "A"},
39
+ {"rs": "rs112029328", "gene": "LDLR", "name": "LDLR c.313+1G>T", "sig": "Pathogenic", "blurb": "familial high cholesterol · splice donor lost","plus_alt": "T"},
40
+ {"rs": "rs1575932011", "gene": "VHL", "name": "VHL c.475A>T", "sig": "Pathogenic", "blurb": "Von Hippel-Lindau · premature STOP", "plus_alt": "T"},
41
+ {"rs": "rs34637584", "gene": "LRRK2", "name": "LRRK2 c.6055G>A", "sig": "Risk", "blurb": "Parkinson's · G2019S kinase variant", "plus_alt": "A"},
42
+ {"rs": "rs182781943", "gene": "VHL", "name": "VHL c.*820A>G", "sig": "Benign", "blurb": "common 3' UTR variant · same gene as row above","plus_alt": "G"},
43
  ]
44
 
45