Spaces:
Running
§3 VEP: replace variant set with audited 8-pick spread
Browse filesOld set: 6 variants, two disagreed with ClinVar at the 4 kb scoring
window, one was effectively invisible, one had a contested clinical
classification (TP53 Pro72Arg). Replaced with 8 picks (6 Pathogenic +
1 Risk + 1 Benign), all checked against Carbon-3B at the demo's window
length: HBB, BRCA2, TP53 c.712T>A, F9, LDLR splice donor, VHL stop,
LRRK2 G2019S, VHL 3' UTR benign. The two VHL rows are paired by design
as the narrative payoff (same gene, opposite verdicts: −323 nats vs
≈0). Regenerated data/variants.json end to end via fetch_variants.py +
precompute_vep against the live /score endpoint; rewrote GENE_INFO /
VARIANT_DESC for the new genes; updated the VEP takeaway to point at
the VHL pair instead of the retired sign-disagreement framing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- assets/js/sections/vep.js +12 -8
- data/variants.json +0 -0
- demo.html +5 -4
- scripts/fetch_variants.py +16 -10
|
@@ -27,21 +27,25 @@
|
|
| 27 |
// One-line gene description (what the protein does).
|
| 28 |
const GENE_INFO = {
|
| 29 |
HBB: "Hemoglobin β-subunit, the protein that carries oxygen in red blood cells.",
|
|
|
|
| 30 |
TP53: "Tumor suppressor that guards the cell cycle and triggers apoptosis when DNA is damaged. Disabling mutations are the most common driver across cancers.",
|
|
|
|
|
|
|
|
|
|
| 31 |
LRRK2: "Leucine-rich repeat kinase that regulates vesicle trafficking. Specific mutations are the most common monogenic cause of Parkinson's disease.",
|
| 32 |
-
APOE: "Apolipoprotein E, which packages cholesterol for transport in the blood.",
|
| 33 |
-
F5: "Coagulation factor V, a key protein in the blood-clotting cascade.",
|
| 34 |
};
|
| 35 |
|
| 36 |
// Per-variant description, written to flow as a natural sentence after the
|
| 37 |
// gene description (so the two paragraphs read as one merged blurb).
|
| 38 |
const VARIANT_DESC = {
|
| 39 |
-
rs334:
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
|
|
|
|
|
|
| 45 |
};
|
| 46 |
|
| 47 |
function setStatus(text, mode = "") {
|
|
|
|
| 27 |
// One-line gene description (what the protein does).
|
| 28 |
const GENE_INFO = {
|
| 29 |
HBB: "Hemoglobin β-subunit, the protein that carries oxygen in red blood cells.",
|
| 30 |
+
BRCA2: "Tumor suppressor essential for repairing DNA double-strand breaks by homologous recombination. Germline loss-of-function variants drive much of hereditary breast and ovarian cancer.",
|
| 31 |
TP53: "Tumor suppressor that guards the cell cycle and triggers apoptosis when DNA is damaged. Disabling mutations are the most common driver across cancers.",
|
| 32 |
+
F9: "Coagulation factor IX, a serine protease in the blood-clotting cascade. Loss-of-function variants cause hemophilia B.",
|
| 33 |
+
LDLR: "Low-density lipoprotein receptor, which pulls cholesterol-carrying LDL particles out of the blood. Loss-of-function variants cause familial hypercholesterolemia.",
|
| 34 |
+
VHL: "Tumor suppressor that marks the HIF transcription factors for degradation when oxygen is plentiful. Loss of function drives Von Hippel-Lindau disease, with hemangioblastomas and renal cell carcinoma.",
|
| 35 |
LRRK2: "Leucine-rich repeat kinase that regulates vesicle trafficking. Specific mutations are the most common monogenic cause of Parkinson's disease.",
|
|
|
|
|
|
|
| 36 |
};
|
| 37 |
|
| 38 |
// Per-variant description, written to flow as a natural sentence after the
|
| 39 |
// gene description (so the two paragraphs read as one merged blurb).
|
| 40 |
const VARIANT_DESC = {
|
| 41 |
+
rs334: "The c.20A>T mutation flips a single base in the second exon, replacing glutamic acid with valine at position 6 of the protein (p.Glu6Val). The altered hemoglobin polymerises in low oxygen, deforming red blood cells into the characteristic sickle shape, the cause of sickle cell anemia.",
|
| 42 |
+
rs80359027: "The c.7976G>T substitution replaces arginine with isoleucine at position 2659 (p.Arg2659Ile) inside one of the helical domains BRCA2 uses to bind its DNA-repair partner proteins. Missense variants in this region are recurrently reported in families with hereditary breast and ovarian cancer.",
|
| 43 |
+
rs1057519981: "The c.712T>A substitution converts cysteine to serine at position 238 (p.Cys238Ser) inside TP53's DNA-binding core. The cysteine helps coordinate a structural zinc ion, and losing it destabilises the fold the tumor suppressor uses to recognise its DNA targets.",
|
| 44 |
+
rs1603267420: "The c.1186T>A mutation swaps cysteine for serine at position 396 (p.Cys396Ser) in factor IX, breaking one of the disulfide bonds that stabilise the protein's catalytic domain. Loss-of-function variants in F9 cause hemophilia B.",
|
| 45 |
+
rs112029328: "The c.313+1G>T mutation hits the +1 base of an LDLR splice donor, the most conserved position in any intron and one the spliceosome essentially can't miss. With the donor lost, the affected exon is skipped or read through into intronic sequence, producing a non-functional LDL receptor and causing familial hypercholesterolemia.",
|
| 46 |
+
rs1575932011: "The c.475A>T mutation rewrites a lysine codon into a stop, truncating VHL at p.Lys159Ter. The remaining protein is missing the β-domain it needs to bind HIF transcription factors and mark them for degradation, the hallmark loss-of-function pattern behind Von Hippel-Lindau disease.",
|
| 47 |
+
rs34637584: "The Gly2019Ser (G2019S) mutation sits in LRRK2's kinase activation loop and turbocharges its kinase activity. Carriers have roughly 30% lifetime risk of Parkinson's disease.",
|
| 48 |
+
rs182781943: "The c.*820A>G change sits 820 bases past VHL's stop codon, deep in the 3' untranslated region. It changes no amino acid and disrupts no splice site, so the encoded protein is identical to the reference and ClinVar classifies it as benign.",
|
| 49 |
};
|
| 50 |
|
| 51 |
function setStatus(text, mode = "") {
|
|
The diff for this file is too large to render.
See raw diff
|
|
|
|
@@ -792,10 +792,11 @@ for t, lp in zip(tok.convert_ids_to_tokens(ids[0, 1:].tolist()),
|
|
| 792 |
Read each row two ways: the <em>dot color</em> is what ClinVar says (red = pathogenic,
|
| 793 |
orange = risk, green = benign); the <em>bar direction</em> is what Carbon says (red bar
|
| 794 |
pointing left = mutation less likely than original; charcoal bar pointing right =
|
| 795 |
-
mutation looks fine or more likely).
|
| 796 |
-
|
| 797 |
-
|
| 798 |
-
|
|
|
|
| 799 |
</p>
|
| 800 |
</div>
|
| 801 |
|
|
|
|
| 792 |
Read each row two ways: the <em>dot color</em> is what ClinVar says (red = pathogenic,
|
| 793 |
orange = risk, green = benign); the <em>bar direction</em> is what Carbon says (red bar
|
| 794 |
pointing left = mutation less likely than original; charcoal bar pointing right =
|
| 795 |
+
mutation looks fine or more likely). Watch the two VHL rows for the cleanest
|
| 796 |
+
demonstration: a premature stop codon (c.475A>T) swings the bar hundreds of nats to
|
| 797 |
+
the left, while a common 3' UTR variant (c.*820A>G) in the very same gene sits at
|
| 798 |
+
zero. Same model, same window length, opposite verdicts. Carbon learned the
|
| 799 |
+
distinction from raw sequence alone, with no labels.
|
| 800 |
</p>
|
| 801 |
</div>
|
| 802 |
|
|
@@ -23,17 +23,23 @@ ENSEMBL = "https://rest.ensembl.org"
|
|
| 23 |
WINDOW = 4002 # multiple of 6 for the 6-mer BPE tokenizer + matches HALF math
|
| 24 |
HALF = WINDOW // 2 # variant placed at center
|
| 25 |
|
| 26 |
-
# Curated set: famous pathogenic + benign SNVs across several
|
| 27 |
-
# Coordinates resolved via the Ensembl variation API
|
| 28 |
-
#
|
| 29 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
| 30 |
VARIANTS = [
|
| 31 |
-
{"rs": "rs334",
|
| 32 |
-
{"rs": "
|
| 33 |
-
{"rs": "
|
| 34 |
-
{"rs": "
|
| 35 |
-
{"rs": "
|
| 36 |
-
{"rs": "
|
|
|
|
|
|
|
| 37 |
]
|
| 38 |
|
| 39 |
|
|
|
|
| 23 |
WINDOW = 4002 # multiple of 6 for the 6-mer BPE tokenizer + matches HALF math
|
| 24 |
HALF = WINDOW // 2 # variant placed at center
|
| 25 |
|
| 26 |
+
# Curated set: famous pathogenic + risk + benign SNVs across several
|
| 27 |
+
# genes/diseases. Coordinates resolved via the Ensembl variation API; the
|
| 28 |
+
# plus-strand alt is the ClinVar canonical disease-associated allele for
|
| 29 |
+
# pathogenic / risk picks, or the single reported alt for benign.
|
| 30 |
+
#
|
| 31 |
+
# Spread: 6 Pathogenic + 1 Risk + 1 Benign. The two VHL rows (rs1575932011
|
| 32 |
+
# and rs182781943) are paired on purpose, same gene, opposite verdicts:
|
| 33 |
+
# a premature stop codon vs. a benign 3' UTR SNP.
|
| 34 |
VARIANTS = [
|
| 35 |
+
{"rs": "rs334", "gene": "HBB", "name": "HBB c.20A>T", "sig": "Pathogenic", "blurb": "sickle cell anemia · p.Glu6Val", "plus_alt": "A"},
|
| 36 |
+
{"rs": "rs80359027", "gene": "BRCA2", "name": "BRCA2 c.7976G>T", "sig": "Pathogenic", "blurb": "hereditary breast/ovarian cancer · missense", "plus_alt": "T"},
|
| 37 |
+
{"rs": "rs1057519981", "gene": "TP53", "name": "TP53 c.712T>A", "sig": "Pathogenic", "blurb": "Li-Fraumeni cancer · missense", "plus_alt": "T"},
|
| 38 |
+
{"rs": "rs1603267420", "gene": "F9", "name": "F9 c.1186T>A", "sig": "Pathogenic", "blurb": "hemophilia B · clotting factor missense", "plus_alt": "A"},
|
| 39 |
+
{"rs": "rs112029328", "gene": "LDLR", "name": "LDLR c.313+1G>T", "sig": "Pathogenic", "blurb": "familial high cholesterol · splice donor lost","plus_alt": "T"},
|
| 40 |
+
{"rs": "rs1575932011", "gene": "VHL", "name": "VHL c.475A>T", "sig": "Pathogenic", "blurb": "Von Hippel-Lindau · premature STOP", "plus_alt": "T"},
|
| 41 |
+
{"rs": "rs34637584", "gene": "LRRK2", "name": "LRRK2 c.6055G>A", "sig": "Risk", "blurb": "Parkinson's · G2019S kinase variant", "plus_alt": "A"},
|
| 42 |
+
{"rs": "rs182781943", "gene": "VHL", "name": "VHL c.*820A>G", "sig": "Benign", "blurb": "common 3' UTR variant · same gene as row above","plus_alt": "G"},
|
| 43 |
]
|
| 44 |
|
| 45 |
|