Release v1.0: HPRC-trained 35/21-feature calibrated SV+STR models

#1
by khyeom - opened
README.md CHANGED
@@ -1,109 +1,82 @@
1
  ---
2
  license: mit
3
- tags:
4
- - genomics
5
- - structural-variants
6
- - short-tandem-repeats
7
- - random-forest
8
- - variant-calling
9
  library_name: sklearn
 
 
 
 
 
 
 
 
10
  ---
11
 
12
- # SVSTR-Score: long-read-guided confidence scoring for short-read SV and STR genotypes
13
 
14
- Per-class random-forest models that assign a calibrated **confidence score (CS [0,1])**
15
- and a four-tier operating point to each **short-read** structural-variant (SV) or
16
- short-tandem-repeat (STR) call, learned from paired short-/long-read genomes. The primary SV
17
- model is the 23-feature caller-output random forest; a caller-independent sequence-only model
18
- is provided as a portable secondary option.
19
- The long-read genotype is used only to build the training label; **inference needs
20
- short-read caller output only**.
21
 
22
- ## Files
23
- | File | Description |
24
- |---|---|
25
- | `sv_model_v13_parents.joblib` | SV random forest (23 caller-output features; **primary SV model / external-validation headline**) |
26
- | `str_model_v13_parents.joblib` | STR random forest (25 features) |
27
- | `seqonly/` | **caller-independent sequence-only SV model (svspr, 11 features) — portable secondary option** for callers whose fields differ from training. `seqonly/model/svspr_v14_seq.pkl` + the `svspr` inference package (VCF + reference → CS/tier; tiers HIGH≥0.9/MOD≥0.7/WARN≥0.5) + example. 11 features = SV length/log-length, SVTYPE one-hot, ±100 bp flank GC/AT/inner-GC, motif-2/3 counts; no caller fields |
28
- | `str_locus_lookup.parquet` | per-locus historical concordance (`locus_conc_rate`) keyed by (chrom, pos); 163,726 loci |
29
- | `sv_config.json` / `str_config.json` | feature order, tier thresholds, lookup metadata |
30
- | `feature_manifest.json` | definition of every SV/STR feature |
31
- | `tier_thresholds.json` | tier cut-offs + override rule |
32
- | `score_svstr.py` | inference entry point |
33
- | `requirements.txt` | pinned dependencies (**scikit-learn==1.5.1**) |
34
- | `example/` | real example calls from the study cohort (input features + scored output) |
35
 
36
- > Models were trained with **scikit-learn 1.5.1**; load with the same version to
37
- > avoid `InconsistentVersionWarning` and guarantee identical results.
 
 
 
 
 
38
 
39
- ## Quick start
40
- ```bash
41
- pip install -r requirements.txt
42
- # SV
43
- python score_svstr.py --variant sv --model-dir . --features example/sv_features.tsv --out sv_scored.tsv
44
- # STR (locus_conc_rate / locus_in_lookup are filled here from str_locus_lookup.parquet)
45
- python score_svstr.py --variant str --model-dir . --features example/str_features.tsv --out str_scored.tsv
46
- ```
47
- Input is a table extracted from the caller VCF (e.g. with `bcftools query`) holding
48
- the features in `*_config.json`. For STR, supply the 23 caller-output features plus
49
- `chrom,pos`; the script joins the catalogue lookup to add `locus_conc_rate` and
50
- `locus_in_lookup`.
51
 
52
- `example/` holds real calls drawn from the study cohort (`*_features.tsv`) and
53
- their scored output (`*_scored.tsv`), spanning the tier range. They double as a
54
- reproducibility check: the SV set includes precise PASS calls scored HIGH and
55
- override cases (`is_imprecise==1` or `filter_pass==0`) demoted to LOW; the STR
56
- set includes high-concordance loci scored HIGH and a feature-clean but
57
- historically discordant locus scored LOW, illustrating that `locus_conc_rate`
58
- dominates the STR score.
59
 
60
- ## Confidence tiers
61
- | Tier | Cut |
62
- |---|---|
63
- | HIGH | CS ≥ 0.70 |
64
- | MODERATE | 0.50 ≤ CS < 0.70 |
65
- | WARNING | 0.30 ≤ CS < 0.50 |
66
- | LOW | CS < 0.30 |
67
 
68
- **Override.** Deterministic quality-demotion rules in each `*_config.json` (`rules`)
69
- are applied on top of the CS tier by `score_svstr.py`; any rule triggered →
70
- `override_target` regardless of CS. **SV LOW** (QUAL<20, GQ<15, total_alt_support≤2,
71
- vaf_estimate<0.15, is_imprecise==1); **STR WARNING** (locus_coverage<20, is_low_depth==1,
72
- support_type_a2≤1, total_support_a2<5, ci_width_a2>3, allele_balance<0.3). **Only the
73
- MODERATE→HIGH range is a monotone precision ladder; LOW and WARNING are not rank-ordered
74
- by precision.** Use the **HIGH tier as a candidate-triage filter**.
75
 
76
- ## Intended use & performance
77
- - **Sample-LOSO cross-validation** (143 unrelated parents): SV F1 = **0.9308**, STR F1 = **0.9960**.
78
- - **Within-cohort generalization** — re-scored on 3,608 unseen short-read genomes without retraining;
79
- the score distribution and per-1-Mb genomic reliability map reproduce (training↔unseen Spearman ρ = **0.90**).
80
- - **External multi-sample validation — 194 HPRC genomes** (sequence-aware Truvari labels; long-read truth Sawfish/SV, TRGT/STR):
81
- - SV (this 23-feature model, the primary external object): per-genome median AUROC **0.927**, above raw
82
- QUAL (0.741), GQ (0.476) and read-depth (0.671) and the applicable dedicated tools (duphold, Paragraph),
83
- most clearly for deletions and insertions (AUPRC 0.948, expected calibration error 0.053). Robust across
84
- the tested callers (median **0.915** on an independent DRAGEN reanalysis). The companion
85
- **caller-independent sequence-only model** transfers at a lower ceiling (median AUROC **0.851**).
86
- - STR (this model): per-genome median AUROC **0.912**; in-catalogue 0.909 vs out-of-catalogue 0.688 (catalogue-bound reliability atlas).
87
- - **GIAB HG002** (single-sample supplementary check): HIGH-tier SV precision **97.7 %** (consensus v0.6).
88
 
89
- ## Limitations (read before use)
90
- - **External validation is multi-sample (194 HPRC genomes; see above).** Outputs remain
91
- candidate triage, not validated clinical calls. The 23-feature caller-output model is robust
92
- across the tested Manta-family callers (Manta v1.6, DRAGEN), but it consumes caller-specific
93
- fields and is untested on structurally different callers (e.g. DELLY, GATK-SV); for those use
94
- the caller-independent sequence-only model (lower ceiling).
95
- - **STR scoring is a per-locus reliability atlas.** ~94 % of STR model importance is
96
- `locus_conc_rate`; removing it drops cross-validated AUROC 0.998 → 0.62. The score
97
- does **not** extend to STR loci outside the 163,726-locus catalogue (out-of-catalogue
98
- loci get a conservative score by construction), and only the HIGH tier transfers
99
- across callers — lower-tier ranking does not.
100
- - Trained on a single ancestry and pipeline (Illumina DRAGEN + Manta/ExpansionHunter;
101
- PacBio HiFi + Sawfish/TRGT, GRCh38). Behaviour under other pipelines is untested.
102
- - `.joblib` is a pickle: load only from trusted sources and with the pinned versions.
103
 
104
- ## Citation
105
- Kim W*, Yeom K*, et al. SVSTR-Score: long-read-guided confidence scoring for
106
- short-read SV and STR genotypes. (manuscript in preparation.)
107
 
108
- ## Licence
109
- MIT.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
 
 
 
 
 
 
3
  library_name: sklearn
4
+ tags:
5
+ - genomics
6
+ - structural-variants
7
+ - short-tandem-repeats
8
+ - variant-calling
9
+ - confidence-calibration
10
+ - random-forest
11
+ pipeline_tag: tabular-classification
12
  ---
13
 
14
+ # SVSTR-Score (v1.0) — long-read-guided confidence scoring for short-read SV/STR calls
15
 
16
+ Two per-class **RandomForest + isotonic calibrator** models that assign a
17
+ **calibrated confidence score** `CS ∈ [0,1]` and a four-tier operating point to
18
+ each short-read **structural-variant (SV)** or **short-tandem-repeat (STR)** call.
19
+ The long-read genotype is used **only to build the training label**, so inference
20
+ needs short-read caller output only.
 
 
21
 
22
+ `CS = isotonic_calibrator( RandomForest.predict_proba(X)[:, 1] )` = P(the
23
+ short-read call is concordant with the long-read truth).
 
 
 
 
 
 
 
 
 
 
 
24
 
25
+ | | SV | STR |
26
+ |---|---|---|
27
+ | weights | `sv_model.joblib` + `sv_calibrator.joblib` | `str_model.joblib` + `str_calibrator.joblib` |
28
+ | features | 35 (`feature_builder.py`) | 21 |
29
+ | short-read caller | Manta | ExpansionHunter |
30
+ | long-read truth (label only) | sawfish | TRGT |
31
+ | training cohort | 208 HPRC paired genomes | 208 HPRC |
32
 
33
+ ## Tiers
34
+ `HIGH CS≥0.70` · `MODERATE 0.50–0.70` · `WARNING 0.30–0.50` · `LOW <0.30`.
35
+ Tiers are buckets of the calibrated CS (no heuristic overrides). **HIGH** is the
36
+ candidate-triage tier.
 
 
 
 
 
 
 
 
37
 
38
+ ## Intended use
39
+ Triage / down-weight *emitted* short-read SV & STR calls by their probability of
40
+ matching long-read truth, **without long-read sequencing every sample**. Not a
41
+ variant caller; does not recover missed variants; does not replace long-read
42
+ sequencing for complete discovery.
 
 
43
 
44
+ ## Performance
45
+ **Internal 5-fold GroupKFold by sample (out-of-fold), 208 HPRC genomes**
 
 
 
 
 
46
 
47
+ | | AUROC | AUPRC | per-sample AUROC median |
48
+ |---|---|---|---|
49
+ | SV | 0.950 | 0.951 | 0.981 |
50
+ | STR | 0.834 | 0.886 | 0.835 |
 
 
 
51
 
52
+ **External 295 ASD genomes, applied unchanged (no retraining)**
 
 
 
 
 
 
 
 
 
 
 
53
 
54
+ | | AUROC | observed concordance LOW / WARN / MOD / HIGH |
55
+ |---|---|---|
56
+ | SV | 0.891 | 12% / 35% / 50% / **86%** |
57
+ | STR | 0.831 | 16% / 40% / 60% / **85%** |
 
 
 
 
 
 
 
 
 
 
58
 
59
+ Calibrated tiers stay monotone and meaningful on the external cohort; isotonic
60
+ calibration improves Brier (SV 0.077→0.074, STR 0.167→0.159).
 
61
 
62
+ ## Usage
63
+ ```bash
64
+ pip install -U huggingface_hub scikit-learn==1.7.1
65
+ hf download khyeom/SVSTR-Score sv_model.joblib sv_calibrator.joblib sv_config.json score_svstr.py feature_builder.py --local-dir svstr/
66
+ # features from a single-sample VCF, then score:
67
+ python svstr/feature_builder.py --vcf sample.manta.vcf.gz --caller manta \
68
+ --fasta GRCh38.fa --giab-dir giab_prepared --repeatmasker rmsk_class.bed.gz -o feats.tsv
69
+ python svstr/score_svstr.py --variant sv --model-dir svstr --features feats.tsv --out scored.tsv
70
+ ```
71
+ Code, training pipeline and reproduction scripts: https://github.com/khyeom0608/SVSTR-Score
72
+
73
+ ## Limitations
74
+ - Relies on **caller support / breakpoint-confidence fields** (PR/SR, CIPOS/CIEND,
75
+ VAF, GQ, depth). On **merged or heavily filtered call sets that drop these**,
76
+ scores deflate and tiers are unreliable (rank-discrimination degrades only
77
+ moderately, but calibration breaks).
78
+ - Strongest for **DEL/INS**; DUP/BND are mostly down-weighted/triaged.
79
+ - STR scoring is **bound to the caller's genotyped locus set**.
80
+
81
+ ## Citation
82
+ Kim W\*, Yeom K\*, et al. *SVSTR-Score* (manuscript in preparation). Licence: MIT.
example/str_features.tsv CHANGED
@@ -1,7 +1,51 @@
1
- chrom pos repcn_a1 repcn_a2 ru_length ref_count delta_from_ref_a1 delta_from_ref_a2 ci_width_a1 ci_width_a2 adsp_a1 adsp_a2 adfl_a1 adfl_a2 adir_a1 adir_a2 total_support_a1 total_support_a2 spanning_frac_a1 spanning_frac_a2 locus_coverage allele_balance support_type_a1 support_type_a2 is_low_depth
2
- chr12 10728964 8.0 8.0 2 19 -11.0 -11.0 7.0 146.0 3.0 3.0 2.0 2.0 0.0 0.0 5.0 5.0 0.6 0.6 23.5197 1.0 2.0 2.0 1
3
- chr13 70022282 11.0 18.0 2 11 0.0 7.0 0.0 23.0 34.0 1.0 5.0 9.0 0.0 0.0 39.0 10.0 0.8717948717948718 0.1 40.7512 0.2564102564102564 2.0 2.0 0
4
- chr4 98868919 6.0 6.0 3 6 0.0 0.0 0.0 0.0 27.0 27.0 9.0 9.0 0.0 0.0 36.0 36.0 0.75 0.75 37.1579 1.0 2.0 2.0 0
5
- chr6 88530656 6.0 5.0 4 6 0.0 -1.0 0.0 0.0 13.0 8.0 8.0 4.0 0.0 0.0 21.0 12.0 0.6190476190476191 0.6666666666666666 37.5663 0.5714285714285714 2.0 2.0 0
6
- chr8 37146469 14.0 18.0 2 14 0.0 4.0 0.0 0.0 15.0 15.0 14.0 18.0 0.0 0.0 29.0 33.0 0.5172413793103449 0.4545454545454545 43.7729 0.8787878787878788 2.0 2.0 0
7
- chr8 50668495 9.0 15.0 3 14 -5.0 1.0 0.0 0.0 8.0 6.0 16.0 18.0 0.0 0.0 24.0 24.0 0.3333333333333333 0.25 37.7296 1.0 2.0 2.0 0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ sample caller variant_ID is_pass motif_len ref_copynum gt_repcn_max gt_repcn_min expansion_over_ref repci_width_max spanning_reads flanking_reads inrepeat_reads locus_depth gt_hom ref_tract_bp spanning_frac allele_vs_readlen motif_is_homopolymer gc_flank entropy_flank in_segdup in_difficult flank_lowmap
2
+ HG00097 expansionhunter chr1:165954:165962 1 4.0 2.0 2.0 1.0 0.0 0.0 25.0 2.0 0.0 34.5405 0.0 8.0 0.9259259259259259 0.05333333333333334 0 0.297029702970297 1.8767240669197331 1 1 1
3
+ HG00097 expansionhunter chr1:370632:370648 1 4.0 4.0 4.0 3.0 0.0 1.0 30.0 14.0 0.0 40.3784 0.0 16.0 0.6818181818181818 0.10666666666666667 0 0.37623762376237624 1.7475795513453003 1 1 1
4
+ HG00097 expansionhunter chr1:832736:832781 1 5.0 9.0 10.0 10.0 1.0 0.0 44.0 104.0 0.0 42.8919 1.0 45.0 0.2972972972972973 0.3333333333333333 0 0.36633663366336633 1.7496871374624745 1 1 0
5
+ HG00097 expansionhunter chr1:932613:932621 1 4.0 2.0 1.0 1.0 -1.0 0.0 88.0 4.0 0.0 49.3784 1.0 8.0 0.9565217391304348 0.02666666666666667 0 0.6138613861386139 1.93240710844821 0 1 0
6
+ HG00097 expansionhunter chr1:1010481:1010497 1 4.0 4.0 3.0 3.0 -1.0 0.0 62.0 20.0 0.0 35.1081 1.0 16.0 0.7560975609756098 0.08 0 0.13861386138613863 1.5081706750248076 0 1 0
7
+ HG00097 expansionhunter chr1:1052514:1052528 1 2.0 7.0 8.0 7.0 1.0 0.0 40.0 4.0 0.0 46.3784 0.0 14.0 0.9090909090909091 0.10666666666666667 0 0.46534653465346537 1.8443047122008371 0 1 0
8
+ HG00097 expansionhunter chr1:1063661:1063681 1 2.0 10.0 10.0 9.0 0.0 0.0 39.0 26.0 0.0 51.0811 0.0 20.0 0.6 0.13333333333333333 0 0.5643564356435643 1.7595456304802637 0 1 0
9
+ HG00097 expansionhunter chr1:1265603:1265649 1 2.0 23.0 24.0 23.0 1.0 0.0 19.0 44.0 0.0 38.3514 0.0 46.0 0.30158730158730157 0.32 0 0.48514851485148514 1.4319496839589465 0 1 0
10
+ HG00097 expansionhunter chr1:1431026:1431058 1 8.0 4.0 5.0 4.0 1.0 0.0 25.0 65.0 0.0 53.2703 0.0 32.0 0.2777777777777778 0.26666666666666666 0 0.7128712871287128 1.8074915272734042 0 1 0
11
+ HG00097 expansionhunter chr1:1585949:1585976 1 3.0 9.0 16.0 15.0 7.0 0.0 23.0 57.0 0.0 44.4324 0.0 27.0 0.2875 0.32 0 0.33663366336633666 1.870380810485376 0 1 0
12
+ HG00097 expansionhunter chr1:1653825:1653863 1 2.0 19.0 21.0 21.0 2.0 1.0 22.0 28.0 0.0 34.8649 1.0 38.0 0.44 0.28 0 0.49504950495049505 1.861174864442864 1 1 0
13
+ HG00097 expansionhunter chr1:1752908:1752935 1 3.0 9.0 10.0 9.0 1.0 0.0 39.0 31.0 0.0 49.4595 0.0 27.0 0.5571428571428572 0.2 0 0.6435643564356436 1.8970057796754802 0 1 0
14
+ HG00097 expansionhunter chr1:1762759:1762791 1 4.0 8.0 9.0 8.0 1.0 0.0 38.0 32.0 0.0 40.5405 0.0 32.0 0.5428571428571428 0.24 0 0.38613861386138615 1.9107593839500945 0 1 0
15
+ HG00097 expansionhunter chr1:1769969:1770011 1 6.0 7.0 5.0 5.0 -2.0 0.0 36.0 38.0 0.0 44.7568 1.0 42.0 0.4864864864864865 0.2 0 0.32673267326732675 1.680234216638318 0 1 0
16
+ HG00097 expansionhunter chr1:1776461:1776473 1 4.0 3.0 2.0 2.0 -1.0 0.0 74.0 12.0 0.0 38.7568 1.0 12.0 0.8604651162790697 0.05333333333333334 0 0.45544554455445546 1.8897699414887892 0 0 0
17
+ HG00097 expansionhunter chr1:1812407:1812429 1 2.0 11.0 12.0 12.0 1.0 0.0 64.0 20.0 0.0 39.6486 1.0 22.0 0.7619047619047619 0.16 0 0.40594059405940597 1.8347498835870788 0 1 0
18
+ HG00097 expansionhunter chr1:1845824:1845872 1 2.0 24.0 23.0 20.0 -1.0 0.0 37.0 34.0 0.0 43.3784 0.0 48.0 0.5211267605633803 0.30666666666666664 0 0.40594059405940597 1.8265902247349062 0 1 0
19
+ HG00097 expansionhunter chr1:1891654:1891658 1 2.0 2.0 4.0 4.0 2.0 0.0 60.0 4.0 0.0 40.7027 1.0 4.0 0.9375 0.05333333333333334 0 0.7128712871287128 1.854785288155074 0 1 0
20
+ HG00097 expansionhunter chr1:1904424:1904448 1 4.0 6.0 8.0 8.0 2.0 0.0 44.0 32.0 0.0 39.2432 1.0 24.0 0.5789473684210527 0.21333333333333335 0 0.3069306930693069 1.7889467525708116 0 1 0
21
+ HG00097 expansionhunter chr1:1948412:1948428 1 2.0 8.0 10.0 10.0 2.0 0.0 78.0 20.0 0.0 37.7838 1.0 16.0 0.7959183673469388 0.13333333333333333 0 0.33663366336633666 1.908726104503474 0 1 0
22
+ HG00097 expansionhunter chr1:2003928:2003940 1 6.0 2.0 3.0 2.0 1.0 0.0 51.0 19.0 0.0 45.8919 0.0 12.0 0.7285714285714285 0.12 0 0.7920792079207921 1.7281983769914673 0 1 0
23
+ HG00097 expansionhunter chr1:2012279:2012309 1 5.0 6.0 8.0 6.0 2.0 0.0 23.0 59.0 0.0 40.8649 0.0 30.0 0.2804878048780488 0.26666666666666666 0 0.2871287128712871 1.8125908839274099 0 1 0
24
+ HG00097 expansionhunter chr1:2018334:2018389 1 5.0 11.0 11.0 10.0 0.0 0.0 29.0 43.0 0.0 34.7838 0.0 55.0 0.4027777777777778 0.36666666666666664 0 0.36633663366336633 1.8582276266039703 0 1 0
25
+ HG00097 expansionhunter chr1:2112675:2112691 1 4.0 4.0 5.0 5.0 1.0 0.0 68.0 26.0 0.0 48.4054 1.0 16.0 0.723404255319149 0.13333333333333333 0 0.37623762376237624 1.7586128527134226 0 1 0
26
+ HG00097 expansionhunter chr1:2173095:2173103 1 4.0 2.0 2.0 1.0 0.0 0.0 44.0 8.0 0.0 41.6757 0.0 8.0 0.8461538461538461 0.05333333333333334 0 0.504950495049505 1.9625707732852478 0 0 0
27
+ HG00097 expansionhunter chr1:2207909:2207955 1 2.0 23.0 25.0 20.0 2.0 0.0 25.0 65.0 0.0 39.8919 0.0 46.0 0.2777777777777778 0.3333333333333333 0 0.5841584158415841 1.7558294024999062 0 1 0
28
+ HG00097 expansionhunter chr1:2219851:2219871 1 5.0 4.0 4.0 3.0 0.0 0.0 21.0 9.0 0.0 41.7568 0.0 20.0 0.7 0.13333333333333333 0 0.4752475247524752 1.8595044600250858 0 1 0
29
+ HG00097 expansionhunter chr1:2345829:2345855 1 2.0 13.0 14.0 10.0 1.0 0.0 24.0 25.0 0.0 47.1892 0.0 26.0 0.4897959183673469 0.18666666666666668 0 0.5247524752475248 1.9180855002676722 0 1 0
30
+ HG00097 expansionhunter chr1:2371211:2371241 1 3.0 10.0 13.0 10.0 3.0 0.0 27.0 40.0 0.0 50.1892 0.0 30.0 0.40298507462686567 0.26 0 0.5148514851485149 1.8305650137637772 0 1 0
31
+ HG00097 expansionhunter chr1:2431330:2431346 1 2.0 8.0 9.0 8.0 1.0 0.0 41.0 32.0 0.0 48.6486 0.0 16.0 0.5616438356164384 0.12 0 0.5841584158415841 1.9166005274309357 0 1 0
32
+ HG00097 expansionhunter chr1:2435454:2435489 1 7.0 5.0 5.0 4.0 0.0 0.0 39.0 30.0 0.0 39.0 0.0 35.0 0.5652173913043478 0.23333333333333334 0 0.5742574257425742 1.9335830591930787 0 1 0
33
+ HG00097 expansionhunter chr1:2449499:2449535 1 4.0 9.0 9.0 8.0 0.0 0.0 17.0 48.0 0.0 40.9459 0.0 36.0 0.26153846153846155 0.24 0 0.5247524752475248 1.8838136852474037 0 1 0
34
+ HG00097 expansionhunter chr1:2508612:2508630 1 2.0 9.0 11.0 9.0 2.0 0.0 35.0 18.0 0.0 49.0541 0.0 18.0 0.660377358490566 0.14666666666666667 0 0.3564356435643564 1.933754201968858 0 1 0
35
+ HG00097 expansionhunter chr1:2566825:2566869 1 4.0 11.0 12.0 11.0 1.0 0.0 19.0 124.0 0.0 47.8378 0.0 44.0 0.13286713286713286 0.32 0 0.4158415841584158 1.9433008996133987 0 1 0
36
+ HG00097 expansionhunter chr1:2580534:2580555 1 3.0 7.0 8.0 8.0 1.0 0.0 42.0 32.0 0.0 33.8108 1.0 21.0 0.5675675675675675 0.16 0 0.37623762376237624 1.9016529508428732 0 1 0
37
+ HG00097 expansionhunter chr1:2600684:2600708 1 4.0 6.0 6.0 3.0 0.0 0.0 35.0 17.0 0.0 37.9459 0.0 24.0 0.6730769230769231 0.16 0 0.44554455445544555 1.8855340778294845 0 1 0
38
+ HG00097 expansionhunter chr1:2782508:2782518 1 2.0 5.0 6.0 6.0 1.0 0.0 80.0 8.0 0.0 40.2162 1.0 10.0 0.9090909090909091 0.08 0 0.31683168316831684 1.539579197931996 0 1 0
39
+ HG00097 expansionhunter chr1:2784990:2785010 1 10.0 2.0 4.0 3.0 2.0 0.0 33.0 28.0 0.0 40.1351 0.0 20.0 0.5409836065573771 0.26666666666666666 0 0.693069306930693 1.849024777034605 0 1 0
40
+ HG00097 expansionhunter chr1:2824891:2824941 1 2.0 25.0 23.0 22.0 -2.0 0.0 20.0 28.0 0.0 41.1081 0.0 50.0 0.4166666666666667 0.30666666666666664 0 0.5148514851485149 1.736792200782953 0 1 0
41
+ HG00097 expansionhunter chr1:2847577:2847593 1 8.0 2.0 2.0 1.0 0.0 0.0 45.0 4.0 0.0 41.3514 0.0 16.0 0.9183673469387755 0.10666666666666667 0 0.6039603960396039 1.893728291589638 0 1 0
42
+ HG00097 expansionhunter chr1:2899043:2899075 1 4.0 8.0 7.0 7.0 -1.0 0.0 54.0 32.0 0.0 41.5946 1.0 32.0 0.627906976744186 0.18666666666666668 0 0.44554455445544555 1.7976208220041983 0 1 0
43
+ HG00097 expansionhunter chr1:2952909:2952927 1 2.0 9.0 12.0 12.0 3.0 0.0 60.0 22.0 0.0 36.7297 1.0 18.0 0.7317073170731707 0.16 0 0.6138613861386139 1.4586949056098804 0 1 0
44
+ HG00097 expansionhunter chr1:3089915:3089939 1 4.0 6.0 8.0 8.0 2.0 0.0 72.0 34.0 0.0 49.6216 1.0 24.0 0.6792452830188679 0.21333333333333335 0 0.44554455445544555 1.9438900377229549 0 1 0
45
+ HG00097 expansionhunter chr1:3109083:3109116 1 3.0 11.0 11.0 9.0 0.0 0.0 24.0 49.0 0.0 41.3514 0.0 33.0 0.3287671232876712 0.22 0 0.32673267326732675 1.8491066059426058 0 1 0
46
+ HG00097 expansionhunter chr1:3152348:3152380 1 4.0 8.0 6.0 6.0 -2.0 0.0 58.0 10.0 0.0 44.9189 1.0 32.0 0.8529411764705882 0.16 0 0.43564356435643564 1.6103551420752904 0 1 0
47
+ HG00097 expansionhunter chr1:3175983:3176011 1 4.0 7.0 8.0 8.0 1.0 0.0 50.0 42.0 0.0 45.2432 1.0 28.0 0.5434782608695652 0.21333333333333335 0 0.5643564356435643 1.8171755888564571 0 1 0
48
+ HG00097 expansionhunter chr1:3225530:3225574 1 2.0 22.0 21.0 17.0 -1.0 0.0 20.0 40.0 0.0 42.973 0.0 44.0 0.3333333333333333 0.28 0 0.5742574257425742 1.7700475389123633 0 1 0
49
+ HG00097 expansionhunter chr1:3240056:3240106 1 5.0 10.0 5.0 4.0 -5.0 0.0 31.0 16.0 0.0 39.6486 0.0 50.0 0.6595744680851063 0.16666666666666666 0 0.5544554455445545 1.0599256723036075 0 1 0
50
+ HG00097 expansionhunter chr1:3273438:3273450 1 2.0 6.0 4.0 4.0 -2.0 0.0 64.0 12.0 0.0 36.7297 1.0 12.0 0.8421052631578947 0.05333333333333334 0 0.5247524752475248 1.856780293719988 0 1 0
51
+ HG00097 expansionhunter chr1:3273817:3273851 1 2.0 17.0 17.0 15.0 0.0 0.0 25.0 34.0 0.0 36.973 0.0 34.0 0.423728813559322 0.22666666666666666 0 0.4752475247524752 1.78618987974027 0 1 0
example/str_scored.tsv CHANGED
@@ -1,7 +1,51 @@
1
- chrom pos repcn_a1 repcn_a2 ru_length ref_count delta_from_ref_a1 delta_from_ref_a2 ci_width_a1 ci_width_a2 adsp_a1 adsp_a2 adfl_a1 adfl_a2 adir_a1 adir_a2 total_support_a1 total_support_a2 spanning_frac_a1 spanning_frac_a2 locus_coverage allele_balance support_type_a1 support_type_a2 is_low_depth locus_conc_rate locus_in_lookup CS tier
2
- chr12 10728964 8.0 8.0 2 19 -11.0 -11.0 7.0 146.0 3.0 3.0 2.0 2.0 0.0 0.0 5.0 5.0 0.6 0.6 23.5197 1.0 2.0 2.0 1 0.0 1 0.005 WARNING
3
- chr13 70022282 11.0 18.0 2 11 0.0 7.0 0.0 23.0 34.0 1.0 5.0 9.0 0.0 0.0 39.0 10.0 0.8717948717948718 0.1 40.7512 0.2564102564102564 2.0 2.0 0 0.0 1 0.005 WARNING
4
- chr4 98868919 6.0 6.0 3 6 0.0 0.0 0.0 0.0 27.0 27.0 9.0 9.0 0.0 0.0 36.0 36.0 0.75 0.75 37.1579 1.0 2.0 2.0 0 1.0 1 1.0 HIGH
5
- chr6 88530656 6.0 5.0 4 6 0.0 -1.0 0.0 0.0 13.0 8.0 8.0 4.0 0.0 0.0 21.0 12.0 0.6190476190476191 0.6666666666666666 37.5663 0.5714285714285714 2.0 2.0 0 0.0 1 0.0 LOW
6
- chr8 37146469 14.0 18.0 2 14 0.0 4.0 0.0 0.0 15.0 15.0 14.0 18.0 0.0 0.0 29.0 33.0 0.5172413793103449 0.4545454545454545 43.7729 0.8787878787878788 2.0 2.0 0 0.7062937062937062 1 0.97 HIGH
7
- chr8 50668495 9.0 15.0 3 14 -5.0 1.0 0.0 0.0 8.0 6.0 16.0 18.0 0.0 0.0 24.0 24.0 0.3333333333333333 0.25 37.7296 1.0 2.0 2.0 0 0.986013986013986 1 1.0 HIGH
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ sample caller variant_ID is_pass motif_len ref_copynum gt_repcn_max gt_repcn_min expansion_over_ref repci_width_max spanning_reads flanking_reads inrepeat_reads locus_depth gt_hom ref_tract_bp spanning_frac allele_vs_readlen motif_is_homopolymer gc_flank entropy_flank in_segdup in_difficult flank_lowmap CS_raw CS tier
2
+ HG00097 expansionhunter chr1:165954:165962 1 4.0 2.0 2.0 1.0 0.0 0.0 25.0 2.0 0.0 34.5405 0.0 8.0 0.925925925925926 0.0533333333333333 0 0.297029702970297 1.8767240669197327 1 1 1 0.7378 0.8546 HIGH
3
+ HG00097 expansionhunter chr1:370632:370648 1 4.0 4.0 4.0 3.0 0.0 1.0 30.0 14.0 0.0 40.3784 0.0 16.0 0.6818181818181818 0.1066666666666666 0 0.3762376237623762 1.7475795513453003 1 1 1 0.3337 0.3967 WARNING
4
+ HG00097 expansionhunter chr1:832736:832781 1 5.0 9.0 10.0 10.0 1.0 0.0 44.0 104.0 0.0 42.8919 1.0 45.0 0.2972972972972973 0.3333333333333333 0 0.3663366336633663 1.7496871374624745 1 1 0 0.9082 0.9876 HIGH
5
+ HG00097 expansionhunter chr1:932613:932621 1 4.0 2.0 1.0 1.0 -1.0 0.0 88.0 4.0 0.0 49.3784 1.0 8.0 0.9565217391304348 0.0266666666666666 0 0.6138613861386139 1.93240710844821 0 1 0 0.4569 0.5696 MODERATE
6
+ HG00097 expansionhunter chr1:1010481:1010497 1 4.0 4.0 3.0 3.0 -1.0 0.0 62.0 20.0 0.0 35.1081 1.0 16.0 0.7560975609756098 0.08 0 0.1386138613861386 1.5081706750248076 0 1 0 0.1697 0.1673 LOW
7
+ HG00097 expansionhunter chr1:1052514:1052528 1 2.0 7.0 8.0 7.0 1.0 0.0 40.0 4.0 0.0 46.3784 0.0 14.0 0.9090909090909092 0.1066666666666666 0 0.4653465346534653 1.8443047122008367 0 1 0 0.1755 0.1752 LOW
8
+ HG00097 expansionhunter chr1:1063661:1063681 1 2.0 10.0 10.0 9.0 0.0 0.0 39.0 26.0 0.0 51.0811 0.0 20.0 0.6 0.1333333333333333 0 0.5643564356435643 1.7595456304802637 0 1 0 0.0987 0.0615 LOW
9
+ HG00097 expansionhunter chr1:1265603:1265649 1 2.0 23.0 24.0 23.0 1.0 0.0 19.0 44.0 0.0 38.3514 0.0 46.0 0.3015873015873015 0.32 0 0.4851485148514851 1.4319496839589465 0 1 0 0.1248 0.0944 LOW
10
+ HG00097 expansionhunter chr1:1431026:1431058 1 8.0 4.0 5.0 4.0 1.0 0.0 25.0 65.0 0.0 53.2703 0.0 32.0 0.2777777777777778 0.2666666666666666 0 0.7128712871287128 1.807491527273404 0 1 0 0.6394 0.7505 HIGH
11
+ HG00097 expansionhunter chr1:1585949:1585976 1 3.0 9.0 16.0 15.0 7.0 0.0 23.0 57.0 0.0 44.4324 0.0 27.0 0.2875 0.32 0 0.3366336633663366 1.870380810485376 0 1 0 0.8984 0.9819 HIGH
12
+ HG00097 expansionhunter chr1:1653825:1653863 1 2.0 19.0 21.0 21.0 2.0 1.0 22.0 28.0 0.0 34.8649 1.0 38.0 0.44 0.28 0 0.495049504950495 1.861174864442864 1 1 0 0.6052 0.7199 HIGH
13
+ HG00097 expansionhunter chr1:1752908:1752935 1 3.0 9.0 10.0 9.0 1.0 0.0 39.0 31.0 0.0 49.4595 0.0 27.0 0.5571428571428572 0.2 0 0.6435643564356436 1.89700577967548 0 1 0 0.7839 0.8853 HIGH
14
+ HG00097 expansionhunter chr1:1762759:1762791 1 4.0 8.0 9.0 8.0 1.0 0.0 38.0 32.0 0.0 40.5405 0.0 32.0 0.5428571428571428 0.24 0 0.3861386138613861 1.9107593839500945 0 1 0 0.9107 0.988 HIGH
15
+ HG00097 expansionhunter chr1:1769969:1770011 1 6.0 7.0 5.0 5.0 -2.0 0.0 36.0 38.0 0.0 44.7568 1.0 42.0 0.4864864864864865 0.2 0 0.3267326732673267 1.680234216638318 0 1 0 0.8992 0.9829 HIGH
16
+ HG00097 expansionhunter chr1:1776461:1776473 1 4.0 3.0 2.0 2.0 -1.0 0.0 74.0 12.0 0.0 38.7568 1.0 12.0 0.8604651162790697 0.0533333333333333 0 0.4554455445544554 1.8897699414887887 0 0 0 0.8386 0.9299 HIGH
17
+ HG00097 expansionhunter chr1:1812407:1812429 1 2.0 11.0 12.0 12.0 1.0 0.0 64.0 20.0 0.0 39.6486 1.0 22.0 0.7619047619047619 0.16 0 0.4059405940594059 1.8347498835870788 0 1 0 0.2376 0.2636 LOW
18
+ HG00097 expansionhunter chr1:1845824:1845872 1 2.0 24.0 23.0 20.0 -1.0 0.0 37.0 34.0 0.0 43.3784 0.0 48.0 0.5211267605633803 0.3066666666666666 0 0.4059405940594059 1.8265902247349064 0 1 0 0.7847 0.8853 HIGH
19
+ HG00097 expansionhunter chr1:1891654:1891658 1 2.0 2.0 4.0 4.0 2.0 0.0 60.0 4.0 0.0 40.7027 1.0 4.0 0.9375 0.0533333333333333 0 0.7128712871287128 1.854785288155074 0 1 0 0.1098 0.0718 LOW
20
+ HG00097 expansionhunter chr1:1904424:1904448 1 4.0 6.0 8.0 8.0 2.0 0.0 44.0 32.0 0.0 39.2432 1.0 24.0 0.5789473684210527 0.2133333333333333 0 0.3069306930693069 1.7889467525708116 0 1 0 0.6801 0.8002 HIGH
21
+ HG00097 expansionhunter chr1:1948412:1948428 1 2.0 8.0 10.0 10.0 2.0 0.0 78.0 20.0 0.0 37.7838 1.0 16.0 0.7959183673469388 0.1333333333333333 0 0.3366336633663366 1.908726104503474 0 1 0 0.3487 0.4133 WARNING
22
+ HG00097 expansionhunter chr1:2003928:2003940 1 6.0 2.0 3.0 2.0 1.0 0.0 51.0 19.0 0.0 45.8919 0.0 12.0 0.7285714285714285 0.12 0 0.7920792079207921 1.7281983769914673 0 1 0 0.7141 0.8336 HIGH
23
+ HG00097 expansionhunter chr1:2012279:2012309 1 5.0 6.0 8.0 6.0 2.0 0.0 23.0 59.0 0.0 40.8649 0.0 30.0 0.2804878048780488 0.2666666666666666 0 0.2871287128712871 1.8125908839274096 0 1 0 0.9037 0.985 HIGH
24
+ HG00097 expansionhunter chr1:2018334:2018389 1 5.0 11.0 11.0 10.0 0.0 0.0 29.0 43.0 0.0 34.7838 0.0 55.0 0.4027777777777778 0.3666666666666666 0 0.3663366336633663 1.8582276266039703 0 1 0 0.8468 0.9378 HIGH
25
+ HG00097 expansionhunter chr1:2112675:2112691 1 4.0 4.0 5.0 5.0 1.0 0.0 68.0 26.0 0.0 48.4054 1.0 16.0 0.723404255319149 0.1333333333333333 0 0.3762376237623762 1.7586128527134226 0 1 0 0.371 0.4441 WARNING
26
+ HG00097 expansionhunter chr1:2173095:2173103 1 4.0 2.0 2.0 1.0 0.0 0.0 44.0 8.0 0.0 41.6757 0.0 8.0 0.8461538461538461 0.0533333333333333 0 0.504950495049505 1.962570773285248 0 0 0 0.6207 0.7324 HIGH
27
+ HG00097 expansionhunter chr1:2207909:2207955 1 2.0 23.0 25.0 20.0 2.0 0.0 25.0 65.0 0.0 39.8919 0.0 46.0 0.2777777777777778 0.3333333333333333 0 0.5841584158415841 1.7558294024999062 0 1 0 0.7201 0.8388 HIGH
28
+ HG00097 expansionhunter chr1:2219851:2219871 1 5.0 4.0 4.0 3.0 0.0 0.0 21.0 9.0 0.0 41.7568 0.0 20.0 0.7 0.1333333333333333 0 0.4752475247524752 1.8595044600250856 0 1 0 0.7221 0.8402 HIGH
29
+ HG00097 expansionhunter chr1:2345829:2345855 1 2.0 13.0 14.0 10.0 1.0 0.0 24.0 25.0 0.0 47.1892 0.0 26.0 0.4897959183673469 0.1866666666666666 0 0.5247524752475248 1.918085500267672 0 1 0 0.3912 0.4822 WARNING
30
+ HG00097 expansionhunter chr1:2371211:2371241 1 3.0 10.0 13.0 10.0 3.0 0.0 27.0 40.0 0.0 50.1892 0.0 30.0 0.4029850746268656 0.26 0 0.5148514851485149 1.8305650137637768 0 1 0 0.796 0.8924 HIGH
31
+ HG00097 expansionhunter chr1:2431330:2431346 1 2.0 8.0 9.0 8.0 1.0 0.0 41.0 32.0 0.0 48.6486 0.0 16.0 0.5616438356164384 0.12 0 0.5841584158415841 1.916600527430936 0 1 0 0.2435 0.2759 LOW
32
+ HG00097 expansionhunter chr1:2435454:2435489 1 7.0 5.0 5.0 4.0 0.0 0.0 39.0 30.0 0.0 39.0 0.0 35.0 0.5652173913043478 0.2333333333333333 0 0.5742574257425742 1.9335830591930787 0 1 0 0.7708 0.8778 HIGH
33
+ HG00097 expansionhunter chr1:2449499:2449535 1 4.0 9.0 9.0 8.0 0.0 0.0 17.0 48.0 0.0 40.9459 0.0 36.0 0.2615384615384615 0.24 0 0.5247524752475248 1.883813685247404 0 1 0 0.8208 0.9087 HIGH
34
+ HG00097 expansionhunter chr1:2508612:2508630 1 2.0 9.0 11.0 9.0 2.0 0.0 35.0 18.0 0.0 49.0541 0.0 18.0 0.660377358490566 0.1466666666666666 0 0.3564356435643564 1.933754201968858 0 1 0 0.6421 0.7584 HIGH
35
+ HG00097 expansionhunter chr1:2566825:2566869 1 4.0 11.0 12.0 11.0 1.0 0.0 19.0 124.0 0.0 47.8378 0.0 44.0 0.1328671328671328 0.32 0 0.4158415841584158 1.9433008996133987 0 1 0 0.7715 0.8778 HIGH
36
+ HG00097 expansionhunter chr1:2580534:2580555 1 3.0 7.0 8.0 8.0 1.0 0.0 42.0 32.0 0.0 33.8108 1.0 21.0 0.5675675675675675 0.16 0 0.3762376237623762 1.9016529508428728 0 1 0 0.8111 0.9036 HIGH
37
+ HG00097 expansionhunter chr1:2600684:2600708 1 4.0 6.0 6.0 3.0 0.0 0.0 35.0 17.0 0.0 37.9459 0.0 24.0 0.6730769230769231 0.16 0 0.4455445544554455 1.8855340778294845 0 1 0 0.8571 0.9491 HIGH
38
+ HG00097 expansionhunter chr1:2782508:2782518 1 2.0 5.0 6.0 6.0 1.0 0.0 80.0 8.0 0.0 40.2162 1.0 10.0 0.9090909090909092 0.08 0 0.3168316831683168 1.539579197931996 0 1 0 0.0884 0.052 LOW
39
+ HG00097 expansionhunter chr1:2784990:2785010 1 10.0 2.0 4.0 3.0 2.0 0.0 33.0 28.0 0.0 40.1351 0.0 20.0 0.5409836065573771 0.2666666666666666 0 0.693069306930693 1.849024777034605 0 1 0 0.4897 0.5908 MODERATE
40
+ HG00097 expansionhunter chr1:2824891:2824941 1 2.0 25.0 23.0 22.0 -2.0 0.0 20.0 28.0 0.0 41.1081 0.0 50.0 0.4166666666666667 0.3066666666666666 0 0.5148514851485149 1.736792200782953 0 1 0 0.6974 0.8162 HIGH
41
+ HG00097 expansionhunter chr1:2847577:2847593 1 8.0 2.0 2.0 1.0 0.0 0.0 45.0 4.0 0.0 41.3514 0.0 16.0 0.9183673469387756 0.1066666666666666 0 0.6039603960396039 1.893728291589638 0 1 0 0.523 0.6287 MODERATE
42
+ HG00097 expansionhunter chr1:2899043:2899075 1 4.0 8.0 7.0 7.0 -1.0 0.0 54.0 32.0 0.0 41.5946 1.0 32.0 0.627906976744186 0.1866666666666666 0 0.4455445544554455 1.7976208220041985 0 1 0 0.8667 0.9568 HIGH
43
+ HG00097 expansionhunter chr1:2952909:2952927 1 2.0 9.0 12.0 12.0 3.0 0.0 60.0 22.0 0.0 36.7297 1.0 18.0 0.7317073170731707 0.16 0 0.6138613861386139 1.4586949056098804 0 1 0 0.0848 0.0452 LOW
44
+ HG00097 expansionhunter chr1:3089915:3089939 1 4.0 6.0 8.0 8.0 2.0 0.0 72.0 34.0 0.0 49.6216 1.0 24.0 0.6792452830188679 0.2133333333333333 0 0.4455445544554455 1.9438900377229549 0 1 0 0.9023 0.985 HIGH
45
+ HG00097 expansionhunter chr1:3109083:3109116 1 3.0 11.0 11.0 9.0 0.0 0.0 24.0 49.0 0.0 41.3514 0.0 33.0 0.3287671232876712 0.22 0 0.3267326732673267 1.8491066059426056 0 1 0 0.8115 0.9036 HIGH
46
+ HG00097 expansionhunter chr1:3152348:3152380 1 4.0 8.0 6.0 6.0 -2.0 0.0 58.0 10.0 0.0 44.9189 1.0 32.0 0.8529411764705882 0.16 0 0.4356435643564356 1.6103551420752904 0 1 0 0.4377 0.5542 MODERATE
47
+ HG00097 expansionhunter chr1:3175983:3176011 1 4.0 7.0 8.0 8.0 1.0 0.0 50.0 42.0 0.0 45.2432 1.0 28.0 0.5434782608695652 0.2133333333333333 0 0.5643564356435643 1.8171755888564567 0 1 0 0.6944 0.8135 HIGH
48
+ HG00097 expansionhunter chr1:3225530:3225574 1 2.0 22.0 21.0 17.0 -1.0 0.0 20.0 40.0 0.0 42.973 0.0 44.0 0.3333333333333333 0.28 0 0.5742574257425742 1.770047538912363 0 1 0 0.6244 0.7366 HIGH
49
+ HG00097 expansionhunter chr1:3240056:3240106 1 5.0 10.0 5.0 4.0 -5.0 0.0 31.0 16.0 0.0 39.6486 0.0 50.0 0.6595744680851063 0.1666666666666666 0 0.5544554455445545 1.0599256723036077 0 1 0 0.2903 0.3499 WARNING
50
+ HG00097 expansionhunter chr1:3273438:3273450 1 2.0 6.0 4.0 4.0 -2.0 0.0 64.0 12.0 0.0 36.7297 1.0 12.0 0.8421052631578947 0.0533333333333333 0 0.5247524752475248 1.856780293719988 0 1 0 0.2091 0.2239 LOW
51
+ HG00097 expansionhunter chr1:3273817:3273851 1 2.0 17.0 17.0 15.0 0.0 0.0 25.0 34.0 0.0 36.973 0.0 34.0 0.423728813559322 0.2266666666666666 0 0.4752475247524752 1.78618987974027 0 1 0 0.2897 0.3499 WARNING
example/sv_features.tsv CHANGED
@@ -1,8 +1,51 @@
1
- chrom pos qual gq pr_ref pr_alt sr_ref sr_alt vf_ref vf_alt total_alt_support sr_pr_ratio vaf_estimate strand_bias_fs cipos_width ciend_width homlen svlen_abs svtype_DEL svtype_INS svtype_DUP svtype_BND is_imprecise filter_pass n_filter_flags
2
- chr10 42254638 182 6 2.0 0.0 3.0 5.0 5.0 5.0 5.0 5.0 0.5 9.7 0.0 50.0 1 0 0 0 0 1 0
3
- chr12 12391920 999 281 11.0 20.0 3.0 14.0 12.0 30.0 34.0 0.6666666666666666 0.7083333333333334 7.186 0.0 0 0 0 1 0 1 0
4
- chr16 8562503 693 693 54.0 22.0 36.0 9.0 69.0 26.0 31.0 0.391304347826087 0.256198347107438 0.0 0.0 1781.0 0 0 1 0 0 1 0
5
- chr17 23232202 310 38 0.0 0.0 0.0 21.0 0.0 21.0 21.0 21.0 1.0 0.0 5.0 5.0 5.0 0 1 0 0 0 0 1
6
- chr3 163746379 999 118 0.0 10.0 0.0 33.0 0.0 43.0 43.0 3.0 1.0 0.0 8.0 8.0 306.0 1 0 0 0 0 1 0
7
- chr6 32648154 82 82 7.0 5.0 7.0 5.0 5.0 0.4166666666666667 415.0 238.0 0.0 4156.0 1 0 0 0 1 1 0
8
- chr8 91664843 999 148 0.0 28.0 0.0 36.0 0.0 56.0 64.0 1.2413793103448276 1.0 0.0 15.0 15.0 316.0 0 1 0 0 0 1 0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ sample caller variant_ID is_pass svtype_DEL svtype_DUP svtype_INS svtype_INV svtype_BND svlen_log cipos_width ciend_width is_imprecise pe_support sr_support total_support vaf gt_hom gq qual_norm local_depth gc_min gc_max entropy_min microhom_max in_segdup_either in_segdup_both in_difficult_either in_difficult_both in_lowmap_either in_tandem_either in_Alu_either in_L1_either in_SVA_either in_LTR_either frac_span_repeat n_neighbors nn_log_dist
2
+ HG00097 manta chr1:861885:BND:None 1 0 0 0 0 1 -99999.0 31.0 -99999.0 0 5.0 7.0 12.0 0.15384615384615385 0 182.0 182.0 43.0 0.297029702970297 0.297029702970297 1.6350564503142975 -99999.0 1 1 1 1 0 1 0 0 0 0 -99999.0 2 4.424979586745809
3
+ HG00097 manta chr1:888490:DEL:888577 1 1 0 0 0 0 1.9444826721501687 55.0 -99999.0 0 0.0 5.0 5.0 0.2631578947368421 0 93.0 116.0 5.0 0.49504950495049505 0.5247524752475248 1.119875913690787 50.0 1 1 1 1 0 1 0 0 0 0 1.0 2 4.20382130251655
4
+ HG00097 manta chr1:904478:DEL:904576 1 1 0 0 0 0 1.99563519459755 14.0 -99999.0 0 0.0 10.0 10.0 0.4166666666666667 0 31.0 294.0 3.0 0.7227722772277227 0.7722772277227723 1.6490755403817534 14.0 0 0 1 1 0 1 0 0 0 0 1.0 3 4.20382130251655
5
+ HG00097 manta chr1:998763:INS:None 1 0 0 1 0 0 1.7708520116421442 -99999.0 -99999.0 0 0.0 15.0 15.0 0.6521739130434783 0 43.0 538.0 -99999.0 0.7920792079207921 0.7920792079207921 1.6906469647232745 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 3 4.472888033769636
6
+ HG00097 manta chr1:1028471:DEL:1029081 1 1 0 0 0 0 2.786041210242554 19.0 -99999.0 0 13.0 12.0 25.0 0.352112676056338 0 477.0 654.0 36.0 0.6534653465346535 0.7227722772277227 1.7262819301695698 19.0 0 0 1 1 0 1 0 0 0 0 1.0 2 4.472888033769636
7
+ HG00097 manta chr1:1068824:INS:None 1 0 0 1 0 0 1.8976270912904414 10.0 -99999.0 0 0.0 29.0 29.0 0.9666666666666667 1 75.0 999.0 -99999.0 0.8514851485148515 0.8514851485148515 1.5447689726449716 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 2 4.60588658966101
8
+ HG00097 manta chr1:1427385:DEL:1427442 0 1 0 0 0 0 1.7634279935629373 -99999.0 -99999.0 0 0.0 2.0 2.0 0.16666666666666666 0 13.0 13.0 3.0 0.693069306930693 0.7029702970297029 1.3844773445130296 0.0 0 0 1 1 0 1 0 0 0 0 1.0 0 5.22457708545644
9
+ HG00097 manta chr1:1595101:DEL:1595183 1 1 0 0 0 0 1.919078092376074 2.0 -99999.0 0 0.0 10.0 10.0 0.19230769230769232 0 266.0 266.0 12.0 0.5148514851485149 0.5247524752475248 1.4455984547909617 2.0 0 0 1 1 1 1 0 0 0 0 1.0 1 4.856571815297643
10
+ HG00097 manta chr1:1666974:DEL:1667141 1 1 0 0 0 0 2.225309281725863 18.0 -99999.0 0 11.0 21.0 32.0 1.0 1 59.0 888.0 11.0 0.5247524752475248 0.5445544554455446 1.9680479893568856 18.0 1 1 1 1 1 0 1 0 0 0 1.0 3 4.763375575548453
11
+ HG00097 manta chr1:1724966:DEL:1726924 1 1 0 0 0 0 3.2920344359947364 27.0 27.0 0 8.0 10.0 18.0 0.2727272727272727 0 420.0 420.0 31.0 0.5643564356435643 0.6435643564356436 1.9231444533753872 27.0 1 1 1 1 1 1 1 0 0 0 1.0 2 4.391640703492388
12
+ HG00097 manta chr1:1749605:INS:None 1 0 0 1 0 0 1.7075701760979363 23.0 -99999.0 0 0.0 10.0 10.0 0.37037037037037035 0 262.0 358.0 -99999.0 0.4752475247524752 0.4752475247524752 1.6885390899981634 -99999.0 1 1 1 1 1 1 0 0 0 0 -99999.0 2 4.391640703492388
13
+ HG00097 manta chr1:1924223:INS:None 1 0 0 1 0 0 2.0170333392987803 10.0 -99999.0 0 0.0 16.0 16.0 1.0 1 41.0 630.0 -99999.0 0.7425742574257426 0.7425742574257426 1.6063691498853343 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 5 3.71281800020785
14
+ HG00097 manta chr1:1929384:INS:None 1 0 0 1 0 0 -99999.0 5.0 5.0 0 0.0 19.0 19.0 1.0 1 39.0 580.0 -99999.0 0.693069306930693 0.693069306930693 1.4041502486751618 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 5 3.71281800020785
15
+ HG00097 manta chr1:1934989:DEL:1935584 1 1 0 0 0 0 2.7752462597402365 10.0 -99999.0 0 16.0 18.0 34.0 1.0 1 86.0 934.0 16.0 0.44554455445544555 0.504950495049505 1.8451028377340413 10.0 0 0 1 1 0 1 0 0 0 0 1.0 5 3.7486530934242674
16
+ HG00097 manta chr1:1948934:INS:None 1 0 0 1 0 0 2.0492180226701815 2.0 -99999.0 0 3.0 22.0 25.0 1.0 1 64.0 864.0 3.0 0.5643564356435643 0.5643564356435643 1.0561905395876316 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 5 4.1444496608689
17
+ HG00097 manta chr1:1993704:INS:None 1 0 0 1 0 0 2.1702617153949575 15.0 -99999.0 0 8.0 45.0 53.0 1.0 1 128.0 999.0 8.0 0.4752475247524752 0.4752475247524752 1.9812115261970087 -99999.0 0 0 0 0 0 0 1 0 0 0 -99999.0 5 4.406829613621544
18
+ HG00097 manta chr1:2019220:INS:None 1 0 0 1 0 0 -99999.0 23.0 23.0 0 2.0 12.0 14.0 1.0 1 20.0 261.0 2.0 0.8613861386138614 0.8613861386138614 1.3524948796891727 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 5 4.406829613621544
19
+ HG00097 manta chr1:2421838:BND:None 1 0 0 0 0 1 -99999.0 3.0 -99999.0 0 4.0 25.0 29.0 0.37662337662337664 0 582.0 999.0 23.0 0.49504950495049505 0.49504950495049505 1.983371708749389 -99999.0 0 0 1 1 0 1 1 0 0 0 -99999.0 0 5.255942731372637
20
+ HG00097 manta chr1:2602115:DUP:2602189 1 0 1 0 0 0 1.8750612633917 -99999.0 -99999.0 0 0.0 11.0 11.0 0.12941176470588237 0 255.0 255.0 30.0 0.5643564356435643 0.6435643564356436 1.913521655146875 1.0 0 0 0 0 0 0 0 0 0 0 0.0 1 1.6989700043360187
21
+ HG00097 manta chr1:2602164:INS:None 1 0 0 1 0 0 1.919078092376074 -99999.0 -99999.0 0 0.0 22.0 22.0 0.6470588235294118 0 137.0 697.0 -99999.0 0.594059405940594 0.594059405940594 1.9680479893568856 -99999.0 0 0 0 0 0 0 0 0 0 0 -99999.0 1 1.6989700043360187
22
+ HG00097 manta chr1:3026038:INS:None 0 0 0 1 0 0 -99999.0 5.0 5.0 0 2.0 26.0 28.0 0.875 1 6.0 831.0 6.0 0.48514851485148514 0.48514851485148514 1.556688426030284 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 0 5.192483819902663
23
+ HG00097 manta chr1:3181807:DEL:3181925 1 1 0 0 0 0 2.075546961392531 14.0 -99999.0 0 0.0 27.0 27.0 1.0 1 65.0 999.0 -99999.0 0.44554455445544555 0.46534653465346537 1.9395446972377677 14.0 0 0 1 1 0 1 0 0 0 0 1.0 2 4.525860772931853
24
+ HG00097 manta chr1:3215369:INS:None 1 0 0 1 0 0 1.792391689498254 54.0 -99999.0 0 0.0 27.0 27.0 1.0 1 68.0 999.0 -99999.0 0.6831683168316832 0.6831683168316832 1.8503664672483442 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 2 4.525860772931853
25
+ HG00097 manta chr1:3260742:INS:None 1 0 0 1 0 0 -99999.0 5.0 5.0 0 8.0 29.0 37.0 1.0 1 95.0 999.0 8.0 0.4752475247524752 0.4752475247524752 1.6383659495376368 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 3 4.6568070667104475
26
+ HG00097 manta chr1:3316611:DEL:3316667 1 1 0 0 0 0 1.7558748556724915 23.0 -99999.0 0 0.0 10.0 10.0 0.37037037037037035 0 177.0 306.0 5.0 0.49504950495049505 0.49504950495049505 1.8200734972498984 23.0 0 0 1 1 0 1 0 0 0 0 1.0 1 4.747178671360165
27
+ HG00097 manta chr1:3832372:DEL:3832444 1 1 0 0 0 0 1.863322860120456 45.0 -99999.0 0 0.0 5.0 5.0 0.625 0 25.0 127.0 1.0 0.5247524752475248 0.5841584158415841 1.9137004533259432 45.0 0 0 1 1 0 1 0 0 0 0 1.0 2 4.560086048497414
28
+ HG00097 manta chr1:3868686:INS:None 1 0 0 1 0 0 2.4712917110589387 1.0 -99999.0 0 4.0 38.0 42.0 1.0 1 29.0 708.0 4.0 0.6336633663366337 0.6336633663366337 1.5250608688455582 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 2 4.560086048497414
29
+ HG00097 manta chr1:3929498:INS:None 1 0 0 1 0 0 2.4668676203541096 4.0 -99999.0 0 2.0 27.0 29.0 1.0 1 62.0 707.0 2.0 0.44554455445544555 0.44554455445544555 1.4880199108409586 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 3 4.7839964283643
30
+ HG00097 manta chr1:3999775:DEL:3999895 1 1 0 0 0 0 2.08278537031645 -99999.0 -99999.0 0 0.0 25.0 25.0 1.0 1 73.0 999.0 -99999.0 0.6633663366336634 0.693069306930693 1.7648687041978608 1.0 0 0 1 1 1 1 0 0 0 0 1.0 1 4.846819393669543
31
+ HG00097 manta chr1:4129813:DEL:4129873 1 1 0 0 0 0 1.7853298350107671 45.0 -99999.0 0 0.0 9.0 9.0 0.34615384615384615 0 162.0 312.0 6.0 0.6039603960396039 0.6831683168316832 1.8738460050301151 45.0 0 0 1 1 0 1 0 0 0 0 1.0 2 4.16660770308391
32
+ HG00097 manta chr1:4144488:DUP:4144620 0 0 1 0 0 0 2.123851640967086 4.0 4.0 0 0.0 22.0 22.0 0.43137254901960786 0 332.0 518.0 11.0 0.42574257425742573 0.48514851485148514 1.4784771363149791 4.0 0 0 1 1 0 1 0 0 0 0 1.0 2 2.1492191126553797
33
+ HG00097 manta chr1:4144628:INS:None 0 0 0 1 0 0 1.9084850188786497 2.0 -99999.0 0 0.0 28.0 28.0 1.0 1 71.0 982.0 -99999.0 0.48514851485148514 0.48514851485148514 1.4687802326926382 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 2 2.1492191126553797
34
+ HG00097 manta chr1:4333580:BND:None 1 0 0 0 0 1 -99999.0 -99999.0 -99999.0 0 12.0 0.0 12.0 0.5 0 142.0 142.0 24.0 0.5643564356435643 0.5643564356435643 1.9623130244403804 -99999.0 0 0 1 1 0 1 0 0 0 1 -99999.0 1 3.465680211598278
35
+ HG00097 manta chr1:4336501:INS:None 1 0 0 1 0 0 -99999.0 27.0 27.0 0 3.0 37.0 40.0 1.0 1 68.0 929.0 3.0 0.40594059405940597 0.40594059405940597 1.747964514410695 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 1 3.465680211598278
36
+ HG00097 manta chr1:4939764:BND:None 0 0 0 0 0 1 -99999.0 599.0 -99999.0 1 14.0 0.0 14.0 0.4117647058823529 0 115.0 115.0 34.0 0.45544554455445546 0.45544554455445546 1.9728402340561 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 0 5.133363239048624
37
+ HG00097 manta chr1:5075708:INS:None 1 0 0 1 0 0 1.863322860120456 13.0 -99999.0 0 0.0 12.0 12.0 1.0 1 26.0 288.0 -99999.0 0.07920792079207921 0.07920792079207921 1.3986907959255488 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 1 4.927334427020689
38
+ HG00097 manta chr1:5160300:INS:None 0 0 0 1 0 0 2.765668554759014 -99999.0 -99999.0 0 0.0 46.0 46.0 0.92 1 52.0 999.0 4.0 0.37623762376237624 0.37623762376237624 1.7050759800383022 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 1 4.927334427020689
39
+ HG00097 manta chr1:5387025:BND:None 1 0 0 0 0 1 -99999.0 466.0 -99999.0 1 15.0 0.0 15.0 0.6 0 116.0 228.0 25.0 0.5148514851485149 0.5148514851485149 1.3511081047001698 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 2 2.161368002234975
40
+ HG00097 manta chr1:5387169:DUP:5387401 1 0 1 0 0 0 2.367355921026019 -99999.0 -99999.0 0 1.0 23.0 24.0 0.42857142857142855 0 338.0 481.0 13.0 0.45544554455445546 0.4752475247524752 1.5481463016741694 0.0 0 0 1 1 0 1 0 0 0 0 1.0 2 2.161368002234975
41
+ HG00097 manta chr1:5414152:DEL:5414227 1 1 0 0 0 0 1.8808135922807914 18.0 -99999.0 0 0.0 7.0 7.0 0.3333333333333333 0 102.0 157.0 6.0 0.5346534653465347 0.5346534653465347 1.9698531038077964 18.0 0 0 1 1 0 1 0 0 0 0 1.0 3 4.431106328181145
42
+ HG00097 manta chr1:5499440:DEL:5499546 1 1 0 0 0 0 2.0293837776852097 8.0 -99999.0 0 0.0 9.0 9.0 0.21951219512195122 0 276.0 276.0 7.0 0.26732673267326734 0.26732673267326734 1.6246863594780123 8.0 0 0 1 1 0 1 0 0 0 0 1.0 2 4.930893022406026
43
+ HG00097 manta chr1:5593288:INS:None 1 0 0 1 0 0 1.806179973983887 2.0 -99999.0 0 0.0 22.0 22.0 0.6470588235294118 0 142.0 889.0 -99999.0 0.5544554455445545 0.5544554455445545 1.9892256298878759 -99999.0 0 0 0 0 0 0 0 0 0 0 -99999.0 2 4.9076585372801205
44
+ HG00097 manta chr1:5674133:BND:None 0 0 0 0 0 1 -99999.0 7.0 -99999.0 0 11.0 47.0 58.0 0.3020833333333333 0 999.0 999.0 78.0 0.38613861386138615 0.38613861386138615 1.9050918955475322 -99999.0 0 0 0 0 0 0 0 1 0 0 -99999.0 1 4.9076585372801205
45
+ HG00097 manta chr1:5874228:INS:None 1 0 0 1 0 0 2.4533183400470375 13.0 -99999.0 0 18.0 23.0 41.0 0.5394736842105263 0 327.0 571.0 37.0 0.44554455445544555 0.44554455445544555 1.9714887292053216 -99999.0 0 0 0 0 0 0 0 0 0 0 -99999.0 0 5.116717299877443
46
+ HG00097 manta chr1:6005060:DEL:6005340 1 1 0 0 0 0 2.44870631990508 -99999.0 -99999.0 0 14.0 15.0 29.0 0.9666666666666667 1 23.0 314.0 15.0 0.6237623762376238 0.6831683168316832 1.5376840111363732 2.0 0 0 1 1 0 1 0 0 0 0 1.0 1 4.317666442356501
47
+ HG00097 manta chr1:6025840:DEL:6025908 1 1 0 0 0 0 1.8388490907372552 74.0 -99999.0 0 0.0 6.0 6.0 0.2222222222222222 0 182.0 184.0 9.0 0.6435643564356436 0.7029702970297029 1.7307956407165326 50.0 0 0 1 1 1 1 0 0 0 0 1.0 1 4.317666442356501
48
+ HG00097 manta chr1:6742557:INS:None 1 0 0 1 0 0 -99999.0 4.0 4.0 0 5.0 31.0 36.0 1.0 1 101.0 999.0 5.0 0.07920792079207921 0.07920792079207921 1.320097363865938 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 0 5.297445341827969
49
+ HG00097 manta chr1:6940912:INS:None 1 0 0 1 0 0 -99999.0 4.0 4.0 0 0.0 34.0 34.0 0.9714285714285714 1 95.0 999.0 -99999.0 0.6336633663366337 0.6336633663366337 1.902230293166466 -99999.0 0 0 0 0 0 0 0 0 0 0 -99999.0 1 4.635272580112365
50
+ HG00097 manta chr1:6984090:INS:None 1 0 0 1 0 0 2.311753861055754 2.0 -99999.0 0 6.0 35.0 41.0 1.0 1 72.0 999.0 6.0 0.48514851485148514 0.48514851485148514 1.5136654464703396 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 1 4.635272580112365
51
+ HG00097 manta chr1:7510011:DEL:7511458 1 1 0 0 0 0 3.1607685618611283 -99999.0 -99999.0 0 11.0 15.0 26.0 0.36619718309859156 0 527.0 711.0 35.0 0.46534653465346537 0.48514851485148514 1.8175669943687356 0.0 0 0 1 0 0 1 0 0 0 0 0.015193370165745856 1 4.5124575861973435
example/sv_scored.tsv CHANGED
@@ -1,8 +1,51 @@
1
- chrom pos qual gq pr_ref pr_alt sr_ref sr_alt vf_ref vf_alt total_alt_support sr_pr_ratio vaf_estimate strand_bias_fs cipos_width ciend_width homlen svlen_abs svtype_DEL svtype_INS svtype_DUP svtype_BND is_imprecise filter_pass n_filter_flags CS tier
2
- chr10 42254638 182 6 2.0 0.0 3.0 5.0 5.0 5.0 5.0 5.0 0.5 9.7 0.0 50.0 1 0 0 0 0 1 0 0.37 LOW
3
- chr12 12391920 999 281 11.0 20.0 3.0 14.0 12.0 30.0 34.0 0.6666666666666666 0.7083333333333334 7.186 0.0 0 0 0 1 0 1 0 0.1762 LOW
4
- chr16 8562503 693 693 54.0 22.0 36.0 9.0 69.0 26.0 31.0 0.391304347826087 0.256198347107438 0.0 0.0 1781.0 0 0 1 0 0 1 0 0.075 LOW
5
- chr17 23232202 310 38 0.0 0.0 0.0 21.0 0.0 21.0 21.0 21.0 1.0 0.0 5.0 5.0 5.0 0 1 0 0 0 0 1 0.3 WARNING
6
- chr3 163746379 999 118 0.0 10.0 0.0 33.0 0.0 43.0 43.0 3.0 1.0 0.0 8.0 8.0 306.0 1 0 0 0 0 1 0 1.0 HIGH
7
- chr6 32648154 82 82 7.0 5.0 7.0 5.0 5.0 0.4166666666666667 415.0 238.0 0.0 4156.0 1 0 0 0 1 1 0 0.495 LOW
8
- chr8 91664843 999 148 0.0 28.0 0.0 36.0 0.0 56.0 64.0 1.2413793103448276 1.0 0.0 15.0 15.0 316.0 0 1 0 0 0 1 0 1.0 HIGH
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ sample caller variant_ID is_pass svtype_DEL svtype_DUP svtype_INS svtype_INV svtype_BND svlen_log cipos_width ciend_width is_imprecise pe_support sr_support total_support vaf gt_hom gq qual_norm local_depth gc_min gc_max entropy_min microhom_max in_segdup_either in_segdup_both in_difficult_either in_difficult_both in_lowmap_either in_tandem_either in_Alu_either in_L1_either in_SVA_either in_LTR_either frac_span_repeat n_neighbors nn_log_dist CS_raw CS tier
2
+ HG00097 manta chr1:861885:BND:None 1 0 0 0 0 1 -99999.0 31.0 -99999.0 0 5.0 7.0 12.0 0.1538461538461538 0 182.0 182.0 43.0 0.297029702970297 0.297029702970297 1.6350564503142977 -99999.0 1 1 1 1 0 1 0 0 0 0 -99999.0 2 4.424979586745809 0.0077 0.0016 LOW
3
+ HG00097 manta chr1:888490:DEL:888577 1 1 0 0 0 0 1.944482672150169 55.0 -99999.0 0 0.0 5.0 5.0 0.2631578947368421 0 93.0 116.0 5.0 0.495049504950495 0.5247524752475248 1.119875913690787 50.0 1 1 1 1 0 1 0 0 0 0 1.0 2 4.20382130251655 0.7047 0.8104 HIGH
4
+ HG00097 manta chr1:904478:DEL:904576 1 1 0 0 0 0 1.99563519459755 14.0 -99999.0 0 0.0 10.0 10.0 0.4166666666666667 0 31.0 294.0 3.0 0.7227722772277227 0.7722772277227723 1.6490755403817534 14.0 0 0 1 1 0 1 0 0 0 0 1.0 3 4.20382130251655 0.616 0.6822 MODERATE
5
+ HG00097 manta chr1:998763:INS:None 1 0 0 1 0 0 1.7708520116421442 -99999.0 -99999.0 0 0.0 15.0 15.0 0.6521739130434783 0 43.0 538.0 -99999.0 0.7920792079207921 0.7920792079207921 1.6906469647232745 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 3 4.472888033769636 0.6379 0.7226 HIGH
6
+ HG00097 manta chr1:1028471:DEL:1029081 1 1 0 0 0 0 2.786041210242554 19.0 -99999.0 0 13.0 12.0 25.0 0.352112676056338 0 477.0 654.0 36.0 0.6534653465346535 0.7227722772277227 1.7262819301695698 19.0 0 0 1 1 0 1 0 0 0 0 1.0 2 4.472888033769636 0.809 0.8988 HIGH
7
+ HG00097 manta chr1:1068824:INS:None 1 0 0 1 0 0 1.8976270912904412 10.0 -99999.0 0 0.0 29.0 29.0 0.9666666666666668 1 75.0 999.0 -99999.0 0.8514851485148515 0.8514851485148515 1.5447689726449716 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 2 4.60588658966101 0.7489 0.8545 HIGH
8
+ HG00097 manta chr1:1427385:DEL:1427442 0 1 0 0 0 0 1.7634279935629371 -99999.0 -99999.0 0 0.0 2.0 2.0 0.1666666666666666 0 13.0 13.0 3.0 0.693069306930693 0.7029702970297029 1.3844773445130296 0.0 0 0 1 1 0 1 0 0 0 0 1.0 0 5.22457708545644 0.1506 0.1012 LOW
9
+ HG00097 manta chr1:1595101:DEL:1595183 1 1 0 0 0 0 1.919078092376074 2.0 -99999.0 0 0.0 10.0 10.0 0.1923076923076923 0 266.0 266.0 12.0 0.5148514851485149 0.5247524752475248 1.4455984547909615 2.0 0 0 1 1 1 1 0 0 0 0 1.0 1 4.856571815297643 0.4295 0.4349 WARNING
10
+ HG00097 manta chr1:1666974:DEL:1667141 1 1 0 0 0 0 2.225309281725863 18.0 -99999.0 0 11.0 21.0 32.0 1.0 1 59.0 888.0 11.0 0.5247524752475248 0.5445544554455446 1.9680479893568856 18.0 1 1 1 1 1 0 1 0 0 0 1.0 3 4.763375575548453 0.7289 0.8283 HIGH
11
+ HG00097 manta chr1:1724966:DEL:1726924 1 1 0 0 0 0 3.2920344359947364 27.0 27.0 0 8.0 10.0 18.0 0.2727272727272727 0 420.0 420.0 31.0 0.5643564356435643 0.6435643564356436 1.9231444533753872 27.0 1 1 1 1 1 1 1 0 0 0 1.0 2 4.391640703492388 0.8005 0.8964 HIGH
12
+ HG00097 manta chr1:1749605:INS:None 1 0 0 1 0 0 1.7075701760979365 23.0 -99999.0 0 0.0 10.0 10.0 0.3703703703703703 0 262.0 358.0 -99999.0 0.4752475247524752 0.4752475247524752 1.6885390899981634 -99999.0 1 1 1 1 1 1 0 0 0 0 -99999.0 2 4.391640703492388 0.7091 0.8117 HIGH
13
+ HG00097 manta chr1:1924223:INS:None 1 0 0 1 0 0 2.0170333392987803 10.0 -99999.0 0 0.0 16.0 16.0 1.0 1 41.0 630.0 -99999.0 0.7425742574257426 0.7425742574257426 1.6063691498853343 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 5 3.71281800020785 0.526 0.5557 MODERATE
14
+ HG00097 manta chr1:1929384:INS:None 1 0 0 1 0 0 -99999.0 5.0 5.0 0 0.0 19.0 19.0 1.0 1 39.0 580.0 -99999.0 0.693069306930693 0.693069306930693 1.4041502486751618 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 5 3.71281800020785 0.2185 0.1761 LOW
15
+ HG00097 manta chr1:1934989:DEL:1935584 1 1 0 0 0 0 2.7752462597402365 10.0 -99999.0 0 16.0 18.0 34.0 1.0 1 86.0 934.0 16.0 0.4455445544554455 0.504950495049505 1.8451028377340413 10.0 0 0 1 1 0 1 0 0 0 0 1.0 5 3.748653093424267 0.686 0.7801 HIGH
16
+ HG00097 manta chr1:1948934:INS:None 1 0 0 1 0 0 2.049218022670181 2.0 -99999.0 0 3.0 22.0 25.0 1.0 1 64.0 864.0 3.0 0.5643564356435643 0.5643564356435643 1.0561905395876316 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 5 4.1444496608689 0.7533 0.8545 HIGH
17
+ HG00097 manta chr1:1993704:INS:None 1 0 0 1 0 0 2.170261715394957 15.0 -99999.0 0 8.0 45.0 53.0 1.0 1 128.0 999.0 8.0 0.4752475247524752 0.4752475247524752 1.981211526197009 -99999.0 0 0 0 0 0 0 1 0 0 0 -99999.0 5 4.406829613621544 0.8105 0.8988 HIGH
18
+ HG00097 manta chr1:2019220:INS:None 1 0 0 1 0 0 -99999.0 23.0 23.0 0 2.0 12.0 14.0 1.0 1 20.0 261.0 2.0 0.8613861386138614 0.8613861386138614 1.3524948796891727 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 5 4.406829613621544 0.5454 0.5772 MODERATE
19
+ HG00097 manta chr1:2421838:BND:None 1 0 0 0 0 1 -99999.0 3.0 -99999.0 0 4.0 25.0 29.0 0.3766233766233766 0 582.0 999.0 23.0 0.495049504950495 0.495049504950495 1.983371708749389 -99999.0 0 0 1 1 0 1 1 0 0 0 -99999.0 0 5.255942731372637 0.002 0.0003 LOW
20
+ HG00097 manta chr1:2602115:DUP:2602189 1 0 1 0 0 0 1.8750612633917 -99999.0 -99999.0 0 0.0 11.0 11.0 0.1294117647058823 0 255.0 255.0 30.0 0.5643564356435643 0.6435643564356436 1.913521655146875 1.0 0 0 0 0 0 0 0 0 0 0 0.0 1 1.6989700043360187 0.0 0.0 LOW
21
+ HG00097 manta chr1:2602164:INS:None 1 0 0 1 0 0 1.919078092376074 -99999.0 -99999.0 0 0.0 22.0 22.0 0.6470588235294118 0 137.0 697.0 -99999.0 0.594059405940594 0.594059405940594 1.9680479893568856 -99999.0 0 0 0 0 0 0 0 0 0 0 -99999.0 1 1.6989700043360187 0.7853 0.8865 HIGH
22
+ HG00097 manta chr1:3026038:INS:None 0 0 0 1 0 0 -99999.0 5.0 5.0 0 2.0 26.0 28.0 0.875 1 6.0 831.0 6.0 0.4851485148514851 0.4851485148514851 1.556688426030284 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 0 5.192483819902663 0.8821 0.935 HIGH
23
+ HG00097 manta chr1:3181807:DEL:3181925 1 1 0 0 0 0 2.075546961392531 14.0 -99999.0 0 0.0 27.0 27.0 1.0 1 65.0 999.0 -99999.0 0.4455445544554455 0.4653465346534653 1.939544697237768 14.0 0 0 1 1 0 1 0 0 0 0 1.0 2 4.525860772931853 0.6932 0.798 HIGH
24
+ HG00097 manta chr1:3215369:INS:None 1 0 0 1 0 0 1.792391689498254 54.0 -99999.0 0 0.0 27.0 27.0 1.0 1 68.0 999.0 -99999.0 0.6831683168316832 0.6831683168316832 1.850366467248344 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 2 4.525860772931853 0.8417 0.9154 HIGH
25
+ HG00097 manta chr1:3260742:INS:None 1 0 0 1 0 0 -99999.0 5.0 5.0 0 8.0 29.0 37.0 1.0 1 95.0 999.0 8.0 0.4752475247524752 0.4752475247524752 1.6383659495376368 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 3 4.656807066710448 0.8274 0.9091 HIGH
26
+ HG00097 manta chr1:3316611:DEL:3316667 1 1 0 0 0 0 1.7558748556724917 23.0 -99999.0 0 0.0 10.0 10.0 0.3703703703703703 0 177.0 306.0 5.0 0.495049504950495 0.495049504950495 1.8200734972498984 23.0 0 0 1 1 0 1 0 0 0 0 1.0 1 4.747178671360165 0.8483 0.917 HIGH
27
+ HG00097 manta chr1:3832372:DEL:3832444 1 1 0 0 0 0 1.863322860120456 45.0 -99999.0 0 0.0 5.0 5.0 0.625 0 25.0 127.0 1.0 0.5247524752475248 0.5841584158415841 1.9137004533259432 45.0 0 0 1 1 0 1 0 0 0 0 1.0 2 4.560086048497414 0.6316 0.7133 HIGH
28
+ HG00097 manta chr1:3868686:INS:None 1 0 0 1 0 0 2.4712917110589387 1.0 -99999.0 0 4.0 38.0 42.0 1.0 1 29.0 708.0 4.0 0.6336633663366337 0.6336633663366337 1.5250608688455582 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 2 4.560086048497414 0.6765 0.7715 HIGH
29
+ HG00097 manta chr1:3929498:INS:None 1 0 0 1 0 0 2.4668676203541096 4.0 -99999.0 0 2.0 27.0 29.0 1.0 1 62.0 707.0 2.0 0.4455445544554455 0.4455445544554455 1.4880199108409586 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 3 4.7839964283643 0.4086 0.3836 WARNING
30
+ HG00097 manta chr1:3999775:DEL:3999895 1 1 0 0 0 0 2.08278537031645 -99999.0 -99999.0 0 0.0 25.0 25.0 1.0 1 73.0 999.0 -99999.0 0.6633663366336634 0.693069306930693 1.7648687041978608 1.0 0 0 1 1 1 1 0 0 0 0 1.0 1 4.846819393669543 0.7755 0.874 HIGH
31
+ HG00097 manta chr1:4129813:DEL:4129873 1 1 0 0 0 0 1.7853298350107671 45.0 -99999.0 0 0.0 9.0 9.0 0.3461538461538461 0 162.0 312.0 6.0 0.6039603960396039 0.6831683168316832 1.8738460050301151 45.0 0 0 1 1 0 1 0 0 0 0 1.0 2 4.16660770308391 0.8085 0.8988 HIGH
32
+ HG00097 manta chr1:4144488:DUP:4144620 0 0 1 0 0 0 2.123851640967086 4.0 4.0 0 0.0 22.0 22.0 0.4313725490196078 0 332.0 518.0 11.0 0.4257425742574257 0.4851485148514851 1.4784771363149791 4.0 0 0 1 1 0 1 0 0 0 0 1.0 2 2.14921911265538 0.0007 0.0 LOW
33
+ HG00097 manta chr1:4144628:INS:None 0 0 0 1 0 0 1.9084850188786495 2.0 -99999.0 0 0.0 28.0 28.0 1.0 1 71.0 982.0 -99999.0 0.4851485148514851 0.4851485148514851 1.4687802326926382 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 2 2.14921911265538 0.1717 0.1265 LOW
34
+ HG00097 manta chr1:4333580:BND:None 1 0 0 0 0 1 -99999.0 -99999.0 -99999.0 0 12.0 0.0 12.0 0.5 0 142.0 142.0 24.0 0.5643564356435643 0.5643564356435643 1.9623130244403804 -99999.0 0 0 1 1 0 1 0 0 0 1 -99999.0 1 3.465680211598278 0.0055 0.0012 LOW
35
+ HG00097 manta chr1:4336501:INS:None 1 0 0 1 0 0 -99999.0 27.0 27.0 0 3.0 37.0 40.0 1.0 1 68.0 929.0 3.0 0.4059405940594059 0.4059405940594059 1.747964514410695 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 1 3.465680211598278 0.7674 0.8684 HIGH
36
+ HG00097 manta chr1:4939764:BND:None 0 0 0 0 0 1 -99999.0 599.0 -99999.0 1 14.0 0.0 14.0 0.4117647058823529 0 115.0 115.0 34.0 0.4554455445544554 0.4554455445544554 1.9728402340561 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 0 5.133363239048624 0.0303 0.0134 LOW
37
+ HG00097 manta chr1:5075708:INS:None 1 0 0 1 0 0 1.863322860120456 13.0 -99999.0 0 0.0 12.0 12.0 1.0 1 26.0 288.0 -99999.0 0.0792079207920792 0.0792079207920792 1.3986907959255488 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 1 4.927334427020689 0.749 0.8545 HIGH
38
+ HG00097 manta chr1:5160300:INS:None 0 0 0 1 0 0 2.765668554759014 -99999.0 -99999.0 0 0.0 46.0 46.0 0.92 1 52.0 999.0 4.0 0.3762376237623762 0.3762376237623762 1.7050759800383022 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 1 4.927334427020689 0.7965 0.8914 HIGH
39
+ HG00097 manta chr1:5387025:BND:None 1 0 0 0 0 1 -99999.0 466.0 -99999.0 1 15.0 0.0 15.0 0.6 0 116.0 228.0 25.0 0.5148514851485149 0.5148514851485149 1.3511081047001698 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 2 2.161368002234975 0.0015 0.0 LOW
40
+ HG00097 manta chr1:5387169:DUP:5387401 1 0 1 0 0 0 2.367355921026019 -99999.0 -99999.0 0 1.0 23.0 24.0 0.4285714285714285 0 338.0 481.0 13.0 0.4554455445544554 0.4752475247524752 1.5481463016741694 0.0 0 0 1 1 0 1 0 0 0 0 1.0 2 2.161368002234975 0.0005 0.0 LOW
41
+ HG00097 manta chr1:5414152:DEL:5414227 1 1 0 0 0 0 1.8808135922807916 18.0 -99999.0 0 0.0 7.0 7.0 0.3333333333333333 0 102.0 157.0 6.0 0.5346534653465347 0.5346534653465347 1.9698531038077964 18.0 0 0 1 1 0 1 0 0 0 0 1.0 3 4.431106328181145 0.6073 0.67 MODERATE
42
+ HG00097 manta chr1:5499440:DEL:5499546 1 1 0 0 0 0 2.0293837776852097 8.0 -99999.0 0 0.0 9.0 9.0 0.2195121951219512 0 276.0 276.0 7.0 0.2673267326732673 0.2673267326732673 1.6246863594780123 8.0 0 0 1 1 0 1 0 0 0 0 1.0 2 4.930893022406026 0.5474 0.5784 MODERATE
43
+ HG00097 manta chr1:5593288:INS:None 1 0 0 1 0 0 1.806179973983887 2.0 -99999.0 0 0.0 22.0 22.0 0.6470588235294118 0 142.0 889.0 -99999.0 0.5544554455445545 0.5544554455445545 1.989225629887876 -99999.0 0 0 0 0 0 0 0 0 0 0 -99999.0 2 4.9076585372801205 0.9166 0.9509 HIGH
44
+ HG00097 manta chr1:5674133:BND:None 0 0 0 0 0 1 -99999.0 7.0 -99999.0 0 11.0 47.0 58.0 0.3020833333333333 0 999.0 999.0 78.0 0.3861386138613861 0.3861386138613861 1.905091895547532 -99999.0 0 0 0 0 0 0 0 1 0 0 -99999.0 1 4.9076585372801205 0.0368 0.0188 LOW
45
+ HG00097 manta chr1:5874228:INS:None 1 0 0 1 0 0 2.453318340047037 13.0 -99999.0 0 18.0 23.0 41.0 0.5394736842105263 0 327.0 571.0 37.0 0.4455445544554455 0.4455445544554455 1.9714887292053216 -99999.0 0 0 0 0 0 0 0 0 0 0 -99999.0 0 5.116717299877443 0.9601 0.9697 HIGH
46
+ HG00097 manta chr1:6005060:DEL:6005340 1 1 0 0 0 0 2.44870631990508 -99999.0 -99999.0 0 14.0 15.0 29.0 0.9666666666666668 1 23.0 314.0 15.0 0.6237623762376238 0.6831683168316832 1.5376840111363732 2.0 0 0 1 1 0 1 0 0 0 0 1.0 1 4.317666442356501 0.7321 0.8286 HIGH
47
+ HG00097 manta chr1:6025840:DEL:6025908 1 1 0 0 0 0 1.8388490907372552 74.0 -99999.0 0 0.0 6.0 6.0 0.2222222222222222 0 182.0 184.0 9.0 0.6435643564356436 0.7029702970297029 1.7307956407165326 50.0 0 0 1 1 1 1 0 0 0 0 1.0 1 4.317666442356501 0.876 0.9325 HIGH
48
+ HG00097 manta chr1:6742557:INS:None 1 0 0 1 0 0 -99999.0 4.0 4.0 0 5.0 31.0 36.0 1.0 1 101.0 999.0 5.0 0.0792079207920792 0.0792079207920792 1.320097363865938 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 0 5.297445341827969 0.6745 0.7695 HIGH
49
+ HG00097 manta chr1:6940912:INS:None 1 0 0 1 0 0 -99999.0 4.0 4.0 0 0.0 34.0 34.0 0.9714285714285714 1 95.0 999.0 -99999.0 0.6336633663366337 0.6336633663366337 1.902230293166466 -99999.0 0 0 0 0 0 0 0 0 0 0 -99999.0 1 4.635272580112365 0.7578 0.8633 HIGH
50
+ HG00097 manta chr1:6984090:INS:None 1 0 0 1 0 0 2.311753861055754 2.0 -99999.0 0 6.0 35.0 41.0 1.0 1 72.0 999.0 6.0 0.4851485148514851 0.4851485148514851 1.5136654464703396 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 1 4.635272580112365 0.8751 0.9325 HIGH
51
+ HG00097 manta chr1:7510011:DEL:7511458 1 1 0 0 0 0 3.1607685618611283 -99999.0 -99999.0 0 11.0 15.0 26.0 0.3661971830985915 0 527.0 711.0 35.0 0.4653465346534653 0.4851485148514851 1.817566994368736 0.0 0 0 1 0 0 1 0 0 0 0 0.0151933701657458 1 4.512457586197344 0.7788 0.8742 HIGH
feature_builder.py ADDED
@@ -0,0 +1,708 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ SVSTR_Score feature builder (VCF + reference only, single sample).
4
+
5
+ Computes the RandomForest input features defined in:
6
+ sv_features.tsv (callers: manta, delly, lumpy)
7
+ str_features.tsv (callers: expansionhunter, gangstr)
8
+
9
+ Design constraints (head model):
10
+ - Inputs are ONLY a short-read VCF + reference FASTA + static annotation BEDs.
11
+ No BAM, no cohort, no long-read (long-read is used elsewhere for labeling only).
12
+ - Features are caller-common *concepts*; each caller is parsed by its own parser.
13
+ - `caller` is recorded for bookkeeping but is NOT emitted as a model feature.
14
+
15
+ Annotation BEDs must be sorted, bgzipped and tabix-indexed (see
16
+ scripts/prepare_annotations or the resources/ prep step).
17
+
18
+ ExpansionHunter input is its flat (optionally gzipped) TSV, not a VCF — pass it to --vcf.
19
+
20
+ VALIDATION: validated on HG00097 (Manta/Delly/GangSTR VCFs + ExpansionHunter TSV).
21
+ Four parsing bugs were found & fixed against real data:
22
+ 1. GangSTR REPCN/REPCI come back from pysam as tuples (Number=2), not strings.
23
+ 2. pysam returns absent Flags as False (not KeyError) -> is_imprecise used `in rec.info`.
24
+ 3. INFO/END is consumed into rec.stop; rec.info['END'] is empty.
25
+ 4. missing sentinel must be out-of-range (-99999); -1 collided with real negative
26
+ expansion_over_ref (contractions). LUMPY (smoove/SVTyper) not yet run.
27
+
28
+ Usage:
29
+ python feature_builder.py \
30
+ --vcf sample.manta.vcf.gz --caller manta \
31
+ --fasta GRCh38.fa \
32
+ --giab-dir ../resources/giab_prepared \
33
+ --repeatmasker ../resources/repeatmasker/rmsk_class.bed.gz \
34
+ -o sample.manta.features.tsv
35
+ """
36
+
37
+ import os
38
+ import sys
39
+ import math
40
+ import bisect
41
+ import argparse
42
+ from collections import defaultdict
43
+
44
+ import numpy as np
45
+ import pandas as pd
46
+ import pysam
47
+
48
+ MISSING = -99999.0 # out-of-range sentinel for missing fields (paired with *_missing indicators).
49
+ # Must be outside every feature's real range: expansion_over_ref can legitimately be negative,
50
+ # so a small sentinel like -1 would collide with real contractions.
51
+ SV_CALLERS = {"manta", "delly", "lumpy"}
52
+ STR_CALLERS = {"expansionhunter", "gangstr"}
53
+ PRIMARY_CONTIGS = ({f"chr{c}" for c in list(range(1, 23)) + ["X", "Y", "M"]}
54
+ | {str(c) for c in list(range(1, 23)) + ["X", "Y", "MT", "M"]})
55
+
56
+ # Features that can legitimately be MISSING. Their `<feat>_missing` indicator is
57
+ # emitted ALWAYS (even if all-zero for a given caller) so every caller's output
58
+ # has an identical, fixed column schema — one trained model consumes any caller's
59
+ # converted VCF directly, no per-caller alignment needed.
60
+ SV_MISSING_INDICATORS = [
61
+ "svlen_log", "cipos_width", "ciend_width", "vaf", "qual_norm", "gq",
62
+ "local_depth", "gt_hom", "gc_min", "gc_max", "entropy_min", "microhom_max",
63
+ "frac_span_repeat", "nn_log_dist",
64
+ ]
65
+ STR_MISSING_INDICATORS = [
66
+ "motif_len", "ref_copynum", "locus_depth", "gt_hom", "gt_repcn_max", "gt_repcn_min",
67
+ "expansion_over_ref", "repci_width_max", "spanning_frac", "ref_tract_bp",
68
+ "allele_vs_readlen", "motif_is_homopolymer", "gc_flank", "entropy_flank",
69
+ ]
70
+
71
+
72
+ # ---------------------------------------------------------------------------
73
+ # Reference-sequence features (reused from A2Denovo conventions)
74
+ # ---------------------------------------------------------------------------
75
+ def gc_content(seq):
76
+ if not seq:
77
+ return MISSING
78
+ seq = seq.upper()
79
+ n = sum(1 for b in seq if b in "ACGT")
80
+ if n == 0:
81
+ return MISSING
82
+ return sum(1 for b in seq if b in "GC") / n
83
+
84
+
85
+ def shannon_entropy(seq):
86
+ if not seq:
87
+ return MISSING
88
+ seq = seq.upper()
89
+ counts = defaultdict(int)
90
+ for b in seq:
91
+ if b in "ACGT":
92
+ counts[b] += 1
93
+ total = sum(counts.values())
94
+ if total == 0:
95
+ return MISSING
96
+ h = 0.0
97
+ for c in counts.values():
98
+ p = c / total
99
+ h -= p * math.log2(p)
100
+ return h
101
+
102
+
103
+ def fetch(fasta, chrom, start, end):
104
+ """0-based half-open fetch with clamping; returns '' on failure."""
105
+ try:
106
+ start = max(0, start)
107
+ return fasta.fetch(chrom, start, end)
108
+ except Exception:
109
+ return ""
110
+
111
+
112
+ def gc_entropy_at(fasta, chrom, pos1, win):
113
+ """GC and entropy in pos +/- win (pos is 1-based)."""
114
+ seq = fetch(fasta, chrom, pos1 - 1 - win, pos1 + win)
115
+ return gc_content(seq), shannon_entropy(seq)
116
+
117
+
118
+ def microhomology(fasta, chrom, pos1, end1, max_k=50):
119
+ """
120
+ Approximate microhomology between the two breakpoints of an intra-chromosomal
121
+ SV: longest k (<=max_k) where the sequence adjacent to bp1 matches bp2.
122
+ Returns MISSING for inter-chromosomal / undefined cases.
123
+ """
124
+ if end1 is None or end1 <= pos1:
125
+ return MISSING
126
+ left = fetch(fasta, chrom, pos1 - max_k, pos1 + max_k).upper()
127
+ right = fetch(fasta, chrom, end1 - max_k, end1 + max_k).upper()
128
+ if len(left) < 2 * max_k or len(right) < 2 * max_k:
129
+ return MISSING
130
+ k = 0
131
+ while k < max_k and left[max_k + k] == right[max_k + k]: # rightward match
132
+ k += 1
133
+ j = 0
134
+ while j < max_k and left[max_k - 1 - j] == right[max_k - 1 - j]: # leftward
135
+ j += 1
136
+ return float(max(k, j))
137
+
138
+
139
+ # ---------------------------------------------------------------------------
140
+ # Tabix annotation (binary overlap + RepeatMasker element class)
141
+ # ---------------------------------------------------------------------------
142
+ class Annotator:
143
+ """Binary overlap against tabixed BEDs, with chr-naming fallback."""
144
+
145
+ RMSK_ELEMENTS = { # label prefix in rmsk_class.bed (repClass/repFamily) -> flag
146
+ "SINE/Alu": "Alu",
147
+ "LINE/L1": "L1",
148
+ "Retroposon/SVA": "SVA",
149
+ "LTR": "LTR",
150
+ }
151
+
152
+ def __init__(self, giab_dir=None, repeatmasker=None):
153
+ self.tbx = {}
154
+ if giab_dir:
155
+ for name in ("segdups", "lowmap", "tandem", "difficult"):
156
+ p = os.path.join(giab_dir, f"{name}.bed.gz")
157
+ if os.path.exists(p):
158
+ self.tbx[name] = pysam.TabixFile(p)
159
+ else:
160
+ sys.stderr.write(f"[warn] missing GIAB bed: {p}\n")
161
+ self.rmsk = pysam.TabixFile(repeatmasker) if repeatmasker and os.path.exists(repeatmasker) else None
162
+
163
+ def _contigs(self, tbx, chrom):
164
+ if chrom in tbx.contigs:
165
+ return chrom
166
+ alt = chrom[3:] if chrom.startswith("chr") else "chr" + chrom
167
+ return alt if alt in tbx.contigs else None
168
+
169
+ def overlaps(self, name, chrom, pos1):
170
+ """1 if 1-based pos overlaps any interval in bed `name`, else 0."""
171
+ tbx = self.tbx.get(name)
172
+ if tbx is None:
173
+ return MISSING
174
+ c = self._contigs(tbx, chrom)
175
+ if c is None:
176
+ return 0
177
+ try:
178
+ for _ in tbx.fetch(c, pos1 - 1, pos1):
179
+ return 1
180
+ except Exception:
181
+ return 0
182
+ return 0
183
+
184
+ def frac_overlap(self, name, chrom, start1, end1):
185
+ """Fraction of [start1,end1] (1-based inclusive) covered by bed `name`."""
186
+ tbx = self.tbx.get(name)
187
+ if tbx is None or end1 is None or end1 < start1:
188
+ return MISSING
189
+ c = self._contigs(tbx, chrom)
190
+ if c is None:
191
+ return 0.0
192
+ span = end1 - start1 + 1
193
+ covered = 0
194
+ try:
195
+ for row in tbx.fetch(c, start1 - 1, end1):
196
+ f = row.split("\t")
197
+ s, e = int(f[1]), int(f[2])
198
+ covered += max(0, min(end1, e) - max(start1 - 1, s))
199
+ except Exception:
200
+ return 0.0
201
+ return min(1.0, covered / span) if span > 0 else 0.0
202
+
203
+ def rmsk_elements(self, chrom, pos1):
204
+ """Return dict {Alu,L1,SVA,LTR -> 0/1} for the position."""
205
+ flags = {"Alu": 0, "L1": 0, "SVA": 0, "LTR": 0}
206
+ if self.rmsk is None:
207
+ return {k: MISSING for k in flags}
208
+ c = self._contigs(self.rmsk, chrom)
209
+ if c is None:
210
+ return flags
211
+ try:
212
+ for row in self.rmsk.fetch(c, pos1 - 1, pos1):
213
+ label = row.split("\t")[3]
214
+ for prefix, flag in self.RMSK_ELEMENTS.items():
215
+ if label.startswith(prefix):
216
+ flags[flag] = 1
217
+ except Exception:
218
+ pass
219
+ return flags
220
+
221
+
222
+ def agg_either_both(a, b):
223
+ """Order-invariant aggregation for the two breakpoints."""
224
+ if a == MISSING or b == MISSING:
225
+ v = a if b == MISSING else b
226
+ return v, v
227
+ return (1 if (a or b) else 0), (1 if (a and b) else 0)
228
+
229
+
230
+ # ---------------------------------------------------------------------------
231
+ # Small helpers for VCF field access
232
+ # ---------------------------------------------------------------------------
233
+ def info(rec, key, default=None):
234
+ try:
235
+ return rec.info[key]
236
+ except Exception:
237
+ return default
238
+
239
+
240
+ def fmt(rec, key, default=None):
241
+ try:
242
+ return rec.samples[0][key]
243
+ except Exception:
244
+ return default
245
+
246
+
247
+ def is_pass(rec):
248
+ fk = list(rec.filter.keys())
249
+ return 1 if (not fk or fk == ["PASS"] or fk == ["."]) else 0
250
+
251
+
252
+ def gt_is_hom_alt(rec):
253
+ gt = fmt(rec, "GT")
254
+ if not gt or any(a is None for a in gt):
255
+ return MISSING
256
+ alleles = [a for a in gt]
257
+ return 1 if all(a == alleles[0] and a > 0 for a in alleles) else 0
258
+
259
+
260
+ def first(x, default=MISSING):
261
+ """Coerce a possibly-tuple INFO/FORMAT value to a scalar number."""
262
+ if x is None:
263
+ return default
264
+ if isinstance(x, (tuple, list)):
265
+ x = x[0] if x else default
266
+ try:
267
+ return float(x)
268
+ except Exception:
269
+ return default
270
+
271
+
272
+ def width(ci):
273
+ if not ci or not isinstance(ci, (tuple, list)) or len(ci) < 2:
274
+ return MISSING
275
+ try:
276
+ return abs(float(ci[1]) - float(ci[0]))
277
+ except Exception:
278
+ return MISSING
279
+
280
+
281
+ def norm_svtype(rec):
282
+ st = info(rec, "SVTYPE")
283
+ if st is None:
284
+ alt = str(rec.alts[0]) if rec.alts else ""
285
+ st = alt.strip("<>").split(":")[0] if alt.startswith("<") else "BND"
286
+ st = str(st).upper().split(":")[0]
287
+ if st in ("TRA", "CTX"):
288
+ st = "BND"
289
+ if st not in ("DEL", "DUP", "INS", "INV", "BND"):
290
+ st = "BND"
291
+ return st
292
+
293
+
294
+ # ---------------------------------------------------------------------------
295
+ # Per-caller SV parsers -> normalized concept dict
296
+ # ---------------------------------------------------------------------------
297
+ def parse_sv_common(rec):
298
+ st = norm_svtype(rec)
299
+ chrom = rec.chrom
300
+ pos = rec.pos
301
+ # pysam consumes INFO/END into rec.stop; meaningful only for spanned SVs.
302
+ # BND/INS are annotated at their primary breakend only (bp2 = bp1 via end=None).
303
+ end = rec.stop if st in ("DEL", "DUP", "INV") else None
304
+ return {
305
+ "chrom": chrom, "pos": pos, "end": end, "chrom2": chrom,
306
+ "svtype": st,
307
+ "is_pass": is_pass(rec),
308
+ "cipos_width": width(info(rec, "CIPOS") or info(rec, "CIPOS95")),
309
+ "ciend_width": width(info(rec, "CIEND") or info(rec, "CIEND95")),
310
+ "is_imprecise": 1 if ("IMPRECISE" in rec.info) else 0,
311
+ "gt_hom": gt_is_hom_alt(rec),
312
+ "svlen_raw": info(rec, "SVLEN"),
313
+ }
314
+
315
+
316
+ def parse_manta(rec):
317
+ d = parse_sv_common(rec)
318
+ pr = fmt(rec, "PR") or (None, None)
319
+ sr = fmt(rec, "SR") or (None, None)
320
+ pr_ref, pr_alt = (first(pr[0], 0), first(pr[1], 0)) if len(pr) == 2 else (0, 0)
321
+ sr_ref, sr_alt = (first(sr[0], 0), first(sr[1], 0)) if len(sr) == 2 else (0, 0)
322
+ tot = pr_ref + pr_alt + sr_ref + sr_alt
323
+ d.update({
324
+ "pe_support": pr_alt, "sr_support": sr_alt, "total_support": pr_alt + sr_alt,
325
+ "vaf": (pr_alt + sr_alt) / tot if tot > 0 else MISSING,
326
+ "gq": first(fmt(rec, "GQ")), "qual_norm": first(rec.qual),
327
+ "local_depth": (pr_ref + pr_alt) or first(info(rec, "BND_DEPTH")),
328
+ })
329
+ return d
330
+
331
+
332
+ def parse_delly(rec):
333
+ d = parse_sv_common(rec)
334
+ dr, dv = first(fmt(rec, "DR"), 0), first(fmt(rec, "DV"), 0)
335
+ rr, rv = first(fmt(rec, "RR"), 0), first(fmt(rec, "RV"), 0)
336
+ tot = dr + dv + rr + rv
337
+ if d["svlen_raw"] is None and d["end"] is not None: # v0.7 has no SVLEN
338
+ d["svlen_raw"] = d["end"] - d["pos"]
339
+ d.update({
340
+ "pe_support": dv, "sr_support": rv, "total_support": dv + rv,
341
+ "vaf": (dv + rv) / tot if tot > 0 else MISSING,
342
+ "gq": first(fmt(rec, "GQ")), "qual_norm": first(rec.qual),
343
+ "local_depth": dr + dv,
344
+ })
345
+ return d
346
+
347
+
348
+ def parse_lumpy(rec):
349
+ d = parse_sv_common(rec)
350
+ ao, ro = first(fmt(rec, "AO"), 0), first(fmt(rec, "RO"), 0)
351
+ ab = fmt(rec, "AB")
352
+ # smoove/LUMPY put SU/PE/SR in INFO (site-level), not FORMAT; fall back to FORMAT for other dialects
353
+ pe = info(rec, "PE"); pe = first(pe) if pe is not None else first(fmt(rec, "PE"), 0)
354
+ sr = info(rec, "SR"); sr = first(sr) if sr is not None else first(fmt(rec, "SR"), 0)
355
+ su = info(rec, "SU"); su = first(su) if su is not None else first(fmt(rec, "SU"), 0)
356
+ d.update({
357
+ "pe_support": pe, "sr_support": sr, "total_support": su,
358
+ "vaf": first(ab) if ab is not None else ((ao / (ao + ro)) if (ao + ro) > 0 else MISSING),
359
+ "gq": first(fmt(rec, "GQ")), "qual_norm": first(fmt(rec, "SQ")),
360
+ "local_depth": first(fmt(rec, "DP")),
361
+ })
362
+ return d
363
+
364
+
365
+ SV_PARSERS = {"manta": parse_manta, "delly": parse_delly, "lumpy": parse_lumpy}
366
+
367
+
368
+ # ---------------------------------------------------------------------------
369
+ # Per-caller STR parsers
370
+ # ---------------------------------------------------------------------------
371
+ def _split_pair(val, sep):
372
+ if val is None:
373
+ return []
374
+ if isinstance(val, (tuple, list)): # pysam returns Number=2 fields (e.g. GangSTR REPCN) as tuples
375
+ out = []
376
+ for x in val:
377
+ try:
378
+ out.append(float(x))
379
+ except Exception:
380
+ pass
381
+ return out
382
+ s = str(val)
383
+ for d in sep:
384
+ s = s.replace(d, "|")
385
+ out = []
386
+ for tok in s.split("|"):
387
+ try:
388
+ out.append(float(tok))
389
+ except Exception:
390
+ pass
391
+ return out
392
+
393
+
394
+ def parse_eh(rec):
395
+ ru = info(rec, "RU") or ""
396
+ repcn = _split_pair(fmt(rec, "REPCN"), "/")
397
+ ref_cn = first(info(rec, "REF"))
398
+ adsp = sum(_split_pair(fmt(rec, "ADSP"), "/"))
399
+ adfl = sum(_split_pair(fmt(rec, "ADFL"), "/"))
400
+ adir = sum(_split_pair(fmt(rec, "ADIR"), "/"))
401
+ return {
402
+ "chrom": rec.chrom, "pos": rec.pos, "end": rec.stop,
403
+ "is_pass": is_pass(rec), "motif_len": float(len(ru)) if ru else first(info(rec, "RL")),
404
+ "ref_copynum": ref_cn,
405
+ "repcn": repcn, "repci_raw": fmt(rec, "REPCI"),
406
+ "spanning_reads": adsp, "flanking_reads": adfl, "inrepeat_reads": adir,
407
+ "locus_depth": first(fmt(rec, "LC")), "gt_hom": gt_is_hom_alt(rec),
408
+ "qual_post": first(rec.qual), "ref_tract_bp": first(info(rec, "RL")),
409
+ "ru": ru,
410
+ }
411
+
412
+
413
+ def parse_gangstr(rec):
414
+ ru = info(rec, "RU") or ""
415
+ period = first(info(rec, "PERIOD"))
416
+ repcn = _split_pair(fmt(rec, "REPCN"), ",")
417
+ ref_cn = first(info(rec, "REF"))
418
+ rc = _split_pair(fmt(rec, "RC"), ",") # enclosing,spanning,FRR,bounding
419
+ enclosing, spanning, frr, bounding = (rc + [0, 0, 0, 0])[:4]
420
+ return {
421
+ "chrom": rec.chrom, "pos": rec.pos, "end": rec.stop,
422
+ "is_pass": is_pass(rec), "motif_len": period if period != MISSING else float(len(ru)),
423
+ "ref_copynum": ref_cn,
424
+ "repcn": repcn, "repci_raw": fmt(rec, "REPCI"),
425
+ "spanning_reads": enclosing + spanning, "flanking_reads": bounding, "inrepeat_reads": frr,
426
+ "locus_depth": first(fmt(rec, "DP")), "gt_hom": gt_is_hom_alt(rec),
427
+ "qual_post": first(fmt(rec, "Q")),
428
+ "ref_tract_bp": (ref_cn * period) if (ref_cn != MISSING and period != MISSING) else MISSING,
429
+ "ru": ru,
430
+ }
431
+
432
+
433
+ def _num(x, default=MISSING):
434
+ try:
435
+ if x is None or x == "":
436
+ return default
437
+ v = float(x)
438
+ return default if v != v else v # NaN guard
439
+ except Exception:
440
+ return default
441
+
442
+
443
+ def parse_eh_tsv(row):
444
+ """One row of an ExpansionHunter flat TSV:
445
+ chrom,pos,end,filter,repid,ru,rl,ref,repcn,repci,adsp,adfl,adir,lc,so"""
446
+ ru = str(row.get("ru") or "")
447
+ repcn = _split_pair(row.get("repcn"), "/")
448
+ ref_cn = _num(row.get("ref"))
449
+ rl = _num(row.get("rl"))
450
+ adsp = sum(_split_pair(row.get("adsp"), "/"))
451
+ adfl = sum(_split_pair(row.get("adfl"), "/"))
452
+ adir = sum(_split_pair(row.get("adir"), "/"))
453
+ gt_hom = MISSING
454
+ if len(repcn) >= 2: # hom-ALT = both alleles equal and differ from reference
455
+ gt_hom = 1 if (repcn[0] == repcn[1] and repcn[0] != ref_cn) else 0
456
+ return {
457
+ "chrom": str(row["chrom"]), "pos": int(float(row["pos"])), "end": _num(row.get("end")),
458
+ "is_pass": 1 if str(row.get("filter", "")).upper() == "PASS" else 0,
459
+ "motif_len": float(len(ru)) if ru else rl,
460
+ "ref_copynum": ref_cn,
461
+ "repcn": repcn, "repci_raw": row.get("repci"),
462
+ "spanning_reads": adsp, "flanking_reads": adfl, "inrepeat_reads": adir,
463
+ "locus_depth": _num(row.get("lc")), "gt_hom": gt_hom,
464
+ "qual_post": MISSING, # EH TSV carries no site quality
465
+ "ref_tract_bp": rl, "ru": ru,
466
+ }
467
+
468
+
469
+ STR_PARSERS = {"expansionhunter": parse_eh, "gangstr": parse_gangstr}
470
+
471
+
472
+ def repci_width_max(repci_raw):
473
+ """Max allele CI width. EH: '2-2/10-10' (str); GangSTR: ('1-2','2-2') (pysam tuple)."""
474
+ if repci_raw is None:
475
+ return MISSING
476
+ if isinstance(repci_raw, (tuple, list)):
477
+ alleles = [str(x) for x in repci_raw]
478
+ else:
479
+ alleles = str(repci_raw).replace("/", ",").split(",")
480
+ best = MISSING
481
+ for allele in alleles:
482
+ if "-" in allele:
483
+ try:
484
+ parts = allele.split("-")
485
+ w = abs(float(parts[1]) - float(parts[0]))
486
+ best = w if best == MISSING else max(best, w)
487
+ except Exception:
488
+ pass
489
+ return best
490
+
491
+
492
+ # ---------------------------------------------------------------------------
493
+ # Feature assembly
494
+ # ---------------------------------------------------------------------------
495
+ def sv_features(d, ann, fasta, win):
496
+ chrom, pos, end = d["chrom"], d["pos"], d["end"]
497
+ chrom2, end2 = d["chrom2"], (end if end is not None else pos)
498
+ st = d["svtype"]
499
+ svlen = first(d["svlen_raw"])
500
+ f = {
501
+ "variant_ID": f"{chrom}:{pos}:{st}:{end}",
502
+ "is_pass": d["is_pass"],
503
+ "svtype_DEL": int(st == "DEL"), "svtype_DUP": int(st == "DUP"),
504
+ "svtype_INS": int(st == "INS"), "svtype_INV": int(st == "INV"),
505
+ "svtype_BND": int(st == "BND"),
506
+ "svlen_log": math.log10(abs(svlen) + 1) if svlen != MISSING else MISSING,
507
+ "cipos_width": d["cipos_width"], "ciend_width": d["ciend_width"],
508
+ "is_imprecise": d["is_imprecise"],
509
+ "pe_support": d["pe_support"], "sr_support": d["sr_support"],
510
+ "total_support": d["total_support"], "vaf": d["vaf"],
511
+ "gt_hom": d["gt_hom"], "gq": d["gq"], "qual_norm": d["qual_norm"],
512
+ "local_depth": d["local_depth"],
513
+ }
514
+ # reference sequence context at both breakpoints
515
+ gc1, e1 = gc_entropy_at(fasta, chrom, pos, win)
516
+ gc2, e2 = gc_entropy_at(fasta, chrom2, end2, win)
517
+ f["gc_min"], f["gc_max"] = (min(gc1, gc2), max(gc1, gc2)) if MISSING not in (gc1, gc2) else (MISSING, MISSING)
518
+ f["entropy_min"] = min(e1, e2) if MISSING not in (e1, e2) else MISSING
519
+ f["microhom_max"] = microhomology(fasta, chrom, pos, end if chrom2 == chrom else None)
520
+ # GIAB binary overlap, both breakpoints
521
+ for name, key in (("segdups", "segdup"), ("difficult", "difficult")):
522
+ ei, bo = agg_either_both(ann.overlaps(name, chrom, pos), ann.overlaps(name, chrom2, end2))
523
+ f[f"in_{key}_either"], f[f"in_{key}_both"] = ei, bo
524
+ for name, key in (("lowmap", "lowmap"), ("tandem", "tandem")):
525
+ ei, _ = agg_either_both(ann.overlaps(name, chrom, pos), ann.overlaps(name, chrom2, end2))
526
+ f[f"in_{key}_either"] = ei
527
+ # RepeatMasker element class, either breakpoint
528
+ r1 = ann.rmsk_elements(chrom, pos)
529
+ r2 = ann.rmsk_elements(chrom2, end2)
530
+ for elt in ("Alu", "L1", "SVA", "LTR"):
531
+ ei, _ = agg_either_both(r1[elt], r2[elt])
532
+ f[f"in_{elt}_either"] = ei
533
+ # fraction of the SV interval covered by repeats (intra-chrom interval SVs only)
534
+ if st in ("DEL", "DUP", "INV") and end is not None and chrom2 == chrom:
535
+ f["frac_span_repeat"] = max(ann.frac_overlap("tandem", chrom, pos, end),
536
+ ann.frac_overlap("segdups", chrom, pos, end))
537
+ else:
538
+ f["frac_span_repeat"] = MISSING
539
+ # neighbor density (SV only) — precomputed onto d by compute_clustering()
540
+ f["n_neighbors"] = d.get("n_neighbors", 0)
541
+ f["nn_log_dist"] = d.get("nn_log_dist", MISSING)
542
+ return f
543
+
544
+
545
+ def str_features(d, ann, fasta, win, read_len):
546
+ chrom, pos = d["chrom"], d["pos"]
547
+ repcn = d["repcn"] or []
548
+ cn_max = max(repcn) if repcn else MISSING
549
+ cn_min = min(repcn) if repcn else MISSING
550
+ ref_cn = d["ref_copynum"]
551
+ motif = d["motif_len"]
552
+ f = {
553
+ "variant_ID": f"{chrom}:{pos}:{info_end(d)}",
554
+ "is_pass": d["is_pass"], "motif_len": motif, "ref_copynum": ref_cn,
555
+ "gt_repcn_max": cn_max, "gt_repcn_min": cn_min,
556
+ "expansion_over_ref": (cn_max - ref_cn) if MISSING not in (cn_max, ref_cn) else MISSING,
557
+ "repci_width_max": repci_width_max(d["repci_raw"]),
558
+ "spanning_reads": d["spanning_reads"], "flanking_reads": d["flanking_reads"],
559
+ "inrepeat_reads": d["inrepeat_reads"],
560
+ "locus_depth": d["locus_depth"], "gt_hom": d["gt_hom"],
561
+ # qual_post dropped: EH never emits it -> structurally-missing -> caller-identity proxy
562
+ "ref_tract_bp": d["ref_tract_bp"],
563
+ }
564
+ tot = d["spanning_reads"] + d["flanking_reads"] + d["inrepeat_reads"]
565
+ f["spanning_frac"] = d["spanning_reads"] / tot if tot > 0 else MISSING
566
+ f["allele_vs_readlen"] = (cn_max * motif / read_len) if MISSING not in (cn_max, motif) else MISSING
567
+ f["motif_is_homopolymer"] = int(motif == 1) if motif != MISSING else MISSING
568
+ gc, ent = gc_entropy_at(fasta, chrom, pos, win)
569
+ f["gc_flank"], f["entropy_flank"] = gc, ent
570
+ f["in_segdup"] = ann.overlaps("segdups", chrom, pos)
571
+ f["in_difficult"] = ann.overlaps("difficult", chrom, pos)
572
+ f["flank_lowmap"] = ann.overlaps("lowmap", chrom, pos)
573
+ return f
574
+
575
+
576
+ def info_end(d):
577
+ return int(d["end"]) if d.get("end") is not None else d["pos"]
578
+
579
+
580
+ # ---------------------------------------------------------------------------
581
+ # Clustering (SV) — within-callset neighbor density
582
+ # ---------------------------------------------------------------------------
583
+ def compute_clustering(parsed, radius):
584
+ """Set on each parsed SV dict:
585
+ nn_log_dist = log10(distance to nearest other call + 1), UNCAPPED (isolation).
586
+ n_neighbors = number of other calls within +/-radius.
587
+ SV calls are sparse (median nearest neighbor ~5-90 kb), so radius must be SV-scale
588
+ (default 100 kb), not the 1 kb used for dense small variants. Vectorized per chrom."""
589
+ by_chrom = defaultdict(list)
590
+ for j, d in enumerate(parsed):
591
+ by_chrom[d["chrom"]].append((d["pos"], j))
592
+ for items in by_chrom.values():
593
+ items.sort()
594
+ pos = np.array([p for p, _ in items])
595
+ n = len(pos)
596
+ for k, (_, j) in enumerate(items):
597
+ if n < 2:
598
+ parsed[j]["nn_log_dist"], parsed[j]["n_neighbors"] = MISSING, 0
599
+ continue
600
+ p = pos[k]
601
+ nearest = min((p - pos[k - 1]) if k > 0 else float("inf"),
602
+ (pos[k + 1] - p) if k < n - 1 else float("inf"))
603
+ parsed[j]["nn_log_dist"] = math.log10(nearest + 1)
604
+ lo = int(np.searchsorted(pos, p - radius, "left"))
605
+ hi = int(np.searchsorted(pos, p + radius, "right"))
606
+ parsed[j]["n_neighbors"] = hi - lo - 1 # exclude self
607
+
608
+
609
+ # ---------------------------------------------------------------------------
610
+ # Main
611
+ # ---------------------------------------------------------------------------
612
+ def main():
613
+ ap = argparse.ArgumentParser(description="SVSTR_Score feature builder")
614
+ ap.add_argument("--vcf", required=True)
615
+ ap.add_argument("--caller", required=True,
616
+ choices=sorted(SV_CALLERS | STR_CALLERS))
617
+ ap.add_argument("--fasta", required=True)
618
+ ap.add_argument("--giab-dir", default=None, help="dir with segdups/lowmap/tandem/difficult .bed.gz (tabixed)")
619
+ ap.add_argument("--repeatmasker", default=None, help="tabixed rmsk_class.bed.gz")
620
+ ap.add_argument("--win", type=int, default=50, help="GC/entropy window (+/- bp)")
621
+ ap.add_argument("--neighbor-radius", type=int, default=100000,
622
+ help="SV clustering radius for n_neighbors (+/- bp). Default 100kb — SV calls are "
623
+ "sparse (median nearest ~5-90kb); 1kb is for dense small variants.")
624
+ ap.add_argument("--read-len", type=int, default=150, help="short-read length (STR spanning feasibility)")
625
+ ap.add_argument("--primary-only", dest="primary_only", action="store_true", default=True,
626
+ help="keep only primary-assembly contigs chr1-22,X,Y,M (default on)")
627
+ ap.add_argument("--all-contigs", dest="primary_only", action="store_false",
628
+ help="include ALT/decoy/HLA contigs (off by default)")
629
+ ap.add_argument("--str-drop-homref", action="store_true",
630
+ help="(STR) drop hom-ref 0/0 genotype loci (catalog non-variants)")
631
+ ap.add_argument("--sample", default=None,
632
+ help="sample id (default: auto from VCF's single sample, or EH-TSV filename prefix). "
633
+ "Emitted as a `sample` column — the label join key with the truth set.")
634
+ ap.add_argument("--missing-indicators", action="store_true",
635
+ help="also emit <feat>_missing 0/1 columns. OFF by default: redundant for tree "
636
+ "models (the -99999 sentinel is already split-separable). Turn on for linear/NN models.")
637
+ ap.add_argument("-o", "--output", required=True)
638
+ args = ap.parse_args()
639
+
640
+ variant_class = "SV" if args.caller in SV_CALLERS else "STR"
641
+ fasta = pysam.FastaFile(args.fasta)
642
+ ann = Annotator(args.giab_dir, args.repeatmasker)
643
+
644
+ eh_tsv = (args.caller == "expansionhunter") # EH ships a flat (gzipped) TSV, not a VCF
645
+ if eh_tsv:
646
+ with open(args.vcf, "rb") as fh:
647
+ comp = "gzip" if fh.read(2) == b"\x1f\x8b" else None
648
+ records = pd.read_csv(args.vcf, sep="\t", dtype=str, compression=comp).to_dict("records")
649
+ sample = args.sample or os.path.basename(args.vcf).split(".")[0]
650
+ get_chrom = lambda r: str(r["chrom"])
651
+ def is_homref(r):
652
+ cn, ref = _split_pair(r.get("repcn"), "/"), _num(r.get("ref"))
653
+ return bool(cn) and all(x == ref for x in cn)
654
+ else:
655
+ vf = pysam.VariantFile(args.vcf)
656
+ hdr = list(vf.header.samples)
657
+ sample = args.sample or (hdr[0] if len(hdr) == 1 else None)
658
+ if sample is None:
659
+ sys.exit(f"[error] --sample required: VCF has {len(hdr)} samples {hdr}")
660
+ records = list(vf)
661
+ get_chrom = lambda r: r.chrom
662
+ is_homref = lambda r: not (set(fmt(r, "GT") or ()) - {0})
663
+ sys.stderr.write(f"[info] sample={sample}\n")
664
+
665
+ n_raw = len(records)
666
+ if args.primary_only:
667
+ records = [r for r in records if get_chrom(r) in PRIMARY_CONTIGS]
668
+ sys.stderr.write(f"[info] primary-only: dropped {n_raw - len(records):,} non-primary-contig records\n")
669
+ if variant_class == "STR" and args.str_drop_homref:
670
+ before = len(records)
671
+ records = [r for r in records if not is_homref(r)]
672
+ sys.stderr.write(f"[info] str-drop-homref: dropped {before - len(records):,} hom-ref loci\n")
673
+ sys.stderr.write(f"[info] {len(records):,} records to process | caller={args.caller} class={variant_class}\n")
674
+
675
+ rows = []
676
+ if variant_class == "SV":
677
+ parser = SV_PARSERS[args.caller]
678
+ parsed = [parser(r) for r in records]
679
+ compute_clustering(parsed, args.neighbor_radius)
680
+ for d in parsed:
681
+ f = sv_features(d, ann, fasta, args.win)
682
+ f["caller"] = args.caller
683
+ rows.append(f)
684
+ else:
685
+ parser = parse_eh_tsv if eh_tsv else STR_PARSERS[args.caller]
686
+ for r in records:
687
+ d = parser(r)
688
+ f = str_features(d, ann, fasta, args.win, args.read_len)
689
+ f["caller"] = args.caller
690
+ rows.append(f)
691
+
692
+ out = pd.DataFrame(rows)
693
+ out["sample"] = sample
694
+ # Missingness is carried by the -99999 sentinel in each feature (trees split on it
695
+ # directly). Optional explicit indicators (fixed list -> stable schema) for linear/NN.
696
+ if args.missing_indicators:
697
+ indicators = SV_MISSING_INDICATORS if variant_class == "SV" else STR_MISSING_INDICATORS
698
+ for col in indicators:
699
+ out[f"{col}_missing"] = (out[col] == MISSING).astype(int) if col in out.columns else 0
700
+ # meta (label join key) first: sample, caller, variant_ID — NOT model features
701
+ meta = [c for c in ("sample", "caller", "variant_ID") if c in out.columns]
702
+ out = out[meta + [c for c in out.columns if c not in meta]]
703
+ out.to_csv(args.output, sep="\t", index=False)
704
+ sys.stderr.write(f"[info] wrote {len(out):,} rows x {out.shape[1]} cols -> {args.output}\n")
705
+
706
+
707
+ if __name__ == "__main__":
708
+ main()
feature_manifest.json CHANGED
@@ -1,54 +1,64 @@
1
  {
 
2
  "sv_features": [
3
- "qual",
4
- "gq",
5
- "pr_ref",
6
- "pr_alt",
7
- "sr_ref",
8
- "sr_alt",
9
- "vf_ref",
10
- "vf_alt",
11
- "total_alt_support",
12
- "sr_pr_ratio",
13
- "vaf_estimate",
14
- "strand_bias_fs",
15
- "cipos_width",
16
- "ciend_width",
17
- "homlen",
18
- "svlen_abs",
19
  "svtype_DEL",
20
- "svtype_INS",
21
  "svtype_DUP",
 
 
22
  "svtype_BND",
 
 
 
23
  "is_imprecise",
24
- "filter_pass",
25
- "n_filter_flags"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
  ],
27
  "str_features": [
28
- "repcn_a1",
29
- "repcn_a2",
30
- "ru_length",
31
- "ref_count",
32
- "delta_from_ref_a1",
33
- "delta_from_ref_a2",
34
- "ci_width_a1",
35
- "ci_width_a2",
36
- "adsp_a1",
37
- "adsp_a2",
38
- "adfl_a1",
39
- "adfl_a2",
40
- "adir_a1",
41
- "adir_a2",
42
- "total_support_a1",
43
- "total_support_a2",
44
- "spanning_frac_a1",
45
- "spanning_frac_a2",
46
- "locus_coverage",
47
- "allele_balance",
48
- "support_type_a1",
49
- "support_type_a2",
50
- "is_low_depth",
51
- "locus_conc_rate",
52
- "locus_in_lookup"
53
- ]
54
  }
 
1
  {
2
+ "release_version": "1.0",
3
  "sv_features": [
4
+ "is_pass",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  "svtype_DEL",
 
6
  "svtype_DUP",
7
+ "svtype_INS",
8
+ "svtype_INV",
9
  "svtype_BND",
10
+ "svlen_log",
11
+ "cipos_width",
12
+ "ciend_width",
13
  "is_imprecise",
14
+ "pe_support",
15
+ "sr_support",
16
+ "total_support",
17
+ "vaf",
18
+ "gt_hom",
19
+ "gq",
20
+ "qual_norm",
21
+ "local_depth",
22
+ "gc_min",
23
+ "gc_max",
24
+ "entropy_min",
25
+ "microhom_max",
26
+ "in_segdup_either",
27
+ "in_segdup_both",
28
+ "in_difficult_either",
29
+ "in_difficult_both",
30
+ "in_lowmap_either",
31
+ "in_tandem_either",
32
+ "in_Alu_either",
33
+ "in_L1_either",
34
+ "in_SVA_either",
35
+ "in_LTR_either",
36
+ "frac_span_repeat",
37
+ "n_neighbors",
38
+ "nn_log_dist"
39
  ],
40
  "str_features": [
41
+ "is_pass",
42
+ "motif_len",
43
+ "ref_copynum",
44
+ "gt_repcn_max",
45
+ "gt_repcn_min",
46
+ "expansion_over_ref",
47
+ "repci_width_max",
48
+ "spanning_reads",
49
+ "flanking_reads",
50
+ "inrepeat_reads",
51
+ "locus_depth",
52
+ "gt_hom",
53
+ "ref_tract_bp",
54
+ "spanning_frac",
55
+ "allele_vs_readlen",
56
+ "motif_is_homopolymer",
57
+ "gc_flank",
58
+ "entropy_flank",
59
+ "in_segdup",
60
+ "in_difficult",
61
+ "flank_lowmap"
62
+ ],
63
+ "note": "Produced by feature_builder.py from a short-read VCF + reference FASTA + static BEDs; missing fields use the -99999 sentinel."
 
 
 
64
  }
requirements.txt CHANGED
@@ -1,4 +1,4 @@
1
- scikit-learn==1.5.1
2
  pandas>=2.0
3
  numpy>=1.24
4
  joblib>=1.3
 
1
+ scikit-learn==1.7.1
2
  pandas>=2.0
3
  numpy>=1.24
4
  joblib>=1.3
score_svstr.py CHANGED
@@ -1,110 +1,91 @@
1
  #!/usr/bin/env python3
2
  """
3
  score_svstr.py — apply the SVSTR-Score confidence model to short-read SV or STR
4
- calls and emit a per-call confidence score (CS) and tier.
5
 
6
- This is the inference entry point for the released models
7
- (sv_model_v13_parents.joblib / str_model_v13_parents.joblib). It loads the
8
- trained random forest + its feature/config sidecar and (for STR) the per-locus
9
- catalogue lookup, then scores a tabular feature matrix extracted from the
10
- caller VCF.
11
 
12
- TIERS (as published):
13
- HIGH CS >= 0.70
 
 
14
  MODERATE 0.50 <= CS < 0.70
15
  WARNING 0.30 <= CS < 0.50
16
  LOW CS < 0.30
17
- Override: the deployed quality-demotion rules in each `*_config.json` ("rules")
18
- are applied on top of the CS tier. A call flagged by ANY rule is reassigned to
19
- the config's `override_target` regardless of CS SV -> LOW (5 rules:
20
- QUAL<20, GQ<15, total_alt_support<=2, vaf_estimate<0.15, is_imprecise==1);
21
- STR -> WARNING (6 rules: locus_coverage<20, is_low_depth==1, support_type_a2<=1,
22
- total_support_a2<5, ci_width_a2>3, allele_balance<0.3). Only the MODERATE-to-HIGH
23
- range is a monotone precision ladder; LOW/WARNING are not.
24
-
25
- INPUT (--features): tab/comma table extracted from the caller VCF.
26
- SV : the 23 features in sv_config.json (+ chrom,pos optional for provenance)
27
- STR : the 23 caller-output features in str_config.json EXCLUDING locus_conc_rate
28
- and locus_in_lookup, PLUS chrom,pos (used to join the catalogue lookup).
29
- locus_conc_rate / locus_in_lookup are filled here from the released lookup.
30
 
31
  USAGE
32
- python score_svstr.py --variant sv --model-dir . --features sv_feats.tsv --out sv_scored.tsv
33
- python score_svstr.py --variant str --model-dir . --features str_feats.tsv --out str_scored.tsv
34
 
35
- Requires the package versions in requirements.txt (scikit-learn==1.5.1).
36
- Licence: MIT.
37
  """
38
  import argparse, json, os, sys
39
  import numpy as np
40
  import pandas as pd
41
  import joblib
42
 
43
- # out-of-catalogue STR loci: published convention (Methods) — conservative score
44
- STR_MISS_LOCUS_CONC_RATE = 0.0
45
- STR_MISS_LOCUS_IN_LOOKUP = 0
46
-
47
- _OP = {"<": lambda a, v: a < v, "<=": lambda a, v: a <= v,
48
- ">": lambda a, v: a > v, ">=": lambda a, v: a >= v, "==": lambda a, v: a == v}
49
 
50
 
51
- def assign_tier(cs, df, cfg):
52
- """CS tier (HIGH/MODERATE/WARNING/LOW), then apply the config's deterministic
53
- quality-demotion rules: any rule triggered -> cfg['override_target'] (SV: LOW,
54
- STR: WARNING), regardless of CS. NaN feature values never trigger a rule."""
55
- tier = np.where(cs >= 0.70, "HIGH",
56
- np.where(cs >= 0.50, "MODERATE",
57
- np.where(cs >= 0.30, "WARNING", "LOW")))
58
- rules = cfg.get("rules", {})
59
- target = cfg.get("override_target", "LOW")
60
- trig = np.zeros(len(cs), dtype=bool)
61
- for r in rules.values():
62
- f = r["feature"]
63
- if f in df.columns:
64
- v = pd.to_numeric(df[f], errors="coerce").values
65
- trig = trig | np.nan_to_num(_OP[r["op"]](v, r["value"]), nan=False).astype(bool)
66
- return np.where(trig, target, tier)
67
 
68
 
69
  def main():
70
  ap = argparse.ArgumentParser(description=__doc__,
71
  formatter_class=argparse.RawDescriptionHelpFormatter)
72
  ap.add_argument("--variant", choices=["sv", "str"], required=True)
73
- ap.add_argument("--model-dir", default=".", help="dir with *_config.json, *.joblib, lookup")
74
- ap.add_argument("--features", required=True, help="feature table (tsv/csv)")
 
75
  ap.add_argument("--out", required=True)
 
76
  a = ap.parse_args()
77
 
78
  cfg = json.load(open(os.path.join(a.model_dir, f"{a.variant}_config.json")))
79
  model = joblib.load(os.path.join(a.model_dir, cfg["model_file"]))
 
80
  feats = cfg["features"]
81
 
82
  sep = "\t" if a.features.endswith((".tsv", ".txt", ".gz")) else ","
83
  df = pd.read_csv(a.features, sep=sep)
84
 
85
- if a.variant == "str":
86
- lk = pd.read_parquet(os.path.join(a.model_dir, cfg["lookup_file"]))
87
- keys = cfg.get("lookup_keys", ["chrom", "pos"])
88
- rate_col = [c for c in lk.columns if c not in keys][0] if "locus_conc_rate" not in lk.columns else "locus_conc_rate"
89
- lk = lk.rename(columns={rate_col: "locus_conc_rate"})[keys + ["locus_conc_rate"]]
90
- df = df.merge(lk, on=keys, how="left")
91
- df["locus_in_lookup"] = df["locus_conc_rate"].notna().astype(int)
92
- df["locus_conc_rate"] = df["locus_conc_rate"].fillna(STR_MISS_LOCUS_CONC_RATE)
93
- df.loc[df["locus_in_lookup"] == 0, "locus_in_lookup"] = STR_MISS_LOCUS_IN_LOOKUP
94
-
95
- missing = [f for f in feats if f not in df.columns]
96
- if missing:
97
- sys.exit(f"ERROR: input is missing required features: {missing}")
98
-
99
- X = df[feats]
100
- cs = model.predict_proba(X)[:, 1]
101
- df["CS"] = np.round(cs, 4)
102
- df["tier"] = assign_tier(cs, df, cfg)
103
-
104
- df.to_csv(a.out, sep="\t", index=False)
105
- n = len(df)
106
- hi = int((df["tier"] == "HIGH").sum())
107
- print(f"{a.variant.upper()}: scored {n:,} calls; HIGH {hi:,} ({hi/n:.1%})", file=sys.stderr)
 
108
  print(f"wrote {a.out}", file=sys.stderr)
109
 
110
 
 
1
  #!/usr/bin/env python3
2
  """
3
  score_svstr.py — apply the SVSTR-Score confidence model to short-read SV or STR
4
+ calls and emit a per-call calibrated confidence score (CS) and tier.
5
 
6
+ Inference entry point for the released models (sv_model.joblib / str_model.joblib,
7
+ each paired with an isotonic calibrator). It loads the trained random forest + its
8
+ isotonic calibrator + the feature/config sidecar, then scores a tabular feature
9
+ matrix produced by `feature_builder.py` from the caller VCF (short-read VCF +
10
+ reference FASTA + static annotation BEDs):
11
 
12
+ CS = isotonic_calibrator( RF.predict_proba(X)[:, 1] ) # P(concordant)
13
+
14
+ TIERS:
15
+ HIGH CS >= 0.70 (candidate-triage filter)
16
  MODERATE 0.50 <= CS < 0.70
17
  WARNING 0.30 <= CS < 0.50
18
  LOW CS < 0.30
19
+
20
+ The score is isotonic-calibrated, so the tier is a pure bucket of the calibrated
21
+ CS there are no heuristic override rules, and STR needs no per-locus catalogue
22
+ lookup (its features are self-contained). Missing features (fields a merged or
23
+ filtered callset may not carry) are filled with the -99999 sentinel that the
24
+ trees were trained to split on.
 
 
 
 
 
 
 
25
 
26
  USAGE
27
+ python score_svstr.py --variant sv --model-dir . --features sv_features.tsv --out sv_scored.tsv
28
+ python score_svstr.py --variant str --model-dir . --features str_features.tsv --out str_scored.tsv
29
 
30
+ Requires the versions in requirements.txt (scikit-learn==1.7.1). Licence: MIT.
 
31
  """
32
  import argparse, json, os, sys
33
  import numpy as np
34
  import pandas as pd
35
  import joblib
36
 
37
+ MISSING = -99999.0
38
+ TIER_EDGES = (0.30, 0.50, 0.70)
39
+ TIER_NAMES = ("LOW", "WARNING", "MODERATE", "HIGH")
 
 
 
40
 
41
 
42
+ def to_tier(cs):
43
+ return np.asarray(TIER_NAMES)[np.digitize(np.asarray(cs, float), TIER_EDGES, right=False)]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
44
 
45
 
46
  def main():
47
  ap = argparse.ArgumentParser(description=__doc__,
48
  formatter_class=argparse.RawDescriptionHelpFormatter)
49
  ap.add_argument("--variant", choices=["sv", "str"], required=True)
50
+ ap.add_argument("--model-dir", default=".", help="dir with *_config.json + *.joblib")
51
+ ap.add_argument("--features", required=True,
52
+ help="feature table from feature_builder.py (tsv/csv[.gz])")
53
  ap.add_argument("--out", required=True)
54
+ ap.add_argument("--raw", action="store_true", help="also emit the uncalibrated RF score (CS_raw)")
55
  a = ap.parse_args()
56
 
57
  cfg = json.load(open(os.path.join(a.model_dir, f"{a.variant}_config.json")))
58
  model = joblib.load(os.path.join(a.model_dir, cfg["model_file"]))
59
+ cal = joblib.load(os.path.join(a.model_dir, cfg["calibrator_file"]))
60
  feats = cfg["features"]
61
 
62
  sep = "\t" if a.features.endswith((".tsv", ".txt", ".gz")) else ","
63
  df = pd.read_csv(a.features, sep=sep)
64
 
65
+ absent = [f for f in feats if f not in df.columns]
66
+ if absent:
67
+ print(f"[warn] {len(absent)} model features absent -> -99999 sentinel: {absent}",
68
+ file=sys.stderr)
69
+ for f in absent:
70
+ df[f] = MISSING
71
+
72
+ X = df[feats].astype("float32")
73
+ p = model.predict_proba(X)
74
+ raw = p[:, list(model.classes_).index(1)] if p.shape[1] > 1 else p[:, 0]
75
+ cs = np.clip(cal.predict(raw), 0.0, 1.0)
76
+
77
+ out = df.copy()
78
+ if a.raw:
79
+ out["CS_raw"] = np.round(raw, 4)
80
+ out["CS"] = np.round(cs, 4)
81
+ out["tier"] = to_tier(cs)
82
+ out.to_csv(a.out, sep="\t", index=False)
83
+
84
+ n = len(out)
85
+ vc = pd.Series(out["tier"]).value_counts()
86
+ hi = int(vc.get("HIGH", 0))
87
+ print(f"{a.variant.upper()}: scored {n:,} calls; HIGH {hi:,} ({hi/n:.1%}) | "
88
+ + " ".join(f"{t}:{int(vc.get(t,0)):,}" for t in TIER_NAMES), file=sys.stderr)
89
  print(f"wrote {a.out}", file=sys.stderr)
90
 
91
 
str_locus_lookup.parquet → str_calibrator.joblib RENAMED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:f9e0f5d6becccb4f130486de1eb80341b44e8e1d66b935a41e5c81f9ae79f8e6
3
- size 2356330
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3af5671f62a57815797d4c3b71b1c5b5b0102d36d9275651b3ba088a3827e453
3
+ size 16219
str_config.json CHANGED
@@ -1,74 +1,53 @@
1
  {
2
- "model_file": "str_model_v13_parents.joblib",
 
 
 
3
  "features": [
4
- "repcn_a1",
5
- "repcn_a2",
6
- "ru_length",
7
- "ref_count",
8
- "delta_from_ref_a1",
9
- "delta_from_ref_a2",
10
- "ci_width_a1",
11
- "ci_width_a2",
12
- "adsp_a1",
13
- "adsp_a2",
14
- "adfl_a1",
15
- "adfl_a2",
16
- "adir_a1",
17
- "adir_a2",
18
- "total_support_a1",
19
- "total_support_a2",
20
- "spanning_frac_a1",
21
- "spanning_frac_a2",
22
- "locus_coverage",
23
- "allele_balance",
24
- "support_type_a1",
25
- "support_type_a2",
26
- "is_low_depth",
27
- "locus_conc_rate",
28
- "locus_in_lookup"
29
  ],
30
- "tier_thresholds": {
31
- "low": 0.3,
32
- "high": 0.7
 
 
 
 
33
  },
34
- "rules": {
35
- "low_coverage": {
36
- "feature": "locus_coverage",
37
- "op": "<",
38
- "value": 20
39
- },
40
- "low_depth_flag": {
41
- "feature": "is_low_depth",
42
- "op": "==",
43
- "value": 1
44
- },
45
- "no_spanning_a2": {
46
- "feature": "support_type_a2",
47
- "op": "<=",
48
- "value": 1
49
- },
50
- "low_support_a2": {
51
- "feature": "total_support_a2",
52
- "op": "<",
53
- "value": 5
54
- },
55
- "wide_ci_a2": {
56
- "feature": "ci_width_a2",
57
- "op": ">",
58
- "value": 3
59
- },
60
- "low_allele_balance": {
61
- "feature": "allele_balance",
62
- "op": "<",
63
- "value": 0.3
64
- }
65
- },
66
- "lookup_file": "str_locus_lookup.parquet",
67
- "lookup_keys": [
68
- "chrom",
69
- "pos"
70
  ],
71
- "sklearn_version_trained": "1.5.1",
72
  "variant_class": "STR",
73
- "override_target": "WARNING"
 
 
 
 
 
 
 
 
74
  }
 
1
  {
2
+ "release_version": "1.0",
3
+ "model_file": "str_model.joblib",
4
+ "calibrator_file": "str_calibrator.joblib",
5
+ "calibration": "isotonic regression on out-of-fold scores",
6
  "features": [
7
+ "is_pass",
8
+ "motif_len",
9
+ "ref_copynum",
10
+ "gt_repcn_max",
11
+ "gt_repcn_min",
12
+ "expansion_over_ref",
13
+ "repci_width_max",
14
+ "spanning_reads",
15
+ "flanking_reads",
16
+ "inrepeat_reads",
17
+ "locus_depth",
18
+ "gt_hom",
19
+ "ref_tract_bp",
20
+ "spanning_frac",
21
+ "allele_vs_readlen",
22
+ "motif_is_homopolymer",
23
+ "gc_flank",
24
+ "entropy_flank",
25
+ "in_segdup",
26
+ "in_difficult",
27
+ "flank_lowmap"
 
 
 
 
28
  ],
29
+ "n_features": 21,
30
+ "missing_sentinel": -99999.0,
31
+ "tiers": {
32
+ "HIGH": "CS>=0.70",
33
+ "MODERATE": "0.50<=CS<0.70",
34
+ "WARNING": "0.30<=CS<0.50",
35
+ "LOW": "CS<0.30"
36
  },
37
+ "tier_edges": [
38
+ 0.3,
39
+ 0.5,
40
+ 0.7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41
  ],
42
+ "score": "CS = isotonic-calibrated P(call concordant with long-read truth)",
43
  "variant_class": "STR",
44
+ "sklearn_version_trained": "1.7.1",
45
+ "training": {
46
+ "cohort": "HPRC",
47
+ "n_samples": 208,
48
+ "n_train_rows": 22651133,
49
+ "cv": "5-fold GroupKFold by sample",
50
+ "oof_auroc": 0.8342,
51
+ "oof_auprc": 0.886
52
+ }
53
  }
str_model_v13_parents.joblib → str_model.joblib RENAMED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:a8a3f4a19de1cfd46ea3241b6c2ad31b4f71bddcd2e2a23d0e7f1c35d54bf044
3
- size 1067649121
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:589539ef52b9c0d6518ac7e6a7beda82abfe767d7a5e27fb1dd9a55099f373d5
3
+ size 512205159
str_model_meta.json ADDED
@@ -0,0 +1,448 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "variant": "str",
3
+ "created_unix": 1782043477,
4
+ "feature_cols": [
5
+ "is_pass",
6
+ "motif_len",
7
+ "ref_copynum",
8
+ "gt_repcn_max",
9
+ "gt_repcn_min",
10
+ "expansion_over_ref",
11
+ "repci_width_max",
12
+ "spanning_reads",
13
+ "flanking_reads",
14
+ "inrepeat_reads",
15
+ "locus_depth",
16
+ "gt_hom",
17
+ "ref_tract_bp",
18
+ "spanning_frac",
19
+ "allele_vs_readlen",
20
+ "motif_is_homopolymer",
21
+ "gc_flank",
22
+ "entropy_flank",
23
+ "in_segdup",
24
+ "in_difficult",
25
+ "flank_lowmap"
26
+ ],
27
+ "n_features": 21,
28
+ "tier_edges": [
29
+ 0.3,
30
+ 0.5,
31
+ 0.7
32
+ ],
33
+ "tier_names": [
34
+ "LOW",
35
+ "Warning",
36
+ "Moderate",
37
+ "High"
38
+ ],
39
+ "missing_sentinel": -99999.0,
40
+ "rf_params": {
41
+ "bootstrap": true,
42
+ "ccp_alpha": 0.0,
43
+ "class_weight": "balanced_subsample",
44
+ "criterion": "gini",
45
+ "max_depth": null,
46
+ "max_features": "sqrt",
47
+ "max_leaf_nodes": null,
48
+ "max_samples": 2000000,
49
+ "min_impurity_decrease": 0.0,
50
+ "min_samples_leaf": 50,
51
+ "min_samples_split": 2,
52
+ "min_weight_fraction_leaf": 0.0,
53
+ "monotonic_cst": null,
54
+ "n_estimators": 300,
55
+ "n_jobs": -1,
56
+ "oob_score": false,
57
+ "random_state": 42,
58
+ "verbose": 0,
59
+ "warm_start": false
60
+ },
61
+ "n_train_rows": 22651133,
62
+ "n_samples": 208,
63
+ "qc": {
64
+ "label_rows_raw": 36254400,
65
+ "label_dist_raw": {
66
+ "concordant": 21350382,
67
+ "discordant": 13838163,
68
+ "unlabeled": 1065855
69
+ },
70
+ "label_rows_usable": 35188545,
71
+ "ambiguous_keys_dropped": 0,
72
+ "ambiguous_feat_rows": 0,
73
+ "ambiguous_label_rows": 0,
74
+ "dup_keys_feature": 0,
75
+ "dup_keys_label": 0,
76
+ "merged_rows": 22651133,
77
+ "match_rate_vs_labels": 0.6437075758602693,
78
+ "match_rate_vs_features": 0.9832673629385175,
79
+ "class_balance": {
80
+ "concordant": 13960015,
81
+ "discordant": 8691118
82
+ },
83
+ "concordant_rate": 0.6163053742168217
84
+ },
85
+ "cv_folds": 5,
86
+ "cv_fold_metrics": [
87
+ {
88
+ "n": 4469639,
89
+ "pos_rate": 0.6172603648751052,
90
+ "auroc": 0.8345731588413778,
91
+ "auprc": 0.8868311937682424,
92
+ "brier": 0.16715199887480572,
93
+ "logloss": 0.505031384190826,
94
+ "fold": 0,
95
+ "seconds": 404.5
96
+ },
97
+ {
98
+ "n": 4469658,
99
+ "pos_rate": 0.6172628867801518,
100
+ "auroc": 0.8348793797657998,
101
+ "auprc": 0.8871277104956028,
102
+ "brier": 0.16710046702995693,
103
+ "logloss": 0.5048207582711781,
104
+ "fold": 1,
105
+ "seconds": 457.3
106
+ },
107
+ {
108
+ "n": 4569998,
109
+ "pos_rate": 0.6173429397562099,
110
+ "auroc": 0.8345632397054213,
111
+ "auprc": 0.8867279699640327,
112
+ "brier": 0.16717765756008388,
113
+ "logloss": 0.5050623605875793,
114
+ "fold": 2,
115
+ "seconds": 480.6
116
+ },
117
+ {
118
+ "n": 4570859,
119
+ "pos_rate": 0.6168989242503433,
120
+ "auroc": 0.8350534258010407,
121
+ "auprc": 0.8870572426757822,
122
+ "brier": 0.1669604630273807,
123
+ "logloss": 0.5044600822147348,
124
+ "fold": 3,
125
+ "seconds": 546.9
126
+ },
127
+ {
128
+ "n": 4570979,
129
+ "pos_rate": 0.6128043904817765,
130
+ "auroc": 0.8317845587452297,
131
+ "auprc": 0.8823297885222531,
132
+ "brier": 0.16790578845588436,
133
+ "logloss": 0.5066430427730261,
134
+ "fold": 4,
135
+ "seconds": 558.6
136
+ }
137
+ ],
138
+ "cv_report": {
139
+ "overall": {
140
+ "n": 22651133,
141
+ "pos_rate": 0.6163053742168217,
142
+ "auroc": 0.8341539493365068,
143
+ "auprc": 0.885996637709877,
144
+ "brier": 0.16726047042063633,
145
+ "logloss": 0.5052060179718258
146
+ },
147
+ "calibration": [
148
+ {
149
+ "bin": "[0.0,0.1)",
150
+ "n": 759079,
151
+ "mean_pred": 0.06623314081824806,
152
+ "obs_rate": 0.027333123429840636
153
+ },
154
+ {
155
+ "bin": "[0.1,0.2)",
156
+ "n": 1807689,
157
+ "mean_pred": 0.15353118408631086,
158
+ "obs_rate": 0.1398288090484591
159
+ },
160
+ {
161
+ "bin": "[0.2,0.3)",
162
+ "n": 2278662,
163
+ "mean_pred": 0.250703986073481,
164
+ "obs_rate": 0.2854271497922904
165
+ },
166
+ {
167
+ "bin": "[0.3,0.4)",
168
+ "n": 2401825,
169
+ "mean_pred": 0.35114505321433914,
170
+ "obs_rate": 0.4219845325950059
171
+ },
172
+ {
173
+ "bin": "[0.4,0.5)",
174
+ "n": 2503890,
175
+ "mean_pred": 0.4496778698066448,
176
+ "obs_rate": 0.5559477453083003
177
+ },
178
+ {
179
+ "bin": "[0.5,0.6)",
180
+ "n": 2743182,
181
+ "mean_pred": 0.5514420283736253,
182
+ "obs_rate": 0.6633803371413198
183
+ },
184
+ {
185
+ "bin": "[0.6,0.7)",
186
+ "n": 3201411,
187
+ "mean_pred": 0.6513120336728542,
188
+ "obs_rate": 0.7673941271520589
189
+ },
190
+ {
191
+ "bin": "[0.7,0.8)",
192
+ "n": 2972899,
193
+ "mean_pred": 0.7478180823491758,
194
+ "obs_rate": 0.8596629081579966
195
+ },
196
+ {
197
+ "bin": "[0.8,0.9)",
198
+ "n": 2979925,
199
+ "mean_pred": 0.8513437073854806,
200
+ "obs_rate": 0.9412015403072225
201
+ },
202
+ {
203
+ "bin": "[0.9,1.0)",
204
+ "n": 1002571,
205
+ "mean_pred": 0.9221679799864609,
206
+ "obs_rate": 0.9910769411842154
207
+ }
208
+ ],
209
+ "per_sample_auroc": {
210
+ "n_samples": 208,
211
+ "median": 0.8353140721290141,
212
+ "p25": 0.8326614184016954,
213
+ "p75": 0.8373927525350378,
214
+ "min": 0.740174387702103,
215
+ "max": 0.8401855333526593
216
+ },
217
+ "by_homopolymer": {
218
+ "homopolymer": {
219
+ "n": 176,
220
+ "pos_rate": 0.0,
221
+ "auroc": null,
222
+ "auprc": null,
223
+ "brier": 0.12461994174893026
224
+ },
225
+ "other": {
226
+ "n": 22650957,
227
+ "pos_rate": 0.6163101629657414,
228
+ "auroc": 0.8341526308855854,
229
+ "auprc": 0.8859973231761953,
230
+ "brier": 0.16726080174142982,
231
+ "logloss": 0.5052065639352175
232
+ }
233
+ },
234
+ "by_is_pass": {
235
+ "PASS": {
236
+ "n": 22645309,
237
+ "pos_rate": 0.6163365225000904,
238
+ "auroc": 0.8341536917536043,
239
+ "auprc": 0.8860084593752011,
240
+ "brier": 0.1672574382686718,
241
+ "logloss": 0.505198302627369
242
+ },
243
+ "nonPASS": {
244
+ "n": 5824,
245
+ "pos_rate": 0.4951923076923077,
246
+ "auroc": 0.821139738835895,
247
+ "auprc": 0.8249088115206255,
248
+ "brier": 0.17905030870563365,
249
+ "logloss": 0.5352053928461165
250
+ }
251
+ }
252
+ },
253
+ "importances": {
254
+ "impurity": [
255
+ {
256
+ "feature": "entropy_flank",
257
+ "impurity_importance": 0.28992320685730033
258
+ },
259
+ {
260
+ "feature": "motif_len",
261
+ "impurity_importance": 0.15078304844246473
262
+ },
263
+ {
264
+ "feature": "gc_flank",
265
+ "impurity_importance": 0.11765967510912077
266
+ },
267
+ {
268
+ "feature": "ref_tract_bp",
269
+ "impurity_importance": 0.09594543197447271
270
+ },
271
+ {
272
+ "feature": "allele_vs_readlen",
273
+ "impurity_importance": 0.06304989891121958
274
+ },
275
+ {
276
+ "feature": "ref_copynum",
277
+ "impurity_importance": 0.06281644250839796
278
+ },
279
+ {
280
+ "feature": "gt_repcn_max",
281
+ "impurity_importance": 0.045375808024477604
282
+ },
283
+ {
284
+ "feature": "gt_repcn_min",
285
+ "impurity_importance": 0.04503548319154128
286
+ },
287
+ {
288
+ "feature": "flanking_reads",
289
+ "impurity_importance": 0.04081082547154657
290
+ },
291
+ {
292
+ "feature": "spanning_frac",
293
+ "impurity_importance": 0.02788421749138721
294
+ },
295
+ {
296
+ "feature": "expansion_over_ref",
297
+ "impurity_importance": 0.017739812221077934
298
+ },
299
+ {
300
+ "feature": "locus_depth",
301
+ "impurity_importance": 0.014556405292958223
302
+ },
303
+ {
304
+ "feature": "spanning_reads",
305
+ "impurity_importance": 0.011672664495590936
306
+ },
307
+ {
308
+ "feature": "in_difficult",
309
+ "impurity_importance": 0.009656418449637608
310
+ },
311
+ {
312
+ "feature": "gt_hom",
313
+ "impurity_importance": 0.0024291103645865167
314
+ },
315
+ {
316
+ "feature": "in_segdup",
317
+ "impurity_importance": 0.001648983588740384
318
+ },
319
+ {
320
+ "feature": "flank_lowmap",
321
+ "impurity_importance": 0.001477948437034436
322
+ },
323
+ {
324
+ "feature": "repci_width_max",
325
+ "impurity_importance": 0.0012018133362063474
326
+ },
327
+ {
328
+ "feature": "inrepeat_reads",
329
+ "impurity_importance": 0.00033183288445321743
330
+ },
331
+ {
332
+ "feature": "is_pass",
333
+ "impurity_importance": 9.729477856164029e-07
334
+ },
335
+ {
336
+ "feature": "motif_is_homopolymer",
337
+ "impurity_importance": 0.0
338
+ }
339
+ ],
340
+ "permutation": [
341
+ {
342
+ "feature": "entropy_flank",
343
+ "perm_importance_mean": 0.13934060781658777,
344
+ "perm_importance_std": 0.0006361765279266924
345
+ },
346
+ {
347
+ "feature": "motif_len",
348
+ "perm_importance_mean": 0.1232472127797279,
349
+ "perm_importance_std": 0.0005893220011599711
350
+ },
351
+ {
352
+ "feature": "gc_flank",
353
+ "perm_importance_mean": 0.06320217026546789,
354
+ "perm_importance_std": 0.00039522027338993824
355
+ },
356
+ {
357
+ "feature": "ref_tract_bp",
358
+ "perm_importance_mean": 0.056776687651067095,
359
+ "perm_importance_std": 0.00015236878123781785
360
+ },
361
+ {
362
+ "feature": "ref_copynum",
363
+ "perm_importance_mean": 0.02267318905161917,
364
+ "perm_importance_std": 0.00014989102435837524
365
+ },
366
+ {
367
+ "feature": "allele_vs_readlen",
368
+ "perm_importance_mean": 0.020529595235711205,
369
+ "perm_importance_std": 0.00017190103816491447
370
+ },
371
+ {
372
+ "feature": "gt_repcn_min",
373
+ "perm_importance_mean": 0.01731383830567197,
374
+ "perm_importance_std": 0.000195043199990813
375
+ },
376
+ {
377
+ "feature": "gt_repcn_max",
378
+ "perm_importance_mean": 0.014405902490600276,
379
+ "perm_importance_std": 0.00013955774976049523
380
+ },
381
+ {
382
+ "feature": "expansion_over_ref",
383
+ "perm_importance_mean": 0.008579439049389648,
384
+ "perm_importance_std": 8.141211169349268e-05
385
+ },
386
+ {
387
+ "feature": "flanking_reads",
388
+ "perm_importance_mean": 0.005908979701386818,
389
+ "perm_importance_std": 8.933000723756271e-05
390
+ },
391
+ {
392
+ "feature": "spanning_frac",
393
+ "perm_importance_mean": 0.005236130437139996,
394
+ "perm_importance_std": 4.831785228506296e-05
395
+ },
396
+ {
397
+ "feature": "in_difficult",
398
+ "perm_importance_mean": 0.003852866555695589,
399
+ "perm_importance_std": 2.129084797378384e-05
400
+ },
401
+ {
402
+ "feature": "spanning_reads",
403
+ "perm_importance_mean": 0.0029217009056680563,
404
+ "perm_importance_std": 4.176582259464099e-05
405
+ },
406
+ {
407
+ "feature": "gt_hom",
408
+ "perm_importance_mean": 0.002172501389781667,
409
+ "perm_importance_std": 8.3379119655914e-06
410
+ },
411
+ {
412
+ "feature": "locus_depth",
413
+ "perm_importance_mean": 0.0020709165127682284,
414
+ "perm_importance_std": 2.549011860464095e-05
415
+ },
416
+ {
417
+ "feature": "in_segdup",
418
+ "perm_importance_mean": 0.0009386532858458585,
419
+ "perm_importance_std": 1.750671402431846e-05
420
+ },
421
+ {
422
+ "feature": "flank_lowmap",
423
+ "perm_importance_mean": 0.0005812032061902617,
424
+ "perm_importance_std": 1.3028115550094254e-05
425
+ },
426
+ {
427
+ "feature": "repci_width_max",
428
+ "perm_importance_mean": 0.00026026760399893155,
429
+ "perm_importance_std": 1.492427417015547e-05
430
+ },
431
+ {
432
+ "feature": "inrepeat_reads",
433
+ "perm_importance_mean": 5.1632608300478114e-05,
434
+ "perm_importance_std": 5.166444569830962e-06
435
+ },
436
+ {
437
+ "feature": "is_pass",
438
+ "perm_importance_mean": 5.758337677796987e-08,
439
+ "perm_importance_std": 3.3427425855204445e-08
440
+ },
441
+ {
442
+ "feature": "motif_is_homopolymer",
443
+ "perm_importance_mean": 0.0,
444
+ "perm_importance_std": 0.0
445
+ }
446
+ ]
447
+ }
448
+ }
sv_model_v13_parents.joblib → sv_calibrator.joblib RENAMED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:3a6ac169d5079ed6dbc5ece294c562ce9595b029147cee5debd48f27b53c2dfe
3
- size 1809789541
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:307cbb00688a5a3b5aa5caac1700cdbd1c2fdd5b4a7e08072a121a4f7526e2e5
3
+ size 7872
sv_config.json CHANGED
@@ -1,62 +1,67 @@
1
  {
2
- "model_file": "sv_model_v13_parents.joblib",
 
 
 
3
  "features": [
4
- "qual",
5
- "gq",
6
- "pr_ref",
7
- "pr_alt",
8
- "sr_ref",
9
- "sr_alt",
10
- "vf_ref",
11
- "vf_alt",
12
- "total_alt_support",
13
- "sr_pr_ratio",
14
- "vaf_estimate",
15
- "strand_bias_fs",
16
- "cipos_width",
17
- "ciend_width",
18
- "homlen",
19
- "svlen_abs",
20
  "svtype_DEL",
21
- "svtype_INS",
22
  "svtype_DUP",
 
 
23
  "svtype_BND",
 
 
 
24
  "is_imprecise",
25
- "filter_pass",
26
- "n_filter_flags"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
  ],
28
- "tier_thresholds": {
29
- "low": 0.3,
30
- "high": 0.7
 
 
 
 
31
  },
32
- "rules": {
33
- "low_qual": {
34
- "feature": "qual",
35
- "op": "<",
36
- "value": 20
37
- },
38
- "low_gq": {
39
- "feature": "gq",
40
- "op": "<",
41
- "value": 15
42
- },
43
- "no_alt_support": {
44
- "feature": "total_alt_support",
45
- "op": "<=",
46
- "value": 2
47
- },
48
- "low_vaf": {
49
- "feature": "vaf_estimate",
50
- "op": "<",
51
- "value": 0.15
52
- },
53
- "imprecise": {
54
- "feature": "is_imprecise",
55
- "op": "==",
56
- "value": 1
57
- }
58
- },
59
- "sklearn_version_trained": "1.5.1",
60
  "variant_class": "SV",
61
- "override_target": "LOW"
 
 
 
 
 
 
 
 
62
  }
 
1
  {
2
+ "release_version": "1.0",
3
+ "model_file": "sv_model.joblib",
4
+ "calibrator_file": "sv_calibrator.joblib",
5
+ "calibration": "isotonic regression on out-of-fold scores",
6
  "features": [
7
+ "is_pass",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
  "svtype_DEL",
 
9
  "svtype_DUP",
10
+ "svtype_INS",
11
+ "svtype_INV",
12
  "svtype_BND",
13
+ "svlen_log",
14
+ "cipos_width",
15
+ "ciend_width",
16
  "is_imprecise",
17
+ "pe_support",
18
+ "sr_support",
19
+ "total_support",
20
+ "vaf",
21
+ "gt_hom",
22
+ "gq",
23
+ "qual_norm",
24
+ "local_depth",
25
+ "gc_min",
26
+ "gc_max",
27
+ "entropy_min",
28
+ "microhom_max",
29
+ "in_segdup_either",
30
+ "in_segdup_both",
31
+ "in_difficult_either",
32
+ "in_difficult_both",
33
+ "in_lowmap_either",
34
+ "in_tandem_either",
35
+ "in_Alu_either",
36
+ "in_L1_either",
37
+ "in_SVA_either",
38
+ "in_LTR_either",
39
+ "frac_span_repeat",
40
+ "n_neighbors",
41
+ "nn_log_dist"
42
  ],
43
+ "n_features": 35,
44
+ "missing_sentinel": -99999.0,
45
+ "tiers": {
46
+ "HIGH": "CS>=0.70",
47
+ "MODERATE": "0.50<=CS<0.70",
48
+ "WARNING": "0.30<=CS<0.50",
49
+ "LOW": "CS<0.30"
50
  },
51
+ "tier_edges": [
52
+ 0.3,
53
+ 0.5,
54
+ 0.7
55
+ ],
56
+ "score": "CS = isotonic-calibrated P(call concordant with long-read truth)",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
57
  "variant_class": "SV",
58
+ "sklearn_version_trained": "1.7.1",
59
+ "training": {
60
+ "cohort": "HPRC",
61
+ "n_samples": 208,
62
+ "n_train_rows": 2575116,
63
+ "cv": "5-fold GroupKFold by sample",
64
+ "oof_auroc": 0.9502,
65
+ "oof_auprc": 0.9512
66
+ }
67
  }
sv_model.joblib ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6cafd7dd091a0c1f9ceb2629c8bd6315e4967c37c4bf8910fcfea27454be64e7
3
+ size 992442127
sv_model_meta.json ADDED
@@ -0,0 +1,583 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "variant": "sv",
3
+ "feature_cols": [
4
+ "is_pass",
5
+ "svtype_DEL",
6
+ "svtype_DUP",
7
+ "svtype_INS",
8
+ "svtype_INV",
9
+ "svtype_BND",
10
+ "svlen_log",
11
+ "cipos_width",
12
+ "ciend_width",
13
+ "is_imprecise",
14
+ "pe_support",
15
+ "sr_support",
16
+ "total_support",
17
+ "vaf",
18
+ "gt_hom",
19
+ "gq",
20
+ "qual_norm",
21
+ "local_depth",
22
+ "gc_min",
23
+ "gc_max",
24
+ "entropy_min",
25
+ "microhom_max",
26
+ "in_segdup_either",
27
+ "in_segdup_both",
28
+ "in_difficult_either",
29
+ "in_difficult_both",
30
+ "in_lowmap_either",
31
+ "in_tandem_either",
32
+ "in_Alu_either",
33
+ "in_L1_either",
34
+ "in_SVA_either",
35
+ "in_LTR_either",
36
+ "frac_span_repeat",
37
+ "n_neighbors",
38
+ "nn_log_dist"
39
+ ],
40
+ "n_features": 35,
41
+ "tier_edges": [
42
+ 0.3,
43
+ 0.5,
44
+ 0.7
45
+ ],
46
+ "tier_names": [
47
+ "LOW",
48
+ "Warning",
49
+ "Moderate",
50
+ "High"
51
+ ],
52
+ "missing_sentinel": -99999.0,
53
+ "rf_params": {
54
+ "bootstrap": true,
55
+ "ccp_alpha": 0.0,
56
+ "class_weight": "balanced_subsample",
57
+ "criterion": "gini",
58
+ "max_depth": null,
59
+ "max_features": "sqrt",
60
+ "max_leaf_nodes": null,
61
+ "max_samples": null,
62
+ "min_impurity_decrease": 0.0,
63
+ "min_samples_leaf": 20,
64
+ "min_samples_split": 2,
65
+ "min_weight_fraction_leaf": 0.0,
66
+ "monotonic_cst": null,
67
+ "n_estimators": 400,
68
+ "n_jobs": -1,
69
+ "oob_score": false,
70
+ "random_state": 42,
71
+ "verbose": 0,
72
+ "warm_start": false
73
+ },
74
+ "n_train_rows": 2575116,
75
+ "n_samples": 208,
76
+ "qc": {
77
+ "label_rows_raw": 2782190,
78
+ "label_dist_raw": {
79
+ "concordant": 1530286,
80
+ "discordant": 1251904
81
+ },
82
+ "label_rows_usable": 2782190,
83
+ "ambiguous_keys_dropped": 9462,
84
+ "ambiguous_feat_rows": 18568,
85
+ "ambiguous_label_rows": 19114,
86
+ "dup_keys_feature": 18568,
87
+ "dup_keys_label": 19114,
88
+ "merged_rows": 2575116,
89
+ "match_rate_vs_labels": 0.9255715820989939,
90
+ "match_rate_vs_features": 1.0,
91
+ "class_balance": {
92
+ "concordant": 1511906,
93
+ "discordant": 1063210
94
+ },
95
+ "concordant_rate": 0.5871215121959554
96
+ },
97
+ "importances": {
98
+ "impurity": [
99
+ {
100
+ "feature": "svlen_log",
101
+ "impurity_importance": 0.14231107100051552
102
+ },
103
+ {
104
+ "feature": "svtype_BND",
105
+ "impurity_importance": 0.13513441637259968
106
+ },
107
+ {
108
+ "feature": "nn_log_dist",
109
+ "impurity_importance": 0.06888043456708229
110
+ },
111
+ {
112
+ "feature": "svtype_DUP",
113
+ "impurity_importance": 0.05967552431932549
114
+ },
115
+ {
116
+ "feature": "cipos_width",
117
+ "impurity_importance": 0.05563805020798381
118
+ },
119
+ {
120
+ "feature": "svtype_DEL",
121
+ "impurity_importance": 0.053612638587210104
122
+ },
123
+ {
124
+ "feature": "sr_support",
125
+ "impurity_importance": 0.04950031891627467
126
+ },
127
+ {
128
+ "feature": "vaf",
129
+ "impurity_importance": 0.04822805852199238
130
+ },
131
+ {
132
+ "feature": "qual_norm",
133
+ "impurity_importance": 0.04321008187665855
134
+ },
135
+ {
136
+ "feature": "svtype_INS",
137
+ "impurity_importance": 0.03671741257598638
138
+ },
139
+ {
140
+ "feature": "ciend_width",
141
+ "impurity_importance": 0.027655467190250325
142
+ },
143
+ {
144
+ "feature": "local_depth",
145
+ "impurity_importance": 0.024667786835611386
146
+ },
147
+ {
148
+ "feature": "microhom_max",
149
+ "impurity_importance": 0.023198612318248754
150
+ },
151
+ {
152
+ "feature": "is_imprecise",
153
+ "impurity_importance": 0.02244493530457544
154
+ },
155
+ {
156
+ "feature": "frac_span_repeat",
157
+ "impurity_importance": 0.02223685870094091
158
+ },
159
+ {
160
+ "feature": "entropy_min",
161
+ "impurity_importance": 0.02149966456515826
162
+ },
163
+ {
164
+ "feature": "pe_support",
165
+ "impurity_importance": 0.018807543132767727
166
+ },
167
+ {
168
+ "feature": "gc_min",
169
+ "impurity_importance": 0.018609267758191137
170
+ },
171
+ {
172
+ "feature": "gq",
173
+ "impurity_importance": 0.017999043161167707
174
+ },
175
+ {
176
+ "feature": "gc_max",
177
+ "impurity_importance": 0.01691329606031783
178
+ },
179
+ {
180
+ "feature": "total_support",
181
+ "impurity_importance": 0.01639193545906872
182
+ },
183
+ {
184
+ "feature": "gt_hom",
185
+ "impurity_importance": 0.014227746565587479
186
+ },
187
+ {
188
+ "feature": "n_neighbors",
189
+ "impurity_importance": 0.013414404592739407
190
+ },
191
+ {
192
+ "feature": "in_difficult_both",
193
+ "impurity_importance": 0.01186932037949349
194
+ },
195
+ {
196
+ "feature": "is_pass",
197
+ "impurity_importance": 0.011331347970027637
198
+ },
199
+ {
200
+ "feature": "in_tandem_either",
201
+ "impurity_importance": 0.006678373764632643
202
+ },
203
+ {
204
+ "feature": "in_lowmap_either",
205
+ "impurity_importance": 0.004352768263433798
206
+ },
207
+ {
208
+ "feature": "in_Alu_either",
209
+ "impurity_importance": 0.004250882794571496
210
+ },
211
+ {
212
+ "feature": "in_difficult_either",
213
+ "impurity_importance": 0.004067397992242813
214
+ },
215
+ {
216
+ "feature": "in_segdup_either",
217
+ "impurity_importance": 0.001790844691729305
218
+ },
219
+ {
220
+ "feature": "in_segdup_both",
221
+ "impurity_importance": 0.001567017346771133
222
+ },
223
+ {
224
+ "feature": "in_L1_either",
225
+ "impurity_importance": 0.0014931685060102253
226
+ },
227
+ {
228
+ "feature": "in_LTR_either",
229
+ "impurity_importance": 0.001243669659616864
230
+ },
231
+ {
232
+ "feature": "in_SVA_either",
233
+ "impurity_importance": 0.00038064004121650836
234
+ },
235
+ {
236
+ "feature": "svtype_INV",
237
+ "impurity_importance": 0.0
238
+ }
239
+ ],
240
+ "permutation": [
241
+ {
242
+ "feature": "svlen_log",
243
+ "perm_importance_mean": 0.04194058803806115,
244
+ "perm_importance_std": 0.0007089186317830076
245
+ },
246
+ {
247
+ "feature": "nn_log_dist",
248
+ "perm_importance_mean": 0.019778079687185944,
249
+ "perm_importance_std": 0.0002553374546546104
250
+ },
251
+ {
252
+ "feature": "svtype_DEL",
253
+ "perm_importance_mean": 0.018689927317770305,
254
+ "perm_importance_std": 0.0002501511162311443
255
+ },
256
+ {
257
+ "feature": "cipos_width",
258
+ "perm_importance_mean": 0.017400205941047363,
259
+ "perm_importance_std": 0.00021672163272084185
260
+ },
261
+ {
262
+ "feature": "svtype_BND",
263
+ "perm_importance_mean": 0.015007739432103828,
264
+ "perm_importance_std": 0.00026380208237979455
265
+ },
266
+ {
267
+ "feature": "qual_norm",
268
+ "perm_importance_mean": 0.013299944084231186,
269
+ "perm_importance_std": 0.00013420815060181905
270
+ },
271
+ {
272
+ "feature": "gc_min",
273
+ "perm_importance_mean": 0.012263946411167393,
274
+ "perm_importance_std": 0.0001367518392425017
275
+ },
276
+ {
277
+ "feature": "entropy_min",
278
+ "perm_importance_mean": 0.01175411732205851,
279
+ "perm_importance_std": 0.00010827028744123835
280
+ },
281
+ {
282
+ "feature": "pe_support",
283
+ "perm_importance_mean": 0.011431666717043187,
284
+ "perm_importance_std": 0.0002522902431662052
285
+ },
286
+ {
287
+ "feature": "vaf",
288
+ "perm_importance_mean": 0.010811854209153338,
289
+ "perm_importance_std": 0.000206278576011294
290
+ },
291
+ {
292
+ "feature": "svtype_DUP",
293
+ "perm_importance_mean": 0.010256370264851777,
294
+ "perm_importance_std": 0.00011705190707458093
295
+ },
296
+ {
297
+ "feature": "frac_span_repeat",
298
+ "perm_importance_mean": 0.009963778341341434,
299
+ "perm_importance_std": 0.0003054705160430916
300
+ },
301
+ {
302
+ "feature": "local_depth",
303
+ "perm_importance_mean": 0.009591312749510661,
304
+ "perm_importance_std": 9.987579143240777e-05
305
+ },
306
+ {
307
+ "feature": "gc_max",
308
+ "perm_importance_mean": 0.009472834004097907,
309
+ "perm_importance_std": 3.73573580102502e-05
310
+ },
311
+ {
312
+ "feature": "gq",
313
+ "perm_importance_mean": 0.009203283859545164,
314
+ "perm_importance_std": 0.00010130110270414088
315
+ },
316
+ {
317
+ "feature": "sr_support",
318
+ "perm_importance_mean": 0.008346937601715121,
319
+ "perm_importance_std": 9.766321902716232e-05
320
+ },
321
+ {
322
+ "feature": "microhom_max",
323
+ "perm_importance_mean": 0.0074945231035745685,
324
+ "perm_importance_std": 3.858741211472431e-05
325
+ },
326
+ {
327
+ "feature": "svtype_INS",
328
+ "perm_importance_mean": 0.007191278115888,
329
+ "perm_importance_std": 6.25382140306982e-05
330
+ },
331
+ {
332
+ "feature": "total_support",
333
+ "perm_importance_mean": 0.0071782291459881135,
334
+ "perm_importance_std": 9.233685802608801e-05
335
+ },
336
+ {
337
+ "feature": "is_pass",
338
+ "perm_importance_mean": 0.006096510294509483,
339
+ "perm_importance_std": 0.00018164506103297094
340
+ },
341
+ {
342
+ "feature": "n_neighbors",
343
+ "perm_importance_mean": 0.005750802897597151,
344
+ "perm_importance_std": 5.2407508065749236e-05
345
+ },
346
+ {
347
+ "feature": "in_difficult_both",
348
+ "perm_importance_mean": 0.005015233708107925,
349
+ "perm_importance_std": 0.00018000694143725676
350
+ },
351
+ {
352
+ "feature": "ciend_width",
353
+ "perm_importance_mean": 0.004891217221742616,
354
+ "perm_importance_std": 8.693371224059806e-05
355
+ },
356
+ {
357
+ "feature": "in_tandem_either",
358
+ "perm_importance_mean": 0.0043522978742952965,
359
+ "perm_importance_std": 0.00013096877220331791
360
+ },
361
+ {
362
+ "feature": "gt_hom",
363
+ "perm_importance_mean": 0.00323902471619224,
364
+ "perm_importance_std": 6.580316549205161e-05
365
+ },
366
+ {
367
+ "feature": "in_lowmap_either",
368
+ "perm_importance_mean": 0.002848785493209416,
369
+ "perm_importance_std": 6.035487488575199e-05
370
+ },
371
+ {
372
+ "feature": "in_Alu_either",
373
+ "perm_importance_mean": 0.002534492148327061,
374
+ "perm_importance_std": 8.942851729207139e-05
375
+ },
376
+ {
377
+ "feature": "in_difficult_either",
378
+ "perm_importance_mean": 0.002091988241603948,
379
+ "perm_importance_std": 5.002148003217409e-05
380
+ },
381
+ {
382
+ "feature": "is_imprecise",
383
+ "perm_importance_mean": 0.001979962861476592,
384
+ "perm_importance_std": 6.0060264184685146e-05
385
+ },
386
+ {
387
+ "feature": "in_L1_either",
388
+ "perm_importance_mean": 0.0011150058316057754,
389
+ "perm_importance_std": 2.43935928349277e-05
390
+ },
391
+ {
392
+ "feature": "in_LTR_either",
393
+ "perm_importance_mean": 0.0006375153501665843,
394
+ "perm_importance_std": 3.5907563733425047e-05
395
+ },
396
+ {
397
+ "feature": "in_segdup_either",
398
+ "perm_importance_mean": 0.0006168866779678206,
399
+ "perm_importance_std": 2.936960388188349e-05
400
+ },
401
+ {
402
+ "feature": "in_segdup_both",
403
+ "perm_importance_mean": 0.0005383371585652164,
404
+ "perm_importance_std": 3.300168720103858e-05
405
+ },
406
+ {
407
+ "feature": "in_SVA_either",
408
+ "perm_importance_mean": 0.00017570394306039018,
409
+ "perm_importance_std": 3.211137729753612e-06
410
+ },
411
+ {
412
+ "feature": "svtype_INV",
413
+ "perm_importance_mean": 0.0,
414
+ "perm_importance_std": 0.0
415
+ }
416
+ ]
417
+ },
418
+ "finalized_unix": 1782044045,
419
+ "cv_report": {
420
+ "overall": {
421
+ "n": 2575116,
422
+ "pos_rate": 0.5871215121959554,
423
+ "auroc": 0.9501778843150939,
424
+ "auprc": 0.9511739760033938,
425
+ "brier": 0.07706030782330338,
426
+ "logloss": 0.2628003224452437
427
+ },
428
+ "calibration": [
429
+ {
430
+ "bin": "[0.0,0.1)",
431
+ "n": 701606,
432
+ "mean_pred": 0.018278732155845624,
433
+ "obs_rate": 0.008842569761376044
434
+ },
435
+ {
436
+ "bin": "[0.1,0.2)",
437
+ "n": 77383,
438
+ "mean_pred": 0.1443051591972469,
439
+ "obs_rate": 0.10029334608376517
440
+ },
441
+ {
442
+ "bin": "[0.2,0.3)",
443
+ "n": 53690,
444
+ "mean_pred": 0.24934861469329547,
445
+ "obs_rate": 0.19802570311044887
446
+ },
447
+ {
448
+ "bin": "[0.3,0.4)",
449
+ "n": 57370,
450
+ "mean_pred": 0.35138865912279116,
451
+ "obs_rate": 0.31784905002614605
452
+ },
453
+ {
454
+ "bin": "[0.4,0.5)",
455
+ "n": 70890,
456
+ "mean_pred": 0.4521905700361762,
457
+ "obs_rate": 0.45171392297926366
458
+ },
459
+ {
460
+ "bin": "[0.5,0.6)",
461
+ "n": 98358,
462
+ "mean_pred": 0.5530042682672152,
463
+ "obs_rate": 0.5941153744484434
464
+ },
465
+ {
466
+ "bin": "[0.6,0.7)",
467
+ "n": 158080,
468
+ "mean_pred": 0.6545151929144646,
469
+ "obs_rate": 0.7394863360323887
470
+ },
471
+ {
472
+ "bin": "[0.7,0.8)",
473
+ "n": 264297,
474
+ "mean_pred": 0.7534285409368959,
475
+ "obs_rate": 0.8542737904705691
476
+ },
477
+ {
478
+ "bin": "[0.8,0.9)",
479
+ "n": 407540,
480
+ "mean_pred": 0.8554391726312989,
481
+ "obs_rate": 0.922547480001963
482
+ },
483
+ {
484
+ "bin": "[0.9,1.0)",
485
+ "n": 685902,
486
+ "mean_pred": 0.9410329493878745,
487
+ "obs_rate": 0.962179728299378
488
+ }
489
+ ],
490
+ "per_sample_auroc": {
491
+ "n_samples": 208,
492
+ "median": 0.9812468623755227,
493
+ "p25": 0.9541872315128539,
494
+ "p75": 0.9849148282479339,
495
+ "min": 0.8241255348590714,
496
+ "max": 0.9884727126524584
497
+ },
498
+ "by_svtype": {
499
+ "BND": {
500
+ "n": 545562,
501
+ "pos_rate": 0.05967241120165994,
502
+ "auroc": 0.9578724482896165,
503
+ "auprc": 0.7205291737443336,
504
+ "brier": 0.03197870532515856,
505
+ "logloss": 0.11268068083789429
506
+ },
507
+ "DEL": {
508
+ "n": 1168706,
509
+ "pos_rate": 0.7785747655954535,
510
+ "auroc": 0.8919672267722084,
511
+ "auprc": 0.9557196435784474,
512
+ "brier": 0.0922761808544916,
513
+ "logloss": 0.3149783791403593
514
+ },
515
+ "DUP": {
516
+ "n": 156534,
517
+ "pos_rate": 0.023419832113151136,
518
+ "auroc": 0.9936692829891886,
519
+ "auprc": 0.8545445334945798,
520
+ "brier": 0.009443816981288416,
521
+ "logloss": 0.03587112293286734
522
+ },
523
+ "INS": {
524
+ "n": 704314,
525
+ "pos_rate": 0.8032780833548673,
526
+ "auroc": 0.8549802852596324,
527
+ "auprc": 0.9494948614191011,
528
+ "brier": 0.1017598124373945,
529
+ "logloss": 0.3429363119373415
530
+ }
531
+ },
532
+ "by_size": {
533
+ "1-10kb": {
534
+ "n": 171605,
535
+ "pos_rate": 0.7607121004632732,
536
+ "auroc": 0.9296938425020418,
537
+ "auprc": 0.967200346507,
538
+ "brier": 0.06806650111282099,
539
+ "logloss": 0.2455626050274475
540
+ },
541
+ "10-100kb": {
542
+ "n": 42645,
543
+ "pos_rate": 0.21177160276703014,
544
+ "auroc": 0.9920116885561145,
545
+ "auprc": 0.9676826349150152,
546
+ "brier": 0.029729192191828648,
547
+ "logloss": 0.10986227030572825
548
+ },
549
+ "100bp-1kb": {
550
+ "n": 906987,
551
+ "pos_rate": 0.7867466678133204,
552
+ "auroc": 0.8976865414893258,
553
+ "auprc": 0.9590485952985468,
554
+ "brier": 0.08302330275831495,
555
+ "logloss": 0.28853539649876037
556
+ },
557
+ "<100bp": {
558
+ "n": 733616,
559
+ "pos_rate": 0.7488359032518375,
560
+ "auroc": 0.867906684781433,
561
+ "auprc": 0.93767984701013,
562
+ "brier": 0.11261501240225187,
563
+ "logloss": 0.3687373180284347
564
+ },
565
+ ">100kb": {
566
+ "n": 54796,
567
+ "pos_rate": 0.00839477334111979,
568
+ "auroc": 0.9897200830900804,
569
+ "auprc": 0.788119068363777,
570
+ "brier": 0.004907638846204342,
571
+ "logloss": 0.029204468344063542
572
+ },
573
+ "NA": {
574
+ "n": 665467,
575
+ "pos_rate": 0.16371360262792894,
576
+ "auroc": 0.976441017719166,
577
+ "auprc": 0.9070303608956529,
578
+ "brier": 0.04103092730468184,
579
+ "logloss": 0.1444199784010817
580
+ }
581
+ }
582
+ }
583
+ }
tier_thresholds.json CHANGED
@@ -1,11 +1,12 @@
1
  {
 
2
  "tiers": {
3
- "HIGH": "CS >= 0.70",
4
- "MODERATE": "0.50 <= CS < 0.70",
5
- "WARNING": "0.30 <= CS < 0.50",
6
- "LOW": "CS < 0.30"
7
  },
8
- "override_sv": "any of {QUAL<20, GQ<15, total_alt_support<=2, vaf_estimate<0.15, is_imprecise==1} -> LOW (see sv_config.json 'rules')",
9
- "override_str": "any of {locus_coverage<20, is_low_depth==1, support_type_a2<=1, total_support_a2<5, ci_width_a2>3, allele_balance<0.3} -> WARNING (see str_config.json 'rules')",
10
- "note": "Only MODERATE-to-HIGH is a monotone precision ladder; LOW/WARNING are not rank-ordered by precision. Use HIGH as a candidate-triage filter."
11
  }
 
1
  {
2
+ "release_version": "1.0",
3
  "tiers": {
4
+ "HIGH": "CS>=0.70",
5
+ "MODERATE": "0.50<=CS<0.70",
6
+ "WARNING": "0.30<=CS<0.50",
7
+ "LOW": "CS<0.30"
8
  },
9
+ "score": "CS = isotonic-calibrated probability of concordance with long-read truth",
10
+ "calibration": "isotonic on OOF; SV Brier 0.0771->0.0744, STR 0.1673->0.1589",
11
+ "note": "Tiers are buckets of the calibrated CS. HIGH is the candidate-triage tier."
12
  }