Tabular Classification
Scikit-learn
Joblib
genomics
structural-variants
short-tandem-repeats
variant-calling
confidence-calibration
random-forest
Instructions to use khyeom/SVSTR-Score with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Scikit-learn
How to use khyeom/SVSTR-Score with Scikit-learn:
from huggingface_hub import hf_hub_download import joblib model = joblib.load( hf_hub_download("khyeom/SVSTR-Score", "sklearn_model.joblib") ) # only load pickle files from sources you trust # read more about it here https://skops.readthedocs.io/en/stable/persistence.html - Notebooks
- Google Colab
- Kaggle
Release v1.0: HPRC-trained 35/21-feature calibrated SV+STR models
#1
by khyeom - opened
- README.md +66 -93
- example/str_features.tsv +51 -7
- example/str_scored.tsv +51 -7
- example/sv_features.tsv +51 -8
- example/sv_scored.tsv +51 -8
- feature_builder.py +708 -0
- feature_manifest.json +55 -45
- requirements.txt +1 -1
- score_svstr.py +53 -72
- str_locus_lookup.parquet → str_calibrator.joblib +2 -2
- str_config.json +46 -67
- str_model_v13_parents.joblib → str_model.joblib +2 -2
- str_model_meta.json +448 -0
- sv_model_v13_parents.joblib → sv_calibrator.joblib +2 -2
- sv_config.json +57 -52
- sv_model.joblib +3 -0
- sv_model_meta.json +583 -0
- tier_thresholds.json +8 -7
README.md
CHANGED
|
@@ -1,109 +1,82 @@
|
|
| 1 |
---
|
| 2 |
license: mit
|
| 3 |
-
tags:
|
| 4 |
-
- genomics
|
| 5 |
-
- structural-variants
|
| 6 |
-
- short-tandem-repeats
|
| 7 |
-
- random-forest
|
| 8 |
-
- variant-calling
|
| 9 |
library_name: sklearn
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
---
|
| 11 |
|
| 12 |
-
# SVSTR-Score
|
| 13 |
|
| 14 |
-
|
| 15 |
-
and a four-tier operating point to
|
| 16 |
-
short-
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
The long-read genotype is used only to build the training label; **inference needs
|
| 20 |
-
short-read caller output only**.
|
| 21 |
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|---|---|
|
| 25 |
-
| `sv_model_v13_parents.joblib` | SV random forest (23 caller-output features; **primary SV model / external-validation headline**) |
|
| 26 |
-
| `str_model_v13_parents.joblib` | STR random forest (25 features) |
|
| 27 |
-
| `seqonly/` | **caller-independent sequence-only SV model (svspr, 11 features) — portable secondary option** for callers whose fields differ from training. `seqonly/model/svspr_v14_seq.pkl` + the `svspr` inference package (VCF + reference → CS/tier; tiers HIGH≥0.9/MOD≥0.7/WARN≥0.5) + example. 11 features = SV length/log-length, SVTYPE one-hot, ±100 bp flank GC/AT/inner-GC, motif-2/3 counts; no caller fields |
|
| 28 |
-
| `str_locus_lookup.parquet` | per-locus historical concordance (`locus_conc_rate`) keyed by (chrom, pos); 163,726 loci |
|
| 29 |
-
| `sv_config.json` / `str_config.json` | feature order, tier thresholds, lookup metadata |
|
| 30 |
-
| `feature_manifest.json` | definition of every SV/STR feature |
|
| 31 |
-
| `tier_thresholds.json` | tier cut-offs + override rule |
|
| 32 |
-
| `score_svstr.py` | inference entry point |
|
| 33 |
-
| `requirements.txt` | pinned dependencies (**scikit-learn==1.5.1**) |
|
| 34 |
-
| `example/` | real example calls from the study cohort (input features + scored output) |
|
| 35 |
|
| 36 |
-
|
| 37 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 38 |
|
| 39 |
-
##
|
| 40 |
-
```
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
python score_svstr.py --variant sv --model-dir . --features example/sv_features.tsv --out sv_scored.tsv
|
| 44 |
-
# STR (locus_conc_rate / locus_in_lookup are filled here from str_locus_lookup.parquet)
|
| 45 |
-
python score_svstr.py --variant str --model-dir . --features example/str_features.tsv --out str_scored.tsv
|
| 46 |
-
```
|
| 47 |
-
Input is a table extracted from the caller VCF (e.g. with `bcftools query`) holding
|
| 48 |
-
the features in `*_config.json`. For STR, supply the 23 caller-output features plus
|
| 49 |
-
`chrom,pos`; the script joins the catalogue lookup to add `locus_conc_rate` and
|
| 50 |
-
`locus_in_lookup`.
|
| 51 |
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
historically discordant locus scored LOW, illustrating that `locus_conc_rate`
|
| 58 |
-
dominates the STR score.
|
| 59 |
|
| 60 |
-
##
|
| 61 |
-
|
| 62 |
-
|---|---|
|
| 63 |
-
| HIGH | CS ≥ 0.70 |
|
| 64 |
-
| MODERATE | 0.50 ≤ CS < 0.70 |
|
| 65 |
-
| WARNING | 0.30 ≤ CS < 0.50 |
|
| 66 |
-
| LOW | CS < 0.30 |
|
| 67 |
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
support_type_a2≤1, total_support_a2<5, ci_width_a2>3, allele_balance<0.3). **Only the
|
| 73 |
-
MODERATE→HIGH range is a monotone precision ladder; LOW and WARNING are not rank-ordered
|
| 74 |
-
by precision.** Use the **HIGH tier as a candidate-triage filter**.
|
| 75 |
|
| 76 |
-
|
| 77 |
-
- **Sample-LOSO cross-validation** (143 unrelated parents): SV F1 = **0.9308**, STR F1 = **0.9960**.
|
| 78 |
-
- **Within-cohort generalization** — re-scored on 3,608 unseen short-read genomes without retraining;
|
| 79 |
-
the score distribution and per-1-Mb genomic reliability map reproduce (training↔unseen Spearman ρ = **0.90**).
|
| 80 |
-
- **External multi-sample validation — 194 HPRC genomes** (sequence-aware Truvari labels; long-read truth Sawfish/SV, TRGT/STR):
|
| 81 |
-
- SV (this 23-feature model, the primary external object): per-genome median AUROC **0.927**, above raw
|
| 82 |
-
QUAL (0.741), GQ (0.476) and read-depth (0.671) and the applicable dedicated tools (duphold, Paragraph),
|
| 83 |
-
most clearly for deletions and insertions (AUPRC 0.948, expected calibration error 0.053). Robust across
|
| 84 |
-
the tested callers (median **0.915** on an independent DRAGEN reanalysis). The companion
|
| 85 |
-
**caller-independent sequence-only model** transfers at a lower ceiling (median AUROC **0.851**).
|
| 86 |
-
- STR (this model): per-genome median AUROC **0.912**; in-catalogue 0.909 vs out-of-catalogue 0.688 (catalogue-bound reliability atlas).
|
| 87 |
-
- **GIAB HG002** (single-sample supplementary check): HIGH-tier SV precision **97.7 %** (consensus v0.6).
|
| 88 |
|
| 89 |
-
|
| 90 |
-
-
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
fields and is untested on structurally different callers (e.g. DELLY, GATK-SV); for those use
|
| 94 |
-
the caller-independent sequence-only model (lower ceiling).
|
| 95 |
-
- **STR scoring is a per-locus reliability atlas.** ~94 % of STR model importance is
|
| 96 |
-
`locus_conc_rate`; removing it drops cross-validated AUROC 0.998 → 0.62. The score
|
| 97 |
-
does **not** extend to STR loci outside the 163,726-locus catalogue (out-of-catalogue
|
| 98 |
-
loci get a conservative score by construction), and only the HIGH tier transfers
|
| 99 |
-
across callers — lower-tier ranking does not.
|
| 100 |
-
- Trained on a single ancestry and pipeline (Illumina DRAGEN + Manta/ExpansionHunter;
|
| 101 |
-
PacBio HiFi + Sawfish/TRGT, GRCh38). Behaviour under other pipelines is untested.
|
| 102 |
-
- `.joblib` is a pickle: load only from trusted sources and with the pinned versions.
|
| 103 |
|
| 104 |
-
|
| 105 |
-
|
| 106 |
-
short-read SV and STR genotypes. (manuscript in preparation.)
|
| 107 |
|
| 108 |
-
##
|
| 109 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: mit
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
library_name: sklearn
|
| 4 |
+
tags:
|
| 5 |
+
- genomics
|
| 6 |
+
- structural-variants
|
| 7 |
+
- short-tandem-repeats
|
| 8 |
+
- variant-calling
|
| 9 |
+
- confidence-calibration
|
| 10 |
+
- random-forest
|
| 11 |
+
pipeline_tag: tabular-classification
|
| 12 |
---
|
| 13 |
|
| 14 |
+
# SVSTR-Score (v1.0) — long-read-guided confidence scoring for short-read SV/STR calls
|
| 15 |
|
| 16 |
+
Two per-class **RandomForest + isotonic calibrator** models that assign a
|
| 17 |
+
**calibrated confidence score** `CS ∈ [0,1]` and a four-tier operating point to
|
| 18 |
+
each short-read **structural-variant (SV)** or **short-tandem-repeat (STR)** call.
|
| 19 |
+
The long-read genotype is used **only to build the training label**, so inference
|
| 20 |
+
needs short-read caller output only.
|
|
|
|
|
|
|
| 21 |
|
| 22 |
+
`CS = isotonic_calibrator( RandomForest.predict_proba(X)[:, 1] )` = P(the
|
| 23 |
+
short-read call is concordant with the long-read truth).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
|
| 25 |
+
| | SV | STR |
|
| 26 |
+
|---|---|---|
|
| 27 |
+
| weights | `sv_model.joblib` + `sv_calibrator.joblib` | `str_model.joblib` + `str_calibrator.joblib` |
|
| 28 |
+
| features | 35 (`feature_builder.py`) | 21 |
|
| 29 |
+
| short-read caller | Manta | ExpansionHunter |
|
| 30 |
+
| long-read truth (label only) | sawfish | TRGT |
|
| 31 |
+
| training cohort | 208 HPRC paired genomes | 208 HPRC |
|
| 32 |
|
| 33 |
+
## Tiers
|
| 34 |
+
`HIGH CS≥0.70` · `MODERATE 0.50–0.70` · `WARNING 0.30–0.50` · `LOW <0.30`.
|
| 35 |
+
Tiers are buckets of the calibrated CS (no heuristic overrides). **HIGH** is the
|
| 36 |
+
candidate-triage tier.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 37 |
|
| 38 |
+
## Intended use
|
| 39 |
+
Triage / down-weight *emitted* short-read SV & STR calls by their probability of
|
| 40 |
+
matching long-read truth, **without long-read sequencing every sample**. Not a
|
| 41 |
+
variant caller; does not recover missed variants; does not replace long-read
|
| 42 |
+
sequencing for complete discovery.
|
|
|
|
|
|
|
| 43 |
|
| 44 |
+
## Performance
|
| 45 |
+
**Internal — 5-fold GroupKFold by sample (out-of-fold), 208 HPRC genomes**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 46 |
|
| 47 |
+
| | AUROC | AUPRC | per-sample AUROC median |
|
| 48 |
+
|---|---|---|---|
|
| 49 |
+
| SV | 0.950 | 0.951 | 0.981 |
|
| 50 |
+
| STR | 0.834 | 0.886 | 0.835 |
|
|
|
|
|
|
|
|
|
|
| 51 |
|
| 52 |
+
**External — 295 ASD genomes, applied unchanged (no retraining)**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 53 |
|
| 54 |
+
| | AUROC | observed concordance LOW / WARN / MOD / HIGH |
|
| 55 |
+
|---|---|---|
|
| 56 |
+
| SV | 0.891 | 12% / 35% / 50% / **86%** |
|
| 57 |
+
| STR | 0.831 | 16% / 40% / 60% / **85%** |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 58 |
|
| 59 |
+
Calibrated tiers stay monotone and meaningful on the external cohort; isotonic
|
| 60 |
+
calibration improves Brier (SV 0.077→0.074, STR 0.167→0.159).
|
|
|
|
| 61 |
|
| 62 |
+
## Usage
|
| 63 |
+
```bash
|
| 64 |
+
pip install -U huggingface_hub scikit-learn==1.7.1
|
| 65 |
+
hf download khyeom/SVSTR-Score sv_model.joblib sv_calibrator.joblib sv_config.json score_svstr.py feature_builder.py --local-dir svstr/
|
| 66 |
+
# features from a single-sample VCF, then score:
|
| 67 |
+
python svstr/feature_builder.py --vcf sample.manta.vcf.gz --caller manta \
|
| 68 |
+
--fasta GRCh38.fa --giab-dir giab_prepared --repeatmasker rmsk_class.bed.gz -o feats.tsv
|
| 69 |
+
python svstr/score_svstr.py --variant sv --model-dir svstr --features feats.tsv --out scored.tsv
|
| 70 |
+
```
|
| 71 |
+
Code, training pipeline and reproduction scripts: https://github.com/khyeom0608/SVSTR-Score
|
| 72 |
+
|
| 73 |
+
## Limitations
|
| 74 |
+
- Relies on **caller support / breakpoint-confidence fields** (PR/SR, CIPOS/CIEND,
|
| 75 |
+
VAF, GQ, depth). On **merged or heavily filtered call sets that drop these**,
|
| 76 |
+
scores deflate and tiers are unreliable (rank-discrimination degrades only
|
| 77 |
+
moderately, but calibration breaks).
|
| 78 |
+
- Strongest for **DEL/INS**; DUP/BND are mostly down-weighted/triaged.
|
| 79 |
+
- STR scoring is **bound to the caller's genotyped locus set**.
|
| 80 |
+
|
| 81 |
+
## Citation
|
| 82 |
+
Kim W\*, Yeom K\*, et al. *SVSTR-Score* (manuscript in preparation). Licence: MIT.
|
example/str_features.tsv
CHANGED
|
@@ -1,7 +1,51 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
sample caller variant_ID is_pass motif_len ref_copynum gt_repcn_max gt_repcn_min expansion_over_ref repci_width_max spanning_reads flanking_reads inrepeat_reads locus_depth gt_hom ref_tract_bp spanning_frac allele_vs_readlen motif_is_homopolymer gc_flank entropy_flank in_segdup in_difficult flank_lowmap
|
| 2 |
+
HG00097 expansionhunter chr1:165954:165962 1 4.0 2.0 2.0 1.0 0.0 0.0 25.0 2.0 0.0 34.5405 0.0 8.0 0.9259259259259259 0.05333333333333334 0 0.297029702970297 1.8767240669197331 1 1 1
|
| 3 |
+
HG00097 expansionhunter chr1:370632:370648 1 4.0 4.0 4.0 3.0 0.0 1.0 30.0 14.0 0.0 40.3784 0.0 16.0 0.6818181818181818 0.10666666666666667 0 0.37623762376237624 1.7475795513453003 1 1 1
|
| 4 |
+
HG00097 expansionhunter chr1:832736:832781 1 5.0 9.0 10.0 10.0 1.0 0.0 44.0 104.0 0.0 42.8919 1.0 45.0 0.2972972972972973 0.3333333333333333 0 0.36633663366336633 1.7496871374624745 1 1 0
|
| 5 |
+
HG00097 expansionhunter chr1:932613:932621 1 4.0 2.0 1.0 1.0 -1.0 0.0 88.0 4.0 0.0 49.3784 1.0 8.0 0.9565217391304348 0.02666666666666667 0 0.6138613861386139 1.93240710844821 0 1 0
|
| 6 |
+
HG00097 expansionhunter chr1:1010481:1010497 1 4.0 4.0 3.0 3.0 -1.0 0.0 62.0 20.0 0.0 35.1081 1.0 16.0 0.7560975609756098 0.08 0 0.13861386138613863 1.5081706750248076 0 1 0
|
| 7 |
+
HG00097 expansionhunter chr1:1052514:1052528 1 2.0 7.0 8.0 7.0 1.0 0.0 40.0 4.0 0.0 46.3784 0.0 14.0 0.9090909090909091 0.10666666666666667 0 0.46534653465346537 1.8443047122008371 0 1 0
|
| 8 |
+
HG00097 expansionhunter chr1:1063661:1063681 1 2.0 10.0 10.0 9.0 0.0 0.0 39.0 26.0 0.0 51.0811 0.0 20.0 0.6 0.13333333333333333 0 0.5643564356435643 1.7595456304802637 0 1 0
|
| 9 |
+
HG00097 expansionhunter chr1:1265603:1265649 1 2.0 23.0 24.0 23.0 1.0 0.0 19.0 44.0 0.0 38.3514 0.0 46.0 0.30158730158730157 0.32 0 0.48514851485148514 1.4319496839589465 0 1 0
|
| 10 |
+
HG00097 expansionhunter chr1:1431026:1431058 1 8.0 4.0 5.0 4.0 1.0 0.0 25.0 65.0 0.0 53.2703 0.0 32.0 0.2777777777777778 0.26666666666666666 0 0.7128712871287128 1.8074915272734042 0 1 0
|
| 11 |
+
HG00097 expansionhunter chr1:1585949:1585976 1 3.0 9.0 16.0 15.0 7.0 0.0 23.0 57.0 0.0 44.4324 0.0 27.0 0.2875 0.32 0 0.33663366336633666 1.870380810485376 0 1 0
|
| 12 |
+
HG00097 expansionhunter chr1:1653825:1653863 1 2.0 19.0 21.0 21.0 2.0 1.0 22.0 28.0 0.0 34.8649 1.0 38.0 0.44 0.28 0 0.49504950495049505 1.861174864442864 1 1 0
|
| 13 |
+
HG00097 expansionhunter chr1:1752908:1752935 1 3.0 9.0 10.0 9.0 1.0 0.0 39.0 31.0 0.0 49.4595 0.0 27.0 0.5571428571428572 0.2 0 0.6435643564356436 1.8970057796754802 0 1 0
|
| 14 |
+
HG00097 expansionhunter chr1:1762759:1762791 1 4.0 8.0 9.0 8.0 1.0 0.0 38.0 32.0 0.0 40.5405 0.0 32.0 0.5428571428571428 0.24 0 0.38613861386138615 1.9107593839500945 0 1 0
|
| 15 |
+
HG00097 expansionhunter chr1:1769969:1770011 1 6.0 7.0 5.0 5.0 -2.0 0.0 36.0 38.0 0.0 44.7568 1.0 42.0 0.4864864864864865 0.2 0 0.32673267326732675 1.680234216638318 0 1 0
|
| 16 |
+
HG00097 expansionhunter chr1:1776461:1776473 1 4.0 3.0 2.0 2.0 -1.0 0.0 74.0 12.0 0.0 38.7568 1.0 12.0 0.8604651162790697 0.05333333333333334 0 0.45544554455445546 1.8897699414887892 0 0 0
|
| 17 |
+
HG00097 expansionhunter chr1:1812407:1812429 1 2.0 11.0 12.0 12.0 1.0 0.0 64.0 20.0 0.0 39.6486 1.0 22.0 0.7619047619047619 0.16 0 0.40594059405940597 1.8347498835870788 0 1 0
|
| 18 |
+
HG00097 expansionhunter chr1:1845824:1845872 1 2.0 24.0 23.0 20.0 -1.0 0.0 37.0 34.0 0.0 43.3784 0.0 48.0 0.5211267605633803 0.30666666666666664 0 0.40594059405940597 1.8265902247349062 0 1 0
|
| 19 |
+
HG00097 expansionhunter chr1:1891654:1891658 1 2.0 2.0 4.0 4.0 2.0 0.0 60.0 4.0 0.0 40.7027 1.0 4.0 0.9375 0.05333333333333334 0 0.7128712871287128 1.854785288155074 0 1 0
|
| 20 |
+
HG00097 expansionhunter chr1:1904424:1904448 1 4.0 6.0 8.0 8.0 2.0 0.0 44.0 32.0 0.0 39.2432 1.0 24.0 0.5789473684210527 0.21333333333333335 0 0.3069306930693069 1.7889467525708116 0 1 0
|
| 21 |
+
HG00097 expansionhunter chr1:1948412:1948428 1 2.0 8.0 10.0 10.0 2.0 0.0 78.0 20.0 0.0 37.7838 1.0 16.0 0.7959183673469388 0.13333333333333333 0 0.33663366336633666 1.908726104503474 0 1 0
|
| 22 |
+
HG00097 expansionhunter chr1:2003928:2003940 1 6.0 2.0 3.0 2.0 1.0 0.0 51.0 19.0 0.0 45.8919 0.0 12.0 0.7285714285714285 0.12 0 0.7920792079207921 1.7281983769914673 0 1 0
|
| 23 |
+
HG00097 expansionhunter chr1:2012279:2012309 1 5.0 6.0 8.0 6.0 2.0 0.0 23.0 59.0 0.0 40.8649 0.0 30.0 0.2804878048780488 0.26666666666666666 0 0.2871287128712871 1.8125908839274099 0 1 0
|
| 24 |
+
HG00097 expansionhunter chr1:2018334:2018389 1 5.0 11.0 11.0 10.0 0.0 0.0 29.0 43.0 0.0 34.7838 0.0 55.0 0.4027777777777778 0.36666666666666664 0 0.36633663366336633 1.8582276266039703 0 1 0
|
| 25 |
+
HG00097 expansionhunter chr1:2112675:2112691 1 4.0 4.0 5.0 5.0 1.0 0.0 68.0 26.0 0.0 48.4054 1.0 16.0 0.723404255319149 0.13333333333333333 0 0.37623762376237624 1.7586128527134226 0 1 0
|
| 26 |
+
HG00097 expansionhunter chr1:2173095:2173103 1 4.0 2.0 2.0 1.0 0.0 0.0 44.0 8.0 0.0 41.6757 0.0 8.0 0.8461538461538461 0.05333333333333334 0 0.504950495049505 1.9625707732852478 0 0 0
|
| 27 |
+
HG00097 expansionhunter chr1:2207909:2207955 1 2.0 23.0 25.0 20.0 2.0 0.0 25.0 65.0 0.0 39.8919 0.0 46.0 0.2777777777777778 0.3333333333333333 0 0.5841584158415841 1.7558294024999062 0 1 0
|
| 28 |
+
HG00097 expansionhunter chr1:2219851:2219871 1 5.0 4.0 4.0 3.0 0.0 0.0 21.0 9.0 0.0 41.7568 0.0 20.0 0.7 0.13333333333333333 0 0.4752475247524752 1.8595044600250858 0 1 0
|
| 29 |
+
HG00097 expansionhunter chr1:2345829:2345855 1 2.0 13.0 14.0 10.0 1.0 0.0 24.0 25.0 0.0 47.1892 0.0 26.0 0.4897959183673469 0.18666666666666668 0 0.5247524752475248 1.9180855002676722 0 1 0
|
| 30 |
+
HG00097 expansionhunter chr1:2371211:2371241 1 3.0 10.0 13.0 10.0 3.0 0.0 27.0 40.0 0.0 50.1892 0.0 30.0 0.40298507462686567 0.26 0 0.5148514851485149 1.8305650137637772 0 1 0
|
| 31 |
+
HG00097 expansionhunter chr1:2431330:2431346 1 2.0 8.0 9.0 8.0 1.0 0.0 41.0 32.0 0.0 48.6486 0.0 16.0 0.5616438356164384 0.12 0 0.5841584158415841 1.9166005274309357 0 1 0
|
| 32 |
+
HG00097 expansionhunter chr1:2435454:2435489 1 7.0 5.0 5.0 4.0 0.0 0.0 39.0 30.0 0.0 39.0 0.0 35.0 0.5652173913043478 0.23333333333333334 0 0.5742574257425742 1.9335830591930787 0 1 0
|
| 33 |
+
HG00097 expansionhunter chr1:2449499:2449535 1 4.0 9.0 9.0 8.0 0.0 0.0 17.0 48.0 0.0 40.9459 0.0 36.0 0.26153846153846155 0.24 0 0.5247524752475248 1.8838136852474037 0 1 0
|
| 34 |
+
HG00097 expansionhunter chr1:2508612:2508630 1 2.0 9.0 11.0 9.0 2.0 0.0 35.0 18.0 0.0 49.0541 0.0 18.0 0.660377358490566 0.14666666666666667 0 0.3564356435643564 1.933754201968858 0 1 0
|
| 35 |
+
HG00097 expansionhunter chr1:2566825:2566869 1 4.0 11.0 12.0 11.0 1.0 0.0 19.0 124.0 0.0 47.8378 0.0 44.0 0.13286713286713286 0.32 0 0.4158415841584158 1.9433008996133987 0 1 0
|
| 36 |
+
HG00097 expansionhunter chr1:2580534:2580555 1 3.0 7.0 8.0 8.0 1.0 0.0 42.0 32.0 0.0 33.8108 1.0 21.0 0.5675675675675675 0.16 0 0.37623762376237624 1.9016529508428732 0 1 0
|
| 37 |
+
HG00097 expansionhunter chr1:2600684:2600708 1 4.0 6.0 6.0 3.0 0.0 0.0 35.0 17.0 0.0 37.9459 0.0 24.0 0.6730769230769231 0.16 0 0.44554455445544555 1.8855340778294845 0 1 0
|
| 38 |
+
HG00097 expansionhunter chr1:2782508:2782518 1 2.0 5.0 6.0 6.0 1.0 0.0 80.0 8.0 0.0 40.2162 1.0 10.0 0.9090909090909091 0.08 0 0.31683168316831684 1.539579197931996 0 1 0
|
| 39 |
+
HG00097 expansionhunter chr1:2784990:2785010 1 10.0 2.0 4.0 3.0 2.0 0.0 33.0 28.0 0.0 40.1351 0.0 20.0 0.5409836065573771 0.26666666666666666 0 0.693069306930693 1.849024777034605 0 1 0
|
| 40 |
+
HG00097 expansionhunter chr1:2824891:2824941 1 2.0 25.0 23.0 22.0 -2.0 0.0 20.0 28.0 0.0 41.1081 0.0 50.0 0.4166666666666667 0.30666666666666664 0 0.5148514851485149 1.736792200782953 0 1 0
|
| 41 |
+
HG00097 expansionhunter chr1:2847577:2847593 1 8.0 2.0 2.0 1.0 0.0 0.0 45.0 4.0 0.0 41.3514 0.0 16.0 0.9183673469387755 0.10666666666666667 0 0.6039603960396039 1.893728291589638 0 1 0
|
| 42 |
+
HG00097 expansionhunter chr1:2899043:2899075 1 4.0 8.0 7.0 7.0 -1.0 0.0 54.0 32.0 0.0 41.5946 1.0 32.0 0.627906976744186 0.18666666666666668 0 0.44554455445544555 1.7976208220041983 0 1 0
|
| 43 |
+
HG00097 expansionhunter chr1:2952909:2952927 1 2.0 9.0 12.0 12.0 3.0 0.0 60.0 22.0 0.0 36.7297 1.0 18.0 0.7317073170731707 0.16 0 0.6138613861386139 1.4586949056098804 0 1 0
|
| 44 |
+
HG00097 expansionhunter chr1:3089915:3089939 1 4.0 6.0 8.0 8.0 2.0 0.0 72.0 34.0 0.0 49.6216 1.0 24.0 0.6792452830188679 0.21333333333333335 0 0.44554455445544555 1.9438900377229549 0 1 0
|
| 45 |
+
HG00097 expansionhunter chr1:3109083:3109116 1 3.0 11.0 11.0 9.0 0.0 0.0 24.0 49.0 0.0 41.3514 0.0 33.0 0.3287671232876712 0.22 0 0.32673267326732675 1.8491066059426058 0 1 0
|
| 46 |
+
HG00097 expansionhunter chr1:3152348:3152380 1 4.0 8.0 6.0 6.0 -2.0 0.0 58.0 10.0 0.0 44.9189 1.0 32.0 0.8529411764705882 0.16 0 0.43564356435643564 1.6103551420752904 0 1 0
|
| 47 |
+
HG00097 expansionhunter chr1:3175983:3176011 1 4.0 7.0 8.0 8.0 1.0 0.0 50.0 42.0 0.0 45.2432 1.0 28.0 0.5434782608695652 0.21333333333333335 0 0.5643564356435643 1.8171755888564571 0 1 0
|
| 48 |
+
HG00097 expansionhunter chr1:3225530:3225574 1 2.0 22.0 21.0 17.0 -1.0 0.0 20.0 40.0 0.0 42.973 0.0 44.0 0.3333333333333333 0.28 0 0.5742574257425742 1.7700475389123633 0 1 0
|
| 49 |
+
HG00097 expansionhunter chr1:3240056:3240106 1 5.0 10.0 5.0 4.0 -5.0 0.0 31.0 16.0 0.0 39.6486 0.0 50.0 0.6595744680851063 0.16666666666666666 0 0.5544554455445545 1.0599256723036075 0 1 0
|
| 50 |
+
HG00097 expansionhunter chr1:3273438:3273450 1 2.0 6.0 4.0 4.0 -2.0 0.0 64.0 12.0 0.0 36.7297 1.0 12.0 0.8421052631578947 0.05333333333333334 0 0.5247524752475248 1.856780293719988 0 1 0
|
| 51 |
+
HG00097 expansionhunter chr1:3273817:3273851 1 2.0 17.0 17.0 15.0 0.0 0.0 25.0 34.0 0.0 36.973 0.0 34.0 0.423728813559322 0.22666666666666666 0 0.4752475247524752 1.78618987974027 0 1 0
|
example/str_scored.tsv
CHANGED
|
@@ -1,7 +1,51 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
sample caller variant_ID is_pass motif_len ref_copynum gt_repcn_max gt_repcn_min expansion_over_ref repci_width_max spanning_reads flanking_reads inrepeat_reads locus_depth gt_hom ref_tract_bp spanning_frac allele_vs_readlen motif_is_homopolymer gc_flank entropy_flank in_segdup in_difficult flank_lowmap CS_raw CS tier
|
| 2 |
+
HG00097 expansionhunter chr1:165954:165962 1 4.0 2.0 2.0 1.0 0.0 0.0 25.0 2.0 0.0 34.5405 0.0 8.0 0.925925925925926 0.0533333333333333 0 0.297029702970297 1.8767240669197327 1 1 1 0.7378 0.8546 HIGH
|
| 3 |
+
HG00097 expansionhunter chr1:370632:370648 1 4.0 4.0 4.0 3.0 0.0 1.0 30.0 14.0 0.0 40.3784 0.0 16.0 0.6818181818181818 0.1066666666666666 0 0.3762376237623762 1.7475795513453003 1 1 1 0.3337 0.3967 WARNING
|
| 4 |
+
HG00097 expansionhunter chr1:832736:832781 1 5.0 9.0 10.0 10.0 1.0 0.0 44.0 104.0 0.0 42.8919 1.0 45.0 0.2972972972972973 0.3333333333333333 0 0.3663366336633663 1.7496871374624745 1 1 0 0.9082 0.9876 HIGH
|
| 5 |
+
HG00097 expansionhunter chr1:932613:932621 1 4.0 2.0 1.0 1.0 -1.0 0.0 88.0 4.0 0.0 49.3784 1.0 8.0 0.9565217391304348 0.0266666666666666 0 0.6138613861386139 1.93240710844821 0 1 0 0.4569 0.5696 MODERATE
|
| 6 |
+
HG00097 expansionhunter chr1:1010481:1010497 1 4.0 4.0 3.0 3.0 -1.0 0.0 62.0 20.0 0.0 35.1081 1.0 16.0 0.7560975609756098 0.08 0 0.1386138613861386 1.5081706750248076 0 1 0 0.1697 0.1673 LOW
|
| 7 |
+
HG00097 expansionhunter chr1:1052514:1052528 1 2.0 7.0 8.0 7.0 1.0 0.0 40.0 4.0 0.0 46.3784 0.0 14.0 0.9090909090909092 0.1066666666666666 0 0.4653465346534653 1.8443047122008367 0 1 0 0.1755 0.1752 LOW
|
| 8 |
+
HG00097 expansionhunter chr1:1063661:1063681 1 2.0 10.0 10.0 9.0 0.0 0.0 39.0 26.0 0.0 51.0811 0.0 20.0 0.6 0.1333333333333333 0 0.5643564356435643 1.7595456304802637 0 1 0 0.0987 0.0615 LOW
|
| 9 |
+
HG00097 expansionhunter chr1:1265603:1265649 1 2.0 23.0 24.0 23.0 1.0 0.0 19.0 44.0 0.0 38.3514 0.0 46.0 0.3015873015873015 0.32 0 0.4851485148514851 1.4319496839589465 0 1 0 0.1248 0.0944 LOW
|
| 10 |
+
HG00097 expansionhunter chr1:1431026:1431058 1 8.0 4.0 5.0 4.0 1.0 0.0 25.0 65.0 0.0 53.2703 0.0 32.0 0.2777777777777778 0.2666666666666666 0 0.7128712871287128 1.807491527273404 0 1 0 0.6394 0.7505 HIGH
|
| 11 |
+
HG00097 expansionhunter chr1:1585949:1585976 1 3.0 9.0 16.0 15.0 7.0 0.0 23.0 57.0 0.0 44.4324 0.0 27.0 0.2875 0.32 0 0.3366336633663366 1.870380810485376 0 1 0 0.8984 0.9819 HIGH
|
| 12 |
+
HG00097 expansionhunter chr1:1653825:1653863 1 2.0 19.0 21.0 21.0 2.0 1.0 22.0 28.0 0.0 34.8649 1.0 38.0 0.44 0.28 0 0.495049504950495 1.861174864442864 1 1 0 0.6052 0.7199 HIGH
|
| 13 |
+
HG00097 expansionhunter chr1:1752908:1752935 1 3.0 9.0 10.0 9.0 1.0 0.0 39.0 31.0 0.0 49.4595 0.0 27.0 0.5571428571428572 0.2 0 0.6435643564356436 1.89700577967548 0 1 0 0.7839 0.8853 HIGH
|
| 14 |
+
HG00097 expansionhunter chr1:1762759:1762791 1 4.0 8.0 9.0 8.0 1.0 0.0 38.0 32.0 0.0 40.5405 0.0 32.0 0.5428571428571428 0.24 0 0.3861386138613861 1.9107593839500945 0 1 0 0.9107 0.988 HIGH
|
| 15 |
+
HG00097 expansionhunter chr1:1769969:1770011 1 6.0 7.0 5.0 5.0 -2.0 0.0 36.0 38.0 0.0 44.7568 1.0 42.0 0.4864864864864865 0.2 0 0.3267326732673267 1.680234216638318 0 1 0 0.8992 0.9829 HIGH
|
| 16 |
+
HG00097 expansionhunter chr1:1776461:1776473 1 4.0 3.0 2.0 2.0 -1.0 0.0 74.0 12.0 0.0 38.7568 1.0 12.0 0.8604651162790697 0.0533333333333333 0 0.4554455445544554 1.8897699414887887 0 0 0 0.8386 0.9299 HIGH
|
| 17 |
+
HG00097 expansionhunter chr1:1812407:1812429 1 2.0 11.0 12.0 12.0 1.0 0.0 64.0 20.0 0.0 39.6486 1.0 22.0 0.7619047619047619 0.16 0 0.4059405940594059 1.8347498835870788 0 1 0 0.2376 0.2636 LOW
|
| 18 |
+
HG00097 expansionhunter chr1:1845824:1845872 1 2.0 24.0 23.0 20.0 -1.0 0.0 37.0 34.0 0.0 43.3784 0.0 48.0 0.5211267605633803 0.3066666666666666 0 0.4059405940594059 1.8265902247349064 0 1 0 0.7847 0.8853 HIGH
|
| 19 |
+
HG00097 expansionhunter chr1:1891654:1891658 1 2.0 2.0 4.0 4.0 2.0 0.0 60.0 4.0 0.0 40.7027 1.0 4.0 0.9375 0.0533333333333333 0 0.7128712871287128 1.854785288155074 0 1 0 0.1098 0.0718 LOW
|
| 20 |
+
HG00097 expansionhunter chr1:1904424:1904448 1 4.0 6.0 8.0 8.0 2.0 0.0 44.0 32.0 0.0 39.2432 1.0 24.0 0.5789473684210527 0.2133333333333333 0 0.3069306930693069 1.7889467525708116 0 1 0 0.6801 0.8002 HIGH
|
| 21 |
+
HG00097 expansionhunter chr1:1948412:1948428 1 2.0 8.0 10.0 10.0 2.0 0.0 78.0 20.0 0.0 37.7838 1.0 16.0 0.7959183673469388 0.1333333333333333 0 0.3366336633663366 1.908726104503474 0 1 0 0.3487 0.4133 WARNING
|
| 22 |
+
HG00097 expansionhunter chr1:2003928:2003940 1 6.0 2.0 3.0 2.0 1.0 0.0 51.0 19.0 0.0 45.8919 0.0 12.0 0.7285714285714285 0.12 0 0.7920792079207921 1.7281983769914673 0 1 0 0.7141 0.8336 HIGH
|
| 23 |
+
HG00097 expansionhunter chr1:2012279:2012309 1 5.0 6.0 8.0 6.0 2.0 0.0 23.0 59.0 0.0 40.8649 0.0 30.0 0.2804878048780488 0.2666666666666666 0 0.2871287128712871 1.8125908839274096 0 1 0 0.9037 0.985 HIGH
|
| 24 |
+
HG00097 expansionhunter chr1:2018334:2018389 1 5.0 11.0 11.0 10.0 0.0 0.0 29.0 43.0 0.0 34.7838 0.0 55.0 0.4027777777777778 0.3666666666666666 0 0.3663366336633663 1.8582276266039703 0 1 0 0.8468 0.9378 HIGH
|
| 25 |
+
HG00097 expansionhunter chr1:2112675:2112691 1 4.0 4.0 5.0 5.0 1.0 0.0 68.0 26.0 0.0 48.4054 1.0 16.0 0.723404255319149 0.1333333333333333 0 0.3762376237623762 1.7586128527134226 0 1 0 0.371 0.4441 WARNING
|
| 26 |
+
HG00097 expansionhunter chr1:2173095:2173103 1 4.0 2.0 2.0 1.0 0.0 0.0 44.0 8.0 0.0 41.6757 0.0 8.0 0.8461538461538461 0.0533333333333333 0 0.504950495049505 1.962570773285248 0 0 0 0.6207 0.7324 HIGH
|
| 27 |
+
HG00097 expansionhunter chr1:2207909:2207955 1 2.0 23.0 25.0 20.0 2.0 0.0 25.0 65.0 0.0 39.8919 0.0 46.0 0.2777777777777778 0.3333333333333333 0 0.5841584158415841 1.7558294024999062 0 1 0 0.7201 0.8388 HIGH
|
| 28 |
+
HG00097 expansionhunter chr1:2219851:2219871 1 5.0 4.0 4.0 3.0 0.0 0.0 21.0 9.0 0.0 41.7568 0.0 20.0 0.7 0.1333333333333333 0 0.4752475247524752 1.8595044600250856 0 1 0 0.7221 0.8402 HIGH
|
| 29 |
+
HG00097 expansionhunter chr1:2345829:2345855 1 2.0 13.0 14.0 10.0 1.0 0.0 24.0 25.0 0.0 47.1892 0.0 26.0 0.4897959183673469 0.1866666666666666 0 0.5247524752475248 1.918085500267672 0 1 0 0.3912 0.4822 WARNING
|
| 30 |
+
HG00097 expansionhunter chr1:2371211:2371241 1 3.0 10.0 13.0 10.0 3.0 0.0 27.0 40.0 0.0 50.1892 0.0 30.0 0.4029850746268656 0.26 0 0.5148514851485149 1.8305650137637768 0 1 0 0.796 0.8924 HIGH
|
| 31 |
+
HG00097 expansionhunter chr1:2431330:2431346 1 2.0 8.0 9.0 8.0 1.0 0.0 41.0 32.0 0.0 48.6486 0.0 16.0 0.5616438356164384 0.12 0 0.5841584158415841 1.916600527430936 0 1 0 0.2435 0.2759 LOW
|
| 32 |
+
HG00097 expansionhunter chr1:2435454:2435489 1 7.0 5.0 5.0 4.0 0.0 0.0 39.0 30.0 0.0 39.0 0.0 35.0 0.5652173913043478 0.2333333333333333 0 0.5742574257425742 1.9335830591930787 0 1 0 0.7708 0.8778 HIGH
|
| 33 |
+
HG00097 expansionhunter chr1:2449499:2449535 1 4.0 9.0 9.0 8.0 0.0 0.0 17.0 48.0 0.0 40.9459 0.0 36.0 0.2615384615384615 0.24 0 0.5247524752475248 1.883813685247404 0 1 0 0.8208 0.9087 HIGH
|
| 34 |
+
HG00097 expansionhunter chr1:2508612:2508630 1 2.0 9.0 11.0 9.0 2.0 0.0 35.0 18.0 0.0 49.0541 0.0 18.0 0.660377358490566 0.1466666666666666 0 0.3564356435643564 1.933754201968858 0 1 0 0.6421 0.7584 HIGH
|
| 35 |
+
HG00097 expansionhunter chr1:2566825:2566869 1 4.0 11.0 12.0 11.0 1.0 0.0 19.0 124.0 0.0 47.8378 0.0 44.0 0.1328671328671328 0.32 0 0.4158415841584158 1.9433008996133987 0 1 0 0.7715 0.8778 HIGH
|
| 36 |
+
HG00097 expansionhunter chr1:2580534:2580555 1 3.0 7.0 8.0 8.0 1.0 0.0 42.0 32.0 0.0 33.8108 1.0 21.0 0.5675675675675675 0.16 0 0.3762376237623762 1.9016529508428728 0 1 0 0.8111 0.9036 HIGH
|
| 37 |
+
HG00097 expansionhunter chr1:2600684:2600708 1 4.0 6.0 6.0 3.0 0.0 0.0 35.0 17.0 0.0 37.9459 0.0 24.0 0.6730769230769231 0.16 0 0.4455445544554455 1.8855340778294845 0 1 0 0.8571 0.9491 HIGH
|
| 38 |
+
HG00097 expansionhunter chr1:2782508:2782518 1 2.0 5.0 6.0 6.0 1.0 0.0 80.0 8.0 0.0 40.2162 1.0 10.0 0.9090909090909092 0.08 0 0.3168316831683168 1.539579197931996 0 1 0 0.0884 0.052 LOW
|
| 39 |
+
HG00097 expansionhunter chr1:2784990:2785010 1 10.0 2.0 4.0 3.0 2.0 0.0 33.0 28.0 0.0 40.1351 0.0 20.0 0.5409836065573771 0.2666666666666666 0 0.693069306930693 1.849024777034605 0 1 0 0.4897 0.5908 MODERATE
|
| 40 |
+
HG00097 expansionhunter chr1:2824891:2824941 1 2.0 25.0 23.0 22.0 -2.0 0.0 20.0 28.0 0.0 41.1081 0.0 50.0 0.4166666666666667 0.3066666666666666 0 0.5148514851485149 1.736792200782953 0 1 0 0.6974 0.8162 HIGH
|
| 41 |
+
HG00097 expansionhunter chr1:2847577:2847593 1 8.0 2.0 2.0 1.0 0.0 0.0 45.0 4.0 0.0 41.3514 0.0 16.0 0.9183673469387756 0.1066666666666666 0 0.6039603960396039 1.893728291589638 0 1 0 0.523 0.6287 MODERATE
|
| 42 |
+
HG00097 expansionhunter chr1:2899043:2899075 1 4.0 8.0 7.0 7.0 -1.0 0.0 54.0 32.0 0.0 41.5946 1.0 32.0 0.627906976744186 0.1866666666666666 0 0.4455445544554455 1.7976208220041985 0 1 0 0.8667 0.9568 HIGH
|
| 43 |
+
HG00097 expansionhunter chr1:2952909:2952927 1 2.0 9.0 12.0 12.0 3.0 0.0 60.0 22.0 0.0 36.7297 1.0 18.0 0.7317073170731707 0.16 0 0.6138613861386139 1.4586949056098804 0 1 0 0.0848 0.0452 LOW
|
| 44 |
+
HG00097 expansionhunter chr1:3089915:3089939 1 4.0 6.0 8.0 8.0 2.0 0.0 72.0 34.0 0.0 49.6216 1.0 24.0 0.6792452830188679 0.2133333333333333 0 0.4455445544554455 1.9438900377229549 0 1 0 0.9023 0.985 HIGH
|
| 45 |
+
HG00097 expansionhunter chr1:3109083:3109116 1 3.0 11.0 11.0 9.0 0.0 0.0 24.0 49.0 0.0 41.3514 0.0 33.0 0.3287671232876712 0.22 0 0.3267326732673267 1.8491066059426056 0 1 0 0.8115 0.9036 HIGH
|
| 46 |
+
HG00097 expansionhunter chr1:3152348:3152380 1 4.0 8.0 6.0 6.0 -2.0 0.0 58.0 10.0 0.0 44.9189 1.0 32.0 0.8529411764705882 0.16 0 0.4356435643564356 1.6103551420752904 0 1 0 0.4377 0.5542 MODERATE
|
| 47 |
+
HG00097 expansionhunter chr1:3175983:3176011 1 4.0 7.0 8.0 8.0 1.0 0.0 50.0 42.0 0.0 45.2432 1.0 28.0 0.5434782608695652 0.2133333333333333 0 0.5643564356435643 1.8171755888564567 0 1 0 0.6944 0.8135 HIGH
|
| 48 |
+
HG00097 expansionhunter chr1:3225530:3225574 1 2.0 22.0 21.0 17.0 -1.0 0.0 20.0 40.0 0.0 42.973 0.0 44.0 0.3333333333333333 0.28 0 0.5742574257425742 1.770047538912363 0 1 0 0.6244 0.7366 HIGH
|
| 49 |
+
HG00097 expansionhunter chr1:3240056:3240106 1 5.0 10.0 5.0 4.0 -5.0 0.0 31.0 16.0 0.0 39.6486 0.0 50.0 0.6595744680851063 0.1666666666666666 0 0.5544554455445545 1.0599256723036077 0 1 0 0.2903 0.3499 WARNING
|
| 50 |
+
HG00097 expansionhunter chr1:3273438:3273450 1 2.0 6.0 4.0 4.0 -2.0 0.0 64.0 12.0 0.0 36.7297 1.0 12.0 0.8421052631578947 0.0533333333333333 0 0.5247524752475248 1.856780293719988 0 1 0 0.2091 0.2239 LOW
|
| 51 |
+
HG00097 expansionhunter chr1:3273817:3273851 1 2.0 17.0 17.0 15.0 0.0 0.0 25.0 34.0 0.0 36.973 0.0 34.0 0.423728813559322 0.2266666666666666 0 0.4752475247524752 1.78618987974027 0 1 0 0.2897 0.3499 WARNING
|
example/sv_features.tsv
CHANGED
|
@@ -1,8 +1,51 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
sample caller variant_ID is_pass svtype_DEL svtype_DUP svtype_INS svtype_INV svtype_BND svlen_log cipos_width ciend_width is_imprecise pe_support sr_support total_support vaf gt_hom gq qual_norm local_depth gc_min gc_max entropy_min microhom_max in_segdup_either in_segdup_both in_difficult_either in_difficult_both in_lowmap_either in_tandem_either in_Alu_either in_L1_either in_SVA_either in_LTR_either frac_span_repeat n_neighbors nn_log_dist
|
| 2 |
+
HG00097 manta chr1:861885:BND:None 1 0 0 0 0 1 -99999.0 31.0 -99999.0 0 5.0 7.0 12.0 0.15384615384615385 0 182.0 182.0 43.0 0.297029702970297 0.297029702970297 1.6350564503142975 -99999.0 1 1 1 1 0 1 0 0 0 0 -99999.0 2 4.424979586745809
|
| 3 |
+
HG00097 manta chr1:888490:DEL:888577 1 1 0 0 0 0 1.9444826721501687 55.0 -99999.0 0 0.0 5.0 5.0 0.2631578947368421 0 93.0 116.0 5.0 0.49504950495049505 0.5247524752475248 1.119875913690787 50.0 1 1 1 1 0 1 0 0 0 0 1.0 2 4.20382130251655
|
| 4 |
+
HG00097 manta chr1:904478:DEL:904576 1 1 0 0 0 0 1.99563519459755 14.0 -99999.0 0 0.0 10.0 10.0 0.4166666666666667 0 31.0 294.0 3.0 0.7227722772277227 0.7722772277227723 1.6490755403817534 14.0 0 0 1 1 0 1 0 0 0 0 1.0 3 4.20382130251655
|
| 5 |
+
HG00097 manta chr1:998763:INS:None 1 0 0 1 0 0 1.7708520116421442 -99999.0 -99999.0 0 0.0 15.0 15.0 0.6521739130434783 0 43.0 538.0 -99999.0 0.7920792079207921 0.7920792079207921 1.6906469647232745 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 3 4.472888033769636
|
| 6 |
+
HG00097 manta chr1:1028471:DEL:1029081 1 1 0 0 0 0 2.786041210242554 19.0 -99999.0 0 13.0 12.0 25.0 0.352112676056338 0 477.0 654.0 36.0 0.6534653465346535 0.7227722772277227 1.7262819301695698 19.0 0 0 1 1 0 1 0 0 0 0 1.0 2 4.472888033769636
|
| 7 |
+
HG00097 manta chr1:1068824:INS:None 1 0 0 1 0 0 1.8976270912904414 10.0 -99999.0 0 0.0 29.0 29.0 0.9666666666666667 1 75.0 999.0 -99999.0 0.8514851485148515 0.8514851485148515 1.5447689726449716 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 2 4.60588658966101
|
| 8 |
+
HG00097 manta chr1:1427385:DEL:1427442 0 1 0 0 0 0 1.7634279935629373 -99999.0 -99999.0 0 0.0 2.0 2.0 0.16666666666666666 0 13.0 13.0 3.0 0.693069306930693 0.7029702970297029 1.3844773445130296 0.0 0 0 1 1 0 1 0 0 0 0 1.0 0 5.22457708545644
|
| 9 |
+
HG00097 manta chr1:1595101:DEL:1595183 1 1 0 0 0 0 1.919078092376074 2.0 -99999.0 0 0.0 10.0 10.0 0.19230769230769232 0 266.0 266.0 12.0 0.5148514851485149 0.5247524752475248 1.4455984547909617 2.0 0 0 1 1 1 1 0 0 0 0 1.0 1 4.856571815297643
|
| 10 |
+
HG00097 manta chr1:1666974:DEL:1667141 1 1 0 0 0 0 2.225309281725863 18.0 -99999.0 0 11.0 21.0 32.0 1.0 1 59.0 888.0 11.0 0.5247524752475248 0.5445544554455446 1.9680479893568856 18.0 1 1 1 1 1 0 1 0 0 0 1.0 3 4.763375575548453
|
| 11 |
+
HG00097 manta chr1:1724966:DEL:1726924 1 1 0 0 0 0 3.2920344359947364 27.0 27.0 0 8.0 10.0 18.0 0.2727272727272727 0 420.0 420.0 31.0 0.5643564356435643 0.6435643564356436 1.9231444533753872 27.0 1 1 1 1 1 1 1 0 0 0 1.0 2 4.391640703492388
|
| 12 |
+
HG00097 manta chr1:1749605:INS:None 1 0 0 1 0 0 1.7075701760979363 23.0 -99999.0 0 0.0 10.0 10.0 0.37037037037037035 0 262.0 358.0 -99999.0 0.4752475247524752 0.4752475247524752 1.6885390899981634 -99999.0 1 1 1 1 1 1 0 0 0 0 -99999.0 2 4.391640703492388
|
| 13 |
+
HG00097 manta chr1:1924223:INS:None 1 0 0 1 0 0 2.0170333392987803 10.0 -99999.0 0 0.0 16.0 16.0 1.0 1 41.0 630.0 -99999.0 0.7425742574257426 0.7425742574257426 1.6063691498853343 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 5 3.71281800020785
|
| 14 |
+
HG00097 manta chr1:1929384:INS:None 1 0 0 1 0 0 -99999.0 5.0 5.0 0 0.0 19.0 19.0 1.0 1 39.0 580.0 -99999.0 0.693069306930693 0.693069306930693 1.4041502486751618 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 5 3.71281800020785
|
| 15 |
+
HG00097 manta chr1:1934989:DEL:1935584 1 1 0 0 0 0 2.7752462597402365 10.0 -99999.0 0 16.0 18.0 34.0 1.0 1 86.0 934.0 16.0 0.44554455445544555 0.504950495049505 1.8451028377340413 10.0 0 0 1 1 0 1 0 0 0 0 1.0 5 3.7486530934242674
|
| 16 |
+
HG00097 manta chr1:1948934:INS:None 1 0 0 1 0 0 2.0492180226701815 2.0 -99999.0 0 3.0 22.0 25.0 1.0 1 64.0 864.0 3.0 0.5643564356435643 0.5643564356435643 1.0561905395876316 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 5 4.1444496608689
|
| 17 |
+
HG00097 manta chr1:1993704:INS:None 1 0 0 1 0 0 2.1702617153949575 15.0 -99999.0 0 8.0 45.0 53.0 1.0 1 128.0 999.0 8.0 0.4752475247524752 0.4752475247524752 1.9812115261970087 -99999.0 0 0 0 0 0 0 1 0 0 0 -99999.0 5 4.406829613621544
|
| 18 |
+
HG00097 manta chr1:2019220:INS:None 1 0 0 1 0 0 -99999.0 23.0 23.0 0 2.0 12.0 14.0 1.0 1 20.0 261.0 2.0 0.8613861386138614 0.8613861386138614 1.3524948796891727 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 5 4.406829613621544
|
| 19 |
+
HG00097 manta chr1:2421838:BND:None 1 0 0 0 0 1 -99999.0 3.0 -99999.0 0 4.0 25.0 29.0 0.37662337662337664 0 582.0 999.0 23.0 0.49504950495049505 0.49504950495049505 1.983371708749389 -99999.0 0 0 1 1 0 1 1 0 0 0 -99999.0 0 5.255942731372637
|
| 20 |
+
HG00097 manta chr1:2602115:DUP:2602189 1 0 1 0 0 0 1.8750612633917 -99999.0 -99999.0 0 0.0 11.0 11.0 0.12941176470588237 0 255.0 255.0 30.0 0.5643564356435643 0.6435643564356436 1.913521655146875 1.0 0 0 0 0 0 0 0 0 0 0 0.0 1 1.6989700043360187
|
| 21 |
+
HG00097 manta chr1:2602164:INS:None 1 0 0 1 0 0 1.919078092376074 -99999.0 -99999.0 0 0.0 22.0 22.0 0.6470588235294118 0 137.0 697.0 -99999.0 0.594059405940594 0.594059405940594 1.9680479893568856 -99999.0 0 0 0 0 0 0 0 0 0 0 -99999.0 1 1.6989700043360187
|
| 22 |
+
HG00097 manta chr1:3026038:INS:None 0 0 0 1 0 0 -99999.0 5.0 5.0 0 2.0 26.0 28.0 0.875 1 6.0 831.0 6.0 0.48514851485148514 0.48514851485148514 1.556688426030284 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 0 5.192483819902663
|
| 23 |
+
HG00097 manta chr1:3181807:DEL:3181925 1 1 0 0 0 0 2.075546961392531 14.0 -99999.0 0 0.0 27.0 27.0 1.0 1 65.0 999.0 -99999.0 0.44554455445544555 0.46534653465346537 1.9395446972377677 14.0 0 0 1 1 0 1 0 0 0 0 1.0 2 4.525860772931853
|
| 24 |
+
HG00097 manta chr1:3215369:INS:None 1 0 0 1 0 0 1.792391689498254 54.0 -99999.0 0 0.0 27.0 27.0 1.0 1 68.0 999.0 -99999.0 0.6831683168316832 0.6831683168316832 1.8503664672483442 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 2 4.525860772931853
|
| 25 |
+
HG00097 manta chr1:3260742:INS:None 1 0 0 1 0 0 -99999.0 5.0 5.0 0 8.0 29.0 37.0 1.0 1 95.0 999.0 8.0 0.4752475247524752 0.4752475247524752 1.6383659495376368 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 3 4.6568070667104475
|
| 26 |
+
HG00097 manta chr1:3316611:DEL:3316667 1 1 0 0 0 0 1.7558748556724915 23.0 -99999.0 0 0.0 10.0 10.0 0.37037037037037035 0 177.0 306.0 5.0 0.49504950495049505 0.49504950495049505 1.8200734972498984 23.0 0 0 1 1 0 1 0 0 0 0 1.0 1 4.747178671360165
|
| 27 |
+
HG00097 manta chr1:3832372:DEL:3832444 1 1 0 0 0 0 1.863322860120456 45.0 -99999.0 0 0.0 5.0 5.0 0.625 0 25.0 127.0 1.0 0.5247524752475248 0.5841584158415841 1.9137004533259432 45.0 0 0 1 1 0 1 0 0 0 0 1.0 2 4.560086048497414
|
| 28 |
+
HG00097 manta chr1:3868686:INS:None 1 0 0 1 0 0 2.4712917110589387 1.0 -99999.0 0 4.0 38.0 42.0 1.0 1 29.0 708.0 4.0 0.6336633663366337 0.6336633663366337 1.5250608688455582 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 2 4.560086048497414
|
| 29 |
+
HG00097 manta chr1:3929498:INS:None 1 0 0 1 0 0 2.4668676203541096 4.0 -99999.0 0 2.0 27.0 29.0 1.0 1 62.0 707.0 2.0 0.44554455445544555 0.44554455445544555 1.4880199108409586 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 3 4.7839964283643
|
| 30 |
+
HG00097 manta chr1:3999775:DEL:3999895 1 1 0 0 0 0 2.08278537031645 -99999.0 -99999.0 0 0.0 25.0 25.0 1.0 1 73.0 999.0 -99999.0 0.6633663366336634 0.693069306930693 1.7648687041978608 1.0 0 0 1 1 1 1 0 0 0 0 1.0 1 4.846819393669543
|
| 31 |
+
HG00097 manta chr1:4129813:DEL:4129873 1 1 0 0 0 0 1.7853298350107671 45.0 -99999.0 0 0.0 9.0 9.0 0.34615384615384615 0 162.0 312.0 6.0 0.6039603960396039 0.6831683168316832 1.8738460050301151 45.0 0 0 1 1 0 1 0 0 0 0 1.0 2 4.16660770308391
|
| 32 |
+
HG00097 manta chr1:4144488:DUP:4144620 0 0 1 0 0 0 2.123851640967086 4.0 4.0 0 0.0 22.0 22.0 0.43137254901960786 0 332.0 518.0 11.0 0.42574257425742573 0.48514851485148514 1.4784771363149791 4.0 0 0 1 1 0 1 0 0 0 0 1.0 2 2.1492191126553797
|
| 33 |
+
HG00097 manta chr1:4144628:INS:None 0 0 0 1 0 0 1.9084850188786497 2.0 -99999.0 0 0.0 28.0 28.0 1.0 1 71.0 982.0 -99999.0 0.48514851485148514 0.48514851485148514 1.4687802326926382 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 2 2.1492191126553797
|
| 34 |
+
HG00097 manta chr1:4333580:BND:None 1 0 0 0 0 1 -99999.0 -99999.0 -99999.0 0 12.0 0.0 12.0 0.5 0 142.0 142.0 24.0 0.5643564356435643 0.5643564356435643 1.9623130244403804 -99999.0 0 0 1 1 0 1 0 0 0 1 -99999.0 1 3.465680211598278
|
| 35 |
+
HG00097 manta chr1:4336501:INS:None 1 0 0 1 0 0 -99999.0 27.0 27.0 0 3.0 37.0 40.0 1.0 1 68.0 929.0 3.0 0.40594059405940597 0.40594059405940597 1.747964514410695 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 1 3.465680211598278
|
| 36 |
+
HG00097 manta chr1:4939764:BND:None 0 0 0 0 0 1 -99999.0 599.0 -99999.0 1 14.0 0.0 14.0 0.4117647058823529 0 115.0 115.0 34.0 0.45544554455445546 0.45544554455445546 1.9728402340561 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 0 5.133363239048624
|
| 37 |
+
HG00097 manta chr1:5075708:INS:None 1 0 0 1 0 0 1.863322860120456 13.0 -99999.0 0 0.0 12.0 12.0 1.0 1 26.0 288.0 -99999.0 0.07920792079207921 0.07920792079207921 1.3986907959255488 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 1 4.927334427020689
|
| 38 |
+
HG00097 manta chr1:5160300:INS:None 0 0 0 1 0 0 2.765668554759014 -99999.0 -99999.0 0 0.0 46.0 46.0 0.92 1 52.0 999.0 4.0 0.37623762376237624 0.37623762376237624 1.7050759800383022 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 1 4.927334427020689
|
| 39 |
+
HG00097 manta chr1:5387025:BND:None 1 0 0 0 0 1 -99999.0 466.0 -99999.0 1 15.0 0.0 15.0 0.6 0 116.0 228.0 25.0 0.5148514851485149 0.5148514851485149 1.3511081047001698 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 2 2.161368002234975
|
| 40 |
+
HG00097 manta chr1:5387169:DUP:5387401 1 0 1 0 0 0 2.367355921026019 -99999.0 -99999.0 0 1.0 23.0 24.0 0.42857142857142855 0 338.0 481.0 13.0 0.45544554455445546 0.4752475247524752 1.5481463016741694 0.0 0 0 1 1 0 1 0 0 0 0 1.0 2 2.161368002234975
|
| 41 |
+
HG00097 manta chr1:5414152:DEL:5414227 1 1 0 0 0 0 1.8808135922807914 18.0 -99999.0 0 0.0 7.0 7.0 0.3333333333333333 0 102.0 157.0 6.0 0.5346534653465347 0.5346534653465347 1.9698531038077964 18.0 0 0 1 1 0 1 0 0 0 0 1.0 3 4.431106328181145
|
| 42 |
+
HG00097 manta chr1:5499440:DEL:5499546 1 1 0 0 0 0 2.0293837776852097 8.0 -99999.0 0 0.0 9.0 9.0 0.21951219512195122 0 276.0 276.0 7.0 0.26732673267326734 0.26732673267326734 1.6246863594780123 8.0 0 0 1 1 0 1 0 0 0 0 1.0 2 4.930893022406026
|
| 43 |
+
HG00097 manta chr1:5593288:INS:None 1 0 0 1 0 0 1.806179973983887 2.0 -99999.0 0 0.0 22.0 22.0 0.6470588235294118 0 142.0 889.0 -99999.0 0.5544554455445545 0.5544554455445545 1.9892256298878759 -99999.0 0 0 0 0 0 0 0 0 0 0 -99999.0 2 4.9076585372801205
|
| 44 |
+
HG00097 manta chr1:5674133:BND:None 0 0 0 0 0 1 -99999.0 7.0 -99999.0 0 11.0 47.0 58.0 0.3020833333333333 0 999.0 999.0 78.0 0.38613861386138615 0.38613861386138615 1.9050918955475322 -99999.0 0 0 0 0 0 0 0 1 0 0 -99999.0 1 4.9076585372801205
|
| 45 |
+
HG00097 manta chr1:5874228:INS:None 1 0 0 1 0 0 2.4533183400470375 13.0 -99999.0 0 18.0 23.0 41.0 0.5394736842105263 0 327.0 571.0 37.0 0.44554455445544555 0.44554455445544555 1.9714887292053216 -99999.0 0 0 0 0 0 0 0 0 0 0 -99999.0 0 5.116717299877443
|
| 46 |
+
HG00097 manta chr1:6005060:DEL:6005340 1 1 0 0 0 0 2.44870631990508 -99999.0 -99999.0 0 14.0 15.0 29.0 0.9666666666666667 1 23.0 314.0 15.0 0.6237623762376238 0.6831683168316832 1.5376840111363732 2.0 0 0 1 1 0 1 0 0 0 0 1.0 1 4.317666442356501
|
| 47 |
+
HG00097 manta chr1:6025840:DEL:6025908 1 1 0 0 0 0 1.8388490907372552 74.0 -99999.0 0 0.0 6.0 6.0 0.2222222222222222 0 182.0 184.0 9.0 0.6435643564356436 0.7029702970297029 1.7307956407165326 50.0 0 0 1 1 1 1 0 0 0 0 1.0 1 4.317666442356501
|
| 48 |
+
HG00097 manta chr1:6742557:INS:None 1 0 0 1 0 0 -99999.0 4.0 4.0 0 5.0 31.0 36.0 1.0 1 101.0 999.0 5.0 0.07920792079207921 0.07920792079207921 1.320097363865938 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 0 5.297445341827969
|
| 49 |
+
HG00097 manta chr1:6940912:INS:None 1 0 0 1 0 0 -99999.0 4.0 4.0 0 0.0 34.0 34.0 0.9714285714285714 1 95.0 999.0 -99999.0 0.6336633663366337 0.6336633663366337 1.902230293166466 -99999.0 0 0 0 0 0 0 0 0 0 0 -99999.0 1 4.635272580112365
|
| 50 |
+
HG00097 manta chr1:6984090:INS:None 1 0 0 1 0 0 2.311753861055754 2.0 -99999.0 0 6.0 35.0 41.0 1.0 1 72.0 999.0 6.0 0.48514851485148514 0.48514851485148514 1.5136654464703396 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 1 4.635272580112365
|
| 51 |
+
HG00097 manta chr1:7510011:DEL:7511458 1 1 0 0 0 0 3.1607685618611283 -99999.0 -99999.0 0 11.0 15.0 26.0 0.36619718309859156 0 527.0 711.0 35.0 0.46534653465346537 0.48514851485148514 1.8175669943687356 0.0 0 0 1 0 0 1 0 0 0 0 0.015193370165745856 1 4.5124575861973435
|
example/sv_scored.tsv
CHANGED
|
@@ -1,8 +1,51 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
sample caller variant_ID is_pass svtype_DEL svtype_DUP svtype_INS svtype_INV svtype_BND svlen_log cipos_width ciend_width is_imprecise pe_support sr_support total_support vaf gt_hom gq qual_norm local_depth gc_min gc_max entropy_min microhom_max in_segdup_either in_segdup_both in_difficult_either in_difficult_both in_lowmap_either in_tandem_either in_Alu_either in_L1_either in_SVA_either in_LTR_either frac_span_repeat n_neighbors nn_log_dist CS_raw CS tier
|
| 2 |
+
HG00097 manta chr1:861885:BND:None 1 0 0 0 0 1 -99999.0 31.0 -99999.0 0 5.0 7.0 12.0 0.1538461538461538 0 182.0 182.0 43.0 0.297029702970297 0.297029702970297 1.6350564503142977 -99999.0 1 1 1 1 0 1 0 0 0 0 -99999.0 2 4.424979586745809 0.0077 0.0016 LOW
|
| 3 |
+
HG00097 manta chr1:888490:DEL:888577 1 1 0 0 0 0 1.944482672150169 55.0 -99999.0 0 0.0 5.0 5.0 0.2631578947368421 0 93.0 116.0 5.0 0.495049504950495 0.5247524752475248 1.119875913690787 50.0 1 1 1 1 0 1 0 0 0 0 1.0 2 4.20382130251655 0.7047 0.8104 HIGH
|
| 4 |
+
HG00097 manta chr1:904478:DEL:904576 1 1 0 0 0 0 1.99563519459755 14.0 -99999.0 0 0.0 10.0 10.0 0.4166666666666667 0 31.0 294.0 3.0 0.7227722772277227 0.7722772277227723 1.6490755403817534 14.0 0 0 1 1 0 1 0 0 0 0 1.0 3 4.20382130251655 0.616 0.6822 MODERATE
|
| 5 |
+
HG00097 manta chr1:998763:INS:None 1 0 0 1 0 0 1.7708520116421442 -99999.0 -99999.0 0 0.0 15.0 15.0 0.6521739130434783 0 43.0 538.0 -99999.0 0.7920792079207921 0.7920792079207921 1.6906469647232745 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 3 4.472888033769636 0.6379 0.7226 HIGH
|
| 6 |
+
HG00097 manta chr1:1028471:DEL:1029081 1 1 0 0 0 0 2.786041210242554 19.0 -99999.0 0 13.0 12.0 25.0 0.352112676056338 0 477.0 654.0 36.0 0.6534653465346535 0.7227722772277227 1.7262819301695698 19.0 0 0 1 1 0 1 0 0 0 0 1.0 2 4.472888033769636 0.809 0.8988 HIGH
|
| 7 |
+
HG00097 manta chr1:1068824:INS:None 1 0 0 1 0 0 1.8976270912904412 10.0 -99999.0 0 0.0 29.0 29.0 0.9666666666666668 1 75.0 999.0 -99999.0 0.8514851485148515 0.8514851485148515 1.5447689726449716 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 2 4.60588658966101 0.7489 0.8545 HIGH
|
| 8 |
+
HG00097 manta chr1:1427385:DEL:1427442 0 1 0 0 0 0 1.7634279935629371 -99999.0 -99999.0 0 0.0 2.0 2.0 0.1666666666666666 0 13.0 13.0 3.0 0.693069306930693 0.7029702970297029 1.3844773445130296 0.0 0 0 1 1 0 1 0 0 0 0 1.0 0 5.22457708545644 0.1506 0.1012 LOW
|
| 9 |
+
HG00097 manta chr1:1595101:DEL:1595183 1 1 0 0 0 0 1.919078092376074 2.0 -99999.0 0 0.0 10.0 10.0 0.1923076923076923 0 266.0 266.0 12.0 0.5148514851485149 0.5247524752475248 1.4455984547909615 2.0 0 0 1 1 1 1 0 0 0 0 1.0 1 4.856571815297643 0.4295 0.4349 WARNING
|
| 10 |
+
HG00097 manta chr1:1666974:DEL:1667141 1 1 0 0 0 0 2.225309281725863 18.0 -99999.0 0 11.0 21.0 32.0 1.0 1 59.0 888.0 11.0 0.5247524752475248 0.5445544554455446 1.9680479893568856 18.0 1 1 1 1 1 0 1 0 0 0 1.0 3 4.763375575548453 0.7289 0.8283 HIGH
|
| 11 |
+
HG00097 manta chr1:1724966:DEL:1726924 1 1 0 0 0 0 3.2920344359947364 27.0 27.0 0 8.0 10.0 18.0 0.2727272727272727 0 420.0 420.0 31.0 0.5643564356435643 0.6435643564356436 1.9231444533753872 27.0 1 1 1 1 1 1 1 0 0 0 1.0 2 4.391640703492388 0.8005 0.8964 HIGH
|
| 12 |
+
HG00097 manta chr1:1749605:INS:None 1 0 0 1 0 0 1.7075701760979365 23.0 -99999.0 0 0.0 10.0 10.0 0.3703703703703703 0 262.0 358.0 -99999.0 0.4752475247524752 0.4752475247524752 1.6885390899981634 -99999.0 1 1 1 1 1 1 0 0 0 0 -99999.0 2 4.391640703492388 0.7091 0.8117 HIGH
|
| 13 |
+
HG00097 manta chr1:1924223:INS:None 1 0 0 1 0 0 2.0170333392987803 10.0 -99999.0 0 0.0 16.0 16.0 1.0 1 41.0 630.0 -99999.0 0.7425742574257426 0.7425742574257426 1.6063691498853343 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 5 3.71281800020785 0.526 0.5557 MODERATE
|
| 14 |
+
HG00097 manta chr1:1929384:INS:None 1 0 0 1 0 0 -99999.0 5.0 5.0 0 0.0 19.0 19.0 1.0 1 39.0 580.0 -99999.0 0.693069306930693 0.693069306930693 1.4041502486751618 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 5 3.71281800020785 0.2185 0.1761 LOW
|
| 15 |
+
HG00097 manta chr1:1934989:DEL:1935584 1 1 0 0 0 0 2.7752462597402365 10.0 -99999.0 0 16.0 18.0 34.0 1.0 1 86.0 934.0 16.0 0.4455445544554455 0.504950495049505 1.8451028377340413 10.0 0 0 1 1 0 1 0 0 0 0 1.0 5 3.748653093424267 0.686 0.7801 HIGH
|
| 16 |
+
HG00097 manta chr1:1948934:INS:None 1 0 0 1 0 0 2.049218022670181 2.0 -99999.0 0 3.0 22.0 25.0 1.0 1 64.0 864.0 3.0 0.5643564356435643 0.5643564356435643 1.0561905395876316 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 5 4.1444496608689 0.7533 0.8545 HIGH
|
| 17 |
+
HG00097 manta chr1:1993704:INS:None 1 0 0 1 0 0 2.170261715394957 15.0 -99999.0 0 8.0 45.0 53.0 1.0 1 128.0 999.0 8.0 0.4752475247524752 0.4752475247524752 1.981211526197009 -99999.0 0 0 0 0 0 0 1 0 0 0 -99999.0 5 4.406829613621544 0.8105 0.8988 HIGH
|
| 18 |
+
HG00097 manta chr1:2019220:INS:None 1 0 0 1 0 0 -99999.0 23.0 23.0 0 2.0 12.0 14.0 1.0 1 20.0 261.0 2.0 0.8613861386138614 0.8613861386138614 1.3524948796891727 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 5 4.406829613621544 0.5454 0.5772 MODERATE
|
| 19 |
+
HG00097 manta chr1:2421838:BND:None 1 0 0 0 0 1 -99999.0 3.0 -99999.0 0 4.0 25.0 29.0 0.3766233766233766 0 582.0 999.0 23.0 0.495049504950495 0.495049504950495 1.983371708749389 -99999.0 0 0 1 1 0 1 1 0 0 0 -99999.0 0 5.255942731372637 0.002 0.0003 LOW
|
| 20 |
+
HG00097 manta chr1:2602115:DUP:2602189 1 0 1 0 0 0 1.8750612633917 -99999.0 -99999.0 0 0.0 11.0 11.0 0.1294117647058823 0 255.0 255.0 30.0 0.5643564356435643 0.6435643564356436 1.913521655146875 1.0 0 0 0 0 0 0 0 0 0 0 0.0 1 1.6989700043360187 0.0 0.0 LOW
|
| 21 |
+
HG00097 manta chr1:2602164:INS:None 1 0 0 1 0 0 1.919078092376074 -99999.0 -99999.0 0 0.0 22.0 22.0 0.6470588235294118 0 137.0 697.0 -99999.0 0.594059405940594 0.594059405940594 1.9680479893568856 -99999.0 0 0 0 0 0 0 0 0 0 0 -99999.0 1 1.6989700043360187 0.7853 0.8865 HIGH
|
| 22 |
+
HG00097 manta chr1:3026038:INS:None 0 0 0 1 0 0 -99999.0 5.0 5.0 0 2.0 26.0 28.0 0.875 1 6.0 831.0 6.0 0.4851485148514851 0.4851485148514851 1.556688426030284 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 0 5.192483819902663 0.8821 0.935 HIGH
|
| 23 |
+
HG00097 manta chr1:3181807:DEL:3181925 1 1 0 0 0 0 2.075546961392531 14.0 -99999.0 0 0.0 27.0 27.0 1.0 1 65.0 999.0 -99999.0 0.4455445544554455 0.4653465346534653 1.939544697237768 14.0 0 0 1 1 0 1 0 0 0 0 1.0 2 4.525860772931853 0.6932 0.798 HIGH
|
| 24 |
+
HG00097 manta chr1:3215369:INS:None 1 0 0 1 0 0 1.792391689498254 54.0 -99999.0 0 0.0 27.0 27.0 1.0 1 68.0 999.0 -99999.0 0.6831683168316832 0.6831683168316832 1.850366467248344 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 2 4.525860772931853 0.8417 0.9154 HIGH
|
| 25 |
+
HG00097 manta chr1:3260742:INS:None 1 0 0 1 0 0 -99999.0 5.0 5.0 0 8.0 29.0 37.0 1.0 1 95.0 999.0 8.0 0.4752475247524752 0.4752475247524752 1.6383659495376368 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 3 4.656807066710448 0.8274 0.9091 HIGH
|
| 26 |
+
HG00097 manta chr1:3316611:DEL:3316667 1 1 0 0 0 0 1.7558748556724917 23.0 -99999.0 0 0.0 10.0 10.0 0.3703703703703703 0 177.0 306.0 5.0 0.495049504950495 0.495049504950495 1.8200734972498984 23.0 0 0 1 1 0 1 0 0 0 0 1.0 1 4.747178671360165 0.8483 0.917 HIGH
|
| 27 |
+
HG00097 manta chr1:3832372:DEL:3832444 1 1 0 0 0 0 1.863322860120456 45.0 -99999.0 0 0.0 5.0 5.0 0.625 0 25.0 127.0 1.0 0.5247524752475248 0.5841584158415841 1.9137004533259432 45.0 0 0 1 1 0 1 0 0 0 0 1.0 2 4.560086048497414 0.6316 0.7133 HIGH
|
| 28 |
+
HG00097 manta chr1:3868686:INS:None 1 0 0 1 0 0 2.4712917110589387 1.0 -99999.0 0 4.0 38.0 42.0 1.0 1 29.0 708.0 4.0 0.6336633663366337 0.6336633663366337 1.5250608688455582 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 2 4.560086048497414 0.6765 0.7715 HIGH
|
| 29 |
+
HG00097 manta chr1:3929498:INS:None 1 0 0 1 0 0 2.4668676203541096 4.0 -99999.0 0 2.0 27.0 29.0 1.0 1 62.0 707.0 2.0 0.4455445544554455 0.4455445544554455 1.4880199108409586 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 3 4.7839964283643 0.4086 0.3836 WARNING
|
| 30 |
+
HG00097 manta chr1:3999775:DEL:3999895 1 1 0 0 0 0 2.08278537031645 -99999.0 -99999.0 0 0.0 25.0 25.0 1.0 1 73.0 999.0 -99999.0 0.6633663366336634 0.693069306930693 1.7648687041978608 1.0 0 0 1 1 1 1 0 0 0 0 1.0 1 4.846819393669543 0.7755 0.874 HIGH
|
| 31 |
+
HG00097 manta chr1:4129813:DEL:4129873 1 1 0 0 0 0 1.7853298350107671 45.0 -99999.0 0 0.0 9.0 9.0 0.3461538461538461 0 162.0 312.0 6.0 0.6039603960396039 0.6831683168316832 1.8738460050301151 45.0 0 0 1 1 0 1 0 0 0 0 1.0 2 4.16660770308391 0.8085 0.8988 HIGH
|
| 32 |
+
HG00097 manta chr1:4144488:DUP:4144620 0 0 1 0 0 0 2.123851640967086 4.0 4.0 0 0.0 22.0 22.0 0.4313725490196078 0 332.0 518.0 11.0 0.4257425742574257 0.4851485148514851 1.4784771363149791 4.0 0 0 1 1 0 1 0 0 0 0 1.0 2 2.14921911265538 0.0007 0.0 LOW
|
| 33 |
+
HG00097 manta chr1:4144628:INS:None 0 0 0 1 0 0 1.9084850188786495 2.0 -99999.0 0 0.0 28.0 28.0 1.0 1 71.0 982.0 -99999.0 0.4851485148514851 0.4851485148514851 1.4687802326926382 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 2 2.14921911265538 0.1717 0.1265 LOW
|
| 34 |
+
HG00097 manta chr1:4333580:BND:None 1 0 0 0 0 1 -99999.0 -99999.0 -99999.0 0 12.0 0.0 12.0 0.5 0 142.0 142.0 24.0 0.5643564356435643 0.5643564356435643 1.9623130244403804 -99999.0 0 0 1 1 0 1 0 0 0 1 -99999.0 1 3.465680211598278 0.0055 0.0012 LOW
|
| 35 |
+
HG00097 manta chr1:4336501:INS:None 1 0 0 1 0 0 -99999.0 27.0 27.0 0 3.0 37.0 40.0 1.0 1 68.0 929.0 3.0 0.4059405940594059 0.4059405940594059 1.747964514410695 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 1 3.465680211598278 0.7674 0.8684 HIGH
|
| 36 |
+
HG00097 manta chr1:4939764:BND:None 0 0 0 0 0 1 -99999.0 599.0 -99999.0 1 14.0 0.0 14.0 0.4117647058823529 0 115.0 115.0 34.0 0.4554455445544554 0.4554455445544554 1.9728402340561 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 0 5.133363239048624 0.0303 0.0134 LOW
|
| 37 |
+
HG00097 manta chr1:5075708:INS:None 1 0 0 1 0 0 1.863322860120456 13.0 -99999.0 0 0.0 12.0 12.0 1.0 1 26.0 288.0 -99999.0 0.0792079207920792 0.0792079207920792 1.3986907959255488 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 1 4.927334427020689 0.749 0.8545 HIGH
|
| 38 |
+
HG00097 manta chr1:5160300:INS:None 0 0 0 1 0 0 2.765668554759014 -99999.0 -99999.0 0 0.0 46.0 46.0 0.92 1 52.0 999.0 4.0 0.3762376237623762 0.3762376237623762 1.7050759800383022 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 1 4.927334427020689 0.7965 0.8914 HIGH
|
| 39 |
+
HG00097 manta chr1:5387025:BND:None 1 0 0 0 0 1 -99999.0 466.0 -99999.0 1 15.0 0.0 15.0 0.6 0 116.0 228.0 25.0 0.5148514851485149 0.5148514851485149 1.3511081047001698 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 2 2.161368002234975 0.0015 0.0 LOW
|
| 40 |
+
HG00097 manta chr1:5387169:DUP:5387401 1 0 1 0 0 0 2.367355921026019 -99999.0 -99999.0 0 1.0 23.0 24.0 0.4285714285714285 0 338.0 481.0 13.0 0.4554455445544554 0.4752475247524752 1.5481463016741694 0.0 0 0 1 1 0 1 0 0 0 0 1.0 2 2.161368002234975 0.0005 0.0 LOW
|
| 41 |
+
HG00097 manta chr1:5414152:DEL:5414227 1 1 0 0 0 0 1.8808135922807916 18.0 -99999.0 0 0.0 7.0 7.0 0.3333333333333333 0 102.0 157.0 6.0 0.5346534653465347 0.5346534653465347 1.9698531038077964 18.0 0 0 1 1 0 1 0 0 0 0 1.0 3 4.431106328181145 0.6073 0.67 MODERATE
|
| 42 |
+
HG00097 manta chr1:5499440:DEL:5499546 1 1 0 0 0 0 2.0293837776852097 8.0 -99999.0 0 0.0 9.0 9.0 0.2195121951219512 0 276.0 276.0 7.0 0.2673267326732673 0.2673267326732673 1.6246863594780123 8.0 0 0 1 1 0 1 0 0 0 0 1.0 2 4.930893022406026 0.5474 0.5784 MODERATE
|
| 43 |
+
HG00097 manta chr1:5593288:INS:None 1 0 0 1 0 0 1.806179973983887 2.0 -99999.0 0 0.0 22.0 22.0 0.6470588235294118 0 142.0 889.0 -99999.0 0.5544554455445545 0.5544554455445545 1.989225629887876 -99999.0 0 0 0 0 0 0 0 0 0 0 -99999.0 2 4.9076585372801205 0.9166 0.9509 HIGH
|
| 44 |
+
HG00097 manta chr1:5674133:BND:None 0 0 0 0 0 1 -99999.0 7.0 -99999.0 0 11.0 47.0 58.0 0.3020833333333333 0 999.0 999.0 78.0 0.3861386138613861 0.3861386138613861 1.905091895547532 -99999.0 0 0 0 0 0 0 0 1 0 0 -99999.0 1 4.9076585372801205 0.0368 0.0188 LOW
|
| 45 |
+
HG00097 manta chr1:5874228:INS:None 1 0 0 1 0 0 2.453318340047037 13.0 -99999.0 0 18.0 23.0 41.0 0.5394736842105263 0 327.0 571.0 37.0 0.4455445544554455 0.4455445544554455 1.9714887292053216 -99999.0 0 0 0 0 0 0 0 0 0 0 -99999.0 0 5.116717299877443 0.9601 0.9697 HIGH
|
| 46 |
+
HG00097 manta chr1:6005060:DEL:6005340 1 1 0 0 0 0 2.44870631990508 -99999.0 -99999.0 0 14.0 15.0 29.0 0.9666666666666668 1 23.0 314.0 15.0 0.6237623762376238 0.6831683168316832 1.5376840111363732 2.0 0 0 1 1 0 1 0 0 0 0 1.0 1 4.317666442356501 0.7321 0.8286 HIGH
|
| 47 |
+
HG00097 manta chr1:6025840:DEL:6025908 1 1 0 0 0 0 1.8388490907372552 74.0 -99999.0 0 0.0 6.0 6.0 0.2222222222222222 0 182.0 184.0 9.0 0.6435643564356436 0.7029702970297029 1.7307956407165326 50.0 0 0 1 1 1 1 0 0 0 0 1.0 1 4.317666442356501 0.876 0.9325 HIGH
|
| 48 |
+
HG00097 manta chr1:6742557:INS:None 1 0 0 1 0 0 -99999.0 4.0 4.0 0 5.0 31.0 36.0 1.0 1 101.0 999.0 5.0 0.0792079207920792 0.0792079207920792 1.320097363865938 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 0 5.297445341827969 0.6745 0.7695 HIGH
|
| 49 |
+
HG00097 manta chr1:6940912:INS:None 1 0 0 1 0 0 -99999.0 4.0 4.0 0 0.0 34.0 34.0 0.9714285714285714 1 95.0 999.0 -99999.0 0.6336633663366337 0.6336633663366337 1.902230293166466 -99999.0 0 0 0 0 0 0 0 0 0 0 -99999.0 1 4.635272580112365 0.7578 0.8633 HIGH
|
| 50 |
+
HG00097 manta chr1:6984090:INS:None 1 0 0 1 0 0 2.311753861055754 2.0 -99999.0 0 6.0 35.0 41.0 1.0 1 72.0 999.0 6.0 0.4851485148514851 0.4851485148514851 1.5136654464703396 -99999.0 0 0 1 1 0 1 0 0 0 0 -99999.0 1 4.635272580112365 0.8751 0.9325 HIGH
|
| 51 |
+
HG00097 manta chr1:7510011:DEL:7511458 1 1 0 0 0 0 3.1607685618611283 -99999.0 -99999.0 0 11.0 15.0 26.0 0.3661971830985915 0 527.0 711.0 35.0 0.4653465346534653 0.4851485148514851 1.817566994368736 0.0 0 0 1 0 0 1 0 0 0 0 0.0151933701657458 1 4.512457586197344 0.7788 0.8742 HIGH
|
feature_builder.py
ADDED
|
@@ -0,0 +1,708 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
SVSTR_Score feature builder (VCF + reference only, single sample).
|
| 4 |
+
|
| 5 |
+
Computes the RandomForest input features defined in:
|
| 6 |
+
sv_features.tsv (callers: manta, delly, lumpy)
|
| 7 |
+
str_features.tsv (callers: expansionhunter, gangstr)
|
| 8 |
+
|
| 9 |
+
Design constraints (head model):
|
| 10 |
+
- Inputs are ONLY a short-read VCF + reference FASTA + static annotation BEDs.
|
| 11 |
+
No BAM, no cohort, no long-read (long-read is used elsewhere for labeling only).
|
| 12 |
+
- Features are caller-common *concepts*; each caller is parsed by its own parser.
|
| 13 |
+
- `caller` is recorded for bookkeeping but is NOT emitted as a model feature.
|
| 14 |
+
|
| 15 |
+
Annotation BEDs must be sorted, bgzipped and tabix-indexed (see
|
| 16 |
+
scripts/prepare_annotations or the resources/ prep step).
|
| 17 |
+
|
| 18 |
+
ExpansionHunter input is its flat (optionally gzipped) TSV, not a VCF — pass it to --vcf.
|
| 19 |
+
|
| 20 |
+
VALIDATION: validated on HG00097 (Manta/Delly/GangSTR VCFs + ExpansionHunter TSV).
|
| 21 |
+
Four parsing bugs were found & fixed against real data:
|
| 22 |
+
1. GangSTR REPCN/REPCI come back from pysam as tuples (Number=2), not strings.
|
| 23 |
+
2. pysam returns absent Flags as False (not KeyError) -> is_imprecise used `in rec.info`.
|
| 24 |
+
3. INFO/END is consumed into rec.stop; rec.info['END'] is empty.
|
| 25 |
+
4. missing sentinel must be out-of-range (-99999); -1 collided with real negative
|
| 26 |
+
expansion_over_ref (contractions). LUMPY (smoove/SVTyper) not yet run.
|
| 27 |
+
|
| 28 |
+
Usage:
|
| 29 |
+
python feature_builder.py \
|
| 30 |
+
--vcf sample.manta.vcf.gz --caller manta \
|
| 31 |
+
--fasta GRCh38.fa \
|
| 32 |
+
--giab-dir ../resources/giab_prepared \
|
| 33 |
+
--repeatmasker ../resources/repeatmasker/rmsk_class.bed.gz \
|
| 34 |
+
-o sample.manta.features.tsv
|
| 35 |
+
"""
|
| 36 |
+
|
| 37 |
+
import os
|
| 38 |
+
import sys
|
| 39 |
+
import math
|
| 40 |
+
import bisect
|
| 41 |
+
import argparse
|
| 42 |
+
from collections import defaultdict
|
| 43 |
+
|
| 44 |
+
import numpy as np
|
| 45 |
+
import pandas as pd
|
| 46 |
+
import pysam
|
| 47 |
+
|
| 48 |
+
MISSING = -99999.0 # out-of-range sentinel for missing fields (paired with *_missing indicators).
|
| 49 |
+
# Must be outside every feature's real range: expansion_over_ref can legitimately be negative,
|
| 50 |
+
# so a small sentinel like -1 would collide with real contractions.
|
| 51 |
+
SV_CALLERS = {"manta", "delly", "lumpy"}
|
| 52 |
+
STR_CALLERS = {"expansionhunter", "gangstr"}
|
| 53 |
+
PRIMARY_CONTIGS = ({f"chr{c}" for c in list(range(1, 23)) + ["X", "Y", "M"]}
|
| 54 |
+
| {str(c) for c in list(range(1, 23)) + ["X", "Y", "MT", "M"]})
|
| 55 |
+
|
| 56 |
+
# Features that can legitimately be MISSING. Their `<feat>_missing` indicator is
|
| 57 |
+
# emitted ALWAYS (even if all-zero for a given caller) so every caller's output
|
| 58 |
+
# has an identical, fixed column schema — one trained model consumes any caller's
|
| 59 |
+
# converted VCF directly, no per-caller alignment needed.
|
| 60 |
+
SV_MISSING_INDICATORS = [
|
| 61 |
+
"svlen_log", "cipos_width", "ciend_width", "vaf", "qual_norm", "gq",
|
| 62 |
+
"local_depth", "gt_hom", "gc_min", "gc_max", "entropy_min", "microhom_max",
|
| 63 |
+
"frac_span_repeat", "nn_log_dist",
|
| 64 |
+
]
|
| 65 |
+
STR_MISSING_INDICATORS = [
|
| 66 |
+
"motif_len", "ref_copynum", "locus_depth", "gt_hom", "gt_repcn_max", "gt_repcn_min",
|
| 67 |
+
"expansion_over_ref", "repci_width_max", "spanning_frac", "ref_tract_bp",
|
| 68 |
+
"allele_vs_readlen", "motif_is_homopolymer", "gc_flank", "entropy_flank",
|
| 69 |
+
]
|
| 70 |
+
|
| 71 |
+
|
| 72 |
+
# ---------------------------------------------------------------------------
|
| 73 |
+
# Reference-sequence features (reused from A2Denovo conventions)
|
| 74 |
+
# ---------------------------------------------------------------------------
|
| 75 |
+
def gc_content(seq):
|
| 76 |
+
if not seq:
|
| 77 |
+
return MISSING
|
| 78 |
+
seq = seq.upper()
|
| 79 |
+
n = sum(1 for b in seq if b in "ACGT")
|
| 80 |
+
if n == 0:
|
| 81 |
+
return MISSING
|
| 82 |
+
return sum(1 for b in seq if b in "GC") / n
|
| 83 |
+
|
| 84 |
+
|
| 85 |
+
def shannon_entropy(seq):
|
| 86 |
+
if not seq:
|
| 87 |
+
return MISSING
|
| 88 |
+
seq = seq.upper()
|
| 89 |
+
counts = defaultdict(int)
|
| 90 |
+
for b in seq:
|
| 91 |
+
if b in "ACGT":
|
| 92 |
+
counts[b] += 1
|
| 93 |
+
total = sum(counts.values())
|
| 94 |
+
if total == 0:
|
| 95 |
+
return MISSING
|
| 96 |
+
h = 0.0
|
| 97 |
+
for c in counts.values():
|
| 98 |
+
p = c / total
|
| 99 |
+
h -= p * math.log2(p)
|
| 100 |
+
return h
|
| 101 |
+
|
| 102 |
+
|
| 103 |
+
def fetch(fasta, chrom, start, end):
|
| 104 |
+
"""0-based half-open fetch with clamping; returns '' on failure."""
|
| 105 |
+
try:
|
| 106 |
+
start = max(0, start)
|
| 107 |
+
return fasta.fetch(chrom, start, end)
|
| 108 |
+
except Exception:
|
| 109 |
+
return ""
|
| 110 |
+
|
| 111 |
+
|
| 112 |
+
def gc_entropy_at(fasta, chrom, pos1, win):
|
| 113 |
+
"""GC and entropy in pos +/- win (pos is 1-based)."""
|
| 114 |
+
seq = fetch(fasta, chrom, pos1 - 1 - win, pos1 + win)
|
| 115 |
+
return gc_content(seq), shannon_entropy(seq)
|
| 116 |
+
|
| 117 |
+
|
| 118 |
+
def microhomology(fasta, chrom, pos1, end1, max_k=50):
|
| 119 |
+
"""
|
| 120 |
+
Approximate microhomology between the two breakpoints of an intra-chromosomal
|
| 121 |
+
SV: longest k (<=max_k) where the sequence adjacent to bp1 matches bp2.
|
| 122 |
+
Returns MISSING for inter-chromosomal / undefined cases.
|
| 123 |
+
"""
|
| 124 |
+
if end1 is None or end1 <= pos1:
|
| 125 |
+
return MISSING
|
| 126 |
+
left = fetch(fasta, chrom, pos1 - max_k, pos1 + max_k).upper()
|
| 127 |
+
right = fetch(fasta, chrom, end1 - max_k, end1 + max_k).upper()
|
| 128 |
+
if len(left) < 2 * max_k or len(right) < 2 * max_k:
|
| 129 |
+
return MISSING
|
| 130 |
+
k = 0
|
| 131 |
+
while k < max_k and left[max_k + k] == right[max_k + k]: # rightward match
|
| 132 |
+
k += 1
|
| 133 |
+
j = 0
|
| 134 |
+
while j < max_k and left[max_k - 1 - j] == right[max_k - 1 - j]: # leftward
|
| 135 |
+
j += 1
|
| 136 |
+
return float(max(k, j))
|
| 137 |
+
|
| 138 |
+
|
| 139 |
+
# ---------------------------------------------------------------------------
|
| 140 |
+
# Tabix annotation (binary overlap + RepeatMasker element class)
|
| 141 |
+
# ---------------------------------------------------------------------------
|
| 142 |
+
class Annotator:
|
| 143 |
+
"""Binary overlap against tabixed BEDs, with chr-naming fallback."""
|
| 144 |
+
|
| 145 |
+
RMSK_ELEMENTS = { # label prefix in rmsk_class.bed (repClass/repFamily) -> flag
|
| 146 |
+
"SINE/Alu": "Alu",
|
| 147 |
+
"LINE/L1": "L1",
|
| 148 |
+
"Retroposon/SVA": "SVA",
|
| 149 |
+
"LTR": "LTR",
|
| 150 |
+
}
|
| 151 |
+
|
| 152 |
+
def __init__(self, giab_dir=None, repeatmasker=None):
|
| 153 |
+
self.tbx = {}
|
| 154 |
+
if giab_dir:
|
| 155 |
+
for name in ("segdups", "lowmap", "tandem", "difficult"):
|
| 156 |
+
p = os.path.join(giab_dir, f"{name}.bed.gz")
|
| 157 |
+
if os.path.exists(p):
|
| 158 |
+
self.tbx[name] = pysam.TabixFile(p)
|
| 159 |
+
else:
|
| 160 |
+
sys.stderr.write(f"[warn] missing GIAB bed: {p}\n")
|
| 161 |
+
self.rmsk = pysam.TabixFile(repeatmasker) if repeatmasker and os.path.exists(repeatmasker) else None
|
| 162 |
+
|
| 163 |
+
def _contigs(self, tbx, chrom):
|
| 164 |
+
if chrom in tbx.contigs:
|
| 165 |
+
return chrom
|
| 166 |
+
alt = chrom[3:] if chrom.startswith("chr") else "chr" + chrom
|
| 167 |
+
return alt if alt in tbx.contigs else None
|
| 168 |
+
|
| 169 |
+
def overlaps(self, name, chrom, pos1):
|
| 170 |
+
"""1 if 1-based pos overlaps any interval in bed `name`, else 0."""
|
| 171 |
+
tbx = self.tbx.get(name)
|
| 172 |
+
if tbx is None:
|
| 173 |
+
return MISSING
|
| 174 |
+
c = self._contigs(tbx, chrom)
|
| 175 |
+
if c is None:
|
| 176 |
+
return 0
|
| 177 |
+
try:
|
| 178 |
+
for _ in tbx.fetch(c, pos1 - 1, pos1):
|
| 179 |
+
return 1
|
| 180 |
+
except Exception:
|
| 181 |
+
return 0
|
| 182 |
+
return 0
|
| 183 |
+
|
| 184 |
+
def frac_overlap(self, name, chrom, start1, end1):
|
| 185 |
+
"""Fraction of [start1,end1] (1-based inclusive) covered by bed `name`."""
|
| 186 |
+
tbx = self.tbx.get(name)
|
| 187 |
+
if tbx is None or end1 is None or end1 < start1:
|
| 188 |
+
return MISSING
|
| 189 |
+
c = self._contigs(tbx, chrom)
|
| 190 |
+
if c is None:
|
| 191 |
+
return 0.0
|
| 192 |
+
span = end1 - start1 + 1
|
| 193 |
+
covered = 0
|
| 194 |
+
try:
|
| 195 |
+
for row in tbx.fetch(c, start1 - 1, end1):
|
| 196 |
+
f = row.split("\t")
|
| 197 |
+
s, e = int(f[1]), int(f[2])
|
| 198 |
+
covered += max(0, min(end1, e) - max(start1 - 1, s))
|
| 199 |
+
except Exception:
|
| 200 |
+
return 0.0
|
| 201 |
+
return min(1.0, covered / span) if span > 0 else 0.0
|
| 202 |
+
|
| 203 |
+
def rmsk_elements(self, chrom, pos1):
|
| 204 |
+
"""Return dict {Alu,L1,SVA,LTR -> 0/1} for the position."""
|
| 205 |
+
flags = {"Alu": 0, "L1": 0, "SVA": 0, "LTR": 0}
|
| 206 |
+
if self.rmsk is None:
|
| 207 |
+
return {k: MISSING for k in flags}
|
| 208 |
+
c = self._contigs(self.rmsk, chrom)
|
| 209 |
+
if c is None:
|
| 210 |
+
return flags
|
| 211 |
+
try:
|
| 212 |
+
for row in self.rmsk.fetch(c, pos1 - 1, pos1):
|
| 213 |
+
label = row.split("\t")[3]
|
| 214 |
+
for prefix, flag in self.RMSK_ELEMENTS.items():
|
| 215 |
+
if label.startswith(prefix):
|
| 216 |
+
flags[flag] = 1
|
| 217 |
+
except Exception:
|
| 218 |
+
pass
|
| 219 |
+
return flags
|
| 220 |
+
|
| 221 |
+
|
| 222 |
+
def agg_either_both(a, b):
|
| 223 |
+
"""Order-invariant aggregation for the two breakpoints."""
|
| 224 |
+
if a == MISSING or b == MISSING:
|
| 225 |
+
v = a if b == MISSING else b
|
| 226 |
+
return v, v
|
| 227 |
+
return (1 if (a or b) else 0), (1 if (a and b) else 0)
|
| 228 |
+
|
| 229 |
+
|
| 230 |
+
# ---------------------------------------------------------------------------
|
| 231 |
+
# Small helpers for VCF field access
|
| 232 |
+
# ---------------------------------------------------------------------------
|
| 233 |
+
def info(rec, key, default=None):
|
| 234 |
+
try:
|
| 235 |
+
return rec.info[key]
|
| 236 |
+
except Exception:
|
| 237 |
+
return default
|
| 238 |
+
|
| 239 |
+
|
| 240 |
+
def fmt(rec, key, default=None):
|
| 241 |
+
try:
|
| 242 |
+
return rec.samples[0][key]
|
| 243 |
+
except Exception:
|
| 244 |
+
return default
|
| 245 |
+
|
| 246 |
+
|
| 247 |
+
def is_pass(rec):
|
| 248 |
+
fk = list(rec.filter.keys())
|
| 249 |
+
return 1 if (not fk or fk == ["PASS"] or fk == ["."]) else 0
|
| 250 |
+
|
| 251 |
+
|
| 252 |
+
def gt_is_hom_alt(rec):
|
| 253 |
+
gt = fmt(rec, "GT")
|
| 254 |
+
if not gt or any(a is None for a in gt):
|
| 255 |
+
return MISSING
|
| 256 |
+
alleles = [a for a in gt]
|
| 257 |
+
return 1 if all(a == alleles[0] and a > 0 for a in alleles) else 0
|
| 258 |
+
|
| 259 |
+
|
| 260 |
+
def first(x, default=MISSING):
|
| 261 |
+
"""Coerce a possibly-tuple INFO/FORMAT value to a scalar number."""
|
| 262 |
+
if x is None:
|
| 263 |
+
return default
|
| 264 |
+
if isinstance(x, (tuple, list)):
|
| 265 |
+
x = x[0] if x else default
|
| 266 |
+
try:
|
| 267 |
+
return float(x)
|
| 268 |
+
except Exception:
|
| 269 |
+
return default
|
| 270 |
+
|
| 271 |
+
|
| 272 |
+
def width(ci):
|
| 273 |
+
if not ci or not isinstance(ci, (tuple, list)) or len(ci) < 2:
|
| 274 |
+
return MISSING
|
| 275 |
+
try:
|
| 276 |
+
return abs(float(ci[1]) - float(ci[0]))
|
| 277 |
+
except Exception:
|
| 278 |
+
return MISSING
|
| 279 |
+
|
| 280 |
+
|
| 281 |
+
def norm_svtype(rec):
|
| 282 |
+
st = info(rec, "SVTYPE")
|
| 283 |
+
if st is None:
|
| 284 |
+
alt = str(rec.alts[0]) if rec.alts else ""
|
| 285 |
+
st = alt.strip("<>").split(":")[0] if alt.startswith("<") else "BND"
|
| 286 |
+
st = str(st).upper().split(":")[0]
|
| 287 |
+
if st in ("TRA", "CTX"):
|
| 288 |
+
st = "BND"
|
| 289 |
+
if st not in ("DEL", "DUP", "INS", "INV", "BND"):
|
| 290 |
+
st = "BND"
|
| 291 |
+
return st
|
| 292 |
+
|
| 293 |
+
|
| 294 |
+
# ---------------------------------------------------------------------------
|
| 295 |
+
# Per-caller SV parsers -> normalized concept dict
|
| 296 |
+
# ---------------------------------------------------------------------------
|
| 297 |
+
def parse_sv_common(rec):
|
| 298 |
+
st = norm_svtype(rec)
|
| 299 |
+
chrom = rec.chrom
|
| 300 |
+
pos = rec.pos
|
| 301 |
+
# pysam consumes INFO/END into rec.stop; meaningful only for spanned SVs.
|
| 302 |
+
# BND/INS are annotated at their primary breakend only (bp2 = bp1 via end=None).
|
| 303 |
+
end = rec.stop if st in ("DEL", "DUP", "INV") else None
|
| 304 |
+
return {
|
| 305 |
+
"chrom": chrom, "pos": pos, "end": end, "chrom2": chrom,
|
| 306 |
+
"svtype": st,
|
| 307 |
+
"is_pass": is_pass(rec),
|
| 308 |
+
"cipos_width": width(info(rec, "CIPOS") or info(rec, "CIPOS95")),
|
| 309 |
+
"ciend_width": width(info(rec, "CIEND") or info(rec, "CIEND95")),
|
| 310 |
+
"is_imprecise": 1 if ("IMPRECISE" in rec.info) else 0,
|
| 311 |
+
"gt_hom": gt_is_hom_alt(rec),
|
| 312 |
+
"svlen_raw": info(rec, "SVLEN"),
|
| 313 |
+
}
|
| 314 |
+
|
| 315 |
+
|
| 316 |
+
def parse_manta(rec):
|
| 317 |
+
d = parse_sv_common(rec)
|
| 318 |
+
pr = fmt(rec, "PR") or (None, None)
|
| 319 |
+
sr = fmt(rec, "SR") or (None, None)
|
| 320 |
+
pr_ref, pr_alt = (first(pr[0], 0), first(pr[1], 0)) if len(pr) == 2 else (0, 0)
|
| 321 |
+
sr_ref, sr_alt = (first(sr[0], 0), first(sr[1], 0)) if len(sr) == 2 else (0, 0)
|
| 322 |
+
tot = pr_ref + pr_alt + sr_ref + sr_alt
|
| 323 |
+
d.update({
|
| 324 |
+
"pe_support": pr_alt, "sr_support": sr_alt, "total_support": pr_alt + sr_alt,
|
| 325 |
+
"vaf": (pr_alt + sr_alt) / tot if tot > 0 else MISSING,
|
| 326 |
+
"gq": first(fmt(rec, "GQ")), "qual_norm": first(rec.qual),
|
| 327 |
+
"local_depth": (pr_ref + pr_alt) or first(info(rec, "BND_DEPTH")),
|
| 328 |
+
})
|
| 329 |
+
return d
|
| 330 |
+
|
| 331 |
+
|
| 332 |
+
def parse_delly(rec):
|
| 333 |
+
d = parse_sv_common(rec)
|
| 334 |
+
dr, dv = first(fmt(rec, "DR"), 0), first(fmt(rec, "DV"), 0)
|
| 335 |
+
rr, rv = first(fmt(rec, "RR"), 0), first(fmt(rec, "RV"), 0)
|
| 336 |
+
tot = dr + dv + rr + rv
|
| 337 |
+
if d["svlen_raw"] is None and d["end"] is not None: # v0.7 has no SVLEN
|
| 338 |
+
d["svlen_raw"] = d["end"] - d["pos"]
|
| 339 |
+
d.update({
|
| 340 |
+
"pe_support": dv, "sr_support": rv, "total_support": dv + rv,
|
| 341 |
+
"vaf": (dv + rv) / tot if tot > 0 else MISSING,
|
| 342 |
+
"gq": first(fmt(rec, "GQ")), "qual_norm": first(rec.qual),
|
| 343 |
+
"local_depth": dr + dv,
|
| 344 |
+
})
|
| 345 |
+
return d
|
| 346 |
+
|
| 347 |
+
|
| 348 |
+
def parse_lumpy(rec):
|
| 349 |
+
d = parse_sv_common(rec)
|
| 350 |
+
ao, ro = first(fmt(rec, "AO"), 0), first(fmt(rec, "RO"), 0)
|
| 351 |
+
ab = fmt(rec, "AB")
|
| 352 |
+
# smoove/LUMPY put SU/PE/SR in INFO (site-level), not FORMAT; fall back to FORMAT for other dialects
|
| 353 |
+
pe = info(rec, "PE"); pe = first(pe) if pe is not None else first(fmt(rec, "PE"), 0)
|
| 354 |
+
sr = info(rec, "SR"); sr = first(sr) if sr is not None else first(fmt(rec, "SR"), 0)
|
| 355 |
+
su = info(rec, "SU"); su = first(su) if su is not None else first(fmt(rec, "SU"), 0)
|
| 356 |
+
d.update({
|
| 357 |
+
"pe_support": pe, "sr_support": sr, "total_support": su,
|
| 358 |
+
"vaf": first(ab) if ab is not None else ((ao / (ao + ro)) if (ao + ro) > 0 else MISSING),
|
| 359 |
+
"gq": first(fmt(rec, "GQ")), "qual_norm": first(fmt(rec, "SQ")),
|
| 360 |
+
"local_depth": first(fmt(rec, "DP")),
|
| 361 |
+
})
|
| 362 |
+
return d
|
| 363 |
+
|
| 364 |
+
|
| 365 |
+
SV_PARSERS = {"manta": parse_manta, "delly": parse_delly, "lumpy": parse_lumpy}
|
| 366 |
+
|
| 367 |
+
|
| 368 |
+
# ---------------------------------------------------------------------------
|
| 369 |
+
# Per-caller STR parsers
|
| 370 |
+
# ---------------------------------------------------------------------------
|
| 371 |
+
def _split_pair(val, sep):
|
| 372 |
+
if val is None:
|
| 373 |
+
return []
|
| 374 |
+
if isinstance(val, (tuple, list)): # pysam returns Number=2 fields (e.g. GangSTR REPCN) as tuples
|
| 375 |
+
out = []
|
| 376 |
+
for x in val:
|
| 377 |
+
try:
|
| 378 |
+
out.append(float(x))
|
| 379 |
+
except Exception:
|
| 380 |
+
pass
|
| 381 |
+
return out
|
| 382 |
+
s = str(val)
|
| 383 |
+
for d in sep:
|
| 384 |
+
s = s.replace(d, "|")
|
| 385 |
+
out = []
|
| 386 |
+
for tok in s.split("|"):
|
| 387 |
+
try:
|
| 388 |
+
out.append(float(tok))
|
| 389 |
+
except Exception:
|
| 390 |
+
pass
|
| 391 |
+
return out
|
| 392 |
+
|
| 393 |
+
|
| 394 |
+
def parse_eh(rec):
|
| 395 |
+
ru = info(rec, "RU") or ""
|
| 396 |
+
repcn = _split_pair(fmt(rec, "REPCN"), "/")
|
| 397 |
+
ref_cn = first(info(rec, "REF"))
|
| 398 |
+
adsp = sum(_split_pair(fmt(rec, "ADSP"), "/"))
|
| 399 |
+
adfl = sum(_split_pair(fmt(rec, "ADFL"), "/"))
|
| 400 |
+
adir = sum(_split_pair(fmt(rec, "ADIR"), "/"))
|
| 401 |
+
return {
|
| 402 |
+
"chrom": rec.chrom, "pos": rec.pos, "end": rec.stop,
|
| 403 |
+
"is_pass": is_pass(rec), "motif_len": float(len(ru)) if ru else first(info(rec, "RL")),
|
| 404 |
+
"ref_copynum": ref_cn,
|
| 405 |
+
"repcn": repcn, "repci_raw": fmt(rec, "REPCI"),
|
| 406 |
+
"spanning_reads": adsp, "flanking_reads": adfl, "inrepeat_reads": adir,
|
| 407 |
+
"locus_depth": first(fmt(rec, "LC")), "gt_hom": gt_is_hom_alt(rec),
|
| 408 |
+
"qual_post": first(rec.qual), "ref_tract_bp": first(info(rec, "RL")),
|
| 409 |
+
"ru": ru,
|
| 410 |
+
}
|
| 411 |
+
|
| 412 |
+
|
| 413 |
+
def parse_gangstr(rec):
|
| 414 |
+
ru = info(rec, "RU") or ""
|
| 415 |
+
period = first(info(rec, "PERIOD"))
|
| 416 |
+
repcn = _split_pair(fmt(rec, "REPCN"), ",")
|
| 417 |
+
ref_cn = first(info(rec, "REF"))
|
| 418 |
+
rc = _split_pair(fmt(rec, "RC"), ",") # enclosing,spanning,FRR,bounding
|
| 419 |
+
enclosing, spanning, frr, bounding = (rc + [0, 0, 0, 0])[:4]
|
| 420 |
+
return {
|
| 421 |
+
"chrom": rec.chrom, "pos": rec.pos, "end": rec.stop,
|
| 422 |
+
"is_pass": is_pass(rec), "motif_len": period if period != MISSING else float(len(ru)),
|
| 423 |
+
"ref_copynum": ref_cn,
|
| 424 |
+
"repcn": repcn, "repci_raw": fmt(rec, "REPCI"),
|
| 425 |
+
"spanning_reads": enclosing + spanning, "flanking_reads": bounding, "inrepeat_reads": frr,
|
| 426 |
+
"locus_depth": first(fmt(rec, "DP")), "gt_hom": gt_is_hom_alt(rec),
|
| 427 |
+
"qual_post": first(fmt(rec, "Q")),
|
| 428 |
+
"ref_tract_bp": (ref_cn * period) if (ref_cn != MISSING and period != MISSING) else MISSING,
|
| 429 |
+
"ru": ru,
|
| 430 |
+
}
|
| 431 |
+
|
| 432 |
+
|
| 433 |
+
def _num(x, default=MISSING):
|
| 434 |
+
try:
|
| 435 |
+
if x is None or x == "":
|
| 436 |
+
return default
|
| 437 |
+
v = float(x)
|
| 438 |
+
return default if v != v else v # NaN guard
|
| 439 |
+
except Exception:
|
| 440 |
+
return default
|
| 441 |
+
|
| 442 |
+
|
| 443 |
+
def parse_eh_tsv(row):
|
| 444 |
+
"""One row of an ExpansionHunter flat TSV:
|
| 445 |
+
chrom,pos,end,filter,repid,ru,rl,ref,repcn,repci,adsp,adfl,adir,lc,so"""
|
| 446 |
+
ru = str(row.get("ru") or "")
|
| 447 |
+
repcn = _split_pair(row.get("repcn"), "/")
|
| 448 |
+
ref_cn = _num(row.get("ref"))
|
| 449 |
+
rl = _num(row.get("rl"))
|
| 450 |
+
adsp = sum(_split_pair(row.get("adsp"), "/"))
|
| 451 |
+
adfl = sum(_split_pair(row.get("adfl"), "/"))
|
| 452 |
+
adir = sum(_split_pair(row.get("adir"), "/"))
|
| 453 |
+
gt_hom = MISSING
|
| 454 |
+
if len(repcn) >= 2: # hom-ALT = both alleles equal and differ from reference
|
| 455 |
+
gt_hom = 1 if (repcn[0] == repcn[1] and repcn[0] != ref_cn) else 0
|
| 456 |
+
return {
|
| 457 |
+
"chrom": str(row["chrom"]), "pos": int(float(row["pos"])), "end": _num(row.get("end")),
|
| 458 |
+
"is_pass": 1 if str(row.get("filter", "")).upper() == "PASS" else 0,
|
| 459 |
+
"motif_len": float(len(ru)) if ru else rl,
|
| 460 |
+
"ref_copynum": ref_cn,
|
| 461 |
+
"repcn": repcn, "repci_raw": row.get("repci"),
|
| 462 |
+
"spanning_reads": adsp, "flanking_reads": adfl, "inrepeat_reads": adir,
|
| 463 |
+
"locus_depth": _num(row.get("lc")), "gt_hom": gt_hom,
|
| 464 |
+
"qual_post": MISSING, # EH TSV carries no site quality
|
| 465 |
+
"ref_tract_bp": rl, "ru": ru,
|
| 466 |
+
}
|
| 467 |
+
|
| 468 |
+
|
| 469 |
+
STR_PARSERS = {"expansionhunter": parse_eh, "gangstr": parse_gangstr}
|
| 470 |
+
|
| 471 |
+
|
| 472 |
+
def repci_width_max(repci_raw):
|
| 473 |
+
"""Max allele CI width. EH: '2-2/10-10' (str); GangSTR: ('1-2','2-2') (pysam tuple)."""
|
| 474 |
+
if repci_raw is None:
|
| 475 |
+
return MISSING
|
| 476 |
+
if isinstance(repci_raw, (tuple, list)):
|
| 477 |
+
alleles = [str(x) for x in repci_raw]
|
| 478 |
+
else:
|
| 479 |
+
alleles = str(repci_raw).replace("/", ",").split(",")
|
| 480 |
+
best = MISSING
|
| 481 |
+
for allele in alleles:
|
| 482 |
+
if "-" in allele:
|
| 483 |
+
try:
|
| 484 |
+
parts = allele.split("-")
|
| 485 |
+
w = abs(float(parts[1]) - float(parts[0]))
|
| 486 |
+
best = w if best == MISSING else max(best, w)
|
| 487 |
+
except Exception:
|
| 488 |
+
pass
|
| 489 |
+
return best
|
| 490 |
+
|
| 491 |
+
|
| 492 |
+
# ---------------------------------------------------------------------------
|
| 493 |
+
# Feature assembly
|
| 494 |
+
# ---------------------------------------------------------------------------
|
| 495 |
+
def sv_features(d, ann, fasta, win):
|
| 496 |
+
chrom, pos, end = d["chrom"], d["pos"], d["end"]
|
| 497 |
+
chrom2, end2 = d["chrom2"], (end if end is not None else pos)
|
| 498 |
+
st = d["svtype"]
|
| 499 |
+
svlen = first(d["svlen_raw"])
|
| 500 |
+
f = {
|
| 501 |
+
"variant_ID": f"{chrom}:{pos}:{st}:{end}",
|
| 502 |
+
"is_pass": d["is_pass"],
|
| 503 |
+
"svtype_DEL": int(st == "DEL"), "svtype_DUP": int(st == "DUP"),
|
| 504 |
+
"svtype_INS": int(st == "INS"), "svtype_INV": int(st == "INV"),
|
| 505 |
+
"svtype_BND": int(st == "BND"),
|
| 506 |
+
"svlen_log": math.log10(abs(svlen) + 1) if svlen != MISSING else MISSING,
|
| 507 |
+
"cipos_width": d["cipos_width"], "ciend_width": d["ciend_width"],
|
| 508 |
+
"is_imprecise": d["is_imprecise"],
|
| 509 |
+
"pe_support": d["pe_support"], "sr_support": d["sr_support"],
|
| 510 |
+
"total_support": d["total_support"], "vaf": d["vaf"],
|
| 511 |
+
"gt_hom": d["gt_hom"], "gq": d["gq"], "qual_norm": d["qual_norm"],
|
| 512 |
+
"local_depth": d["local_depth"],
|
| 513 |
+
}
|
| 514 |
+
# reference sequence context at both breakpoints
|
| 515 |
+
gc1, e1 = gc_entropy_at(fasta, chrom, pos, win)
|
| 516 |
+
gc2, e2 = gc_entropy_at(fasta, chrom2, end2, win)
|
| 517 |
+
f["gc_min"], f["gc_max"] = (min(gc1, gc2), max(gc1, gc2)) if MISSING not in (gc1, gc2) else (MISSING, MISSING)
|
| 518 |
+
f["entropy_min"] = min(e1, e2) if MISSING not in (e1, e2) else MISSING
|
| 519 |
+
f["microhom_max"] = microhomology(fasta, chrom, pos, end if chrom2 == chrom else None)
|
| 520 |
+
# GIAB binary overlap, both breakpoints
|
| 521 |
+
for name, key in (("segdups", "segdup"), ("difficult", "difficult")):
|
| 522 |
+
ei, bo = agg_either_both(ann.overlaps(name, chrom, pos), ann.overlaps(name, chrom2, end2))
|
| 523 |
+
f[f"in_{key}_either"], f[f"in_{key}_both"] = ei, bo
|
| 524 |
+
for name, key in (("lowmap", "lowmap"), ("tandem", "tandem")):
|
| 525 |
+
ei, _ = agg_either_both(ann.overlaps(name, chrom, pos), ann.overlaps(name, chrom2, end2))
|
| 526 |
+
f[f"in_{key}_either"] = ei
|
| 527 |
+
# RepeatMasker element class, either breakpoint
|
| 528 |
+
r1 = ann.rmsk_elements(chrom, pos)
|
| 529 |
+
r2 = ann.rmsk_elements(chrom2, end2)
|
| 530 |
+
for elt in ("Alu", "L1", "SVA", "LTR"):
|
| 531 |
+
ei, _ = agg_either_both(r1[elt], r2[elt])
|
| 532 |
+
f[f"in_{elt}_either"] = ei
|
| 533 |
+
# fraction of the SV interval covered by repeats (intra-chrom interval SVs only)
|
| 534 |
+
if st in ("DEL", "DUP", "INV") and end is not None and chrom2 == chrom:
|
| 535 |
+
f["frac_span_repeat"] = max(ann.frac_overlap("tandem", chrom, pos, end),
|
| 536 |
+
ann.frac_overlap("segdups", chrom, pos, end))
|
| 537 |
+
else:
|
| 538 |
+
f["frac_span_repeat"] = MISSING
|
| 539 |
+
# neighbor density (SV only) — precomputed onto d by compute_clustering()
|
| 540 |
+
f["n_neighbors"] = d.get("n_neighbors", 0)
|
| 541 |
+
f["nn_log_dist"] = d.get("nn_log_dist", MISSING)
|
| 542 |
+
return f
|
| 543 |
+
|
| 544 |
+
|
| 545 |
+
def str_features(d, ann, fasta, win, read_len):
|
| 546 |
+
chrom, pos = d["chrom"], d["pos"]
|
| 547 |
+
repcn = d["repcn"] or []
|
| 548 |
+
cn_max = max(repcn) if repcn else MISSING
|
| 549 |
+
cn_min = min(repcn) if repcn else MISSING
|
| 550 |
+
ref_cn = d["ref_copynum"]
|
| 551 |
+
motif = d["motif_len"]
|
| 552 |
+
f = {
|
| 553 |
+
"variant_ID": f"{chrom}:{pos}:{info_end(d)}",
|
| 554 |
+
"is_pass": d["is_pass"], "motif_len": motif, "ref_copynum": ref_cn,
|
| 555 |
+
"gt_repcn_max": cn_max, "gt_repcn_min": cn_min,
|
| 556 |
+
"expansion_over_ref": (cn_max - ref_cn) if MISSING not in (cn_max, ref_cn) else MISSING,
|
| 557 |
+
"repci_width_max": repci_width_max(d["repci_raw"]),
|
| 558 |
+
"spanning_reads": d["spanning_reads"], "flanking_reads": d["flanking_reads"],
|
| 559 |
+
"inrepeat_reads": d["inrepeat_reads"],
|
| 560 |
+
"locus_depth": d["locus_depth"], "gt_hom": d["gt_hom"],
|
| 561 |
+
# qual_post dropped: EH never emits it -> structurally-missing -> caller-identity proxy
|
| 562 |
+
"ref_tract_bp": d["ref_tract_bp"],
|
| 563 |
+
}
|
| 564 |
+
tot = d["spanning_reads"] + d["flanking_reads"] + d["inrepeat_reads"]
|
| 565 |
+
f["spanning_frac"] = d["spanning_reads"] / tot if tot > 0 else MISSING
|
| 566 |
+
f["allele_vs_readlen"] = (cn_max * motif / read_len) if MISSING not in (cn_max, motif) else MISSING
|
| 567 |
+
f["motif_is_homopolymer"] = int(motif == 1) if motif != MISSING else MISSING
|
| 568 |
+
gc, ent = gc_entropy_at(fasta, chrom, pos, win)
|
| 569 |
+
f["gc_flank"], f["entropy_flank"] = gc, ent
|
| 570 |
+
f["in_segdup"] = ann.overlaps("segdups", chrom, pos)
|
| 571 |
+
f["in_difficult"] = ann.overlaps("difficult", chrom, pos)
|
| 572 |
+
f["flank_lowmap"] = ann.overlaps("lowmap", chrom, pos)
|
| 573 |
+
return f
|
| 574 |
+
|
| 575 |
+
|
| 576 |
+
def info_end(d):
|
| 577 |
+
return int(d["end"]) if d.get("end") is not None else d["pos"]
|
| 578 |
+
|
| 579 |
+
|
| 580 |
+
# ---------------------------------------------------------------------------
|
| 581 |
+
# Clustering (SV) — within-callset neighbor density
|
| 582 |
+
# ---------------------------------------------------------------------------
|
| 583 |
+
def compute_clustering(parsed, radius):
|
| 584 |
+
"""Set on each parsed SV dict:
|
| 585 |
+
nn_log_dist = log10(distance to nearest other call + 1), UNCAPPED (isolation).
|
| 586 |
+
n_neighbors = number of other calls within +/-radius.
|
| 587 |
+
SV calls are sparse (median nearest neighbor ~5-90 kb), so radius must be SV-scale
|
| 588 |
+
(default 100 kb), not the 1 kb used for dense small variants. Vectorized per chrom."""
|
| 589 |
+
by_chrom = defaultdict(list)
|
| 590 |
+
for j, d in enumerate(parsed):
|
| 591 |
+
by_chrom[d["chrom"]].append((d["pos"], j))
|
| 592 |
+
for items in by_chrom.values():
|
| 593 |
+
items.sort()
|
| 594 |
+
pos = np.array([p for p, _ in items])
|
| 595 |
+
n = len(pos)
|
| 596 |
+
for k, (_, j) in enumerate(items):
|
| 597 |
+
if n < 2:
|
| 598 |
+
parsed[j]["nn_log_dist"], parsed[j]["n_neighbors"] = MISSING, 0
|
| 599 |
+
continue
|
| 600 |
+
p = pos[k]
|
| 601 |
+
nearest = min((p - pos[k - 1]) if k > 0 else float("inf"),
|
| 602 |
+
(pos[k + 1] - p) if k < n - 1 else float("inf"))
|
| 603 |
+
parsed[j]["nn_log_dist"] = math.log10(nearest + 1)
|
| 604 |
+
lo = int(np.searchsorted(pos, p - radius, "left"))
|
| 605 |
+
hi = int(np.searchsorted(pos, p + radius, "right"))
|
| 606 |
+
parsed[j]["n_neighbors"] = hi - lo - 1 # exclude self
|
| 607 |
+
|
| 608 |
+
|
| 609 |
+
# ---------------------------------------------------------------------------
|
| 610 |
+
# Main
|
| 611 |
+
# ---------------------------------------------------------------------------
|
| 612 |
+
def main():
|
| 613 |
+
ap = argparse.ArgumentParser(description="SVSTR_Score feature builder")
|
| 614 |
+
ap.add_argument("--vcf", required=True)
|
| 615 |
+
ap.add_argument("--caller", required=True,
|
| 616 |
+
choices=sorted(SV_CALLERS | STR_CALLERS))
|
| 617 |
+
ap.add_argument("--fasta", required=True)
|
| 618 |
+
ap.add_argument("--giab-dir", default=None, help="dir with segdups/lowmap/tandem/difficult .bed.gz (tabixed)")
|
| 619 |
+
ap.add_argument("--repeatmasker", default=None, help="tabixed rmsk_class.bed.gz")
|
| 620 |
+
ap.add_argument("--win", type=int, default=50, help="GC/entropy window (+/- bp)")
|
| 621 |
+
ap.add_argument("--neighbor-radius", type=int, default=100000,
|
| 622 |
+
help="SV clustering radius for n_neighbors (+/- bp). Default 100kb — SV calls are "
|
| 623 |
+
"sparse (median nearest ~5-90kb); 1kb is for dense small variants.")
|
| 624 |
+
ap.add_argument("--read-len", type=int, default=150, help="short-read length (STR spanning feasibility)")
|
| 625 |
+
ap.add_argument("--primary-only", dest="primary_only", action="store_true", default=True,
|
| 626 |
+
help="keep only primary-assembly contigs chr1-22,X,Y,M (default on)")
|
| 627 |
+
ap.add_argument("--all-contigs", dest="primary_only", action="store_false",
|
| 628 |
+
help="include ALT/decoy/HLA contigs (off by default)")
|
| 629 |
+
ap.add_argument("--str-drop-homref", action="store_true",
|
| 630 |
+
help="(STR) drop hom-ref 0/0 genotype loci (catalog non-variants)")
|
| 631 |
+
ap.add_argument("--sample", default=None,
|
| 632 |
+
help="sample id (default: auto from VCF's single sample, or EH-TSV filename prefix). "
|
| 633 |
+
"Emitted as a `sample` column — the label join key with the truth set.")
|
| 634 |
+
ap.add_argument("--missing-indicators", action="store_true",
|
| 635 |
+
help="also emit <feat>_missing 0/1 columns. OFF by default: redundant for tree "
|
| 636 |
+
"models (the -99999 sentinel is already split-separable). Turn on for linear/NN models.")
|
| 637 |
+
ap.add_argument("-o", "--output", required=True)
|
| 638 |
+
args = ap.parse_args()
|
| 639 |
+
|
| 640 |
+
variant_class = "SV" if args.caller in SV_CALLERS else "STR"
|
| 641 |
+
fasta = pysam.FastaFile(args.fasta)
|
| 642 |
+
ann = Annotator(args.giab_dir, args.repeatmasker)
|
| 643 |
+
|
| 644 |
+
eh_tsv = (args.caller == "expansionhunter") # EH ships a flat (gzipped) TSV, not a VCF
|
| 645 |
+
if eh_tsv:
|
| 646 |
+
with open(args.vcf, "rb") as fh:
|
| 647 |
+
comp = "gzip" if fh.read(2) == b"\x1f\x8b" else None
|
| 648 |
+
records = pd.read_csv(args.vcf, sep="\t", dtype=str, compression=comp).to_dict("records")
|
| 649 |
+
sample = args.sample or os.path.basename(args.vcf).split(".")[0]
|
| 650 |
+
get_chrom = lambda r: str(r["chrom"])
|
| 651 |
+
def is_homref(r):
|
| 652 |
+
cn, ref = _split_pair(r.get("repcn"), "/"), _num(r.get("ref"))
|
| 653 |
+
return bool(cn) and all(x == ref for x in cn)
|
| 654 |
+
else:
|
| 655 |
+
vf = pysam.VariantFile(args.vcf)
|
| 656 |
+
hdr = list(vf.header.samples)
|
| 657 |
+
sample = args.sample or (hdr[0] if len(hdr) == 1 else None)
|
| 658 |
+
if sample is None:
|
| 659 |
+
sys.exit(f"[error] --sample required: VCF has {len(hdr)} samples {hdr}")
|
| 660 |
+
records = list(vf)
|
| 661 |
+
get_chrom = lambda r: r.chrom
|
| 662 |
+
is_homref = lambda r: not (set(fmt(r, "GT") or ()) - {0})
|
| 663 |
+
sys.stderr.write(f"[info] sample={sample}\n")
|
| 664 |
+
|
| 665 |
+
n_raw = len(records)
|
| 666 |
+
if args.primary_only:
|
| 667 |
+
records = [r for r in records if get_chrom(r) in PRIMARY_CONTIGS]
|
| 668 |
+
sys.stderr.write(f"[info] primary-only: dropped {n_raw - len(records):,} non-primary-contig records\n")
|
| 669 |
+
if variant_class == "STR" and args.str_drop_homref:
|
| 670 |
+
before = len(records)
|
| 671 |
+
records = [r for r in records if not is_homref(r)]
|
| 672 |
+
sys.stderr.write(f"[info] str-drop-homref: dropped {before - len(records):,} hom-ref loci\n")
|
| 673 |
+
sys.stderr.write(f"[info] {len(records):,} records to process | caller={args.caller} class={variant_class}\n")
|
| 674 |
+
|
| 675 |
+
rows = []
|
| 676 |
+
if variant_class == "SV":
|
| 677 |
+
parser = SV_PARSERS[args.caller]
|
| 678 |
+
parsed = [parser(r) for r in records]
|
| 679 |
+
compute_clustering(parsed, args.neighbor_radius)
|
| 680 |
+
for d in parsed:
|
| 681 |
+
f = sv_features(d, ann, fasta, args.win)
|
| 682 |
+
f["caller"] = args.caller
|
| 683 |
+
rows.append(f)
|
| 684 |
+
else:
|
| 685 |
+
parser = parse_eh_tsv if eh_tsv else STR_PARSERS[args.caller]
|
| 686 |
+
for r in records:
|
| 687 |
+
d = parser(r)
|
| 688 |
+
f = str_features(d, ann, fasta, args.win, args.read_len)
|
| 689 |
+
f["caller"] = args.caller
|
| 690 |
+
rows.append(f)
|
| 691 |
+
|
| 692 |
+
out = pd.DataFrame(rows)
|
| 693 |
+
out["sample"] = sample
|
| 694 |
+
# Missingness is carried by the -99999 sentinel in each feature (trees split on it
|
| 695 |
+
# directly). Optional explicit indicators (fixed list -> stable schema) for linear/NN.
|
| 696 |
+
if args.missing_indicators:
|
| 697 |
+
indicators = SV_MISSING_INDICATORS if variant_class == "SV" else STR_MISSING_INDICATORS
|
| 698 |
+
for col in indicators:
|
| 699 |
+
out[f"{col}_missing"] = (out[col] == MISSING).astype(int) if col in out.columns else 0
|
| 700 |
+
# meta (label join key) first: sample, caller, variant_ID — NOT model features
|
| 701 |
+
meta = [c for c in ("sample", "caller", "variant_ID") if c in out.columns]
|
| 702 |
+
out = out[meta + [c for c in out.columns if c not in meta]]
|
| 703 |
+
out.to_csv(args.output, sep="\t", index=False)
|
| 704 |
+
sys.stderr.write(f"[info] wrote {len(out):,} rows x {out.shape[1]} cols -> {args.output}\n")
|
| 705 |
+
|
| 706 |
+
|
| 707 |
+
if __name__ == "__main__":
|
| 708 |
+
main()
|
feature_manifest.json
CHANGED
|
@@ -1,54 +1,64 @@
|
|
| 1 |
{
|
|
|
|
| 2 |
"sv_features": [
|
| 3 |
-
"
|
| 4 |
-
"gq",
|
| 5 |
-
"pr_ref",
|
| 6 |
-
"pr_alt",
|
| 7 |
-
"sr_ref",
|
| 8 |
-
"sr_alt",
|
| 9 |
-
"vf_ref",
|
| 10 |
-
"vf_alt",
|
| 11 |
-
"total_alt_support",
|
| 12 |
-
"sr_pr_ratio",
|
| 13 |
-
"vaf_estimate",
|
| 14 |
-
"strand_bias_fs",
|
| 15 |
-
"cipos_width",
|
| 16 |
-
"ciend_width",
|
| 17 |
-
"homlen",
|
| 18 |
-
"svlen_abs",
|
| 19 |
"svtype_DEL",
|
| 20 |
-
"svtype_INS",
|
| 21 |
"svtype_DUP",
|
|
|
|
|
|
|
| 22 |
"svtype_BND",
|
|
|
|
|
|
|
|
|
|
| 23 |
"is_imprecise",
|
| 24 |
-
"
|
| 25 |
-
"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 26 |
],
|
| 27 |
"str_features": [
|
| 28 |
-
"
|
| 29 |
-
"
|
| 30 |
-
"
|
| 31 |
-
"
|
| 32 |
-
"
|
| 33 |
-
"
|
| 34 |
-
"
|
| 35 |
-
"
|
| 36 |
-
"
|
| 37 |
-
"
|
| 38 |
-
"
|
| 39 |
-
"
|
| 40 |
-
"
|
| 41 |
-
"
|
| 42 |
-
"
|
| 43 |
-
"
|
| 44 |
-
"
|
| 45 |
-
"
|
| 46 |
-
"
|
| 47 |
-
"
|
| 48 |
-
"
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
"locus_conc_rate",
|
| 52 |
-
"locus_in_lookup"
|
| 53 |
-
]
|
| 54 |
}
|
|
|
|
| 1 |
{
|
| 2 |
+
"release_version": "1.0",
|
| 3 |
"sv_features": [
|
| 4 |
+
"is_pass",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
"svtype_DEL",
|
|
|
|
| 6 |
"svtype_DUP",
|
| 7 |
+
"svtype_INS",
|
| 8 |
+
"svtype_INV",
|
| 9 |
"svtype_BND",
|
| 10 |
+
"svlen_log",
|
| 11 |
+
"cipos_width",
|
| 12 |
+
"ciend_width",
|
| 13 |
"is_imprecise",
|
| 14 |
+
"pe_support",
|
| 15 |
+
"sr_support",
|
| 16 |
+
"total_support",
|
| 17 |
+
"vaf",
|
| 18 |
+
"gt_hom",
|
| 19 |
+
"gq",
|
| 20 |
+
"qual_norm",
|
| 21 |
+
"local_depth",
|
| 22 |
+
"gc_min",
|
| 23 |
+
"gc_max",
|
| 24 |
+
"entropy_min",
|
| 25 |
+
"microhom_max",
|
| 26 |
+
"in_segdup_either",
|
| 27 |
+
"in_segdup_both",
|
| 28 |
+
"in_difficult_either",
|
| 29 |
+
"in_difficult_both",
|
| 30 |
+
"in_lowmap_either",
|
| 31 |
+
"in_tandem_either",
|
| 32 |
+
"in_Alu_either",
|
| 33 |
+
"in_L1_either",
|
| 34 |
+
"in_SVA_either",
|
| 35 |
+
"in_LTR_either",
|
| 36 |
+
"frac_span_repeat",
|
| 37 |
+
"n_neighbors",
|
| 38 |
+
"nn_log_dist"
|
| 39 |
],
|
| 40 |
"str_features": [
|
| 41 |
+
"is_pass",
|
| 42 |
+
"motif_len",
|
| 43 |
+
"ref_copynum",
|
| 44 |
+
"gt_repcn_max",
|
| 45 |
+
"gt_repcn_min",
|
| 46 |
+
"expansion_over_ref",
|
| 47 |
+
"repci_width_max",
|
| 48 |
+
"spanning_reads",
|
| 49 |
+
"flanking_reads",
|
| 50 |
+
"inrepeat_reads",
|
| 51 |
+
"locus_depth",
|
| 52 |
+
"gt_hom",
|
| 53 |
+
"ref_tract_bp",
|
| 54 |
+
"spanning_frac",
|
| 55 |
+
"allele_vs_readlen",
|
| 56 |
+
"motif_is_homopolymer",
|
| 57 |
+
"gc_flank",
|
| 58 |
+
"entropy_flank",
|
| 59 |
+
"in_segdup",
|
| 60 |
+
"in_difficult",
|
| 61 |
+
"flank_lowmap"
|
| 62 |
+
],
|
| 63 |
+
"note": "Produced by feature_builder.py from a short-read VCF + reference FASTA + static BEDs; missing fields use the -99999 sentinel."
|
|
|
|
|
|
|
|
|
|
| 64 |
}
|
requirements.txt
CHANGED
|
@@ -1,4 +1,4 @@
|
|
| 1 |
-
scikit-learn==1.
|
| 2 |
pandas>=2.0
|
| 3 |
numpy>=1.24
|
| 4 |
joblib>=1.3
|
|
|
|
| 1 |
+
scikit-learn==1.7.1
|
| 2 |
pandas>=2.0
|
| 3 |
numpy>=1.24
|
| 4 |
joblib>=1.3
|
score_svstr.py
CHANGED
|
@@ -1,110 +1,91 @@
|
|
| 1 |
#!/usr/bin/env python3
|
| 2 |
"""
|
| 3 |
score_svstr.py — apply the SVSTR-Score confidence model to short-read SV or STR
|
| 4 |
-
calls and emit a per-call confidence score (CS) and tier.
|
| 5 |
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
|
| 12 |
-
|
| 13 |
-
|
|
|
|
|
|
|
| 14 |
MODERATE 0.50 <= CS < 0.70
|
| 15 |
WARNING 0.30 <= CS < 0.50
|
| 16 |
LOW CS < 0.30
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
range is a monotone precision ladder; LOW/WARNING are not.
|
| 24 |
-
|
| 25 |
-
INPUT (--features): tab/comma table extracted from the caller VCF.
|
| 26 |
-
SV : the 23 features in sv_config.json (+ chrom,pos optional for provenance)
|
| 27 |
-
STR : the 23 caller-output features in str_config.json EXCLUDING locus_conc_rate
|
| 28 |
-
and locus_in_lookup, PLUS chrom,pos (used to join the catalogue lookup).
|
| 29 |
-
locus_conc_rate / locus_in_lookup are filled here from the released lookup.
|
| 30 |
|
| 31 |
USAGE
|
| 32 |
-
python score_svstr.py --variant sv --model-dir . --features
|
| 33 |
-
python score_svstr.py --variant str --model-dir . --features
|
| 34 |
|
| 35 |
-
Requires the
|
| 36 |
-
Licence: MIT.
|
| 37 |
"""
|
| 38 |
import argparse, json, os, sys
|
| 39 |
import numpy as np
|
| 40 |
import pandas as pd
|
| 41 |
import joblib
|
| 42 |
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
_OP = {"<": lambda a, v: a < v, "<=": lambda a, v: a <= v,
|
| 48 |
-
">": lambda a, v: a > v, ">=": lambda a, v: a >= v, "==": lambda a, v: a == v}
|
| 49 |
|
| 50 |
|
| 51 |
-
def
|
| 52 |
-
|
| 53 |
-
quality-demotion rules: any rule triggered -> cfg['override_target'] (SV: LOW,
|
| 54 |
-
STR: WARNING), regardless of CS. NaN feature values never trigger a rule."""
|
| 55 |
-
tier = np.where(cs >= 0.70, "HIGH",
|
| 56 |
-
np.where(cs >= 0.50, "MODERATE",
|
| 57 |
-
np.where(cs >= 0.30, "WARNING", "LOW")))
|
| 58 |
-
rules = cfg.get("rules", {})
|
| 59 |
-
target = cfg.get("override_target", "LOW")
|
| 60 |
-
trig = np.zeros(len(cs), dtype=bool)
|
| 61 |
-
for r in rules.values():
|
| 62 |
-
f = r["feature"]
|
| 63 |
-
if f in df.columns:
|
| 64 |
-
v = pd.to_numeric(df[f], errors="coerce").values
|
| 65 |
-
trig = trig | np.nan_to_num(_OP[r["op"]](v, r["value"]), nan=False).astype(bool)
|
| 66 |
-
return np.where(trig, target, tier)
|
| 67 |
|
| 68 |
|
| 69 |
def main():
|
| 70 |
ap = argparse.ArgumentParser(description=__doc__,
|
| 71 |
formatter_class=argparse.RawDescriptionHelpFormatter)
|
| 72 |
ap.add_argument("--variant", choices=["sv", "str"], required=True)
|
| 73 |
-
ap.add_argument("--model-dir", default=".", help="dir with *_config.json
|
| 74 |
-
ap.add_argument("--features", required=True,
|
|
|
|
| 75 |
ap.add_argument("--out", required=True)
|
|
|
|
| 76 |
a = ap.parse_args()
|
| 77 |
|
| 78 |
cfg = json.load(open(os.path.join(a.model_dir, f"{a.variant}_config.json")))
|
| 79 |
model = joblib.load(os.path.join(a.model_dir, cfg["model_file"]))
|
|
|
|
| 80 |
feats = cfg["features"]
|
| 81 |
|
| 82 |
sep = "\t" if a.features.endswith((".tsv", ".txt", ".gz")) else ","
|
| 83 |
df = pd.read_csv(a.features, sep=sep)
|
| 84 |
|
| 85 |
-
if
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
|
| 99 |
-
|
| 100 |
-
|
| 101 |
-
|
| 102 |
-
|
| 103 |
-
|
| 104 |
-
|
| 105 |
-
|
| 106 |
-
hi = int((
|
| 107 |
-
print(f"{a.variant.upper()}: scored {n:,} calls; HIGH {hi:,} ({hi/n:.1%})
|
|
|
|
| 108 |
print(f"wrote {a.out}", file=sys.stderr)
|
| 109 |
|
| 110 |
|
|
|
|
| 1 |
#!/usr/bin/env python3
|
| 2 |
"""
|
| 3 |
score_svstr.py — apply the SVSTR-Score confidence model to short-read SV or STR
|
| 4 |
+
calls and emit a per-call calibrated confidence score (CS) and tier.
|
| 5 |
|
| 6 |
+
Inference entry point for the released models (sv_model.joblib / str_model.joblib,
|
| 7 |
+
each paired with an isotonic calibrator). It loads the trained random forest + its
|
| 8 |
+
isotonic calibrator + the feature/config sidecar, then scores a tabular feature
|
| 9 |
+
matrix produced by `feature_builder.py` from the caller VCF (short-read VCF +
|
| 10 |
+
reference FASTA + static annotation BEDs):
|
| 11 |
|
| 12 |
+
CS = isotonic_calibrator( RF.predict_proba(X)[:, 1] ) # P(concordant)
|
| 13 |
+
|
| 14 |
+
TIERS:
|
| 15 |
+
HIGH CS >= 0.70 (candidate-triage filter)
|
| 16 |
MODERATE 0.50 <= CS < 0.70
|
| 17 |
WARNING 0.30 <= CS < 0.50
|
| 18 |
LOW CS < 0.30
|
| 19 |
+
|
| 20 |
+
The score is isotonic-calibrated, so the tier is a pure bucket of the calibrated
|
| 21 |
+
CS — there are no heuristic override rules, and STR needs no per-locus catalogue
|
| 22 |
+
lookup (its features are self-contained). Missing features (fields a merged or
|
| 23 |
+
filtered callset may not carry) are filled with the -99999 sentinel that the
|
| 24 |
+
trees were trained to split on.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
|
| 26 |
USAGE
|
| 27 |
+
python score_svstr.py --variant sv --model-dir . --features sv_features.tsv --out sv_scored.tsv
|
| 28 |
+
python score_svstr.py --variant str --model-dir . --features str_features.tsv --out str_scored.tsv
|
| 29 |
|
| 30 |
+
Requires the versions in requirements.txt (scikit-learn==1.7.1). Licence: MIT.
|
|
|
|
| 31 |
"""
|
| 32 |
import argparse, json, os, sys
|
| 33 |
import numpy as np
|
| 34 |
import pandas as pd
|
| 35 |
import joblib
|
| 36 |
|
| 37 |
+
MISSING = -99999.0
|
| 38 |
+
TIER_EDGES = (0.30, 0.50, 0.70)
|
| 39 |
+
TIER_NAMES = ("LOW", "WARNING", "MODERATE", "HIGH")
|
|
|
|
|
|
|
|
|
|
| 40 |
|
| 41 |
|
| 42 |
+
def to_tier(cs):
|
| 43 |
+
return np.asarray(TIER_NAMES)[np.digitize(np.asarray(cs, float), TIER_EDGES, right=False)]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 44 |
|
| 45 |
|
| 46 |
def main():
|
| 47 |
ap = argparse.ArgumentParser(description=__doc__,
|
| 48 |
formatter_class=argparse.RawDescriptionHelpFormatter)
|
| 49 |
ap.add_argument("--variant", choices=["sv", "str"], required=True)
|
| 50 |
+
ap.add_argument("--model-dir", default=".", help="dir with *_config.json + *.joblib")
|
| 51 |
+
ap.add_argument("--features", required=True,
|
| 52 |
+
help="feature table from feature_builder.py (tsv/csv[.gz])")
|
| 53 |
ap.add_argument("--out", required=True)
|
| 54 |
+
ap.add_argument("--raw", action="store_true", help="also emit the uncalibrated RF score (CS_raw)")
|
| 55 |
a = ap.parse_args()
|
| 56 |
|
| 57 |
cfg = json.load(open(os.path.join(a.model_dir, f"{a.variant}_config.json")))
|
| 58 |
model = joblib.load(os.path.join(a.model_dir, cfg["model_file"]))
|
| 59 |
+
cal = joblib.load(os.path.join(a.model_dir, cfg["calibrator_file"]))
|
| 60 |
feats = cfg["features"]
|
| 61 |
|
| 62 |
sep = "\t" if a.features.endswith((".tsv", ".txt", ".gz")) else ","
|
| 63 |
df = pd.read_csv(a.features, sep=sep)
|
| 64 |
|
| 65 |
+
absent = [f for f in feats if f not in df.columns]
|
| 66 |
+
if absent:
|
| 67 |
+
print(f"[warn] {len(absent)} model features absent -> -99999 sentinel: {absent}",
|
| 68 |
+
file=sys.stderr)
|
| 69 |
+
for f in absent:
|
| 70 |
+
df[f] = MISSING
|
| 71 |
+
|
| 72 |
+
X = df[feats].astype("float32")
|
| 73 |
+
p = model.predict_proba(X)
|
| 74 |
+
raw = p[:, list(model.classes_).index(1)] if p.shape[1] > 1 else p[:, 0]
|
| 75 |
+
cs = np.clip(cal.predict(raw), 0.0, 1.0)
|
| 76 |
+
|
| 77 |
+
out = df.copy()
|
| 78 |
+
if a.raw:
|
| 79 |
+
out["CS_raw"] = np.round(raw, 4)
|
| 80 |
+
out["CS"] = np.round(cs, 4)
|
| 81 |
+
out["tier"] = to_tier(cs)
|
| 82 |
+
out.to_csv(a.out, sep="\t", index=False)
|
| 83 |
+
|
| 84 |
+
n = len(out)
|
| 85 |
+
vc = pd.Series(out["tier"]).value_counts()
|
| 86 |
+
hi = int(vc.get("HIGH", 0))
|
| 87 |
+
print(f"{a.variant.upper()}: scored {n:,} calls; HIGH {hi:,} ({hi/n:.1%}) | "
|
| 88 |
+
+ " ".join(f"{t}:{int(vc.get(t,0)):,}" for t in TIER_NAMES), file=sys.stderr)
|
| 89 |
print(f"wrote {a.out}", file=sys.stderr)
|
| 90 |
|
| 91 |
|
str_locus_lookup.parquet → str_calibrator.joblib
RENAMED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:3af5671f62a57815797d4c3b71b1c5b5b0102d36d9275651b3ba088a3827e453
|
| 3 |
+
size 16219
|
str_config.json
CHANGED
|
@@ -1,74 +1,53 @@
|
|
| 1 |
{
|
| 2 |
-
"
|
|
|
|
|
|
|
|
|
|
| 3 |
"features": [
|
| 4 |
-
"
|
| 5 |
-
"
|
| 6 |
-
"
|
| 7 |
-
"
|
| 8 |
-
"
|
| 9 |
-
"
|
| 10 |
-
"
|
| 11 |
-
"
|
| 12 |
-
"
|
| 13 |
-
"
|
| 14 |
-
"
|
| 15 |
-
"
|
| 16 |
-
"
|
| 17 |
-
"
|
| 18 |
-
"
|
| 19 |
-
"
|
| 20 |
-
"
|
| 21 |
-
"
|
| 22 |
-
"
|
| 23 |
-
"
|
| 24 |
-
"
|
| 25 |
-
"support_type_a2",
|
| 26 |
-
"is_low_depth",
|
| 27 |
-
"locus_conc_rate",
|
| 28 |
-
"locus_in_lookup"
|
| 29 |
],
|
| 30 |
-
"
|
| 31 |
-
|
| 32 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
},
|
| 34 |
-
"
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
"value": 20
|
| 39 |
-
},
|
| 40 |
-
"low_depth_flag": {
|
| 41 |
-
"feature": "is_low_depth",
|
| 42 |
-
"op": "==",
|
| 43 |
-
"value": 1
|
| 44 |
-
},
|
| 45 |
-
"no_spanning_a2": {
|
| 46 |
-
"feature": "support_type_a2",
|
| 47 |
-
"op": "<=",
|
| 48 |
-
"value": 1
|
| 49 |
-
},
|
| 50 |
-
"low_support_a2": {
|
| 51 |
-
"feature": "total_support_a2",
|
| 52 |
-
"op": "<",
|
| 53 |
-
"value": 5
|
| 54 |
-
},
|
| 55 |
-
"wide_ci_a2": {
|
| 56 |
-
"feature": "ci_width_a2",
|
| 57 |
-
"op": ">",
|
| 58 |
-
"value": 3
|
| 59 |
-
},
|
| 60 |
-
"low_allele_balance": {
|
| 61 |
-
"feature": "allele_balance",
|
| 62 |
-
"op": "<",
|
| 63 |
-
"value": 0.3
|
| 64 |
-
}
|
| 65 |
-
},
|
| 66 |
-
"lookup_file": "str_locus_lookup.parquet",
|
| 67 |
-
"lookup_keys": [
|
| 68 |
-
"chrom",
|
| 69 |
-
"pos"
|
| 70 |
],
|
| 71 |
-
"
|
| 72 |
"variant_class": "STR",
|
| 73 |
-
"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 74 |
}
|
|
|
|
| 1 |
{
|
| 2 |
+
"release_version": "1.0",
|
| 3 |
+
"model_file": "str_model.joblib",
|
| 4 |
+
"calibrator_file": "str_calibrator.joblib",
|
| 5 |
+
"calibration": "isotonic regression on out-of-fold scores",
|
| 6 |
"features": [
|
| 7 |
+
"is_pass",
|
| 8 |
+
"motif_len",
|
| 9 |
+
"ref_copynum",
|
| 10 |
+
"gt_repcn_max",
|
| 11 |
+
"gt_repcn_min",
|
| 12 |
+
"expansion_over_ref",
|
| 13 |
+
"repci_width_max",
|
| 14 |
+
"spanning_reads",
|
| 15 |
+
"flanking_reads",
|
| 16 |
+
"inrepeat_reads",
|
| 17 |
+
"locus_depth",
|
| 18 |
+
"gt_hom",
|
| 19 |
+
"ref_tract_bp",
|
| 20 |
+
"spanning_frac",
|
| 21 |
+
"allele_vs_readlen",
|
| 22 |
+
"motif_is_homopolymer",
|
| 23 |
+
"gc_flank",
|
| 24 |
+
"entropy_flank",
|
| 25 |
+
"in_segdup",
|
| 26 |
+
"in_difficult",
|
| 27 |
+
"flank_lowmap"
|
|
|
|
|
|
|
|
|
|
|
|
|
| 28 |
],
|
| 29 |
+
"n_features": 21,
|
| 30 |
+
"missing_sentinel": -99999.0,
|
| 31 |
+
"tiers": {
|
| 32 |
+
"HIGH": "CS>=0.70",
|
| 33 |
+
"MODERATE": "0.50<=CS<0.70",
|
| 34 |
+
"WARNING": "0.30<=CS<0.50",
|
| 35 |
+
"LOW": "CS<0.30"
|
| 36 |
},
|
| 37 |
+
"tier_edges": [
|
| 38 |
+
0.3,
|
| 39 |
+
0.5,
|
| 40 |
+
0.7
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 41 |
],
|
| 42 |
+
"score": "CS = isotonic-calibrated P(call concordant with long-read truth)",
|
| 43 |
"variant_class": "STR",
|
| 44 |
+
"sklearn_version_trained": "1.7.1",
|
| 45 |
+
"training": {
|
| 46 |
+
"cohort": "HPRC",
|
| 47 |
+
"n_samples": 208,
|
| 48 |
+
"n_train_rows": 22651133,
|
| 49 |
+
"cv": "5-fold GroupKFold by sample",
|
| 50 |
+
"oof_auroc": 0.8342,
|
| 51 |
+
"oof_auprc": 0.886
|
| 52 |
+
}
|
| 53 |
}
|
str_model_v13_parents.joblib → str_model.joblib
RENAMED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:589539ef52b9c0d6518ac7e6a7beda82abfe767d7a5e27fb1dd9a55099f373d5
|
| 3 |
+
size 512205159
|
str_model_meta.json
ADDED
|
@@ -0,0 +1,448 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"variant": "str",
|
| 3 |
+
"created_unix": 1782043477,
|
| 4 |
+
"feature_cols": [
|
| 5 |
+
"is_pass",
|
| 6 |
+
"motif_len",
|
| 7 |
+
"ref_copynum",
|
| 8 |
+
"gt_repcn_max",
|
| 9 |
+
"gt_repcn_min",
|
| 10 |
+
"expansion_over_ref",
|
| 11 |
+
"repci_width_max",
|
| 12 |
+
"spanning_reads",
|
| 13 |
+
"flanking_reads",
|
| 14 |
+
"inrepeat_reads",
|
| 15 |
+
"locus_depth",
|
| 16 |
+
"gt_hom",
|
| 17 |
+
"ref_tract_bp",
|
| 18 |
+
"spanning_frac",
|
| 19 |
+
"allele_vs_readlen",
|
| 20 |
+
"motif_is_homopolymer",
|
| 21 |
+
"gc_flank",
|
| 22 |
+
"entropy_flank",
|
| 23 |
+
"in_segdup",
|
| 24 |
+
"in_difficult",
|
| 25 |
+
"flank_lowmap"
|
| 26 |
+
],
|
| 27 |
+
"n_features": 21,
|
| 28 |
+
"tier_edges": [
|
| 29 |
+
0.3,
|
| 30 |
+
0.5,
|
| 31 |
+
0.7
|
| 32 |
+
],
|
| 33 |
+
"tier_names": [
|
| 34 |
+
"LOW",
|
| 35 |
+
"Warning",
|
| 36 |
+
"Moderate",
|
| 37 |
+
"High"
|
| 38 |
+
],
|
| 39 |
+
"missing_sentinel": -99999.0,
|
| 40 |
+
"rf_params": {
|
| 41 |
+
"bootstrap": true,
|
| 42 |
+
"ccp_alpha": 0.0,
|
| 43 |
+
"class_weight": "balanced_subsample",
|
| 44 |
+
"criterion": "gini",
|
| 45 |
+
"max_depth": null,
|
| 46 |
+
"max_features": "sqrt",
|
| 47 |
+
"max_leaf_nodes": null,
|
| 48 |
+
"max_samples": 2000000,
|
| 49 |
+
"min_impurity_decrease": 0.0,
|
| 50 |
+
"min_samples_leaf": 50,
|
| 51 |
+
"min_samples_split": 2,
|
| 52 |
+
"min_weight_fraction_leaf": 0.0,
|
| 53 |
+
"monotonic_cst": null,
|
| 54 |
+
"n_estimators": 300,
|
| 55 |
+
"n_jobs": -1,
|
| 56 |
+
"oob_score": false,
|
| 57 |
+
"random_state": 42,
|
| 58 |
+
"verbose": 0,
|
| 59 |
+
"warm_start": false
|
| 60 |
+
},
|
| 61 |
+
"n_train_rows": 22651133,
|
| 62 |
+
"n_samples": 208,
|
| 63 |
+
"qc": {
|
| 64 |
+
"label_rows_raw": 36254400,
|
| 65 |
+
"label_dist_raw": {
|
| 66 |
+
"concordant": 21350382,
|
| 67 |
+
"discordant": 13838163,
|
| 68 |
+
"unlabeled": 1065855
|
| 69 |
+
},
|
| 70 |
+
"label_rows_usable": 35188545,
|
| 71 |
+
"ambiguous_keys_dropped": 0,
|
| 72 |
+
"ambiguous_feat_rows": 0,
|
| 73 |
+
"ambiguous_label_rows": 0,
|
| 74 |
+
"dup_keys_feature": 0,
|
| 75 |
+
"dup_keys_label": 0,
|
| 76 |
+
"merged_rows": 22651133,
|
| 77 |
+
"match_rate_vs_labels": 0.6437075758602693,
|
| 78 |
+
"match_rate_vs_features": 0.9832673629385175,
|
| 79 |
+
"class_balance": {
|
| 80 |
+
"concordant": 13960015,
|
| 81 |
+
"discordant": 8691118
|
| 82 |
+
},
|
| 83 |
+
"concordant_rate": 0.6163053742168217
|
| 84 |
+
},
|
| 85 |
+
"cv_folds": 5,
|
| 86 |
+
"cv_fold_metrics": [
|
| 87 |
+
{
|
| 88 |
+
"n": 4469639,
|
| 89 |
+
"pos_rate": 0.6172603648751052,
|
| 90 |
+
"auroc": 0.8345731588413778,
|
| 91 |
+
"auprc": 0.8868311937682424,
|
| 92 |
+
"brier": 0.16715199887480572,
|
| 93 |
+
"logloss": 0.505031384190826,
|
| 94 |
+
"fold": 0,
|
| 95 |
+
"seconds": 404.5
|
| 96 |
+
},
|
| 97 |
+
{
|
| 98 |
+
"n": 4469658,
|
| 99 |
+
"pos_rate": 0.6172628867801518,
|
| 100 |
+
"auroc": 0.8348793797657998,
|
| 101 |
+
"auprc": 0.8871277104956028,
|
| 102 |
+
"brier": 0.16710046702995693,
|
| 103 |
+
"logloss": 0.5048207582711781,
|
| 104 |
+
"fold": 1,
|
| 105 |
+
"seconds": 457.3
|
| 106 |
+
},
|
| 107 |
+
{
|
| 108 |
+
"n": 4569998,
|
| 109 |
+
"pos_rate": 0.6173429397562099,
|
| 110 |
+
"auroc": 0.8345632397054213,
|
| 111 |
+
"auprc": 0.8867279699640327,
|
| 112 |
+
"brier": 0.16717765756008388,
|
| 113 |
+
"logloss": 0.5050623605875793,
|
| 114 |
+
"fold": 2,
|
| 115 |
+
"seconds": 480.6
|
| 116 |
+
},
|
| 117 |
+
{
|
| 118 |
+
"n": 4570859,
|
| 119 |
+
"pos_rate": 0.6168989242503433,
|
| 120 |
+
"auroc": 0.8350534258010407,
|
| 121 |
+
"auprc": 0.8870572426757822,
|
| 122 |
+
"brier": 0.1669604630273807,
|
| 123 |
+
"logloss": 0.5044600822147348,
|
| 124 |
+
"fold": 3,
|
| 125 |
+
"seconds": 546.9
|
| 126 |
+
},
|
| 127 |
+
{
|
| 128 |
+
"n": 4570979,
|
| 129 |
+
"pos_rate": 0.6128043904817765,
|
| 130 |
+
"auroc": 0.8317845587452297,
|
| 131 |
+
"auprc": 0.8823297885222531,
|
| 132 |
+
"brier": 0.16790578845588436,
|
| 133 |
+
"logloss": 0.5066430427730261,
|
| 134 |
+
"fold": 4,
|
| 135 |
+
"seconds": 558.6
|
| 136 |
+
}
|
| 137 |
+
],
|
| 138 |
+
"cv_report": {
|
| 139 |
+
"overall": {
|
| 140 |
+
"n": 22651133,
|
| 141 |
+
"pos_rate": 0.6163053742168217,
|
| 142 |
+
"auroc": 0.8341539493365068,
|
| 143 |
+
"auprc": 0.885996637709877,
|
| 144 |
+
"brier": 0.16726047042063633,
|
| 145 |
+
"logloss": 0.5052060179718258
|
| 146 |
+
},
|
| 147 |
+
"calibration": [
|
| 148 |
+
{
|
| 149 |
+
"bin": "[0.0,0.1)",
|
| 150 |
+
"n": 759079,
|
| 151 |
+
"mean_pred": 0.06623314081824806,
|
| 152 |
+
"obs_rate": 0.027333123429840636
|
| 153 |
+
},
|
| 154 |
+
{
|
| 155 |
+
"bin": "[0.1,0.2)",
|
| 156 |
+
"n": 1807689,
|
| 157 |
+
"mean_pred": 0.15353118408631086,
|
| 158 |
+
"obs_rate": 0.1398288090484591
|
| 159 |
+
},
|
| 160 |
+
{
|
| 161 |
+
"bin": "[0.2,0.3)",
|
| 162 |
+
"n": 2278662,
|
| 163 |
+
"mean_pred": 0.250703986073481,
|
| 164 |
+
"obs_rate": 0.2854271497922904
|
| 165 |
+
},
|
| 166 |
+
{
|
| 167 |
+
"bin": "[0.3,0.4)",
|
| 168 |
+
"n": 2401825,
|
| 169 |
+
"mean_pred": 0.35114505321433914,
|
| 170 |
+
"obs_rate": 0.4219845325950059
|
| 171 |
+
},
|
| 172 |
+
{
|
| 173 |
+
"bin": "[0.4,0.5)",
|
| 174 |
+
"n": 2503890,
|
| 175 |
+
"mean_pred": 0.4496778698066448,
|
| 176 |
+
"obs_rate": 0.5559477453083003
|
| 177 |
+
},
|
| 178 |
+
{
|
| 179 |
+
"bin": "[0.5,0.6)",
|
| 180 |
+
"n": 2743182,
|
| 181 |
+
"mean_pred": 0.5514420283736253,
|
| 182 |
+
"obs_rate": 0.6633803371413198
|
| 183 |
+
},
|
| 184 |
+
{
|
| 185 |
+
"bin": "[0.6,0.7)",
|
| 186 |
+
"n": 3201411,
|
| 187 |
+
"mean_pred": 0.6513120336728542,
|
| 188 |
+
"obs_rate": 0.7673941271520589
|
| 189 |
+
},
|
| 190 |
+
{
|
| 191 |
+
"bin": "[0.7,0.8)",
|
| 192 |
+
"n": 2972899,
|
| 193 |
+
"mean_pred": 0.7478180823491758,
|
| 194 |
+
"obs_rate": 0.8596629081579966
|
| 195 |
+
},
|
| 196 |
+
{
|
| 197 |
+
"bin": "[0.8,0.9)",
|
| 198 |
+
"n": 2979925,
|
| 199 |
+
"mean_pred": 0.8513437073854806,
|
| 200 |
+
"obs_rate": 0.9412015403072225
|
| 201 |
+
},
|
| 202 |
+
{
|
| 203 |
+
"bin": "[0.9,1.0)",
|
| 204 |
+
"n": 1002571,
|
| 205 |
+
"mean_pred": 0.9221679799864609,
|
| 206 |
+
"obs_rate": 0.9910769411842154
|
| 207 |
+
}
|
| 208 |
+
],
|
| 209 |
+
"per_sample_auroc": {
|
| 210 |
+
"n_samples": 208,
|
| 211 |
+
"median": 0.8353140721290141,
|
| 212 |
+
"p25": 0.8326614184016954,
|
| 213 |
+
"p75": 0.8373927525350378,
|
| 214 |
+
"min": 0.740174387702103,
|
| 215 |
+
"max": 0.8401855333526593
|
| 216 |
+
},
|
| 217 |
+
"by_homopolymer": {
|
| 218 |
+
"homopolymer": {
|
| 219 |
+
"n": 176,
|
| 220 |
+
"pos_rate": 0.0,
|
| 221 |
+
"auroc": null,
|
| 222 |
+
"auprc": null,
|
| 223 |
+
"brier": 0.12461994174893026
|
| 224 |
+
},
|
| 225 |
+
"other": {
|
| 226 |
+
"n": 22650957,
|
| 227 |
+
"pos_rate": 0.6163101629657414,
|
| 228 |
+
"auroc": 0.8341526308855854,
|
| 229 |
+
"auprc": 0.8859973231761953,
|
| 230 |
+
"brier": 0.16726080174142982,
|
| 231 |
+
"logloss": 0.5052065639352175
|
| 232 |
+
}
|
| 233 |
+
},
|
| 234 |
+
"by_is_pass": {
|
| 235 |
+
"PASS": {
|
| 236 |
+
"n": 22645309,
|
| 237 |
+
"pos_rate": 0.6163365225000904,
|
| 238 |
+
"auroc": 0.8341536917536043,
|
| 239 |
+
"auprc": 0.8860084593752011,
|
| 240 |
+
"brier": 0.1672574382686718,
|
| 241 |
+
"logloss": 0.505198302627369
|
| 242 |
+
},
|
| 243 |
+
"nonPASS": {
|
| 244 |
+
"n": 5824,
|
| 245 |
+
"pos_rate": 0.4951923076923077,
|
| 246 |
+
"auroc": 0.821139738835895,
|
| 247 |
+
"auprc": 0.8249088115206255,
|
| 248 |
+
"brier": 0.17905030870563365,
|
| 249 |
+
"logloss": 0.5352053928461165
|
| 250 |
+
}
|
| 251 |
+
}
|
| 252 |
+
},
|
| 253 |
+
"importances": {
|
| 254 |
+
"impurity": [
|
| 255 |
+
{
|
| 256 |
+
"feature": "entropy_flank",
|
| 257 |
+
"impurity_importance": 0.28992320685730033
|
| 258 |
+
},
|
| 259 |
+
{
|
| 260 |
+
"feature": "motif_len",
|
| 261 |
+
"impurity_importance": 0.15078304844246473
|
| 262 |
+
},
|
| 263 |
+
{
|
| 264 |
+
"feature": "gc_flank",
|
| 265 |
+
"impurity_importance": 0.11765967510912077
|
| 266 |
+
},
|
| 267 |
+
{
|
| 268 |
+
"feature": "ref_tract_bp",
|
| 269 |
+
"impurity_importance": 0.09594543197447271
|
| 270 |
+
},
|
| 271 |
+
{
|
| 272 |
+
"feature": "allele_vs_readlen",
|
| 273 |
+
"impurity_importance": 0.06304989891121958
|
| 274 |
+
},
|
| 275 |
+
{
|
| 276 |
+
"feature": "ref_copynum",
|
| 277 |
+
"impurity_importance": 0.06281644250839796
|
| 278 |
+
},
|
| 279 |
+
{
|
| 280 |
+
"feature": "gt_repcn_max",
|
| 281 |
+
"impurity_importance": 0.045375808024477604
|
| 282 |
+
},
|
| 283 |
+
{
|
| 284 |
+
"feature": "gt_repcn_min",
|
| 285 |
+
"impurity_importance": 0.04503548319154128
|
| 286 |
+
},
|
| 287 |
+
{
|
| 288 |
+
"feature": "flanking_reads",
|
| 289 |
+
"impurity_importance": 0.04081082547154657
|
| 290 |
+
},
|
| 291 |
+
{
|
| 292 |
+
"feature": "spanning_frac",
|
| 293 |
+
"impurity_importance": 0.02788421749138721
|
| 294 |
+
},
|
| 295 |
+
{
|
| 296 |
+
"feature": "expansion_over_ref",
|
| 297 |
+
"impurity_importance": 0.017739812221077934
|
| 298 |
+
},
|
| 299 |
+
{
|
| 300 |
+
"feature": "locus_depth",
|
| 301 |
+
"impurity_importance": 0.014556405292958223
|
| 302 |
+
},
|
| 303 |
+
{
|
| 304 |
+
"feature": "spanning_reads",
|
| 305 |
+
"impurity_importance": 0.011672664495590936
|
| 306 |
+
},
|
| 307 |
+
{
|
| 308 |
+
"feature": "in_difficult",
|
| 309 |
+
"impurity_importance": 0.009656418449637608
|
| 310 |
+
},
|
| 311 |
+
{
|
| 312 |
+
"feature": "gt_hom",
|
| 313 |
+
"impurity_importance": 0.0024291103645865167
|
| 314 |
+
},
|
| 315 |
+
{
|
| 316 |
+
"feature": "in_segdup",
|
| 317 |
+
"impurity_importance": 0.001648983588740384
|
| 318 |
+
},
|
| 319 |
+
{
|
| 320 |
+
"feature": "flank_lowmap",
|
| 321 |
+
"impurity_importance": 0.001477948437034436
|
| 322 |
+
},
|
| 323 |
+
{
|
| 324 |
+
"feature": "repci_width_max",
|
| 325 |
+
"impurity_importance": 0.0012018133362063474
|
| 326 |
+
},
|
| 327 |
+
{
|
| 328 |
+
"feature": "inrepeat_reads",
|
| 329 |
+
"impurity_importance": 0.00033183288445321743
|
| 330 |
+
},
|
| 331 |
+
{
|
| 332 |
+
"feature": "is_pass",
|
| 333 |
+
"impurity_importance": 9.729477856164029e-07
|
| 334 |
+
},
|
| 335 |
+
{
|
| 336 |
+
"feature": "motif_is_homopolymer",
|
| 337 |
+
"impurity_importance": 0.0
|
| 338 |
+
}
|
| 339 |
+
],
|
| 340 |
+
"permutation": [
|
| 341 |
+
{
|
| 342 |
+
"feature": "entropy_flank",
|
| 343 |
+
"perm_importance_mean": 0.13934060781658777,
|
| 344 |
+
"perm_importance_std": 0.0006361765279266924
|
| 345 |
+
},
|
| 346 |
+
{
|
| 347 |
+
"feature": "motif_len",
|
| 348 |
+
"perm_importance_mean": 0.1232472127797279,
|
| 349 |
+
"perm_importance_std": 0.0005893220011599711
|
| 350 |
+
},
|
| 351 |
+
{
|
| 352 |
+
"feature": "gc_flank",
|
| 353 |
+
"perm_importance_mean": 0.06320217026546789,
|
| 354 |
+
"perm_importance_std": 0.00039522027338993824
|
| 355 |
+
},
|
| 356 |
+
{
|
| 357 |
+
"feature": "ref_tract_bp",
|
| 358 |
+
"perm_importance_mean": 0.056776687651067095,
|
| 359 |
+
"perm_importance_std": 0.00015236878123781785
|
| 360 |
+
},
|
| 361 |
+
{
|
| 362 |
+
"feature": "ref_copynum",
|
| 363 |
+
"perm_importance_mean": 0.02267318905161917,
|
| 364 |
+
"perm_importance_std": 0.00014989102435837524
|
| 365 |
+
},
|
| 366 |
+
{
|
| 367 |
+
"feature": "allele_vs_readlen",
|
| 368 |
+
"perm_importance_mean": 0.020529595235711205,
|
| 369 |
+
"perm_importance_std": 0.00017190103816491447
|
| 370 |
+
},
|
| 371 |
+
{
|
| 372 |
+
"feature": "gt_repcn_min",
|
| 373 |
+
"perm_importance_mean": 0.01731383830567197,
|
| 374 |
+
"perm_importance_std": 0.000195043199990813
|
| 375 |
+
},
|
| 376 |
+
{
|
| 377 |
+
"feature": "gt_repcn_max",
|
| 378 |
+
"perm_importance_mean": 0.014405902490600276,
|
| 379 |
+
"perm_importance_std": 0.00013955774976049523
|
| 380 |
+
},
|
| 381 |
+
{
|
| 382 |
+
"feature": "expansion_over_ref",
|
| 383 |
+
"perm_importance_mean": 0.008579439049389648,
|
| 384 |
+
"perm_importance_std": 8.141211169349268e-05
|
| 385 |
+
},
|
| 386 |
+
{
|
| 387 |
+
"feature": "flanking_reads",
|
| 388 |
+
"perm_importance_mean": 0.005908979701386818,
|
| 389 |
+
"perm_importance_std": 8.933000723756271e-05
|
| 390 |
+
},
|
| 391 |
+
{
|
| 392 |
+
"feature": "spanning_frac",
|
| 393 |
+
"perm_importance_mean": 0.005236130437139996,
|
| 394 |
+
"perm_importance_std": 4.831785228506296e-05
|
| 395 |
+
},
|
| 396 |
+
{
|
| 397 |
+
"feature": "in_difficult",
|
| 398 |
+
"perm_importance_mean": 0.003852866555695589,
|
| 399 |
+
"perm_importance_std": 2.129084797378384e-05
|
| 400 |
+
},
|
| 401 |
+
{
|
| 402 |
+
"feature": "spanning_reads",
|
| 403 |
+
"perm_importance_mean": 0.0029217009056680563,
|
| 404 |
+
"perm_importance_std": 4.176582259464099e-05
|
| 405 |
+
},
|
| 406 |
+
{
|
| 407 |
+
"feature": "gt_hom",
|
| 408 |
+
"perm_importance_mean": 0.002172501389781667,
|
| 409 |
+
"perm_importance_std": 8.3379119655914e-06
|
| 410 |
+
},
|
| 411 |
+
{
|
| 412 |
+
"feature": "locus_depth",
|
| 413 |
+
"perm_importance_mean": 0.0020709165127682284,
|
| 414 |
+
"perm_importance_std": 2.549011860464095e-05
|
| 415 |
+
},
|
| 416 |
+
{
|
| 417 |
+
"feature": "in_segdup",
|
| 418 |
+
"perm_importance_mean": 0.0009386532858458585,
|
| 419 |
+
"perm_importance_std": 1.750671402431846e-05
|
| 420 |
+
},
|
| 421 |
+
{
|
| 422 |
+
"feature": "flank_lowmap",
|
| 423 |
+
"perm_importance_mean": 0.0005812032061902617,
|
| 424 |
+
"perm_importance_std": 1.3028115550094254e-05
|
| 425 |
+
},
|
| 426 |
+
{
|
| 427 |
+
"feature": "repci_width_max",
|
| 428 |
+
"perm_importance_mean": 0.00026026760399893155,
|
| 429 |
+
"perm_importance_std": 1.492427417015547e-05
|
| 430 |
+
},
|
| 431 |
+
{
|
| 432 |
+
"feature": "inrepeat_reads",
|
| 433 |
+
"perm_importance_mean": 5.1632608300478114e-05,
|
| 434 |
+
"perm_importance_std": 5.166444569830962e-06
|
| 435 |
+
},
|
| 436 |
+
{
|
| 437 |
+
"feature": "is_pass",
|
| 438 |
+
"perm_importance_mean": 5.758337677796987e-08,
|
| 439 |
+
"perm_importance_std": 3.3427425855204445e-08
|
| 440 |
+
},
|
| 441 |
+
{
|
| 442 |
+
"feature": "motif_is_homopolymer",
|
| 443 |
+
"perm_importance_mean": 0.0,
|
| 444 |
+
"perm_importance_std": 0.0
|
| 445 |
+
}
|
| 446 |
+
]
|
| 447 |
+
}
|
| 448 |
+
}
|
sv_model_v13_parents.joblib → sv_calibrator.joblib
RENAMED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:307cbb00688a5a3b5aa5caac1700cdbd1c2fdd5b4a7e08072a121a4f7526e2e5
|
| 3 |
+
size 7872
|
sv_config.json
CHANGED
|
@@ -1,62 +1,67 @@
|
|
| 1 |
{
|
| 2 |
-
"
|
|
|
|
|
|
|
|
|
|
| 3 |
"features": [
|
| 4 |
-
"
|
| 5 |
-
"gq",
|
| 6 |
-
"pr_ref",
|
| 7 |
-
"pr_alt",
|
| 8 |
-
"sr_ref",
|
| 9 |
-
"sr_alt",
|
| 10 |
-
"vf_ref",
|
| 11 |
-
"vf_alt",
|
| 12 |
-
"total_alt_support",
|
| 13 |
-
"sr_pr_ratio",
|
| 14 |
-
"vaf_estimate",
|
| 15 |
-
"strand_bias_fs",
|
| 16 |
-
"cipos_width",
|
| 17 |
-
"ciend_width",
|
| 18 |
-
"homlen",
|
| 19 |
-
"svlen_abs",
|
| 20 |
"svtype_DEL",
|
| 21 |
-
"svtype_INS",
|
| 22 |
"svtype_DUP",
|
|
|
|
|
|
|
| 23 |
"svtype_BND",
|
|
|
|
|
|
|
|
|
|
| 24 |
"is_imprecise",
|
| 25 |
-
"
|
| 26 |
-
"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
],
|
| 28 |
-
"
|
| 29 |
-
|
| 30 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 31 |
},
|
| 32 |
-
"
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
"low_gq": {
|
| 39 |
-
"feature": "gq",
|
| 40 |
-
"op": "<",
|
| 41 |
-
"value": 15
|
| 42 |
-
},
|
| 43 |
-
"no_alt_support": {
|
| 44 |
-
"feature": "total_alt_support",
|
| 45 |
-
"op": "<=",
|
| 46 |
-
"value": 2
|
| 47 |
-
},
|
| 48 |
-
"low_vaf": {
|
| 49 |
-
"feature": "vaf_estimate",
|
| 50 |
-
"op": "<",
|
| 51 |
-
"value": 0.15
|
| 52 |
-
},
|
| 53 |
-
"imprecise": {
|
| 54 |
-
"feature": "is_imprecise",
|
| 55 |
-
"op": "==",
|
| 56 |
-
"value": 1
|
| 57 |
-
}
|
| 58 |
-
},
|
| 59 |
-
"sklearn_version_trained": "1.5.1",
|
| 60 |
"variant_class": "SV",
|
| 61 |
-
"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 62 |
}
|
|
|
|
| 1 |
{
|
| 2 |
+
"release_version": "1.0",
|
| 3 |
+
"model_file": "sv_model.joblib",
|
| 4 |
+
"calibrator_file": "sv_calibrator.joblib",
|
| 5 |
+
"calibration": "isotonic regression on out-of-fold scores",
|
| 6 |
"features": [
|
| 7 |
+
"is_pass",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
"svtype_DEL",
|
|
|
|
| 9 |
"svtype_DUP",
|
| 10 |
+
"svtype_INS",
|
| 11 |
+
"svtype_INV",
|
| 12 |
"svtype_BND",
|
| 13 |
+
"svlen_log",
|
| 14 |
+
"cipos_width",
|
| 15 |
+
"ciend_width",
|
| 16 |
"is_imprecise",
|
| 17 |
+
"pe_support",
|
| 18 |
+
"sr_support",
|
| 19 |
+
"total_support",
|
| 20 |
+
"vaf",
|
| 21 |
+
"gt_hom",
|
| 22 |
+
"gq",
|
| 23 |
+
"qual_norm",
|
| 24 |
+
"local_depth",
|
| 25 |
+
"gc_min",
|
| 26 |
+
"gc_max",
|
| 27 |
+
"entropy_min",
|
| 28 |
+
"microhom_max",
|
| 29 |
+
"in_segdup_either",
|
| 30 |
+
"in_segdup_both",
|
| 31 |
+
"in_difficult_either",
|
| 32 |
+
"in_difficult_both",
|
| 33 |
+
"in_lowmap_either",
|
| 34 |
+
"in_tandem_either",
|
| 35 |
+
"in_Alu_either",
|
| 36 |
+
"in_L1_either",
|
| 37 |
+
"in_SVA_either",
|
| 38 |
+
"in_LTR_either",
|
| 39 |
+
"frac_span_repeat",
|
| 40 |
+
"n_neighbors",
|
| 41 |
+
"nn_log_dist"
|
| 42 |
],
|
| 43 |
+
"n_features": 35,
|
| 44 |
+
"missing_sentinel": -99999.0,
|
| 45 |
+
"tiers": {
|
| 46 |
+
"HIGH": "CS>=0.70",
|
| 47 |
+
"MODERATE": "0.50<=CS<0.70",
|
| 48 |
+
"WARNING": "0.30<=CS<0.50",
|
| 49 |
+
"LOW": "CS<0.30"
|
| 50 |
},
|
| 51 |
+
"tier_edges": [
|
| 52 |
+
0.3,
|
| 53 |
+
0.5,
|
| 54 |
+
0.7
|
| 55 |
+
],
|
| 56 |
+
"score": "CS = isotonic-calibrated P(call concordant with long-read truth)",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 57 |
"variant_class": "SV",
|
| 58 |
+
"sklearn_version_trained": "1.7.1",
|
| 59 |
+
"training": {
|
| 60 |
+
"cohort": "HPRC",
|
| 61 |
+
"n_samples": 208,
|
| 62 |
+
"n_train_rows": 2575116,
|
| 63 |
+
"cv": "5-fold GroupKFold by sample",
|
| 64 |
+
"oof_auroc": 0.9502,
|
| 65 |
+
"oof_auprc": 0.9512
|
| 66 |
+
}
|
| 67 |
}
|
sv_model.joblib
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:6cafd7dd091a0c1f9ceb2629c8bd6315e4967c37c4bf8910fcfea27454be64e7
|
| 3 |
+
size 992442127
|
sv_model_meta.json
ADDED
|
@@ -0,0 +1,583 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"variant": "sv",
|
| 3 |
+
"feature_cols": [
|
| 4 |
+
"is_pass",
|
| 5 |
+
"svtype_DEL",
|
| 6 |
+
"svtype_DUP",
|
| 7 |
+
"svtype_INS",
|
| 8 |
+
"svtype_INV",
|
| 9 |
+
"svtype_BND",
|
| 10 |
+
"svlen_log",
|
| 11 |
+
"cipos_width",
|
| 12 |
+
"ciend_width",
|
| 13 |
+
"is_imprecise",
|
| 14 |
+
"pe_support",
|
| 15 |
+
"sr_support",
|
| 16 |
+
"total_support",
|
| 17 |
+
"vaf",
|
| 18 |
+
"gt_hom",
|
| 19 |
+
"gq",
|
| 20 |
+
"qual_norm",
|
| 21 |
+
"local_depth",
|
| 22 |
+
"gc_min",
|
| 23 |
+
"gc_max",
|
| 24 |
+
"entropy_min",
|
| 25 |
+
"microhom_max",
|
| 26 |
+
"in_segdup_either",
|
| 27 |
+
"in_segdup_both",
|
| 28 |
+
"in_difficult_either",
|
| 29 |
+
"in_difficult_both",
|
| 30 |
+
"in_lowmap_either",
|
| 31 |
+
"in_tandem_either",
|
| 32 |
+
"in_Alu_either",
|
| 33 |
+
"in_L1_either",
|
| 34 |
+
"in_SVA_either",
|
| 35 |
+
"in_LTR_either",
|
| 36 |
+
"frac_span_repeat",
|
| 37 |
+
"n_neighbors",
|
| 38 |
+
"nn_log_dist"
|
| 39 |
+
],
|
| 40 |
+
"n_features": 35,
|
| 41 |
+
"tier_edges": [
|
| 42 |
+
0.3,
|
| 43 |
+
0.5,
|
| 44 |
+
0.7
|
| 45 |
+
],
|
| 46 |
+
"tier_names": [
|
| 47 |
+
"LOW",
|
| 48 |
+
"Warning",
|
| 49 |
+
"Moderate",
|
| 50 |
+
"High"
|
| 51 |
+
],
|
| 52 |
+
"missing_sentinel": -99999.0,
|
| 53 |
+
"rf_params": {
|
| 54 |
+
"bootstrap": true,
|
| 55 |
+
"ccp_alpha": 0.0,
|
| 56 |
+
"class_weight": "balanced_subsample",
|
| 57 |
+
"criterion": "gini",
|
| 58 |
+
"max_depth": null,
|
| 59 |
+
"max_features": "sqrt",
|
| 60 |
+
"max_leaf_nodes": null,
|
| 61 |
+
"max_samples": null,
|
| 62 |
+
"min_impurity_decrease": 0.0,
|
| 63 |
+
"min_samples_leaf": 20,
|
| 64 |
+
"min_samples_split": 2,
|
| 65 |
+
"min_weight_fraction_leaf": 0.0,
|
| 66 |
+
"monotonic_cst": null,
|
| 67 |
+
"n_estimators": 400,
|
| 68 |
+
"n_jobs": -1,
|
| 69 |
+
"oob_score": false,
|
| 70 |
+
"random_state": 42,
|
| 71 |
+
"verbose": 0,
|
| 72 |
+
"warm_start": false
|
| 73 |
+
},
|
| 74 |
+
"n_train_rows": 2575116,
|
| 75 |
+
"n_samples": 208,
|
| 76 |
+
"qc": {
|
| 77 |
+
"label_rows_raw": 2782190,
|
| 78 |
+
"label_dist_raw": {
|
| 79 |
+
"concordant": 1530286,
|
| 80 |
+
"discordant": 1251904
|
| 81 |
+
},
|
| 82 |
+
"label_rows_usable": 2782190,
|
| 83 |
+
"ambiguous_keys_dropped": 9462,
|
| 84 |
+
"ambiguous_feat_rows": 18568,
|
| 85 |
+
"ambiguous_label_rows": 19114,
|
| 86 |
+
"dup_keys_feature": 18568,
|
| 87 |
+
"dup_keys_label": 19114,
|
| 88 |
+
"merged_rows": 2575116,
|
| 89 |
+
"match_rate_vs_labels": 0.9255715820989939,
|
| 90 |
+
"match_rate_vs_features": 1.0,
|
| 91 |
+
"class_balance": {
|
| 92 |
+
"concordant": 1511906,
|
| 93 |
+
"discordant": 1063210
|
| 94 |
+
},
|
| 95 |
+
"concordant_rate": 0.5871215121959554
|
| 96 |
+
},
|
| 97 |
+
"importances": {
|
| 98 |
+
"impurity": [
|
| 99 |
+
{
|
| 100 |
+
"feature": "svlen_log",
|
| 101 |
+
"impurity_importance": 0.14231107100051552
|
| 102 |
+
},
|
| 103 |
+
{
|
| 104 |
+
"feature": "svtype_BND",
|
| 105 |
+
"impurity_importance": 0.13513441637259968
|
| 106 |
+
},
|
| 107 |
+
{
|
| 108 |
+
"feature": "nn_log_dist",
|
| 109 |
+
"impurity_importance": 0.06888043456708229
|
| 110 |
+
},
|
| 111 |
+
{
|
| 112 |
+
"feature": "svtype_DUP",
|
| 113 |
+
"impurity_importance": 0.05967552431932549
|
| 114 |
+
},
|
| 115 |
+
{
|
| 116 |
+
"feature": "cipos_width",
|
| 117 |
+
"impurity_importance": 0.05563805020798381
|
| 118 |
+
},
|
| 119 |
+
{
|
| 120 |
+
"feature": "svtype_DEL",
|
| 121 |
+
"impurity_importance": 0.053612638587210104
|
| 122 |
+
},
|
| 123 |
+
{
|
| 124 |
+
"feature": "sr_support",
|
| 125 |
+
"impurity_importance": 0.04950031891627467
|
| 126 |
+
},
|
| 127 |
+
{
|
| 128 |
+
"feature": "vaf",
|
| 129 |
+
"impurity_importance": 0.04822805852199238
|
| 130 |
+
},
|
| 131 |
+
{
|
| 132 |
+
"feature": "qual_norm",
|
| 133 |
+
"impurity_importance": 0.04321008187665855
|
| 134 |
+
},
|
| 135 |
+
{
|
| 136 |
+
"feature": "svtype_INS",
|
| 137 |
+
"impurity_importance": 0.03671741257598638
|
| 138 |
+
},
|
| 139 |
+
{
|
| 140 |
+
"feature": "ciend_width",
|
| 141 |
+
"impurity_importance": 0.027655467190250325
|
| 142 |
+
},
|
| 143 |
+
{
|
| 144 |
+
"feature": "local_depth",
|
| 145 |
+
"impurity_importance": 0.024667786835611386
|
| 146 |
+
},
|
| 147 |
+
{
|
| 148 |
+
"feature": "microhom_max",
|
| 149 |
+
"impurity_importance": 0.023198612318248754
|
| 150 |
+
},
|
| 151 |
+
{
|
| 152 |
+
"feature": "is_imprecise",
|
| 153 |
+
"impurity_importance": 0.02244493530457544
|
| 154 |
+
},
|
| 155 |
+
{
|
| 156 |
+
"feature": "frac_span_repeat",
|
| 157 |
+
"impurity_importance": 0.02223685870094091
|
| 158 |
+
},
|
| 159 |
+
{
|
| 160 |
+
"feature": "entropy_min",
|
| 161 |
+
"impurity_importance": 0.02149966456515826
|
| 162 |
+
},
|
| 163 |
+
{
|
| 164 |
+
"feature": "pe_support",
|
| 165 |
+
"impurity_importance": 0.018807543132767727
|
| 166 |
+
},
|
| 167 |
+
{
|
| 168 |
+
"feature": "gc_min",
|
| 169 |
+
"impurity_importance": 0.018609267758191137
|
| 170 |
+
},
|
| 171 |
+
{
|
| 172 |
+
"feature": "gq",
|
| 173 |
+
"impurity_importance": 0.017999043161167707
|
| 174 |
+
},
|
| 175 |
+
{
|
| 176 |
+
"feature": "gc_max",
|
| 177 |
+
"impurity_importance": 0.01691329606031783
|
| 178 |
+
},
|
| 179 |
+
{
|
| 180 |
+
"feature": "total_support",
|
| 181 |
+
"impurity_importance": 0.01639193545906872
|
| 182 |
+
},
|
| 183 |
+
{
|
| 184 |
+
"feature": "gt_hom",
|
| 185 |
+
"impurity_importance": 0.014227746565587479
|
| 186 |
+
},
|
| 187 |
+
{
|
| 188 |
+
"feature": "n_neighbors",
|
| 189 |
+
"impurity_importance": 0.013414404592739407
|
| 190 |
+
},
|
| 191 |
+
{
|
| 192 |
+
"feature": "in_difficult_both",
|
| 193 |
+
"impurity_importance": 0.01186932037949349
|
| 194 |
+
},
|
| 195 |
+
{
|
| 196 |
+
"feature": "is_pass",
|
| 197 |
+
"impurity_importance": 0.011331347970027637
|
| 198 |
+
},
|
| 199 |
+
{
|
| 200 |
+
"feature": "in_tandem_either",
|
| 201 |
+
"impurity_importance": 0.006678373764632643
|
| 202 |
+
},
|
| 203 |
+
{
|
| 204 |
+
"feature": "in_lowmap_either",
|
| 205 |
+
"impurity_importance": 0.004352768263433798
|
| 206 |
+
},
|
| 207 |
+
{
|
| 208 |
+
"feature": "in_Alu_either",
|
| 209 |
+
"impurity_importance": 0.004250882794571496
|
| 210 |
+
},
|
| 211 |
+
{
|
| 212 |
+
"feature": "in_difficult_either",
|
| 213 |
+
"impurity_importance": 0.004067397992242813
|
| 214 |
+
},
|
| 215 |
+
{
|
| 216 |
+
"feature": "in_segdup_either",
|
| 217 |
+
"impurity_importance": 0.001790844691729305
|
| 218 |
+
},
|
| 219 |
+
{
|
| 220 |
+
"feature": "in_segdup_both",
|
| 221 |
+
"impurity_importance": 0.001567017346771133
|
| 222 |
+
},
|
| 223 |
+
{
|
| 224 |
+
"feature": "in_L1_either",
|
| 225 |
+
"impurity_importance": 0.0014931685060102253
|
| 226 |
+
},
|
| 227 |
+
{
|
| 228 |
+
"feature": "in_LTR_either",
|
| 229 |
+
"impurity_importance": 0.001243669659616864
|
| 230 |
+
},
|
| 231 |
+
{
|
| 232 |
+
"feature": "in_SVA_either",
|
| 233 |
+
"impurity_importance": 0.00038064004121650836
|
| 234 |
+
},
|
| 235 |
+
{
|
| 236 |
+
"feature": "svtype_INV",
|
| 237 |
+
"impurity_importance": 0.0
|
| 238 |
+
}
|
| 239 |
+
],
|
| 240 |
+
"permutation": [
|
| 241 |
+
{
|
| 242 |
+
"feature": "svlen_log",
|
| 243 |
+
"perm_importance_mean": 0.04194058803806115,
|
| 244 |
+
"perm_importance_std": 0.0007089186317830076
|
| 245 |
+
},
|
| 246 |
+
{
|
| 247 |
+
"feature": "nn_log_dist",
|
| 248 |
+
"perm_importance_mean": 0.019778079687185944,
|
| 249 |
+
"perm_importance_std": 0.0002553374546546104
|
| 250 |
+
},
|
| 251 |
+
{
|
| 252 |
+
"feature": "svtype_DEL",
|
| 253 |
+
"perm_importance_mean": 0.018689927317770305,
|
| 254 |
+
"perm_importance_std": 0.0002501511162311443
|
| 255 |
+
},
|
| 256 |
+
{
|
| 257 |
+
"feature": "cipos_width",
|
| 258 |
+
"perm_importance_mean": 0.017400205941047363,
|
| 259 |
+
"perm_importance_std": 0.00021672163272084185
|
| 260 |
+
},
|
| 261 |
+
{
|
| 262 |
+
"feature": "svtype_BND",
|
| 263 |
+
"perm_importance_mean": 0.015007739432103828,
|
| 264 |
+
"perm_importance_std": 0.00026380208237979455
|
| 265 |
+
},
|
| 266 |
+
{
|
| 267 |
+
"feature": "qual_norm",
|
| 268 |
+
"perm_importance_mean": 0.013299944084231186,
|
| 269 |
+
"perm_importance_std": 0.00013420815060181905
|
| 270 |
+
},
|
| 271 |
+
{
|
| 272 |
+
"feature": "gc_min",
|
| 273 |
+
"perm_importance_mean": 0.012263946411167393,
|
| 274 |
+
"perm_importance_std": 0.0001367518392425017
|
| 275 |
+
},
|
| 276 |
+
{
|
| 277 |
+
"feature": "entropy_min",
|
| 278 |
+
"perm_importance_mean": 0.01175411732205851,
|
| 279 |
+
"perm_importance_std": 0.00010827028744123835
|
| 280 |
+
},
|
| 281 |
+
{
|
| 282 |
+
"feature": "pe_support",
|
| 283 |
+
"perm_importance_mean": 0.011431666717043187,
|
| 284 |
+
"perm_importance_std": 0.0002522902431662052
|
| 285 |
+
},
|
| 286 |
+
{
|
| 287 |
+
"feature": "vaf",
|
| 288 |
+
"perm_importance_mean": 0.010811854209153338,
|
| 289 |
+
"perm_importance_std": 0.000206278576011294
|
| 290 |
+
},
|
| 291 |
+
{
|
| 292 |
+
"feature": "svtype_DUP",
|
| 293 |
+
"perm_importance_mean": 0.010256370264851777,
|
| 294 |
+
"perm_importance_std": 0.00011705190707458093
|
| 295 |
+
},
|
| 296 |
+
{
|
| 297 |
+
"feature": "frac_span_repeat",
|
| 298 |
+
"perm_importance_mean": 0.009963778341341434,
|
| 299 |
+
"perm_importance_std": 0.0003054705160430916
|
| 300 |
+
},
|
| 301 |
+
{
|
| 302 |
+
"feature": "local_depth",
|
| 303 |
+
"perm_importance_mean": 0.009591312749510661,
|
| 304 |
+
"perm_importance_std": 9.987579143240777e-05
|
| 305 |
+
},
|
| 306 |
+
{
|
| 307 |
+
"feature": "gc_max",
|
| 308 |
+
"perm_importance_mean": 0.009472834004097907,
|
| 309 |
+
"perm_importance_std": 3.73573580102502e-05
|
| 310 |
+
},
|
| 311 |
+
{
|
| 312 |
+
"feature": "gq",
|
| 313 |
+
"perm_importance_mean": 0.009203283859545164,
|
| 314 |
+
"perm_importance_std": 0.00010130110270414088
|
| 315 |
+
},
|
| 316 |
+
{
|
| 317 |
+
"feature": "sr_support",
|
| 318 |
+
"perm_importance_mean": 0.008346937601715121,
|
| 319 |
+
"perm_importance_std": 9.766321902716232e-05
|
| 320 |
+
},
|
| 321 |
+
{
|
| 322 |
+
"feature": "microhom_max",
|
| 323 |
+
"perm_importance_mean": 0.0074945231035745685,
|
| 324 |
+
"perm_importance_std": 3.858741211472431e-05
|
| 325 |
+
},
|
| 326 |
+
{
|
| 327 |
+
"feature": "svtype_INS",
|
| 328 |
+
"perm_importance_mean": 0.007191278115888,
|
| 329 |
+
"perm_importance_std": 6.25382140306982e-05
|
| 330 |
+
},
|
| 331 |
+
{
|
| 332 |
+
"feature": "total_support",
|
| 333 |
+
"perm_importance_mean": 0.0071782291459881135,
|
| 334 |
+
"perm_importance_std": 9.233685802608801e-05
|
| 335 |
+
},
|
| 336 |
+
{
|
| 337 |
+
"feature": "is_pass",
|
| 338 |
+
"perm_importance_mean": 0.006096510294509483,
|
| 339 |
+
"perm_importance_std": 0.00018164506103297094
|
| 340 |
+
},
|
| 341 |
+
{
|
| 342 |
+
"feature": "n_neighbors",
|
| 343 |
+
"perm_importance_mean": 0.005750802897597151,
|
| 344 |
+
"perm_importance_std": 5.2407508065749236e-05
|
| 345 |
+
},
|
| 346 |
+
{
|
| 347 |
+
"feature": "in_difficult_both",
|
| 348 |
+
"perm_importance_mean": 0.005015233708107925,
|
| 349 |
+
"perm_importance_std": 0.00018000694143725676
|
| 350 |
+
},
|
| 351 |
+
{
|
| 352 |
+
"feature": "ciend_width",
|
| 353 |
+
"perm_importance_mean": 0.004891217221742616,
|
| 354 |
+
"perm_importance_std": 8.693371224059806e-05
|
| 355 |
+
},
|
| 356 |
+
{
|
| 357 |
+
"feature": "in_tandem_either",
|
| 358 |
+
"perm_importance_mean": 0.0043522978742952965,
|
| 359 |
+
"perm_importance_std": 0.00013096877220331791
|
| 360 |
+
},
|
| 361 |
+
{
|
| 362 |
+
"feature": "gt_hom",
|
| 363 |
+
"perm_importance_mean": 0.00323902471619224,
|
| 364 |
+
"perm_importance_std": 6.580316549205161e-05
|
| 365 |
+
},
|
| 366 |
+
{
|
| 367 |
+
"feature": "in_lowmap_either",
|
| 368 |
+
"perm_importance_mean": 0.002848785493209416,
|
| 369 |
+
"perm_importance_std": 6.035487488575199e-05
|
| 370 |
+
},
|
| 371 |
+
{
|
| 372 |
+
"feature": "in_Alu_either",
|
| 373 |
+
"perm_importance_mean": 0.002534492148327061,
|
| 374 |
+
"perm_importance_std": 8.942851729207139e-05
|
| 375 |
+
},
|
| 376 |
+
{
|
| 377 |
+
"feature": "in_difficult_either",
|
| 378 |
+
"perm_importance_mean": 0.002091988241603948,
|
| 379 |
+
"perm_importance_std": 5.002148003217409e-05
|
| 380 |
+
},
|
| 381 |
+
{
|
| 382 |
+
"feature": "is_imprecise",
|
| 383 |
+
"perm_importance_mean": 0.001979962861476592,
|
| 384 |
+
"perm_importance_std": 6.0060264184685146e-05
|
| 385 |
+
},
|
| 386 |
+
{
|
| 387 |
+
"feature": "in_L1_either",
|
| 388 |
+
"perm_importance_mean": 0.0011150058316057754,
|
| 389 |
+
"perm_importance_std": 2.43935928349277e-05
|
| 390 |
+
},
|
| 391 |
+
{
|
| 392 |
+
"feature": "in_LTR_either",
|
| 393 |
+
"perm_importance_mean": 0.0006375153501665843,
|
| 394 |
+
"perm_importance_std": 3.5907563733425047e-05
|
| 395 |
+
},
|
| 396 |
+
{
|
| 397 |
+
"feature": "in_segdup_either",
|
| 398 |
+
"perm_importance_mean": 0.0006168866779678206,
|
| 399 |
+
"perm_importance_std": 2.936960388188349e-05
|
| 400 |
+
},
|
| 401 |
+
{
|
| 402 |
+
"feature": "in_segdup_both",
|
| 403 |
+
"perm_importance_mean": 0.0005383371585652164,
|
| 404 |
+
"perm_importance_std": 3.300168720103858e-05
|
| 405 |
+
},
|
| 406 |
+
{
|
| 407 |
+
"feature": "in_SVA_either",
|
| 408 |
+
"perm_importance_mean": 0.00017570394306039018,
|
| 409 |
+
"perm_importance_std": 3.211137729753612e-06
|
| 410 |
+
},
|
| 411 |
+
{
|
| 412 |
+
"feature": "svtype_INV",
|
| 413 |
+
"perm_importance_mean": 0.0,
|
| 414 |
+
"perm_importance_std": 0.0
|
| 415 |
+
}
|
| 416 |
+
]
|
| 417 |
+
},
|
| 418 |
+
"finalized_unix": 1782044045,
|
| 419 |
+
"cv_report": {
|
| 420 |
+
"overall": {
|
| 421 |
+
"n": 2575116,
|
| 422 |
+
"pos_rate": 0.5871215121959554,
|
| 423 |
+
"auroc": 0.9501778843150939,
|
| 424 |
+
"auprc": 0.9511739760033938,
|
| 425 |
+
"brier": 0.07706030782330338,
|
| 426 |
+
"logloss": 0.2628003224452437
|
| 427 |
+
},
|
| 428 |
+
"calibration": [
|
| 429 |
+
{
|
| 430 |
+
"bin": "[0.0,0.1)",
|
| 431 |
+
"n": 701606,
|
| 432 |
+
"mean_pred": 0.018278732155845624,
|
| 433 |
+
"obs_rate": 0.008842569761376044
|
| 434 |
+
},
|
| 435 |
+
{
|
| 436 |
+
"bin": "[0.1,0.2)",
|
| 437 |
+
"n": 77383,
|
| 438 |
+
"mean_pred": 0.1443051591972469,
|
| 439 |
+
"obs_rate": 0.10029334608376517
|
| 440 |
+
},
|
| 441 |
+
{
|
| 442 |
+
"bin": "[0.2,0.3)",
|
| 443 |
+
"n": 53690,
|
| 444 |
+
"mean_pred": 0.24934861469329547,
|
| 445 |
+
"obs_rate": 0.19802570311044887
|
| 446 |
+
},
|
| 447 |
+
{
|
| 448 |
+
"bin": "[0.3,0.4)",
|
| 449 |
+
"n": 57370,
|
| 450 |
+
"mean_pred": 0.35138865912279116,
|
| 451 |
+
"obs_rate": 0.31784905002614605
|
| 452 |
+
},
|
| 453 |
+
{
|
| 454 |
+
"bin": "[0.4,0.5)",
|
| 455 |
+
"n": 70890,
|
| 456 |
+
"mean_pred": 0.4521905700361762,
|
| 457 |
+
"obs_rate": 0.45171392297926366
|
| 458 |
+
},
|
| 459 |
+
{
|
| 460 |
+
"bin": "[0.5,0.6)",
|
| 461 |
+
"n": 98358,
|
| 462 |
+
"mean_pred": 0.5530042682672152,
|
| 463 |
+
"obs_rate": 0.5941153744484434
|
| 464 |
+
},
|
| 465 |
+
{
|
| 466 |
+
"bin": "[0.6,0.7)",
|
| 467 |
+
"n": 158080,
|
| 468 |
+
"mean_pred": 0.6545151929144646,
|
| 469 |
+
"obs_rate": 0.7394863360323887
|
| 470 |
+
},
|
| 471 |
+
{
|
| 472 |
+
"bin": "[0.7,0.8)",
|
| 473 |
+
"n": 264297,
|
| 474 |
+
"mean_pred": 0.7534285409368959,
|
| 475 |
+
"obs_rate": 0.8542737904705691
|
| 476 |
+
},
|
| 477 |
+
{
|
| 478 |
+
"bin": "[0.8,0.9)",
|
| 479 |
+
"n": 407540,
|
| 480 |
+
"mean_pred": 0.8554391726312989,
|
| 481 |
+
"obs_rate": 0.922547480001963
|
| 482 |
+
},
|
| 483 |
+
{
|
| 484 |
+
"bin": "[0.9,1.0)",
|
| 485 |
+
"n": 685902,
|
| 486 |
+
"mean_pred": 0.9410329493878745,
|
| 487 |
+
"obs_rate": 0.962179728299378
|
| 488 |
+
}
|
| 489 |
+
],
|
| 490 |
+
"per_sample_auroc": {
|
| 491 |
+
"n_samples": 208,
|
| 492 |
+
"median": 0.9812468623755227,
|
| 493 |
+
"p25": 0.9541872315128539,
|
| 494 |
+
"p75": 0.9849148282479339,
|
| 495 |
+
"min": 0.8241255348590714,
|
| 496 |
+
"max": 0.9884727126524584
|
| 497 |
+
},
|
| 498 |
+
"by_svtype": {
|
| 499 |
+
"BND": {
|
| 500 |
+
"n": 545562,
|
| 501 |
+
"pos_rate": 0.05967241120165994,
|
| 502 |
+
"auroc": 0.9578724482896165,
|
| 503 |
+
"auprc": 0.7205291737443336,
|
| 504 |
+
"brier": 0.03197870532515856,
|
| 505 |
+
"logloss": 0.11268068083789429
|
| 506 |
+
},
|
| 507 |
+
"DEL": {
|
| 508 |
+
"n": 1168706,
|
| 509 |
+
"pos_rate": 0.7785747655954535,
|
| 510 |
+
"auroc": 0.8919672267722084,
|
| 511 |
+
"auprc": 0.9557196435784474,
|
| 512 |
+
"brier": 0.0922761808544916,
|
| 513 |
+
"logloss": 0.3149783791403593
|
| 514 |
+
},
|
| 515 |
+
"DUP": {
|
| 516 |
+
"n": 156534,
|
| 517 |
+
"pos_rate": 0.023419832113151136,
|
| 518 |
+
"auroc": 0.9936692829891886,
|
| 519 |
+
"auprc": 0.8545445334945798,
|
| 520 |
+
"brier": 0.009443816981288416,
|
| 521 |
+
"logloss": 0.03587112293286734
|
| 522 |
+
},
|
| 523 |
+
"INS": {
|
| 524 |
+
"n": 704314,
|
| 525 |
+
"pos_rate": 0.8032780833548673,
|
| 526 |
+
"auroc": 0.8549802852596324,
|
| 527 |
+
"auprc": 0.9494948614191011,
|
| 528 |
+
"brier": 0.1017598124373945,
|
| 529 |
+
"logloss": 0.3429363119373415
|
| 530 |
+
}
|
| 531 |
+
},
|
| 532 |
+
"by_size": {
|
| 533 |
+
"1-10kb": {
|
| 534 |
+
"n": 171605,
|
| 535 |
+
"pos_rate": 0.7607121004632732,
|
| 536 |
+
"auroc": 0.9296938425020418,
|
| 537 |
+
"auprc": 0.967200346507,
|
| 538 |
+
"brier": 0.06806650111282099,
|
| 539 |
+
"logloss": 0.2455626050274475
|
| 540 |
+
},
|
| 541 |
+
"10-100kb": {
|
| 542 |
+
"n": 42645,
|
| 543 |
+
"pos_rate": 0.21177160276703014,
|
| 544 |
+
"auroc": 0.9920116885561145,
|
| 545 |
+
"auprc": 0.9676826349150152,
|
| 546 |
+
"brier": 0.029729192191828648,
|
| 547 |
+
"logloss": 0.10986227030572825
|
| 548 |
+
},
|
| 549 |
+
"100bp-1kb": {
|
| 550 |
+
"n": 906987,
|
| 551 |
+
"pos_rate": 0.7867466678133204,
|
| 552 |
+
"auroc": 0.8976865414893258,
|
| 553 |
+
"auprc": 0.9590485952985468,
|
| 554 |
+
"brier": 0.08302330275831495,
|
| 555 |
+
"logloss": 0.28853539649876037
|
| 556 |
+
},
|
| 557 |
+
"<100bp": {
|
| 558 |
+
"n": 733616,
|
| 559 |
+
"pos_rate": 0.7488359032518375,
|
| 560 |
+
"auroc": 0.867906684781433,
|
| 561 |
+
"auprc": 0.93767984701013,
|
| 562 |
+
"brier": 0.11261501240225187,
|
| 563 |
+
"logloss": 0.3687373180284347
|
| 564 |
+
},
|
| 565 |
+
">100kb": {
|
| 566 |
+
"n": 54796,
|
| 567 |
+
"pos_rate": 0.00839477334111979,
|
| 568 |
+
"auroc": 0.9897200830900804,
|
| 569 |
+
"auprc": 0.788119068363777,
|
| 570 |
+
"brier": 0.004907638846204342,
|
| 571 |
+
"logloss": 0.029204468344063542
|
| 572 |
+
},
|
| 573 |
+
"NA": {
|
| 574 |
+
"n": 665467,
|
| 575 |
+
"pos_rate": 0.16371360262792894,
|
| 576 |
+
"auroc": 0.976441017719166,
|
| 577 |
+
"auprc": 0.9070303608956529,
|
| 578 |
+
"brier": 0.04103092730468184,
|
| 579 |
+
"logloss": 0.1444199784010817
|
| 580 |
+
}
|
| 581 |
+
}
|
| 582 |
+
}
|
| 583 |
+
}
|
tier_thresholds.json
CHANGED
|
@@ -1,11 +1,12 @@
|
|
| 1 |
{
|
|
|
|
| 2 |
"tiers": {
|
| 3 |
-
"HIGH": "CS
|
| 4 |
-
"MODERATE": "0.50
|
| 5 |
-
"WARNING": "0.30
|
| 6 |
-
"LOW": "CS
|
| 7 |
},
|
| 8 |
-
"
|
| 9 |
-
"
|
| 10 |
-
"note": "
|
| 11 |
}
|
|
|
|
| 1 |
{
|
| 2 |
+
"release_version": "1.0",
|
| 3 |
"tiers": {
|
| 4 |
+
"HIGH": "CS>=0.70",
|
| 5 |
+
"MODERATE": "0.50<=CS<0.70",
|
| 6 |
+
"WARNING": "0.30<=CS<0.50",
|
| 7 |
+
"LOW": "CS<0.30"
|
| 8 |
},
|
| 9 |
+
"score": "CS = isotonic-calibrated probability of concordance with long-read truth",
|
| 10 |
+
"calibration": "isotonic on OOF; SV Brier 0.0771->0.0744, STR 0.1673->0.1589",
|
| 11 |
+
"note": "Tiers are buckets of the calibrated CS. HIGH is the candidate-triage tier."
|
| 12 |
}
|