Tabular Classification
Scikit-learn
Joblib
genomics
structural-variants
short-tandem-repeats
variant-calling
confidence-calibration
random-forest
Instructions to use khyeom/SVSTR-Score with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Scikit-learn
How to use khyeom/SVSTR-Score with Scikit-learn:
from huggingface_hub import hf_hub_download import joblib model = joblib.load( hf_hub_download("khyeom/SVSTR-Score", "sklearn_model.joblib") ) # only load pickle files from sources you trust # read more about it here https://skops.readthedocs.io/en/stable/persistence.html - Notebooks
- Google Colab
- Kaggle
| license: mit | |
| library_name: sklearn | |
| tags: | |
| - genomics | |
| - structural-variants | |
| - short-tandem-repeats | |
| - variant-calling | |
| - confidence-calibration | |
| - random-forest | |
| pipeline_tag: tabular-classification | |
| # SVSTR-Score (v1.0) — long-read-guided confidence scoring for short-read SV/STR calls | |
| Two per-class **RandomForest + isotonic calibrator** models that assign a | |
| **calibrated confidence score** `CS ∈ [0,1]` and a four-tier operating point to | |
| each short-read **structural-variant (SV)** or **short-tandem-repeat (STR)** call. | |
| The long-read genotype is used **only to build the training label**, so inference | |
| needs short-read caller output only. | |
| `CS = isotonic_calibrator( RandomForest.predict_proba(X)[:, 1] )` = P(the | |
| short-read call is concordant with the long-read truth). | |
| | | SV | STR | | |
| |---|---|---| | |
| | weights | `sv_model.joblib` + `sv_calibrator.joblib` | `str_model.joblib` + `str_calibrator.joblib` | | |
| | features | 35 (`feature_builder.py`) | 21 | | |
| | short-read caller | Manta | ExpansionHunter | | |
| | long-read truth (label only) | sawfish | TRGT | | |
| | training cohort | 208 HPRC paired genomes | 208 HPRC | | |
| ## Tiers | |
| `HIGH CS≥0.70` · `MODERATE 0.50–0.70` · `WARNING 0.30–0.50` · `LOW <0.30`. | |
| Tiers are buckets of the calibrated CS (no heuristic overrides). **HIGH** is the | |
| candidate-triage tier. | |
| ## Intended use | |
| Triage / down-weight *emitted* short-read SV & STR calls by their probability of | |
| matching long-read truth, **without long-read sequencing every sample**. Not a | |
| variant caller; does not recover missed variants; does not replace long-read | |
| sequencing for complete discovery. | |
| ## Performance | |
| **Internal — 5-fold GroupKFold by sample (out-of-fold), 208 HPRC genomes** | |
| | | AUROC | AUPRC | per-sample AUROC median | | |
| |---|---|---|---| | |
| | SV | 0.950 | 0.951 | 0.981 | | |
| | STR | 0.834 | 0.886 | 0.835 | | |
| **External — 295 ASD genomes, applied unchanged (no retraining)** | |
| | | AUROC | observed concordance LOW / WARN / MOD / HIGH | | |
| |---|---|---| | |
| | SV | 0.891 | 12% / 35% / 50% / **86%** | | |
| | STR | 0.831 | 16% / 40% / 60% / **85%** | | |
| Calibrated tiers stay monotone and meaningful on the external cohort; isotonic | |
| calibration improves Brier (SV 0.077→0.074, STR 0.167→0.159). | |
| ## Usage | |
| ```bash | |
| pip install -U huggingface_hub scikit-learn==1.7.1 | |
| hf download khyeom/SVSTR-Score sv_model.joblib sv_calibrator.joblib sv_config.json score_svstr.py feature_builder.py --local-dir svstr/ | |
| # features from a single-sample VCF, then score: | |
| python svstr/feature_builder.py --vcf sample.manta.vcf.gz --caller manta \ | |
| --fasta GRCh38.fa --giab-dir giab_prepared --repeatmasker rmsk_class.bed.gz -o feats.tsv | |
| python svstr/score_svstr.py --variant sv --model-dir svstr --features feats.tsv --out scored.tsv | |
| ``` | |
| Code, training pipeline and reproduction scripts: https://github.com/khyeom0608/SVSTR-Score | |
| ## Limitations | |
| - Relies on **caller support / breakpoint-confidence fields** (PR/SR, CIPOS/CIEND, | |
| VAF, GQ, depth). On **merged or heavily filtered call sets that drop these**, | |
| scores deflate and tiers are unreliable (rank-discrimination degrades only | |
| moderately, but calibration breaks). | |
| - Strongest for **DEL/INS**; DUP/BND are mostly down-weighted/triaged. | |
| - STR scoring is **bound to the caller's genotyped locus set**. | |
| ## Citation | |
| Kim W\*, Yeom K\*, et al. *SVSTR-Score* (manuscript in preparation). Licence: MIT. | |