Instructions to use khyeom/SVSTR-Score with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Scikit-learn
How to use khyeom/SVSTR-Score with Scikit-learn:
from huggingface_hub import hf_hub_download import joblib model = joblib.load( hf_hub_download("khyeom/SVSTR-Score", "sklearn_model.joblib") ) # only load pickle files from sources you trust # read more about it here https://skops.readthedocs.io/en/stable/persistence.html - Notebooks
- Google Colab
- Kaggle
SVSTR-Score (v1.0) β long-read-guided confidence scoring for short-read SV/STR calls
Two per-class RandomForest + isotonic calibrator models that assign a
calibrated confidence score CS β [0,1] and a four-tier operating point to
each short-read structural-variant (SV) or short-tandem-repeat (STR) call.
The long-read genotype is used only to build the training label, so inference
needs short-read caller output only.
CS = isotonic_calibrator( RandomForest.predict_proba(X)[:, 1] ) = P(the
short-read call is concordant with the long-read truth).
| SV | STR | |
|---|---|---|
| weights | sv_model.joblib + sv_calibrator.joblib |
str_model.joblib + str_calibrator.joblib |
| features | 35 (feature_builder.py) |
21 |
| short-read caller | Manta | ExpansionHunter |
| long-read truth (label only) | sawfish | TRGT |
| training cohort | 208 HPRC paired genomes | 208 HPRC |
Tiers
HIGH CSβ₯0.70 Β· MODERATE 0.50β0.70 Β· WARNING 0.30β0.50 Β· LOW <0.30.
Tiers are buckets of the calibrated CS (no heuristic overrides). HIGH is the
candidate-triage tier.
Intended use
Triage / down-weight emitted short-read SV & STR calls by their probability of matching long-read truth, without long-read sequencing every sample. Not a variant caller; does not recover missed variants; does not replace long-read sequencing for complete discovery.
Performance
Internal β 5-fold GroupKFold by sample (out-of-fold), 208 HPRC genomes
| AUROC | AUPRC | per-sample AUROC median | |
|---|---|---|---|
| SV | 0.950 | 0.951 | 0.981 |
| STR | 0.834 | 0.886 | 0.835 |
External β 295 ASD genomes, applied unchanged (no retraining)
| AUROC | observed concordance LOW / WARN / MOD / HIGH | |
|---|---|---|
| SV | 0.891 | 12% / 35% / 50% / 86% |
| STR | 0.831 | 16% / 40% / 60% / 85% |
Calibrated tiers stay monotone and meaningful on the external cohort; isotonic calibration improves Brier (SV 0.077β0.074, STR 0.167β0.159).
Usage
pip install -U huggingface_hub scikit-learn==1.7.1
hf download khyeom/SVSTR-Score sv_model.joblib sv_calibrator.joblib sv_config.json score_svstr.py feature_builder.py --local-dir svstr/
# features from a single-sample VCF, then score:
python svstr/feature_builder.py --vcf sample.manta.vcf.gz --caller manta \
--fasta GRCh38.fa --giab-dir giab_prepared --repeatmasker rmsk_class.bed.gz -o feats.tsv
python svstr/score_svstr.py --variant sv --model-dir svstr --features feats.tsv --out scored.tsv
Code, training pipeline and reproduction scripts: https://github.com/khyeom0608/SVSTR-Score
Limitations
- Relies on caller support / breakpoint-confidence fields (PR/SR, CIPOS/CIEND, VAF, GQ, depth). On merged or heavily filtered call sets that drop these, scores deflate and tiers are unreliable (rank-discrimination degrades only moderately, but calibration breaks).
- Strongest for DEL/INS; DUP/BND are mostly down-weighted/triaged.
- STR scoring is bound to the caller's genotyped locus set.
Citation
Kim W*, Yeom K*, et al. SVSTR-Score (manuscript in preparation). Licence: MIT.
- Downloads last month
- -