Instructions to use khyeom/SVSTR-Score with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Scikit-learn
How to use khyeom/SVSTR-Score with Scikit-learn:
from huggingface_hub import hf_hub_download import joblib model = joblib.load( hf_hub_download("khyeom/SVSTR-Score", "sklearn_model.joblib") ) # only load pickle files from sources you trust # read more about it here https://skops.readthedocs.io/en/stable/persistence.html - Notebooks
- Google Colab
- Kaggle
language:
- en
license: cc-by-4.0
library_name: scikit-learn
tags:
- genomics
- structural-variants
- bioinformatics
- vcf
- variant-filtering
pipeline_tag: tabular-classification
SV-SPR: Short-read SV confidence Scoring with PaRents-only training
Reference-only post-VCF rescoring for short-read structural-variant (SV) calls. Given a VCF entry (coordinate + type + length) and a reference FASTA, the model returns a confidence score (CS) β the RandomForest probability that the call would be confirmed by long-read sequencing (LRS). The CS is uncalibrated out-of-the-box (held-out ECE β 0.07, under-confident in the mid-range); apply isotonic/Platt calibration before using it as a literal probability. Tiers (High β₯ 0.9 / Moderate 0.7β0.9 / Warning 0.5β0.7 / Low < 0.5) match Methods 2.7.2.
No BAM file is needed. Unlike DeepSVFilter, Samplot-ML, or Duphold, SV-SPR only requires the VCF and a reference. This makes it applicable to legacy VCFs, summary-stat archives, and federated cohorts where read-level data is unavailable.
Model details
| Item | Value |
|---|---|
| Architecture | RandomForest (200 trees, balanced class weight) |
| Features | 11 (sequence-context + length + type) |
| Training data | 226 Korean trios, paired Illumina DRAGEN + PacBio HiFi |
| Training samples | 143 parents only (no children, ASD-agnostic) |
| Cross-validation | 143-fold sample-LOSO |
| F1 (CV) | 0.9593 [95% CI 0.9564β0.9616] |
| AUROC (CV) | 0.9739 |
| Reference | GRCh38 |
| Caller agnostic | Yes β no Manta / DRAGEN / Delly fields required |
Features used (11)
| # | Feature | Source |
|---|---|---|
| 1 | svlen_abs | VCF SVLEN |
| 2 | log10_svlen | derived |
| 3-6 | svtype_DEL/INS/DUP/BND | VCF SVTYPE one-hot |
| 7 | gc_flank_w100 | GC fraction in Β±100bp flanks |
| 8 | at_flank_w100 | AT fraction in Β±100bp flanks |
| 9 | gc_inner_w100 | GC fraction inside the called region |
| 10 | n_motif_2_w100 | tandem dinucleotide count in flank |
| 11 | n_motif_3_w100 | tandem trinucleotide count in flank |
Sequence features contribute 47.4% of total importance (T01d analysis).
Quick start
from svspr import classify, score
# 1. Score a single SV
result = classify(
chrom='chr1', pos=1000000, end=1005000,
svtype='DEL', svlen=5000, total_alt_support=15,
ref_path='GRCh38.fa',
)
# {'CS': 0.69, 'tier': 'moderate'}
# 2. Score every SV in a VCF
df = score(vcf_path='my_calls.vcf', ref_path='GRCh38.fa')
df[['chrom', 'pos', 'svtype', 'CS', 'tier']].head()
CLI usage:
# Batch
python -m svspr.cli --vcf my_calls.vcf --ref GRCh38.fa --out scored.tsv
# Single
python -m svspr.cli --one --chrom chr1 --pos 1000000 --end 1005000 \
--svtype DEL --svlen 5000 --alt-support 15 --ref GRCh38.fa
Tier definition
| Tier | Confidence Score (CS) | Recommendation |
|---|---|---|
| high | β₯ 0.85 | High-confidence call. Use directly. |
| moderate | 0.50 β 0.85 | Acceptable. Consider tier in downstream filters. |
| warning | 0.30 β 0.50 | Low confidence. LRS validation recommended. |
| low | < 0.30 | Likely false positive. Exclude unless rescued. |
Performance vs. comparable filters
| Model | F1 | AUROC | Inputs needed |
|---|---|---|---|
| SV-SPR (this) | 0.9593 | 0.9739 | VCF + reference |
| v13_parents (Manta-specific FULL, 23 feat) | 0.9308 | 0.9283 | VCF + Manta-specific fields |
| Jan 2026 bioRxiv (15 features) | 0.957 | β | VCF + reference |
| PostSV (2020) | ~0.74 | β | VCF + caller features |
(Performance numbers reported on 226-trio Korean cohort, sample-LOSO.)
Intended use
- Post-VCF filtering / rescoring of short-read SV calls
- Quality control in population-scale SV pipelines
- Prioritising calls for long-read validation in clinical settings
Out of scope / limitations
- Trained on GRCh38 only β performance on T2T-CHM13 not validated
- Trained on a single Korean cohort (226 trios); external generalisation not yet quantified
- Sequence-only by default; if your caller provides PR/SR/AD fields, supplying
total_alt_supportmay improve recall but is not strictly required - Does not call SVs β feeds on already-called VCFs
- Best for DEL/INS in the 100bpβ10kb range. BND/DUP have lower base rates (DRAGEN Manta BND has ~88% over-call rate)
Caveats and known issues
- Sample-LOSO assumes parental independence. Korean founder effects may invalidate this; we have not measured pairwise IBD.
- Sequence vs. depth contribution. Adding depth features (T01d +both) yields F1 0.9579, which is not significantly different from sequence-only (Wilcoxon p=0.0625, CIs overlap). Depth and sequence are partially redundant.
Prior art (substantial overlap)
This model occupies a niche close to several prior tools. Closest analogues:
- Jan 2026 bioRxiv "Systematic Assessment of ML for SV Filtering" β RF on 15 features, F1=0.957. The closest direct comparison.
- DeepSVFilter (Liu 2020) β CNN, requires BAM.
- Samplot-ML (Belyeu 2021, Genome Biology) β CNN, requires BAM.
- Duphold (Pedersen 2019, GigaScience) β depth + GC-matched, requires BAM.
- CADD-SV (Kleinert 2022, Genome Research) β RF on sequence annotations.
- PostSV (Alzaid 2020), SV2 (Antaki 2018), GATK-SV (Collins 2020).
Our differentiator: reference-only inputs (no BAM, no caller-specific fields). Applicable to legacy VCFs and federated cohorts where read-level access is restricted.
Training script
See training/T01d_unified_with_v13.py in the source repo for the exact
reproduction recipe.
Citation
Manuscript in preparation. Provisional citation:
Kim, W. SV-SPR: Reference-only post-VCF rescoring for short-read structural
variants. Preprint, 2026.
Contact
- Author: Woohun Kim (alex990713@gmail.com)
- Lab: KAIST
- Cohort: 226 Korean trios with paired Illumina DRAGEN + PacBio HiFi WGS