Tabular Classification
Scikit-learn
Joblib
genomics
structural-variants
short-tandem-repeats
variant-calling
confidence-calibration
random-forest
Instructions to use khyeom/SVSTR-Score with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Scikit-learn
How to use khyeom/SVSTR-Score with Scikit-learn:
from huggingface_hub import hf_hub_download import joblib model = joblib.load( hf_hub_download("khyeom/SVSTR-Score", "sklearn_model.joblib") ) # only load pickle files from sources you trust # read more about it here https://skops.readthedocs.io/en/stable/persistence.html - Notebooks
- Google Colab
- Kaggle
| language: | |
| - en | |
| license: cc-by-4.0 | |
| library_name: scikit-learn | |
| tags: | |
| - genomics | |
| - structural-variants | |
| - bioinformatics | |
| - vcf | |
| - variant-filtering | |
| pipeline_tag: tabular-classification | |
| # SV-SPR: Short-read SV confidence Scoring with PaRents-only training | |
| Reference-only post-VCF rescoring for short-read structural-variant (SV) calls. | |
| Given a VCF entry (coordinate + type + length) and a reference FASTA, the model | |
| returns a confidence score (CS) β the RandomForest probability that the call | |
| would be confirmed by long-read sequencing (LRS). The CS is **uncalibrated** | |
| out-of-the-box (held-out ECE β 0.07, under-confident in the mid-range); apply | |
| isotonic/Platt calibration before using it as a literal probability. Tiers | |
| (High β₯ 0.9 / Moderate 0.7β0.9 / Warning 0.5β0.7 / Low < 0.5) match Methods 2.7.2. | |
| > **No BAM file is needed.** Unlike DeepSVFilter, Samplot-ML, or Duphold, SV-SPR | |
| > only requires the VCF and a reference. This makes it applicable to legacy | |
| > VCFs, summary-stat archives, and federated cohorts where read-level data is | |
| > unavailable. | |
| ## Model details | |
| | Item | Value | | |
| |--|--| | |
| | Architecture | RandomForest (200 trees, balanced class weight) | | |
| | Features | 11 (sequence-context + length + type) | | |
| | Training data | 226 Korean trios, paired Illumina DRAGEN + PacBio HiFi | | |
| | Training samples | 143 parents only (no children, ASD-agnostic) | | |
| | Cross-validation | 143-fold sample-LOSO | | |
| | **F1 (CV)** | **0.9593** [95% CI 0.9564β0.9616] | | |
| | **AUROC (CV)** | **0.9739** | | |
| | Reference | GRCh38 | | |
| | Caller agnostic | Yes β no Manta / DRAGEN / Delly fields required | | |
| ### Features used (11) | |
| | # | Feature | Source | | |
| |--|--|--| | |
| | 1 | svlen_abs | VCF SVLEN | | |
| | 2 | log10_svlen | derived | | |
| | 3-6 | svtype_DEL/INS/DUP/BND | VCF SVTYPE one-hot | | |
| | 7 | gc_flank_w100 | GC fraction in Β±100bp flanks | | |
| | 8 | at_flank_w100 | AT fraction in Β±100bp flanks | | |
| | 9 | gc_inner_w100 | GC fraction inside the called region | | |
| | 10 | n_motif_2_w100 | tandem dinucleotide count in flank | | |
| | 11 | n_motif_3_w100 | tandem trinucleotide count in flank | | |
| Sequence features contribute **47.4% of total importance** (T01d analysis). | |
| ## Quick start | |
| ```python | |
| from svspr import classify, score | |
| # 1. Score a single SV | |
| result = classify( | |
| chrom='chr1', pos=1000000, end=1005000, | |
| svtype='DEL', svlen=5000, total_alt_support=15, | |
| ref_path='GRCh38.fa', | |
| ) | |
| # {'CS': 0.69, 'tier': 'moderate'} | |
| # 2. Score every SV in a VCF | |
| df = score(vcf_path='my_calls.vcf', ref_path='GRCh38.fa') | |
| df[['chrom', 'pos', 'svtype', 'CS', 'tier']].head() | |
| ``` | |
| CLI usage: | |
| ```bash | |
| # Batch | |
| python -m svspr.cli --vcf my_calls.vcf --ref GRCh38.fa --out scored.tsv | |
| # Single | |
| python -m svspr.cli --one --chrom chr1 --pos 1000000 --end 1005000 \ | |
| --svtype DEL --svlen 5000 --alt-support 15 --ref GRCh38.fa | |
| ``` | |
| ## Tier definition | |
| | Tier | Confidence Score (CS) | Recommendation | | |
| |--|--|--| | |
| | high | β₯ 0.85 | High-confidence call. Use directly. | | |
| | moderate | 0.50 β 0.85 | Acceptable. Consider tier in downstream filters. | | |
| | warning | 0.30 β 0.50 | Low confidence. LRS validation recommended. | | |
| | low | < 0.30 | Likely false positive. Exclude unless rescued. | | |
| ## Performance vs. comparable filters | |
| | Model | F1 | AUROC | Inputs needed | | |
| |--|--|--|--| | |
| | SV-SPR (this) | **0.9593** | 0.9739 | VCF + reference | | |
| | v13_parents (Manta-specific FULL, 23 feat) | 0.9308 | 0.9283 | VCF + Manta-specific fields | | |
| | Jan 2026 bioRxiv (15 features) | 0.957 | β | VCF + reference | | |
| | PostSV (2020) | ~0.74 | β | VCF + caller features | | |
| (Performance numbers reported on 226-trio Korean cohort, sample-LOSO.) | |
| ## Intended use | |
| - Post-VCF filtering / rescoring of short-read SV calls | |
| - Quality control in population-scale SV pipelines | |
| - Prioritising calls for long-read validation in clinical settings | |
| ## Out of scope / limitations | |
| - Trained on **GRCh38** only β performance on T2T-CHM13 not validated | |
| - Trained on a single Korean cohort (226 trios); external generalisation not yet quantified | |
| - Sequence-only by default; if your caller provides PR/SR/AD fields, supplying | |
| `total_alt_support` may improve recall but is not strictly required | |
| - Does not call SVs β feeds on already-called VCFs | |
| - Best for DEL/INS in the 100bpβ10kb range. BND/DUP have lower base rates | |
| (DRAGEN Manta BND has ~88% over-call rate) | |
| ## Caveats and known issues | |
| - **Sample-LOSO assumes parental independence.** Korean founder effects may | |
| invalidate this; we have not measured pairwise IBD. | |
| - **Sequence vs. depth contribution.** Adding depth features (T01d +both) | |
| yields F1 0.9579, which is not significantly different from sequence-only | |
| (Wilcoxon p=0.0625, CIs overlap). Depth and sequence are partially redundant. | |
| ## Prior art (substantial overlap) | |
| This model occupies a niche close to several prior tools. Closest analogues: | |
| - **Jan 2026 bioRxiv "Systematic Assessment of ML for SV Filtering"** β RF on | |
| 15 features, F1=0.957. The closest direct comparison. | |
| - **DeepSVFilter** (Liu 2020) β CNN, requires BAM. | |
| - **Samplot-ML** (Belyeu 2021, Genome Biology) β CNN, requires BAM. | |
| - **Duphold** (Pedersen 2019, GigaScience) β depth + GC-matched, requires BAM. | |
| - **CADD-SV** (Kleinert 2022, Genome Research) β RF on sequence annotations. | |
| - **PostSV** (Alzaid 2020), **SV2** (Antaki 2018), **GATK-SV** (Collins 2020). | |
| **Our differentiator**: reference-only inputs (no BAM, no caller-specific | |
| fields). Applicable to legacy VCFs and federated cohorts where read-level | |
| access is restricted. | |
| ## Training script | |
| See `training/T01d_unified_with_v13.py` in the source repo for the exact | |
| reproduction recipe. | |
| ## Citation | |
| Manuscript in preparation. Provisional citation: | |
| ``` | |
| Kim, W. SV-SPR: Reference-only post-VCF rescoring for short-read structural | |
| variants. Preprint, 2026. | |
| ``` | |
| ## Contact | |
| - Author: Woohun Kim (alex990713@gmail.com) | |
| - Lab: KAIST | |
| - Cohort: 226 Korean trios with paired Illumina DRAGEN + PacBio HiFi WGS | |