--- language: - en license: cc-by-4.0 library_name: scikit-learn tags: - genomics - structural-variants - bioinformatics - vcf - variant-filtering pipeline_tag: tabular-classification --- # SV-SPR: Short-read SV confidence Scoring with PaRents-only training Reference-only post-VCF rescoring for short-read structural-variant (SV) calls. Given a VCF entry (coordinate + type + length) and a reference FASTA, the model returns a confidence score (CS) — the RandomForest probability that the call would be confirmed by long-read sequencing (LRS). The CS is **uncalibrated** out-of-the-box (held-out ECE ≈ 0.07, under-confident in the mid-range); apply isotonic/Platt calibration before using it as a literal probability. Tiers (High ≥ 0.9 / Moderate 0.7–0.9 / Warning 0.5–0.7 / Low < 0.5) match Methods 2.7.2. > **No BAM file is needed.** Unlike DeepSVFilter, Samplot-ML, or Duphold, SV-SPR > only requires the VCF and a reference. This makes it applicable to legacy > VCFs, summary-stat archives, and federated cohorts where read-level data is > unavailable. ## Model details | Item | Value | |--|--| | Architecture | RandomForest (200 trees, balanced class weight) | | Features | 11 (sequence-context + length + type) | | Training data | 226 Korean trios, paired Illumina DRAGEN + PacBio HiFi | | Training samples | 143 parents only (no children, ASD-agnostic) | | Cross-validation | 143-fold sample-LOSO | | **F1 (CV)** | **0.9593** [95% CI 0.9564–0.9616] | | **AUROC (CV)** | **0.9739** | | Reference | GRCh38 | | Caller agnostic | Yes — no Manta / DRAGEN / Delly fields required | ### Features used (11) | # | Feature | Source | |--|--|--| | 1 | svlen_abs | VCF SVLEN | | 2 | log10_svlen | derived | | 3-6 | svtype_DEL/INS/DUP/BND | VCF SVTYPE one-hot | | 7 | gc_flank_w100 | GC fraction in ±100bp flanks | | 8 | at_flank_w100 | AT fraction in ±100bp flanks | | 9 | gc_inner_w100 | GC fraction inside the called region | | 10 | n_motif_2_w100 | tandem dinucleotide count in flank | | 11 | n_motif_3_w100 | tandem trinucleotide count in flank | Sequence features contribute **47.4% of total importance** (T01d analysis). ## Quick start ```python from svspr import classify, score # 1. Score a single SV result = classify( chrom='chr1', pos=1000000, end=1005000, svtype='DEL', svlen=5000, total_alt_support=15, ref_path='GRCh38.fa', ) # {'CS': 0.69, 'tier': 'moderate'} # 2. Score every SV in a VCF df = score(vcf_path='my_calls.vcf', ref_path='GRCh38.fa') df[['chrom', 'pos', 'svtype', 'CS', 'tier']].head() ``` CLI usage: ```bash # Batch python -m svspr.cli --vcf my_calls.vcf --ref GRCh38.fa --out scored.tsv # Single python -m svspr.cli --one --chrom chr1 --pos 1000000 --end 1005000 \ --svtype DEL --svlen 5000 --alt-support 15 --ref GRCh38.fa ``` ## Tier definition | Tier | Confidence Score (CS) | Recommendation | |--|--|--| | high | ≥ 0.85 | High-confidence call. Use directly. | | moderate | 0.50 – 0.85 | Acceptable. Consider tier in downstream filters. | | warning | 0.30 – 0.50 | Low confidence. LRS validation recommended. | | low | < 0.30 | Likely false positive. Exclude unless rescued. | ## Performance vs. comparable filters | Model | F1 | AUROC | Inputs needed | |--|--|--|--| | SV-SPR (this) | **0.9593** | 0.9739 | VCF + reference | | v13_parents (Manta-specific FULL, 23 feat) | 0.9308 | 0.9283 | VCF + Manta-specific fields | | Jan 2026 bioRxiv (15 features) | 0.957 | — | VCF + reference | | PostSV (2020) | ~0.74 | — | VCF + caller features | (Performance numbers reported on 226-trio Korean cohort, sample-LOSO.) ## Intended use - Post-VCF filtering / rescoring of short-read SV calls - Quality control in population-scale SV pipelines - Prioritising calls for long-read validation in clinical settings ## Out of scope / limitations - Trained on **GRCh38** only — performance on T2T-CHM13 not validated - Trained on a single Korean cohort (226 trios); external generalisation not yet quantified - Sequence-only by default; if your caller provides PR/SR/AD fields, supplying `total_alt_support` may improve recall but is not strictly required - Does not call SVs — feeds on already-called VCFs - Best for DEL/INS in the 100bp–10kb range. BND/DUP have lower base rates (DRAGEN Manta BND has ~88% over-call rate) ## Caveats and known issues - **Sample-LOSO assumes parental independence.** Korean founder effects may invalidate this; we have not measured pairwise IBD. - **Sequence vs. depth contribution.** Adding depth features (T01d +both) yields F1 0.9579, which is not significantly different from sequence-only (Wilcoxon p=0.0625, CIs overlap). Depth and sequence are partially redundant. ## Prior art (substantial overlap) This model occupies a niche close to several prior tools. Closest analogues: - **Jan 2026 bioRxiv "Systematic Assessment of ML for SV Filtering"** — RF on 15 features, F1=0.957. The closest direct comparison. - **DeepSVFilter** (Liu 2020) — CNN, requires BAM. - **Samplot-ML** (Belyeu 2021, Genome Biology) — CNN, requires BAM. - **Duphold** (Pedersen 2019, GigaScience) — depth + GC-matched, requires BAM. - **CADD-SV** (Kleinert 2022, Genome Research) — RF on sequence annotations. - **PostSV** (Alzaid 2020), **SV2** (Antaki 2018), **GATK-SV** (Collins 2020). **Our differentiator**: reference-only inputs (no BAM, no caller-specific fields). Applicable to legacy VCFs and federated cohorts where read-level access is restricted. ## Training script See `training/T01d_unified_with_v13.py` in the source repo for the exact reproduction recipe. ## Citation Manuscript in preparation. Provisional citation: ``` Kim, W. SV-SPR: Reference-only post-VCF rescoring for short-read structural variants. Preprint, 2026. ``` ## Contact - Author: Woohun Kim (alex990713@gmail.com) - Lab: KAIST - Cohort: 226 Korean trios with paired Illumina DRAGEN + PacBio HiFi WGS