Add sequence-only headline model (svspr_v14_seq, 11-feature) + inference package

90d0b4b verified 20 days ago

5.93 kB

language:
  - en
license: cc-by-4.0
library_name: scikit-learn
tags:
  - genomics
  - structural-variants
  - bioinformatics
  - vcf
  - variant-filtering
pipeline_tag: tabular-classification

SV-SPR: Short-read SV confidence Scoring with PaRents-only training

Reference-only post-VCF rescoring for short-read structural-variant (SV) calls. Given a VCF entry (coordinate + type + length) and a reference FASTA, the model returns a confidence score (CS) — the RandomForest probability that the call would be confirmed by long-read sequencing (LRS). The CS is uncalibrated out-of-the-box (held-out ECE ≈ 0.07, under-confident in the mid-range); apply isotonic/Platt calibration before using it as a literal probability. Tiers (High ≥ 0.9 / Moderate 0.7–0.9 / Warning 0.5–0.7 / Low < 0.5) match Methods 2.7.2.

No BAM file is needed. Unlike DeepSVFilter, Samplot-ML, or Duphold, SV-SPR only requires the VCF and a reference. This makes it applicable to legacy VCFs, summary-stat archives, and federated cohorts where read-level data is unavailable.

Model details

Item	Value
Architecture	RandomForest (200 trees, balanced class weight)
Features	11 (sequence-context + length + type)
Training data	226 Korean trios, paired Illumina DRAGEN + PacBio HiFi
Training samples	143 parents only (no children, ASD-agnostic)
Cross-validation	143-fold sample-LOSO
F1 (CV)	0.9593 [95% CI 0.9564–0.9616]
AUROC (CV)	0.9739
Reference	GRCh38
Caller agnostic	Yes — no Manta / DRAGEN / Delly fields required

Features used (11)

#	Feature	Source
1	svlen_abs	VCF SVLEN
2	log10_svlen	derived
3-6	svtype_DEL/INS/DUP/BND	VCF SVTYPE one-hot
7	gc_flank_w100	GC fraction in ±100bp flanks
8	at_flank_w100	AT fraction in ±100bp flanks
9	gc_inner_w100	GC fraction inside the called region
10	n_motif_2_w100	tandem dinucleotide count in flank
11	n_motif_3_w100	tandem trinucleotide count in flank

Sequence features contribute 47.4% of total importance (T01d analysis).

Quick start

from svspr import classify, score

# 1. Score a single SV
result = classify(
    chrom='chr1', pos=1000000, end=1005000,
    svtype='DEL', svlen=5000, total_alt_support=15,
    ref_path='GRCh38.fa',
)
# {'CS': 0.69, 'tier': 'moderate'}

# 2. Score every SV in a VCF
df = score(vcf_path='my_calls.vcf', ref_path='GRCh38.fa')
df[['chrom', 'pos', 'svtype', 'CS', 'tier']].head()

CLI usage:

# Batch
python -m svspr.cli --vcf my_calls.vcf --ref GRCh38.fa --out scored.tsv

# Single
python -m svspr.cli --one --chrom chr1 --pos 1000000 --end 1005000 \
    --svtype DEL --svlen 5000 --alt-support 15 --ref GRCh38.fa

Tier definition

Tier	Confidence Score (CS)	Recommendation
high	≥ 0.85	High-confidence call. Use directly.
moderate	0.50 – 0.85	Acceptable. Consider tier in downstream filters.
warning	0.30 – 0.50	Low confidence. LRS validation recommended.
low	< 0.30	Likely false positive. Exclude unless rescued.

Performance vs. comparable filters

Model	F1	AUROC	Inputs needed
SV-SPR (this)	0.9593	0.9739	VCF + reference
v13_parents (Manta-specific FULL, 23 feat)	0.9308	0.9283	VCF + Manta-specific fields
Jan 2026 bioRxiv (15 features)	0.957	—	VCF + reference
PostSV (2020)	~0.74	—	VCF + caller features

(Performance numbers reported on 226-trio Korean cohort, sample-LOSO.)

Intended use

Post-VCF filtering / rescoring of short-read SV calls
Quality control in population-scale SV pipelines
Prioritising calls for long-read validation in clinical settings

Out of scope / limitations

Trained on GRCh38 only — performance on T2T-CHM13 not validated
Trained on a single Korean cohort (226 trios); external generalisation not yet quantified
Sequence-only by default; if your caller provides PR/SR/AD fields, supplying total_alt_support may improve recall but is not strictly required
Does not call SVs — feeds on already-called VCFs
Best for DEL/INS in the 100bp–10kb range. BND/DUP have lower base rates (DRAGEN Manta BND has ~88% over-call rate)

Caveats and known issues

Sample-LOSO assumes parental independence. Korean founder effects may invalidate this; we have not measured pairwise IBD.
Sequence vs. depth contribution. Adding depth features (T01d +both) yields F1 0.9579, which is not significantly different from sequence-only (Wilcoxon p=0.0625, CIs overlap). Depth and sequence are partially redundant.

Prior art (substantial overlap)

This model occupies a niche close to several prior tools. Closest analogues:

Jan 2026 bioRxiv "Systematic Assessment of ML for SV Filtering" — RF on 15 features, F1=0.957. The closest direct comparison.
DeepSVFilter (Liu 2020) — CNN, requires BAM.
Samplot-ML (Belyeu 2021, Genome Biology) — CNN, requires BAM.
Duphold (Pedersen 2019, GigaScience) — depth + GC-matched, requires BAM.
CADD-SV (Kleinert 2022, Genome Research) — RF on sequence annotations.
PostSV (Alzaid 2020), SV2 (Antaki 2018), GATK-SV (Collins 2020).

Our differentiator: reference-only inputs (no BAM, no caller-specific fields). Applicable to legacy VCFs and federated cohorts where read-level access is restricted.

Training script

See training/T01d_unified_with_v13.py in the source repo for the exact reproduction recipe.

Citation

Manuscript in preparation. Provisional citation:

Kim, W. SV-SPR: Reference-only post-VCF rescoring for short-read structural
variants. Preprint, 2026.

Contact

Author: Woohun Kim (alex990713@gmail.com)
Lab: KAIST
Cohort: 226 Korean trios with paired Illumina DRAGEN + PacBio HiFi WGS