SVSTR-Score / seqonly /README.md
khyeom's picture
Add sequence-only headline model (svspr_v14_seq, 11-feature) + inference package
90d0b4b verified
|
Raw
History Blame Contribute Delete
5.93 kB
metadata
language:
  - en
license: cc-by-4.0
library_name: scikit-learn
tags:
  - genomics
  - structural-variants
  - bioinformatics
  - vcf
  - variant-filtering
pipeline_tag: tabular-classification

SV-SPR: Short-read SV confidence Scoring with PaRents-only training

Reference-only post-VCF rescoring for short-read structural-variant (SV) calls. Given a VCF entry (coordinate + type + length) and a reference FASTA, the model returns a confidence score (CS) β€” the RandomForest probability that the call would be confirmed by long-read sequencing (LRS). The CS is uncalibrated out-of-the-box (held-out ECE β‰ˆ 0.07, under-confident in the mid-range); apply isotonic/Platt calibration before using it as a literal probability. Tiers (High β‰₯ 0.9 / Moderate 0.7–0.9 / Warning 0.5–0.7 / Low < 0.5) match Methods 2.7.2.

No BAM file is needed. Unlike DeepSVFilter, Samplot-ML, or Duphold, SV-SPR only requires the VCF and a reference. This makes it applicable to legacy VCFs, summary-stat archives, and federated cohorts where read-level data is unavailable.

Model details

Item Value
Architecture RandomForest (200 trees, balanced class weight)
Features 11 (sequence-context + length + type)
Training data 226 Korean trios, paired Illumina DRAGEN + PacBio HiFi
Training samples 143 parents only (no children, ASD-agnostic)
Cross-validation 143-fold sample-LOSO
F1 (CV) 0.9593 [95% CI 0.9564–0.9616]
AUROC (CV) 0.9739
Reference GRCh38
Caller agnostic Yes β€” no Manta / DRAGEN / Delly fields required

Features used (11)

# Feature Source
1 svlen_abs VCF SVLEN
2 log10_svlen derived
3-6 svtype_DEL/INS/DUP/BND VCF SVTYPE one-hot
7 gc_flank_w100 GC fraction in Β±100bp flanks
8 at_flank_w100 AT fraction in Β±100bp flanks
9 gc_inner_w100 GC fraction inside the called region
10 n_motif_2_w100 tandem dinucleotide count in flank
11 n_motif_3_w100 tandem trinucleotide count in flank

Sequence features contribute 47.4% of total importance (T01d analysis).

Quick start

from svspr import classify, score

# 1. Score a single SV
result = classify(
    chrom='chr1', pos=1000000, end=1005000,
    svtype='DEL', svlen=5000, total_alt_support=15,
    ref_path='GRCh38.fa',
)
# {'CS': 0.69, 'tier': 'moderate'}

# 2. Score every SV in a VCF
df = score(vcf_path='my_calls.vcf', ref_path='GRCh38.fa')
df[['chrom', 'pos', 'svtype', 'CS', 'tier']].head()

CLI usage:

# Batch
python -m svspr.cli --vcf my_calls.vcf --ref GRCh38.fa --out scored.tsv

# Single
python -m svspr.cli --one --chrom chr1 --pos 1000000 --end 1005000 \
    --svtype DEL --svlen 5000 --alt-support 15 --ref GRCh38.fa

Tier definition

Tier Confidence Score (CS) Recommendation
high β‰₯ 0.85 High-confidence call. Use directly.
moderate 0.50 – 0.85 Acceptable. Consider tier in downstream filters.
warning 0.30 – 0.50 Low confidence. LRS validation recommended.
low < 0.30 Likely false positive. Exclude unless rescued.

Performance vs. comparable filters

Model F1 AUROC Inputs needed
SV-SPR (this) 0.9593 0.9739 VCF + reference
v13_parents (Manta-specific FULL, 23 feat) 0.9308 0.9283 VCF + Manta-specific fields
Jan 2026 bioRxiv (15 features) 0.957 β€” VCF + reference
PostSV (2020) ~0.74 β€” VCF + caller features

(Performance numbers reported on 226-trio Korean cohort, sample-LOSO.)

Intended use

  • Post-VCF filtering / rescoring of short-read SV calls
  • Quality control in population-scale SV pipelines
  • Prioritising calls for long-read validation in clinical settings

Out of scope / limitations

  • Trained on GRCh38 only β€” performance on T2T-CHM13 not validated
  • Trained on a single Korean cohort (226 trios); external generalisation not yet quantified
  • Sequence-only by default; if your caller provides PR/SR/AD fields, supplying total_alt_support may improve recall but is not strictly required
  • Does not call SVs β€” feeds on already-called VCFs
  • Best for DEL/INS in the 100bp–10kb range. BND/DUP have lower base rates (DRAGEN Manta BND has ~88% over-call rate)

Caveats and known issues

  • Sample-LOSO assumes parental independence. Korean founder effects may invalidate this; we have not measured pairwise IBD.
  • Sequence vs. depth contribution. Adding depth features (T01d +both) yields F1 0.9579, which is not significantly different from sequence-only (Wilcoxon p=0.0625, CIs overlap). Depth and sequence are partially redundant.

Prior art (substantial overlap)

This model occupies a niche close to several prior tools. Closest analogues:

  • Jan 2026 bioRxiv "Systematic Assessment of ML for SV Filtering" β€” RF on 15 features, F1=0.957. The closest direct comparison.
  • DeepSVFilter (Liu 2020) β€” CNN, requires BAM.
  • Samplot-ML (Belyeu 2021, Genome Biology) β€” CNN, requires BAM.
  • Duphold (Pedersen 2019, GigaScience) β€” depth + GC-matched, requires BAM.
  • CADD-SV (Kleinert 2022, Genome Research) β€” RF on sequence annotations.
  • PostSV (Alzaid 2020), SV2 (Antaki 2018), GATK-SV (Collins 2020).

Our differentiator: reference-only inputs (no BAM, no caller-specific fields). Applicable to legacy VCFs and federated cohorts where read-level access is restricted.

Training script

See training/T01d_unified_with_v13.py in the source repo for the exact reproduction recipe.

Citation

Manuscript in preparation. Provisional citation:

Kim, W. SV-SPR: Reference-only post-VCF rescoring for short-read structural
variants. Preprint, 2026.

Contact

  • Author: Woohun Kim (alex990713@gmail.com)
  • Lab: KAIST
  • Cohort: 226 Korean trios with paired Illumina DRAGEN + PacBio HiFi WGS