---
language:
  - en
license: cc-by-4.0
library_name: scikit-learn
tags:
  - genomics
  - structural-variants
  - bioinformatics
  - vcf
  - variant-filtering
pipeline_tag: tabular-classification
---

# SV-SPR: Short-read SV confidence Scoring with PaRents-only training

Reference-only post-VCF rescoring for short-read structural-variant (SV) calls.
Given a VCF entry (coordinate + type + length) and a reference FASTA, the model
returns a confidence score (CS) — the RandomForest probability that the call
would be confirmed by long-read sequencing (LRS). The CS is **uncalibrated**
out-of-the-box (held-out ECE ≈ 0.07, under-confident in the mid-range); apply
isotonic/Platt calibration before using it as a literal probability. Tiers
(High ≥ 0.9 / Moderate 0.7–0.9 / Warning 0.5–0.7 / Low < 0.5) match Methods 2.7.2.

> **No BAM file is needed.** Unlike DeepSVFilter, Samplot-ML, or Duphold, SV-SPR
> only requires the VCF and a reference. This makes it applicable to legacy
> VCFs, summary-stat archives, and federated cohorts where read-level data is
> unavailable.

## Model details

| Item | Value |
|--|--|
| Architecture | RandomForest (200 trees, balanced class weight) |
| Features | 11 (sequence-context + length + type) |
| Training data | 226 Korean trios, paired Illumina DRAGEN + PacBio HiFi |
| Training samples | 143 parents only (no children, ASD-agnostic) |
| Cross-validation | 143-fold sample-LOSO |
| **F1 (CV)** | **0.9593** [95% CI 0.9564–0.9616] |
| **AUROC (CV)** | **0.9739** |
| Reference | GRCh38 |
| Caller agnostic | Yes — no Manta / DRAGEN / Delly fields required |

### Features used (11)

| # | Feature | Source |
|--|--|--|
| 1 | svlen_abs | VCF SVLEN |
| 2 | log10_svlen | derived |
| 3-6 | svtype_DEL/INS/DUP/BND | VCF SVTYPE one-hot |
| 7 | gc_flank_w100 | GC fraction in ±100bp flanks |
| 8 | at_flank_w100 | AT fraction in ±100bp flanks |
| 9 | gc_inner_w100 | GC fraction inside the called region |
| 10 | n_motif_2_w100 | tandem dinucleotide count in flank |
| 11 | n_motif_3_w100 | tandem trinucleotide count in flank |

Sequence features contribute **47.4% of total importance** (T01d analysis).

## Quick start

```python
from svspr import classify, score

# 1. Score a single SV
result = classify(
    chrom='chr1', pos=1000000, end=1005000,
    svtype='DEL', svlen=5000, total_alt_support=15,
    ref_path='GRCh38.fa',
)
# {'CS': 0.69, 'tier': 'moderate'}

# 2. Score every SV in a VCF
df = score(vcf_path='my_calls.vcf', ref_path='GRCh38.fa')
df[['chrom', 'pos', 'svtype', 'CS', 'tier']].head()
```

CLI usage:

```bash
# Batch
python -m svspr.cli --vcf my_calls.vcf --ref GRCh38.fa --out scored.tsv

# Single
python -m svspr.cli --one --chrom chr1 --pos 1000000 --end 1005000 \
    --svtype DEL --svlen 5000 --alt-support 15 --ref GRCh38.fa
```

## Tier definition

| Tier | Confidence Score (CS) | Recommendation |
|--|--|--|
| high | ≥ 0.85 | High-confidence call. Use directly. |
| moderate | 0.50 – 0.85 | Acceptable. Consider tier in downstream filters. |
| warning | 0.30 – 0.50 | Low confidence. LRS validation recommended. |
| low | < 0.30 | Likely false positive. Exclude unless rescued. |

## Performance vs. comparable filters

| Model | F1 | AUROC | Inputs needed |
|--|--|--|--|
| SV-SPR (this) | **0.9593** | 0.9739 | VCF + reference |
| v13_parents (Manta-specific FULL, 23 feat) | 0.9308 | 0.9283 | VCF + Manta-specific fields |
| Jan 2026 bioRxiv (15 features) | 0.957 | — | VCF + reference |
| PostSV (2020) | ~0.74 | — | VCF + caller features |

(Performance numbers reported on 226-trio Korean cohort, sample-LOSO.)

## Intended use

- Post-VCF filtering / rescoring of short-read SV calls
- Quality control in population-scale SV pipelines
- Prioritising calls for long-read validation in clinical settings

## Out of scope / limitations

- Trained on **GRCh38** only — performance on T2T-CHM13 not validated
- Trained on a single Korean cohort (226 trios); external generalisation not yet quantified
- Sequence-only by default; if your caller provides PR/SR/AD fields, supplying
  `total_alt_support` may improve recall but is not strictly required
- Does not call SVs — feeds on already-called VCFs
- Best for DEL/INS in the 100bp–10kb range. BND/DUP have lower base rates
  (DRAGEN Manta BND has ~88% over-call rate)

## Caveats and known issues

- **Sample-LOSO assumes parental independence.** Korean founder effects may
  invalidate this; we have not measured pairwise IBD.
- **Sequence vs. depth contribution.** Adding depth features (T01d +both)
  yields F1 0.9579, which is not significantly different from sequence-only
  (Wilcoxon p=0.0625, CIs overlap). Depth and sequence are partially redundant.

## Prior art (substantial overlap)

This model occupies a niche close to several prior tools. Closest analogues:

- **Jan 2026 bioRxiv "Systematic Assessment of ML for SV Filtering"** — RF on
  15 features, F1=0.957. The closest direct comparison.
- **DeepSVFilter** (Liu 2020) — CNN, requires BAM.
- **Samplot-ML** (Belyeu 2021, Genome Biology) — CNN, requires BAM.
- **Duphold** (Pedersen 2019, GigaScience) — depth + GC-matched, requires BAM.
- **CADD-SV** (Kleinert 2022, Genome Research) — RF on sequence annotations.
- **PostSV** (Alzaid 2020), **SV2** (Antaki 2018), **GATK-SV** (Collins 2020).

**Our differentiator**: reference-only inputs (no BAM, no caller-specific
fields). Applicable to legacy VCFs and federated cohorts where read-level
access is restricted.

## Training script

See `training/T01d_unified_with_v13.py` in the source repo for the exact
reproduction recipe.

## Citation

Manuscript in preparation. Provisional citation:

```
Kim, W. SV-SPR: Reference-only post-VCF rescoring for short-read structural
variants. Preprint, 2026.
```

## Contact

- Author: Woohun Kim (alex990713@gmail.com)
- Lab: KAIST
- Cohort: 226 Korean trios with paired Illumina DRAGEN + PacBio HiFi WGS