SVSTR-Score / seqonly /README.md
khyeom's picture
Add sequence-only headline model (svspr_v14_seq, 11-feature) + inference package
90d0b4b verified
|
Raw
History Blame Contribute Delete
5.93 kB
---
language:
- en
license: cc-by-4.0
library_name: scikit-learn
tags:
- genomics
- structural-variants
- bioinformatics
- vcf
- variant-filtering
pipeline_tag: tabular-classification
---
# SV-SPR: Short-read SV confidence Scoring with PaRents-only training
Reference-only post-VCF rescoring for short-read structural-variant (SV) calls.
Given a VCF entry (coordinate + type + length) and a reference FASTA, the model
returns a confidence score (CS) β€” the RandomForest probability that the call
would be confirmed by long-read sequencing (LRS). The CS is **uncalibrated**
out-of-the-box (held-out ECE β‰ˆ 0.07, under-confident in the mid-range); apply
isotonic/Platt calibration before using it as a literal probability. Tiers
(High β‰₯ 0.9 / Moderate 0.7–0.9 / Warning 0.5–0.7 / Low < 0.5) match Methods 2.7.2.
> **No BAM file is needed.** Unlike DeepSVFilter, Samplot-ML, or Duphold, SV-SPR
> only requires the VCF and a reference. This makes it applicable to legacy
> VCFs, summary-stat archives, and federated cohorts where read-level data is
> unavailable.
## Model details
| Item | Value |
|--|--|
| Architecture | RandomForest (200 trees, balanced class weight) |
| Features | 11 (sequence-context + length + type) |
| Training data | 226 Korean trios, paired Illumina DRAGEN + PacBio HiFi |
| Training samples | 143 parents only (no children, ASD-agnostic) |
| Cross-validation | 143-fold sample-LOSO |
| **F1 (CV)** | **0.9593** [95% CI 0.9564–0.9616] |
| **AUROC (CV)** | **0.9739** |
| Reference | GRCh38 |
| Caller agnostic | Yes β€” no Manta / DRAGEN / Delly fields required |
### Features used (11)
| # | Feature | Source |
|--|--|--|
| 1 | svlen_abs | VCF SVLEN |
| 2 | log10_svlen | derived |
| 3-6 | svtype_DEL/INS/DUP/BND | VCF SVTYPE one-hot |
| 7 | gc_flank_w100 | GC fraction in Β±100bp flanks |
| 8 | at_flank_w100 | AT fraction in Β±100bp flanks |
| 9 | gc_inner_w100 | GC fraction inside the called region |
| 10 | n_motif_2_w100 | tandem dinucleotide count in flank |
| 11 | n_motif_3_w100 | tandem trinucleotide count in flank |
Sequence features contribute **47.4% of total importance** (T01d analysis).
## Quick start
```python
from svspr import classify, score
# 1. Score a single SV
result = classify(
chrom='chr1', pos=1000000, end=1005000,
svtype='DEL', svlen=5000, total_alt_support=15,
ref_path='GRCh38.fa',
)
# {'CS': 0.69, 'tier': 'moderate'}
# 2. Score every SV in a VCF
df = score(vcf_path='my_calls.vcf', ref_path='GRCh38.fa')
df[['chrom', 'pos', 'svtype', 'CS', 'tier']].head()
```
CLI usage:
```bash
# Batch
python -m svspr.cli --vcf my_calls.vcf --ref GRCh38.fa --out scored.tsv
# Single
python -m svspr.cli --one --chrom chr1 --pos 1000000 --end 1005000 \
--svtype DEL --svlen 5000 --alt-support 15 --ref GRCh38.fa
```
## Tier definition
| Tier | Confidence Score (CS) | Recommendation |
|--|--|--|
| high | β‰₯ 0.85 | High-confidence call. Use directly. |
| moderate | 0.50 – 0.85 | Acceptable. Consider tier in downstream filters. |
| warning | 0.30 – 0.50 | Low confidence. LRS validation recommended. |
| low | < 0.30 | Likely false positive. Exclude unless rescued. |
## Performance vs. comparable filters
| Model | F1 | AUROC | Inputs needed |
|--|--|--|--|
| SV-SPR (this) | **0.9593** | 0.9739 | VCF + reference |
| v13_parents (Manta-specific FULL, 23 feat) | 0.9308 | 0.9283 | VCF + Manta-specific fields |
| Jan 2026 bioRxiv (15 features) | 0.957 | β€” | VCF + reference |
| PostSV (2020) | ~0.74 | β€” | VCF + caller features |
(Performance numbers reported on 226-trio Korean cohort, sample-LOSO.)
## Intended use
- Post-VCF filtering / rescoring of short-read SV calls
- Quality control in population-scale SV pipelines
- Prioritising calls for long-read validation in clinical settings
## Out of scope / limitations
- Trained on **GRCh38** only β€” performance on T2T-CHM13 not validated
- Trained on a single Korean cohort (226 trios); external generalisation not yet quantified
- Sequence-only by default; if your caller provides PR/SR/AD fields, supplying
`total_alt_support` may improve recall but is not strictly required
- Does not call SVs β€” feeds on already-called VCFs
- Best for DEL/INS in the 100bp–10kb range. BND/DUP have lower base rates
(DRAGEN Manta BND has ~88% over-call rate)
## Caveats and known issues
- **Sample-LOSO assumes parental independence.** Korean founder effects may
invalidate this; we have not measured pairwise IBD.
- **Sequence vs. depth contribution.** Adding depth features (T01d +both)
yields F1 0.9579, which is not significantly different from sequence-only
(Wilcoxon p=0.0625, CIs overlap). Depth and sequence are partially redundant.
## Prior art (substantial overlap)
This model occupies a niche close to several prior tools. Closest analogues:
- **Jan 2026 bioRxiv "Systematic Assessment of ML for SV Filtering"** β€” RF on
15 features, F1=0.957. The closest direct comparison.
- **DeepSVFilter** (Liu 2020) β€” CNN, requires BAM.
- **Samplot-ML** (Belyeu 2021, Genome Biology) β€” CNN, requires BAM.
- **Duphold** (Pedersen 2019, GigaScience) β€” depth + GC-matched, requires BAM.
- **CADD-SV** (Kleinert 2022, Genome Research) β€” RF on sequence annotations.
- **PostSV** (Alzaid 2020), **SV2** (Antaki 2018), **GATK-SV** (Collins 2020).
**Our differentiator**: reference-only inputs (no BAM, no caller-specific
fields). Applicable to legacy VCFs and federated cohorts where read-level
access is restricted.
## Training script
See `training/T01d_unified_with_v13.py` in the source repo for the exact
reproduction recipe.
## Citation
Manuscript in preparation. Provisional citation:
```
Kim, W. SV-SPR: Reference-only post-VCF rescoring for short-read structural
variants. Preprint, 2026.
```
## Contact
- Author: Woohun Kim (alex990713@gmail.com)
- Lab: KAIST
- Cohort: 226 Korean trios with paired Illumina DRAGEN + PacBio HiFi WGS