Add sequence-only headline model (svspr_v14_seq, 11-feature) + inference package

90d0b4b verified 21 days ago

5.93 kB

	---
	language:
	- en
	license: cc-by-4.0
	library_name: scikit-learn
	tags:
	- genomics
	- structural-variants
	- bioinformatics
	- vcf
	- variant-filtering
	pipeline_tag: tabular-classification
	---

	# SV-SPR: Short-read SV confidence Scoring with PaRents-only training

	Reference-only post-VCF rescoring for short-read structural-variant (SV) calls.
	Given a VCF entry (coordinate + type + length) and a reference FASTA, the model
	returns a confidence score (CS) — the RandomForest probability that the call
	would be confirmed by long-read sequencing (LRS). The CS is uncalibrated
	out-of-the-box (held-out ECE ≈ 0.07, under-confident in the mid-range); apply
	isotonic/Platt calibration before using it as a literal probability. Tiers
	(High ≥ 0.9 / Moderate 0.7–0.9 / Warning 0.5–0.7 / Low < 0.5) match Methods 2.7.2.

	> No BAM file is needed. Unlike DeepSVFilter, Samplot-ML, or Duphold, SV-SPR
	> only requires the VCF and a reference. This makes it applicable to legacy
	> VCFs, summary-stat archives, and federated cohorts where read-level data is
	> unavailable.

	## Model details

	\| Item \| Value \|
	\|--\|--\|
	\| Architecture \| RandomForest (200 trees, balanced class weight) \|
	\| Features \| 11 (sequence-context + length + type) \|
	\| Training data \| 226 Korean trios, paired Illumina DRAGEN + PacBio HiFi \|
	\| Training samples \| 143 parents only (no children, ASD-agnostic) \|
	\| Cross-validation \| 143-fold sample-LOSO \|
	\| F1 (CV) \| 0.9593 [95% CI 0.9564–0.9616] \|
	\| AUROC (CV) \| 0.9739 \|
	\| Reference \| GRCh38 \|
	\| Caller agnostic \| Yes — no Manta / DRAGEN / Delly fields required \|

	### Features used (11)

	\| # \| Feature \| Source \|
	\|--\|--\|--\|
	\| 1 \| svlen_abs \| VCF SVLEN \|
	\| 2 \| log10_svlen \| derived \|
	\| 3-6 \| svtype_DEL/INS/DUP/BND \| VCF SVTYPE one-hot \|
	\| 7 \| gc_flank_w100 \| GC fraction in ±100bp flanks \|
	\| 8 \| at_flank_w100 \| AT fraction in ±100bp flanks \|
	\| 9 \| gc_inner_w100 \| GC fraction inside the called region \|
	\| 10 \| n_motif_2_w100 \| tandem dinucleotide count in flank \|
	\| 11 \| n_motif_3_w100 \| tandem trinucleotide count in flank \|

	Sequence features contribute 47.4% of total importance (T01d analysis).

	## Quick start

	```python
	from svspr import classify, score

	# 1. Score a single SV
	result = classify(
	chrom='chr1', pos=1000000, end=1005000,
	svtype='DEL', svlen=5000, total_alt_support=15,
	ref_path='GRCh38.fa',
	)
	# {'CS': 0.69, 'tier': 'moderate'}

	# 2. Score every SV in a VCF
	df = score(vcf_path='my_calls.vcf', ref_path='GRCh38.fa')
	df[['chrom', 'pos', 'svtype', 'CS', 'tier']].head()
	```

	CLI usage:

	```bash
	# Batch
	python -m svspr.cli --vcf my_calls.vcf --ref GRCh38.fa --out scored.tsv

	# Single
	python -m svspr.cli --one --chrom chr1 --pos 1000000 --end 1005000 \
	--svtype DEL --svlen 5000 --alt-support 15 --ref GRCh38.fa
	```

	## Tier definition

	\| Tier \| Confidence Score (CS) \| Recommendation \|
	\|--\|--\|--\|
	\| high \| ≥ 0.85 \| High-confidence call. Use directly. \|
	\| moderate \| 0.50 – 0.85 \| Acceptable. Consider tier in downstream filters. \|
	\| warning \| 0.30 – 0.50 \| Low confidence. LRS validation recommended. \|
	\| low \| < 0.30 \| Likely false positive. Exclude unless rescued. \|

	## Performance vs. comparable filters

	\| Model \| F1 \| AUROC \| Inputs needed \|
	\|--\|--\|--\|--\|
	\| SV-SPR (this) \| 0.9593 \| 0.9739 \| VCF + reference \|
	\| v13_parents (Manta-specific FULL, 23 feat) \| 0.9308 \| 0.9283 \| VCF + Manta-specific fields \|
	\| Jan 2026 bioRxiv (15 features) \| 0.957 \| — \| VCF + reference \|
	\| PostSV (2020) \| ~0.74 \| — \| VCF + caller features \|

	(Performance numbers reported on 226-trio Korean cohort, sample-LOSO.)

	## Intended use

	- Post-VCF filtering / rescoring of short-read SV calls
	- Quality control in population-scale SV pipelines
	- Prioritising calls for long-read validation in clinical settings

	## Out of scope / limitations

	- Trained on GRCh38 only — performance on T2T-CHM13 not validated
	- Trained on a single Korean cohort (226 trios); external generalisation not yet quantified
	- Sequence-only by default; if your caller provides PR/SR/AD fields, supplying
	`total_alt_support` may improve recall but is not strictly required
	- Does not call SVs — feeds on already-called VCFs
	- Best for DEL/INS in the 100bp–10kb range. BND/DUP have lower base rates
	(DRAGEN Manta BND has ~88% over-call rate)

	## Caveats and known issues

	- Sample-LOSO assumes parental independence. Korean founder effects may
	invalidate this; we have not measured pairwise IBD.
	- Sequence vs. depth contribution. Adding depth features (T01d +both)
	yields F1 0.9579, which is not significantly different from sequence-only
	(Wilcoxon p=0.0625, CIs overlap). Depth and sequence are partially redundant.

	## Prior art (substantial overlap)

	This model occupies a niche close to several prior tools. Closest analogues:

	- Jan 2026 bioRxiv "Systematic Assessment of ML for SV Filtering" — RF on
	15 features, F1=0.957. The closest direct comparison.
	- DeepSVFilter (Liu 2020) — CNN, requires BAM.
	- Samplot-ML (Belyeu 2021, Genome Biology) — CNN, requires BAM.
	- Duphold (Pedersen 2019, GigaScience) — depth + GC-matched, requires BAM.
	- CADD-SV (Kleinert 2022, Genome Research) — RF on sequence annotations.
	- PostSV (Alzaid 2020), SV2 (Antaki 2018), GATK-SV (Collins 2020).

	Our differentiator: reference-only inputs (no BAM, no caller-specific
	fields). Applicable to legacy VCFs and federated cohorts where read-level
	access is restricted.

	## Training script

	See `training/T01d_unified_with_v13.py` in the source repo for the exact
	reproduction recipe.

	## Citation

	Manuscript in preparation. Provisional citation:

	```
	Kim, W. SV-SPR: Reference-only post-VCF rescoring for short-read structural
	variants. Preprint, 2026.
	```

	## Contact

	- Author: Woohun Kim (alex990713@gmail.com)
	- Lab: KAIST
	- Cohort: 226 Korean trios with paired Illumina DRAGEN + PacBio HiFi WGS