Release v1.0: HPRC-trained 35/21-feature calibrated SV+STR models (#1)

3c7d0d1 11 days ago

3.34 kB

	---
	license: mit
	library_name: sklearn
	tags:
	- genomics
	- structural-variants
	- short-tandem-repeats
	- variant-calling
	- confidence-calibration
	- random-forest
	pipeline_tag: tabular-classification
	---

	# SVSTR-Score (v1.0) — long-read-guided confidence scoring for short-read SV/STR calls

	Two per-class RandomForest + isotonic calibrator models that assign a
	calibrated confidence score `CS ∈ [0,1]` and a four-tier operating point to
	each short-read structural-variant (SV) or short-tandem-repeat (STR) call.
	The long-read genotype is used only to build the training label, so inference
	needs short-read caller output only.

	`CS = isotonic_calibrator( RandomForest.predict_proba(X)[:, 1] )` = P(the
	short-read call is concordant with the long-read truth).

	\| \| SV \| STR \|
	\|---\|---\|---\|
	\| weights \| `sv_model.joblib` + `sv_calibrator.joblib` \| `str_model.joblib` + `str_calibrator.joblib` \|
	\| features \| 35 (`feature_builder.py`) \| 21 \|
	\| short-read caller \| Manta \| ExpansionHunter \|
	\| long-read truth (label only) \| sawfish \| TRGT \|
	\| training cohort \| 208 HPRC paired genomes \| 208 HPRC \|

	## Tiers
	`HIGH CS≥0.70` · `MODERATE 0.50–0.70` · `WARNING 0.30–0.50` · `LOW <0.30`.
	Tiers are buckets of the calibrated CS (no heuristic overrides). HIGH is the
	candidate-triage tier.

	## Intended use
	Triage / down-weight emitted short-read SV & STR calls by their probability of
	matching long-read truth, without long-read sequencing every sample. Not a
	variant caller; does not recover missed variants; does not replace long-read
	sequencing for complete discovery.

	## Performance
	Internal — 5-fold GroupKFold by sample (out-of-fold), 208 HPRC genomes

	\| \| AUROC \| AUPRC \| per-sample AUROC median \|
	\|---\|---\|---\|---\|
	\| SV \| 0.950 \| 0.951 \| 0.981 \|
	\| STR \| 0.834 \| 0.886 \| 0.835 \|

	External — 295 ASD genomes, applied unchanged (no retraining)

	\| \| AUROC \| observed concordance LOW / WARN / MOD / HIGH \|
	\|---\|---\|---\|
	\| SV \| 0.891 \| 12% / 35% / 50% / 86% \|
	\| STR \| 0.831 \| 16% / 40% / 60% / 85% \|

	Calibrated tiers stay monotone and meaningful on the external cohort; isotonic
	calibration improves Brier (SV 0.077→0.074, STR 0.167→0.159).

	## Usage
	```bash
	pip install -U huggingface_hub scikit-learn==1.7.1
	hf download khyeom/SVSTR-Score sv_model.joblib sv_calibrator.joblib sv_config.json score_svstr.py feature_builder.py --local-dir svstr/
	# features from a single-sample VCF, then score:
	python svstr/feature_builder.py --vcf sample.manta.vcf.gz --caller manta \
	--fasta GRCh38.fa --giab-dir giab_prepared --repeatmasker rmsk_class.bed.gz -o feats.tsv
	python svstr/score_svstr.py --variant sv --model-dir svstr --features feats.tsv --out scored.tsv
	```
	Code, training pipeline and reproduction scripts: https://github.com/khyeom0608/SVSTR-Score

	## Limitations
	- Relies on caller support / breakpoint-confidence fields (PR/SR, CIPOS/CIEND,
	VAF, GQ, depth). On merged or heavily filtered call sets that drop these,
	scores deflate and tiers are unreliable (rank-discrimination degrades only
	moderately, but calibration breaks).
	- Strongest for DEL/INS; DUP/BND are mostly down-weighted/triaged.
	- STR scoring is bound to the caller's genotyped locus set.

	## Citation
	Kim W\, Yeom K\, et al. SVSTR-Score (manuscript in preparation). Licence: MIT.