SVSTR-Score / README.md
khyeom's picture
Release v1.0: HPRC-trained 35/21-feature calibrated SV+STR models (#1)
3c7d0d1
|
Raw
History Blame Contribute Delete
3.34 kB
metadata
license: mit
library_name: sklearn
tags:
  - genomics
  - structural-variants
  - short-tandem-repeats
  - variant-calling
  - confidence-calibration
  - random-forest
pipeline_tag: tabular-classification

SVSTR-Score (v1.0) — long-read-guided confidence scoring for short-read SV/STR calls

Two per-class RandomForest + isotonic calibrator models that assign a calibrated confidence score CS ∈ [0,1] and a four-tier operating point to each short-read structural-variant (SV) or short-tandem-repeat (STR) call. The long-read genotype is used only to build the training label, so inference needs short-read caller output only.

CS = isotonic_calibrator( RandomForest.predict_proba(X)[:, 1] ) = P(the short-read call is concordant with the long-read truth).

SV STR
weights sv_model.joblib + sv_calibrator.joblib str_model.joblib + str_calibrator.joblib
features 35 (feature_builder.py) 21
short-read caller Manta ExpansionHunter
long-read truth (label only) sawfish TRGT
training cohort 208 HPRC paired genomes 208 HPRC

Tiers

HIGH CS≥0.70 · MODERATE 0.50–0.70 · WARNING 0.30–0.50 · LOW <0.30. Tiers are buckets of the calibrated CS (no heuristic overrides). HIGH is the candidate-triage tier.

Intended use

Triage / down-weight emitted short-read SV & STR calls by their probability of matching long-read truth, without long-read sequencing every sample. Not a variant caller; does not recover missed variants; does not replace long-read sequencing for complete discovery.

Performance

Internal — 5-fold GroupKFold by sample (out-of-fold), 208 HPRC genomes

AUROC AUPRC per-sample AUROC median
SV 0.950 0.951 0.981
STR 0.834 0.886 0.835

External — 295 ASD genomes, applied unchanged (no retraining)

AUROC observed concordance LOW / WARN / MOD / HIGH
SV 0.891 12% / 35% / 50% / 86%
STR 0.831 16% / 40% / 60% / 85%

Calibrated tiers stay monotone and meaningful on the external cohort; isotonic calibration improves Brier (SV 0.077→0.074, STR 0.167→0.159).

Usage

pip install -U huggingface_hub scikit-learn==1.7.1
hf download khyeom/SVSTR-Score sv_model.joblib sv_calibrator.joblib sv_config.json score_svstr.py feature_builder.py --local-dir svstr/
# features from a single-sample VCF, then score:
python svstr/feature_builder.py --vcf sample.manta.vcf.gz --caller manta \
  --fasta GRCh38.fa --giab-dir giab_prepared --repeatmasker rmsk_class.bed.gz -o feats.tsv
python svstr/score_svstr.py --variant sv --model-dir svstr --features feats.tsv --out scored.tsv

Code, training pipeline and reproduction scripts: https://github.com/khyeom0608/SVSTR-Score

Limitations

  • Relies on caller support / breakpoint-confidence fields (PR/SR, CIPOS/CIEND, VAF, GQ, depth). On merged or heavily filtered call sets that drop these, scores deflate and tiers are unreliable (rank-discrimination degrades only moderately, but calibration breaks).
  • Strongest for DEL/INS; DUP/BND are mostly down-weighted/triaged.
  • STR scoring is bound to the caller's genotyped locus set.

Citation

Kim W*, Yeom K*, et al. SVSTR-Score (manuscript in preparation). Licence: MIT.