---
license: mit
library_name: sklearn
tags:
- genomics
- structural-variants
- short-tandem-repeats
- variant-calling
- confidence-calibration
- random-forest
pipeline_tag: tabular-classification
---

# SVSTR-Score (v1.0) — long-read-guided confidence scoring for short-read SV/STR calls

Two per-class **RandomForest + isotonic calibrator** models that assign a
**calibrated confidence score** `CS ∈ [0,1]` and a four-tier operating point to
each short-read **structural-variant (SV)** or **short-tandem-repeat (STR)** call.
The long-read genotype is used **only to build the training label**, so inference
needs short-read caller output only.

`CS = isotonic_calibrator( RandomForest.predict_proba(X)[:, 1] )` = P(the
short-read call is concordant with the long-read truth).

| | SV | STR |
|---|---|---|
| weights | `sv_model.joblib` + `sv_calibrator.joblib` | `str_model.joblib` + `str_calibrator.joblib` |
| features | 35 (`feature_builder.py`) | 21 |
| short-read caller | Manta | ExpansionHunter |
| long-read truth (label only) | sawfish | TRGT |
| training cohort | 208 HPRC paired genomes | 208 HPRC |

## Tiers
`HIGH CS≥0.70` · `MODERATE 0.50–0.70` · `WARNING 0.30–0.50` · `LOW <0.30`.
Tiers are buckets of the calibrated CS (no heuristic overrides). **HIGH** is the
candidate-triage tier.

## Intended use
Triage / down-weight *emitted* short-read SV & STR calls by their probability of
matching long-read truth, **without long-read sequencing every sample**. Not a
variant caller; does not recover missed variants; does not replace long-read
sequencing for complete discovery.

## Performance
**Internal — 5-fold GroupKFold by sample (out-of-fold), 208 HPRC genomes**

| | AUROC | AUPRC | per-sample AUROC median |
|---|---|---|---|
| SV | 0.950 | 0.951 | 0.981 |
| STR | 0.834 | 0.886 | 0.835 |

**External — 295 ASD genomes, applied unchanged (no retraining)**

| | AUROC | observed concordance LOW / WARN / MOD / HIGH |
|---|---|---|
| SV | 0.891 | 12% / 35% / 50% / **86%** |
| STR | 0.831 | 16% / 40% / 60% / **85%** |

Calibrated tiers stay monotone and meaningful on the external cohort; isotonic
calibration improves Brier (SV 0.077→0.074, STR 0.167→0.159).

## Usage
```bash
pip install -U huggingface_hub scikit-learn==1.7.1
hf download khyeom/SVSTR-Score sv_model.joblib sv_calibrator.joblib sv_config.json score_svstr.py feature_builder.py --local-dir svstr/
# features from a single-sample VCF, then score:
python svstr/feature_builder.py --vcf sample.manta.vcf.gz --caller manta \
  --fasta GRCh38.fa --giab-dir giab_prepared --repeatmasker rmsk_class.bed.gz -o feats.tsv
python svstr/score_svstr.py --variant sv --model-dir svstr --features feats.tsv --out scored.tsv
```
Code, training pipeline and reproduction scripts: https://github.com/khyeom0608/SVSTR-Score

## Limitations
- Relies on **caller support / breakpoint-confidence fields** (PR/SR, CIPOS/CIEND,
  VAF, GQ, depth). On **merged or heavily filtered call sets that drop these**,
  scores deflate and tiers are unreliable (rank-discrimination degrades only
  moderately, but calibration breaks).
- Strongest for **DEL/INS**; DUP/BND are mostly down-weighted/triaged.
- STR scoring is **bound to the caller's genotyped locus set**.

## Citation
Kim W\*, Yeom K\*, et al. *SVSTR-Score* (manuscript in preparation). Licence: MIT.