File size: 7,145 Bytes
515509f e80d0ec 515509f 17be79d 515509f 6e34abf 515509f 6e34abf 515509f 6e34abf 515509f 01a31ca 515509f 01a31ca 515509f 6e34abf ac40a7c 6e34abf 9cd6f04 6e34abf fc1003c | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 | ---
license: cc-by-4.0
language:
- en
tags:
- proteomics
- mass-spectrometry
- peptide-sequencing
- de-novo
- calibration
- fdr
---
## Winnow General Probability Calibrator
[**Winnow**](https://github.com/instadeepai/winnow) recalibrates confidence scores and provides FDR control for *de novo* peptide sequencing (DNS) workflows.
This repository hosts a pretrained, general-purpose calibrator that maps raw [InstaNovo](https://github.com/instadeepai/instanovo) model confidences and complementary features (mass error, retention time, chimericity, beam features, Prosit features) to well-calibrated probabilities.
- Intended inputs: spectrum input data and corresponding MS/MS PSM results produced by InstaNovo
- Outputs: calibrated per-PSM probabilities in `calibrated_confidence`.
### What’s inside
- `calibrator.pkl`: trained classifier
- `scaler.pkl`: feature standardiser
- `irt_predictor.pkl`: Prosit iRT regressor used by RT features
---
## How to use
### Python
```python
from pathlib import Path
from huggingface_hub import snapshot_download
from winnow.calibration.calibrator import ProbabilityCalibrator
from winnow.datasets.data_loaders import InstaNovoDatasetLoader
from winnow.scripts.main import filter_dataset
from winnow.fdr.nonparametric import NonParametricFDRControl
# 1) Download model files
general_model = Path("general_model")
snapshot_download(
repo_id="InstaDeepAI/winnow-general-model",
allow_patterns=["*.pkl"],
repo_type="model",
local_dir=general_model,
)
# 2) Load calibrator
calibrator = ProbabilityCalibrator.load(general_model)
# 3) Load your dataset (InstaNovo-style config)
dataset = InstaNovoDatasetLoader().load(
data_path="path_to_spectrum_data.parquet",
predictions_path="path_to_instanovo_predictions.csv",
)
dataset = filter_dataset(dataset) # standard Winnow filtering
# 4) Predict calibrated confidences
calibrator.predict(dataset) # adds dataset.metadata["calibrated_confidence"]
# 5) Optional: FDR control on calibrated confidence
fdr = NonParametricFDRControl()
fdr.fit(dataset.metadata["calibrated_confidence"])
cutoff = fdr.get_confidence_cutoff(0.05) # 5% FDR cutoff
dataset.metadata["keep@5%"] = dataset.metadata["calibrated_confidence"] >= cutoff
```
### CLI
```bash
# After `pip install winnow`
winnow predict \
--data-source instanovo \
--dataset-config-path config_with_dataset_paths.yaml \
--model-folder general_model_folder \
--method winnow \
--fdr-threshold 0.05 \
--confidence-column calibrated_confidence \
--output-path outputs/winnow_predictions.csv
```
---
## Inputs and outputs
**Required columns for calibration:**
- Spectrum data (*.parquet)
- `spectrum_id` (string): unique spectrum identifier
- `sequence` (string): ground truth peptide sequence from database search (optional)
- `retention_time` (float): retention time (seconds)
- `precursor_mass` (float): mass of the precursor ion (from MS1)
- `mz_array` (list[float]): mass-to-charge values of the MS2 spectrum
- `intensity_array` (list[float]): intensity values of the MS2 spectrum
- `precursor_charge` (int): charge of the precursor (from MS1)
- Beam predictions (*_beams.csv)
- `spectrum_id` (string)
- `sequence` (string): ground truth peptide sequence from database search (optional)
- `preds` (string): top prediction, untokenised sequence
- `preds_tokenised` (string): comma‐separated tokens for the top prediction
- `log_probs` (float): top prediction log probability
- `preds_beam_k` (string): untokenised sequence for beam k (k≥0)
- `log_probs_beam_k` (float)
- `token_log_probs_k` (string/list-encoded): per-token log probabilities for beam k
**Output columns (added by Winnow's calibrator on `predict`):**
- `calibrated_confidence`: calibrated probability
- Optional (if requested): `psm_pep`, `psm_fdr`, `psm_qvalue`
- All input columns are retained in-place
---
## Training data
- The general model was trained on a pooled, labelled set spanning multiple public datasets to encourage cross-dataset generalisation:
- HeLa single-shot (PXD044934)
- *Candidatus* Scalindua Brodae (PXD044934)
- Wound exudates (PXD025748)
- HepG2 (PXD019483)
- Immunopeptidomics (PXD006939)
- HeLa degradome (PXD044934)
- Snake venoms (PXD036161)
- All default features were enabled for the training of this model.
- Predictions were obtained using InstaNovo v1.1.1 with knapsack beam search set to 50 beams.
---
## Citation
If you use `winnow` in your research, please cite our preprint: [De novo peptide sequencing rescoring and FDR estimation with Winnow](https://arxiv.org/abs/2509.24952)
```bibtex
@article{mabona2025novopeptidesequencingrescoring,
title = {De novo peptide sequencing rescoring and FDR estimation with Winnow},
author = {Amandla Mabona and Jemma Daniel and Henrik Servais Janssen Knudsen and
Rachel Catzel and Kevin Michael Eloff and Erwin M. Schoof and Nicolas
Lopez Carranza and Timothy P. Jenkins and Jeroen Van Goey and
Konstantinos Kalogeropoulos},
year = {2025},
eprint = {2509.24952},
archivePrefix = {arXiv},
primaryClass = {q-bio.QM},
url = {https://arxiv.org/abs/2509.24952},
}
```
If you use this pretrained calibrator, please cite:
```bibtex
@misc{instadeep_ltd_2025,
author = { InstaDeep Ltd },
title = { winnow-general-model (Revision d70bef6) },
year = 2025,
url = { https://huggingface.co/InstaDeepAI/winnow-general-model },
doi = { 10.57967/hf/6611 },
publisher = { Hugging Face }
}
```
If you use the `InstaNovo` model to generate predictions, please also cite: [InstaNovo enables diffusion-powered de novo peptide sequencing in large-scale proteomics experiments](https://doi.org/10.1038/s42256-025-01019-5)
```bibtex
@article{eloff_kalogeropoulos_2025_instanovo,
title = {InstaNovo enables diffusion-powered de novo peptide sequencing in large-scale
proteomics experiments},
author = {Eloff, Kevin and Kalogeropoulos, Konstantinos and Mabona, Amandla and Morell,
Oliver and Catzel, Rachel and Rivera-de-Torre, Esperanza and Berg Jespersen,
Jakob and Williams, Wesley and van Beljouw, Sam P. B. and Skwark, Marcin J.
and Laustsen, Andreas Hougaard and Brouns, Stan J. J. and Ljungars,
Anne and Schoof, Erwin M. and Van Goey, Jeroen and auf dem Keller, Ulrich and
Beguir, Karim and Lopez Carranza, Nicolas and Jenkins, Timothy P.},
year = 2025,
month = {Mar},
day = 31,
journal = {Nature Machine Intelligence},
doi = {10.1038/s42256-025-01019-5},
issn = {2522-5839},
url = {https://doi.org/10.1038/s42256-025-01019-5}
}
```
## Contact
For issues with this pretrained model or usage in Winnow, please open an issue on the Winnow GitHub: https://github.com/instadeepai/winnow |