| --- |
| license: cc-by-4.0 |
| language: |
| - en |
| tags: |
| - proteomics |
| - mass-spectrometry |
| - peptide-sequencing |
| - de-novo |
| - calibration |
| - fdr |
| --- |
| |
| ## Winnow HeLa Single Shot Probability Calibrator |
|
|
| [**Winnow**](https://github.com/instadeepai/winnow) recalibrates confidence scores and provides FDR control for *de novo* peptide sequencing (DNS) workflows. |
| This repository contains the calibrator trained on HeLa Single Shot data as referenced in our paper: [De novo peptide sequencing rescoring and FDR estimation with Winnow](https://arxiv.org/abs/2509.24952). |
|
|
| - Intended inputs: spectrum input data and corresponding MS/MS PSM results produced by [InstaNovo](https://github.com/instadeepai/instanovo) |
| - Outputs: calibrated per-PSM probabilities in `calibrated_confidence`. |
|
|
| ### What’s inside |
| - `calibrator.pkl`: trained classifier |
| - `scaler.pkl`: feature standardiser |
| - `irt_predictor.pkl`: Prosit iRT regressor used by RT features |
|
|
| --- |
|
|
| ## How to use |
|
|
| ### Python |
| ```python |
| from pathlib import Path |
| from huggingface_hub import snapshot_download |
| from winnow.calibration.calibrator import ProbabilityCalibrator |
| from winnow.datasets.data_loaders import InstaNovoDatasetLoader |
| from winnow.scripts.main import filter_dataset |
| from winnow.fdr.nonparametric import NonParametricFDRControl |
| |
| # 1) Download model files |
| helaqc_model = Path("helaqc_model") |
| snapshot_download( |
| repo_id="InstaDeepAI/winnow-helaqc-model", |
| allow_patterns=["*.pkl"], |
| repo_type="model", |
| local_dir=helaqc_model, |
| ) |
| |
| # 2) Load calibrator |
| calibrator = ProbabilityCalibrator.load(helaqc_model) |
| |
| # 3) Load your dataset (InstaNovo-style config) |
| dataset = InstaNovoDatasetLoader().load( |
| data_path="path_to_spectrum_data.parquet", |
| predictions_path="path_to_instanovo_predictions.csv", |
| ) |
| dataset = filter_dataset(dataset) # standard Winnow filtering |
| |
| # 4) Predict calibrated confidences |
| calibrator.predict(dataset) # adds dataset.metadata["calibrated_confidence"] |
| |
| # 5) Optional: FDR control on calibrated confidence |
| fdr = NonParametricFDRControl() |
| fdr.fit(dataset.metadata["calibrated_confidence"]) |
| cutoff = fdr.get_confidence_cutoff(0.05) # 5% FDR cutoff |
| dataset.metadata["keep@5%"] = dataset.metadata["calibrated_confidence"] >= cutoff |
| ``` |
|
|
| ### CLI |
| ```bash |
| # After `pip install winnow` |
| winnow predict \ |
| --data-source instanovo \ |
| --dataset-config-path config_with_dataset_paths.yaml \ |
| --model-folder general_model_folder \ |
| --method winnow \ |
| --fdr-threshold 0.05 \ |
| --confidence-column calibrated_confidence \ |
| --output-path outputs/winnow_predictions.csv |
| ``` |
|
|
| --- |
|
|
| ## Inputs and outputs |
| **Required columns for calibration:** |
| - Spectrum data (*.parquet) |
| - `spectrum_id` (string): unique spectrum identifier |
| - `sequence` (string): ground truth peptide sequence from database search (optional) |
| - `retention_time` (float): retention time (seconds) |
| - `precursor_mass` (float): mass of the precursor ion (from MS1) |
| - `mz_array` (list[float]): mass-to-charge values of the MS2 spectrum |
| - `intensity_array` (list[float]): intensity values of the MS2 spectrum |
| - `precursor_charge` (int): charge of the precursor (from MS1) |
| |
| - Beam predictions (*_beams.csv) |
| - `spectrum_id` (string) |
| - `sequence` (string): ground truth peptide sequence from database search (optional) |
| - `preds` (string): top prediction, untokenised sequence |
| - `preds_tokenised` (string): comma‐separated tokens for the top prediction |
| - `log_probs` (float): top prediction log probability |
| - `preds_beam_k` (string): untokenised sequence for beam k (k≥0) |
| - `log_probs_beam_k` (float) |
| - `token_log_probs_k` (string/list-encoded): per-token log probabilities for beam k |
|
|
| **Output columns (added by Winnow's calibrator on `predict`):** |
| - `calibrated_confidence`: calibrated probability |
| - Optional (if requested): `psm_pep`, `psm_fdr`, `psm_qvalue` |
| - All input columns are retained in-place |
|
|
| --- |
|
|
| ## Training data |
|
|
| - The general model was trained on the HeLa single-shot dataset (PXD044934) |
| - All default features were enabled for the training of this model. |
| - Predictions were obtained using InstaNovo v1.1.1 with knapsack beam search set to 50 beams. |
|
|
| --- |
|
|
| ## Citation |
|
|
| If you use `winnow` in your research, please cite our preprint: [De novo peptide sequencing rescoring and FDR estimation with Winnow](https://arxiv.org/abs/2509.24952) |
|
|
| ```bibtex |
| @article{mabona2025novopeptidesequencingrescoring, |
| title = {De novo peptide sequencing rescoring and FDR estimation with Winnow}, |
| author = {Amandla Mabona and Jemma Daniel and Henrik Servais Janssen Knudsen and |
| Rachel Catzel and Kevin Michael Eloff and Erwin M. Schoof and Nicolas |
| Lopez Carranza and Timothy P. Jenkins and Jeroen Van Goey and |
| Konstantinos Kalogeropoulos}, |
| year = {2025}, |
| eprint = {2509.24952}, |
| archivePrefix = {arXiv}, |
| primaryClass = {q-bio.QM}, |
| url = {https://arxiv.org/abs/2509.24952}, |
| } |
| ``` |
|
|
| If you use this calibrator trained on HeLa Single Shot data, please cite: |
|
|
| ```bibtex |
| @misc{instadeep_ltd_2025, |
| author = { InstaDeep Ltd }, |
| title = { winnow-helaqc-model (Revision b826cbb) }, |
| year = 2025, |
| url = { https://huggingface.co/InstaDeepAI/winnow-helaqc-model }, |
| doi = { 10.57967/hf/6612 }, |
| publisher = { Hugging Face } |
| } |
| ``` |
|
|
| If you use the `InstaNovo` model to generate predictions, please also cite: [InstaNovo enables diffusion-powered de novo peptide sequencing in large-scale proteomics experiments](https://doi.org/10.1038/s42256-025-01019-5) |
|
|
| ```bibtex |
| @article{eloff_kalogeropoulos_2025_instanovo, |
| title = {InstaNovo enables diffusion-powered de novo peptide sequencing in large-scale |
| proteomics experiments}, |
| author = {Eloff, Kevin and Kalogeropoulos, Konstantinos and Mabona, Amandla and Morell, |
| Oliver and Catzel, Rachel and Rivera-de-Torre, Esperanza and Berg Jespersen, |
| Jakob and Williams, Wesley and van Beljouw, Sam P. B. and Skwark, Marcin J. |
| and Laustsen, Andreas Hougaard and Brouns, Stan J. J. and Ljungars, |
| Anne and Schoof, Erwin M. and Van Goey, Jeroen and auf dem Keller, Ulrich and |
| Beguir, Karim and Lopez Carranza, Nicolas and Jenkins, Timothy P.}, |
| year = 2025, |
| month = {Mar}, |
| day = 31, |
| journal = {Nature Machine Intelligence}, |
| doi = {10.1038/s42256-025-01019-5}, |
| issn = {2522-5839}, |
| url = {https://doi.org/10.1038/s42256-025-01019-5} |
| } |
| ``` |
|
|
| ## Contact |
| For issues with dataset structure or usage in Winnow, please open an issue on the Winnow GitHub: https://github.com/instadeepai/winnow |