JemmaDaniel commited on
Commit
515509f
·
verified ·
1 Parent(s): d7224bb

Create model card

Browse files
Files changed (1) hide show
  1. README.md +128 -0
README.md ADDED
@@ -0,0 +1,128 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ tags:
6
+ - proteomics
7
+ - mass-spectrometry
8
+ - peptide-sequencing
9
+ - de-novo
10
+ - calibration
11
+ - fdr
12
+ ---
13
+
14
+ ## Winnow General Probability Calibrator
15
+
16
+ **Winnow** recalibrates confidence scores and provides FDR control for *de novo* peptide sequencing (DNS) workflows.
17
+ This repository hosts a pretrained, general-purpose calibrator that maps raw InstaNovo model confidences and complementary features (mass error, retention time, chimericity, beam features, Prosit features) to well-calibrated probabilities.
18
+
19
+ - Intended inputs: spectrum input data and corresponding MS/MS PSM results produced by InstaNovo
20
+ - Outputs: calibrated per-PSM probabilities in `calibrated_confidence`.
21
+
22
+ ### What’s inside
23
+ - `calibrator.pkl`: trained classifier
24
+ - `scaler.pkl`: feature standardiser
25
+ - `irt_predictor.pkl`: Prosit iRT regressor used by RT features
26
+
27
+ ---
28
+
29
+ ## How to use
30
+
31
+ ### Python
32
+ ```python
33
+ from pathlib import Path
34
+ from huggingface_hub import snapshot_download
35
+ from winnow.calibration.calibrator import ProbabilityCalibrator
36
+ from winnow.datasets.data_loaders import InstaNovoDatasetLoader
37
+ from winnow.scripts.main import filter_dataset
38
+ from winnow.fdr.nonparametric import NonParametricFDRControl
39
+
40
+ # 1) Download model files
41
+ snapshot_download(
42
+ repo_id="InstaDeepAI/winnow-general-model",
43
+ allow_patterns=["*.pkl"]),
44
+ repo_type="model",
45
+ local_dir=general_model,
46
+ )
47
+
48
+ # 2) Load calibrator
49
+ calibrator = ProbabilityCalibrator.load(general_model)
50
+
51
+ # 3) Load your dataset (InstaNovo-style config)
52
+ dataset = InstaNovoDatasetLoader().load(
53
+ "path_to_spectrum_data.parquet",
54
+ "path_to_instanovo_predictions.csv",
55
+ )
56
+ dataset = filter_dataset(dataset) # standard Winnow filtering
57
+
58
+ # 4) Predict calibrated confidences
59
+ calibrator.predict(dataset) # adds dataset.metadata["calibrated_confidence"]
60
+
61
+ # 5) Optional: FDR control on calibrated confidence
62
+ fdr = NonParametricFDRControl()
63
+ fdr.fit(dataset.metadata["calibrated_confidence"])
64
+ cutoff = fdr.get_confidence_cutoff(0.05) # 5% FDR cutoff
65
+ dataset.metadata["keep@5%"] = dataset.metadata["calibrated_confidence"] >= cutoff
66
+ ```
67
+
68
+ ### CLI
69
+ ```bash
70
+ # After `pip install winnow`
71
+ winnow predict \
72
+ --data-source instanovo \
73
+ --dataset-config-path config_with_dataset_paths.yaml \
74
+ --model-folder general_model_folder \
75
+ --method winnow \
76
+ --fdr-threshold 0.05 \
77
+ --confidence-column calibrated_confidence \
78
+ --output-path outputs/winnow_predictions.csv
79
+ ```
80
+
81
+ ---
82
+
83
+ ## Inputs and outputs
84
+ **Required columns for calibration:**
85
+ - Spectrum data (*.parquet)
86
+ - spectrum_id (string): unique spectrum identifier
87
+ - sequence (string): ground truth peptide sequence from database search (optional)
88
+ - retention_time (float): retention time (seconds)
89
+ - precursor_mass (float): mass of the precursor ion (from MS1)
90
+ - mz_array (list[float]): mass-to-charge values of the MS2 spectrum
91
+ - intensity_array (list[float]): intensity values of the MS2 spectrum
92
+ - precursor_charge (int): charge of the precursor (from MS1)
93
+
94
+ - Beam predictions (*_beams.csv)
95
+ - spectrum_id (string)
96
+ - sequence (string): ground truth peptide sequence from database search (optional)
97
+ - preds (string): top prediction, untokenised sequence
98
+ - preds_tokenised (string): comma‐separated tokens for the top prediction
99
+ - log_probs (float): top prediction log probability
100
+ - preds_beam_k (string): untokenised sequence for beam k (k≥0)
101
+ - log_probs_beam_k (float)
102
+ - token_log_probs_k (string/list-encoded): per-token log probabilities for beam k
103
+
104
+ **Output columns (added by Winnow's calibrator on `predict`):**
105
+ - `calibrated_confidence`: calibrated probability
106
+ - Optional (if requested): `psm_pep`, `psm_fdr`, `psm_qvalue`
107
+ - All input columns are retained in-place
108
+
109
+ ---
110
+
111
+ ## Training data
112
+
113
+ - The general model was trained on a pooled, labelled set spanning multiple public datasets to encourage cross-dataset generalisation:
114
+ - HeLa single-shot (PXD044934)
115
+ - *Candidatus* Scalindua Brodae (PXD044934)
116
+ - Wound exudates (PXD025748)
117
+ - HepG2 (PXD019483)
118
+ - Immunopeptidomics (PXD006939)
119
+ - HeLa degradome (PXD044934)
120
+ - Snake venoms (PXD036161)
121
+ - All default features were enabled for the training of this model.
122
+ - Predictions were obtained using InstaNovo v1.1.1 with knapsack beam search set to 50 beams.
123
+
124
+ ---
125
+
126
+ ## Citation
127
+
128
+ If you use Winnow or this model, please cite: