| --- |
| license: cc-by-nc-sa-4.0 |
| library_name: pytorch |
| tags: |
| - proteomics |
| - mass-spectrometry |
| - peptide-sequencing |
| - de-novo-sequencing |
| - transformer |
| - biology |
| - computational-biology |
| pipeline_tag: text-generation |
| datasets: |
| - InstaDeepAI/ms_ninespecies_benchmark |
| - InstaDeepAI/ms_proteometools |
| --- |
| |
| # InstaNovo: De novo Peptide Sequencing Model |
| ## Model Description |
|
|
| InstaNovo is a state-of-the-art transformer-based model for de novo peptide sequencing from mass spectrometry data. This model enables accurate, database-free peptide identification for large-scale proteomics experiments. InstaNovo uses a transformer architecture specifically designed for peptide sequencing from tandem mass spectrometry (MS/MS) data. The model predicts peptide sequences directly from MS/MS spectra without requiring a protein database, making it particularly valuable for discovering novel peptides, post-translational modifications, and sequences from organisms with incomplete genomic databases. |
|
|
| ## Usage |
|
|
| ```python |
| import torch |
| import numpy as np |
| import pandas as pd |
| from instanovo.transformer.model import InstaNovo |
| from instanovo.utils import SpectrumDataFrame |
| from instanovo.transformer.dataset import SpectrumDataset, collate_batch |
| from torch.utils.data import DataLoader |
| from instanovo.inference import ScoredSequence |
| from instanovo.inference import BeamSearchDecoder |
| from instanovo.utils.metrics import Metrics |
| from tqdm.notebook import tqdm |
| |
| # Load the model from the Hugging Face Hub |
| model, config = InstaNovo.from_pretrained("InstaDeepAI/instanovo-v1.1.0") |
| |
| # Move the model to the GPU if available |
| device = "cuda" if torch.cuda.is_available() else "cpu" |
| model = model.to(device).eval() |
| |
| # Update the residue set with custom modifications |
| model.residue_set.update_remapping( |
| { |
| "M(ox)": "M[UNIMOD:35]", |
| "M(+15.99)": "M[UNIMOD:35]", |
| "S(p)": "S[UNIMOD:21]", # Phosphorylation |
| "T(p)": "T[UNIMOD:21]", |
| "Y(p)": "Y[UNIMOD:21]", |
| "S(+79.97)": "S[UNIMOD:21]", |
| "T(+79.97)": "T[UNIMOD:21]", |
| "Y(+79.97)": "Y[UNIMOD:21]", |
| "Q(+0.98)": "Q[UNIMOD:7]", # Deamidation |
| "N(+0.98)": "N[UNIMOD:7]", |
| "Q(+.98)": "Q[UNIMOD:7]", |
| "N(+.98)": "N[UNIMOD:7]", |
| "C(+57.02)": "C[UNIMOD:4]", # Carboxyamidomethylation |
| "(+42.01)": "[UNIMOD:1]", # Acetylation |
| "(+43.01)": "[UNIMOD:5]", # Carbamylation |
| "(-17.03)": "[UNIMOD:385]", |
| } |
| ) |
| |
| # Load the test data |
| sdf = SpectrumDataFrame.from_huggingface( |
| "InstaDeepAI/ms_ninespecies_benchmark", |
| is_annotated=True, |
| shuffle=False, |
| split="test[:10%]", # Let's only use a subset of the test data for faster inference |
| ) |
| |
| # Create the dataset |
| ds = SpectrumDataset( |
| sdf, |
| model.residue_set, |
| config.get("n_peaks", 200), |
| return_str=True, |
| annotated=True, |
| ) |
| |
| # Create the data loader |
| dl = DataLoader(ds, batch_size=64, shuffle=False, num_workers=0, collate_fn=collate_batch) |
| |
| # Create the decoder |
| decoder = BeamSearchDecoder(model=model) |
| |
| # Initialize lists to store predictions and targets |
| preds = [] |
| targs = [] |
| probs = [] |
| |
| # Iterate over the data loader |
| for _, batch in tqdm(enumerate(dl), total=len(dl)): |
| spectra, precursors, _, peptides, _ = batch |
| spectra = spectra.to(device) |
| precursors = precursors.to(device) |
| |
| # Perform inference |
| with torch.no_grad(): |
| p = decoder.decode( |
| spectra=spectra, |
| precursors=precursors, |
| beam_size=config["n_beams"], |
| max_length=config["max_length"], |
| ) |
| |
| |
| preds += [x.sequence if isinstance(x, ScoredSequence) else [] for x in p] |
| probs += [ |
| x.sequence_log_probability if isinstance(x, ScoredSequence) else -float("inf") for x in p |
| ] |
| targs += list(peptides) |
| |
| # Initialize metrics |
| metrics = Metrics(model.residue_set, config["isotope_error_range"]) |
| |
| |
| # Compute precision and recall |
| aa_precision, aa_recall, peptide_recall, peptide_precision = metrics.compute_precision_recall( |
| peptides, preds |
| ) |
| |
| # Compute amino acid error rate and AUC |
| aa_error_rate = metrics.compute_aa_er(targs, preds) |
| auc = metrics.calc_auc(targs, preds, np.exp(pd.Series(probs))) |
| |
| print(f"amino acid error rate: {aa_error_rate:.5f}") |
| print(f"amino acid precision: {aa_precision:.5f}") |
| print(f"amino acid recall: {aa_recall:.5f}") |
| print(f"peptide precision: {peptide_precision:.5f}") |
| print(f"peptide recall: {peptide_recall:.5f}") |
| print(f"area under the PR curve: {auc:.5f}") |
| ``` |
|
|
| For more explanation, see the [Getting Started notebook](https://github.com/instadeepai/InstaNovo/blob/main/notebooks/getting_started_with_instanovo.ipynb) in the repository. |
|
|
|
|
| ## Citation |
|
|
| If you use InstaNovo in your research, please cite: |
|
|
| ```bibtex |
| @article{eloff_kalogeropoulos_2025_instanovo, |
| title = {InstaNovo enables diffusion-powered de novo peptide sequencing in large-scale |
| proteomics experiments}, |
| author = {Eloff, Kevin and Kalogeropoulos, Konstantinos and Mabona, Amandla and Morell, |
| Oliver and Catzel, Rachel and Rivera-de-Torre, Esperanza and Berg Jespersen, |
| Jakob and Williams, Wesley and van Beljouw, Sam P. B. and Skwark, Marcin J. |
| and Laustsen, Andreas Hougaard and Brouns, Stan J. J. and Ljungars, |
| Anne and Schoof, Erwin M. and Van Goey, Jeroen and auf dem Keller, Ulrich and |
| Beguir, Karim and Lopez Carranza, Nicolas and Jenkins, Timothy P.}, |
| year = {2025}, |
| month = {Mar}, |
| day = {31}, |
| journal = {Nature Machine Intelligence}, |
| doi = {10.1038/s42256-025-01019-5}, |
| issn = {2522-5839}, |
| url = {https://doi.org/10.1038/s42256-025-01019-5} |
| } |
| ``` |
|
|
| ## Resources |
|
|
| - **Code Repository**: [https://github.com/instadeepai/InstaNovo](https://github.com/instadeepai/InstaNovo) |
| - **Documentation**: [https://instadeepai.github.io/InstaNovo/](https://instadeepai.github.io/InstaNovo/) |
| - **Publication**: [https://www.nature.com/articles/s42256-025-01019-5](https://www.nature.com/articles/s42256-025-01019-5) |
|
|
| ## License |
|
|
| - **Code**: Licensed under Apache License 2.0 |
| - **Model Checkpoints**: Licensed under Creative Commons Non-Commercial (CC BY-NC-SA 4.0) |
|
|
| ## Installation |
|
|
| ```bash |
| pip install instanovo |
| ``` |
|
|
| For GPU support, install with CUDA dependencies: |
| ```bash |
| pip install instanovo[cu126] |
| ``` |
|
|
| ## Requirements |
|
|
| - Python >= 3.10, < 3.13 |
| - PyTorch >= 1.13.0 |
| - CUDA (optional, for GPU acceleration) |
|
|
| ## Support |
|
|
| For questions, issues, or contributions, please visit the [GitHub repository](https://github.com/instadeepai/InstaNovo) or check the [documentation](https://instadeepai.github.io/InstaNovo/). |
|
|