bilalsm commited on
Commit
21ba6b5
·
verified ·
1 Parent(s): bba543a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +82 -4
README.md CHANGED
@@ -10,23 +10,98 @@ tags:
10
  pipeline_tag: feature-extraction
11
  ---
12
 
13
- # ProteoRift Model
14
 
15
- ## Model Description
16
 
17
- ProteoRift is an end-to-end machine learning model for peptide database search in mass spectrometry proteomics. The model predicts multiple peptide properties (length, missed cleavages, and modification status) directly from spectra, enabling efficient search-space reduction.
 
 
18
 
19
 
20
 
21
  ## Usage
22
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23
 
24
  ## Training Data
25
 
26
  The model was trained on large-scale mass spectrometry datasets including:
27
  - NIST human peptide libraries
28
  - MassIVE public datasets
29
-
30
 
31
  ## System Requirements
32
 
@@ -34,7 +109,10 @@ The model was trained on large-scale mass spectrometry datasets including:
34
  - **Python:** 3.8+
35
  - **PyTorch:** 1.10+
36
 
 
 
37
 
 
38
 
39
  ## License
40
 
 
10
  pipeline_tag: feature-extraction
11
  ---
12
 
13
+ # ProteoRift
14
 
15
+ [Github](https://github.com/pcdslab/ProteoRift) | [Cite](#citation)
16
 
17
+ ## Abstract
18
+
19
+ Mass-based filtering significantly reduces the peptide candidate pool for subsequent scoring in database search algorithms. While useful, filtering based on one property may lead to exclusion of non-abundant spectra and uncharacterized peptides – potentially exacerbating the streetlight effect. Here we present ProteoRift, a novel attention and multitask deep-network, which can predict multiple peptide properties (length, missed cleavages, and modification status) directly from spectra 77.8% of the time. Integrating ProteoRift into an end-to-end pipeline significantly reduces the search space compared to mass-only filtering. This delivers 8x to 12x speedups while maintaining peptide deduction accuracy comparable to established algorithmic techniques. We also developed uncertainty estimation metrics, which can distinguish between in-distribution and out-of-distribution data (ROC-AUC 0.99) and predict high-scoring mass spectra against the correct peptide (ROC-AUC 0.94). These models and metrics are integrated in an end-to-end pipeline available at https://github.com/pcdslab/ProteoRift.
20
 
21
 
22
 
23
  ## Usage
24
 
25
+ ### Installation
26
+
27
+ ```bash
28
+ pip install proteorift
29
+ ```
30
+
31
+
32
+ ### Using Sample Data
33
+
34
+ ```python
35
+ from proteorift import ProteoRiftSearch
36
+
37
+ # Initialize and run with sample data
38
+ searcher = ProteoRiftSearch()
39
+ results = searcher.search_with_sample_data()
40
+
41
+ print(f"Results saved to: {results['output_dir']}")
42
+ ```
43
+
44
+ ### Using Your Own Data
45
+
46
+ ```python
47
+ from proteorift import ProteoRiftSearch
48
+
49
+ # Initialize search
50
+ searcher = ProteoRiftSearch()
51
+
52
+ # Run peptide database search
53
+ results = searcher.search(
54
+ mgf_dir="path/to/your/spectra", # Directory with MGF files
55
+ peptide_db="path/to/your/database", # Directory with FASTA files
56
+ output_dir="./results"
57
+ )
58
+ ```
59
+
60
+ ### Custom Parameters
61
+
62
+ ```python
63
+ searcher = ProteoRiftSearch(
64
+ precursor_tolerance=10,
65
+ precursor_tolerance_type="ppm",
66
+ charge=3,
67
+ length_filter=True,
68
+ device="cuda" # or "cpu", "auto"
69
+ )
70
+
71
+ results = searcher.search(mgf_dir="...", peptide_db="...")
72
+ ```
73
+
74
+ ### Command Line Interface
75
+
76
+ ```bash
77
+ # Run search with sample data
78
+ proteorift search-sample --output-dir ./results
79
+
80
+ # Run search with your data
81
+ proteorift search \
82
+ --mgf-dir path/to/spectra \
83
+ --peptide-db path/to/database \
84
+ --output-dir ./results \
85
+ --tolerance 10 \
86
+ --charge 3
87
+
88
+ # Download models only
89
+ proteorift download-models
90
+ ```
91
+
92
+ ## Output
93
+
94
+ ProteoRift generates Percolator-compatible PIN files:
95
+ - `target.pin` - Target peptide-spectrum matches
96
+ - `decoy.pin` - Decoy peptide-spectrum matches
97
+
98
 
99
  ## Training Data
100
 
101
  The model was trained on large-scale mass spectrometry datasets including:
102
  - NIST human peptide libraries
103
  - MassIVE public datasets
104
+ - DeepNovo
105
 
106
  ## System Requirements
107
 
 
109
  - **Python:** 3.8+
110
  - **PyTorch:** 1.10+
111
 
112
+ ## Citation
113
+ If you use ProteoRift in your research, please cite the following paper:
114
 
115
+ Tariq, U., Shabbir, B. & Saeed, F. End-to-end deep attention-based multitask pipeline for predicting uncertainty-quantified peptide properties from mass spectrometry data. Sci Rep (2026). https://doi.org/10.1038/s41598-026-43215-2
116
 
117
  ## License
118