SaeedLab
/

ProteoRift

@@ -10,23 +10,98 @@ tags:
 pipeline_tag: feature-extraction
 ---
-# ProteoRift Model
-## Model Description
-ProteoRift is an end-to-end machine learning model for peptide database search in mass spectrometry proteomics. The model predicts multiple peptide properties (length, missed cleavages, and modification status) directly from spectra, enabling efficient search-space reduction.
 ## Usage
 ## Training Data
 The model was trained on large-scale mass spectrometry datasets including:
 - NIST human peptide libraries
 - MassIVE public datasets
 ## System Requirements
@@ -34,7 +109,10 @@ The model was trained on large-scale mass spectrometry datasets including:
 - **Python:** 3.8+
 - **PyTorch:** 1.10+
 ## License

 pipeline_tag: feature-extraction
 ---
+# ProteoRift
+[Github](https://github.com/pcdslab/ProteoRift) | [Cite](#citation)
+## Abstract
+Mass-based filtering significantly reduces the peptide candidate pool for subsequent scoring in database search algorithms. While useful, filtering based on one property may lead to exclusion of non-abundant spectra and uncharacterized peptides – potentially exacerbating the streetlight effect. Here we present ProteoRift, a novel attention and multitask deep-network, which can predict multiple peptide properties (length, missed cleavages, and modification status) directly from spectra 77.8% of the time. Integrating ProteoRift into an end-to-end pipeline significantly reduces the search space compared to mass-only filtering. This delivers 8x to 12x speedups while maintaining peptide deduction accuracy comparable to established algorithmic techniques. We also developed uncertainty estimation metrics, which can distinguish between in-distribution and out-of-distribution data (ROC-AUC 0.99) and predict high-scoring mass spectra against the correct peptide (ROC-AUC 0.94). These models and metrics are integrated in an end-to-end pipeline available at https://github.com/pcdslab/ProteoRift.
 ## Usage
+### Installation
+```bash
+pip install proteorift
+```
+### Using Sample Data
+```python
+from proteorift import ProteoRiftSearch
+# Initialize and run with sample data
+searcher = ProteoRiftSearch()
+results = searcher.search_with_sample_data()
+print(f"Results saved to: {results['output_dir']}")
+```
+### Using Your Own Data
+```python
+from proteorift import ProteoRiftSearch
+# Initialize search
+searcher = ProteoRiftSearch()
+# Run peptide database search
+results = searcher.search(
+    mgf_dir="path/to/your/spectra",      # Directory with MGF files
+    peptide_db="path/to/your/database",  # Directory with FASTA files
+    output_dir="./results"
+)
+```
+### Custom Parameters
+```python
+searcher = ProteoRiftSearch(
+    precursor_tolerance=10,
+    precursor_tolerance_type="ppm",
+    charge=3,
+    length_filter=True,
+    device="cuda"  # or "cpu", "auto"
+)
+results = searcher.search(mgf_dir="...", peptide_db="...")
+```
+### Command Line Interface
+```bash
+# Run search with sample data
+proteorift search-sample --output-dir ./results
+# Run search with your data
+proteorift search \
+    --mgf-dir path/to/spectra \
+    --peptide-db path/to/database \
+    --output-dir ./results \
+    --tolerance 10 \
+    --charge 3
+# Download models only
+proteorift download-models
+```
+## Output
+ProteoRift generates Percolator-compatible PIN files:
+- `target.pin` - Target peptide-spectrum matches
+- `decoy.pin` - Decoy peptide-spectrum matches
 ## Training Data
 The model was trained on large-scale mass spectrometry datasets including:
 - NIST human peptide libraries
 - MassIVE public datasets
+- DeepNovo
 ## System Requirements
 - **Python:** 3.8+
 - **PyTorch:** 1.10+
+## Citation
+If you use ProteoRift in your research, please cite the following paper:
+Tariq, U., Shabbir, B. & Saeed, F. End-to-end deep attention-based multitask pipeline for predicting uncertainty-quantified peptide properties from mass spectrometry data. Sci Rep (2026). https://doi.org/10.1038/s41598-026-43215-2
 ## License