Update README.md: Revise table of contents, enhance features section, and clarify installation instructions
Browse files
README.md
CHANGED
|
@@ -1,110 +1,136 @@
|
|
| 1 |
-
|
| 2 |
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
## Features
|
| 6 |
-
|
| 7 |
-
- **Multiple Model Support**: Choose from three different prediction models:
|
| 8 |
-
- PROST-T5: Transformer-based protein language model
|
| 9 |
-
- ESM-C 300M: Evolutionary Scale Modeling (300M parameters)
|
| 10 |
-
- ESM-C 600M: Evolutionary Scale Modeling (600M parameters)
|
| 11 |
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
- **Flexible Output**: Save predictions with confidence scores in text format
|
| 15 |
-
- **Error Handling**: Comprehensive error handling and user feedback
|
| 16 |
|
| 17 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
|
| 19 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
|
| 21 |
-
|
| 22 |
|
| 23 |
-
-
|
| 24 |
-
- Transformers library
|
| 25 |
-
- ESM models
|
| 26 |
-
- Scikit-learn
|
| 27 |
-
- BioPython
|
| 28 |
-
- NumPy, Joblib
|
| 29 |
-
- Tkinter (GUI components)
|
| 30 |
|
| 31 |
-
###
|
| 32 |
|
| 33 |
-
|
| 34 |
-
- **Recommended**: 16GB+ RAM, NVIDIA GPU with 8GB+ VRAM
|
| 35 |
-
- **Storage**: ~5GB for model weights and cache
|
| 36 |
|
| 37 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 38 |
|
| 39 |
-
|
| 40 |
|
| 41 |
-
|
| 42 |
|
| 43 |
-
|
| 44 |
-
# Make sure git-lfs is installed (https://git-lfs.com)
|
| 45 |
-
git lfs install
|
| 46 |
|
| 47 |
-
|
| 48 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 49 |
```
|
| 50 |
|
| 51 |
-
|
| 52 |
-
```bash
|
| 53 |
-
# If you want to clone without large files - just their pointers
|
| 54 |
-
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/jpuglia/ProteinLocationPredictor
|
| 55 |
-
```
|
| 56 |
|
| 57 |
-
2. **Navigate to the project directory**:
|
| 58 |
```bash
|
| 59 |
-
|
| 60 |
```
|
| 61 |
|
| 62 |
-
|
| 63 |
-
```bash
|
| 64 |
-
conda env create -n protein-predictor -f environment.yml
|
| 65 |
-
conda activate protein-predictor
|
| 66 |
-
```
|
| 67 |
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
|
| 72 |
-
##
|
| 73 |
|
| 74 |
-
|
| 75 |
|
| 76 |
-
```bash
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
```
|
| 80 |
|
| 81 |
-
|
| 82 |
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
|
| 86 |
|
| 87 |
-
2. **
|
| 88 |
-
- Click "File" β "Load FASTA"
|
| 89 |
-
- Select your protein sequences file (`.fasta`, `.fa`, or `.fas`)
|
| 90 |
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 95 |
|
| 96 |
-
|
| 97 |
-
- Click the corresponding prediction button
|
| 98 |
-
- Monitor progress in the progress bar window
|
| 99 |
-
- Select output directory when prompted
|
| 100 |
|
| 101 |
-
|
| 102 |
-
- Choose location and filename for prediction results
|
| 103 |
-
- Results are saved in CSV format with confidence scores for each subcellular location
|
| 104 |
|
| 105 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 106 |
|
| 107 |
-
|
|
|
|
|
|
|
| 108 |
|
| 109 |
```
|
| 110 |
>protein_1
|
|
@@ -113,117 +139,57 @@ MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG
|
|
| 113 |
MKTIIALSYIFCLVFAHATAKASEQTDNLQWDLAAIDNSGGHNAVDIKQNLQFQCQNNLHGCF
|
| 114 |
```
|
| 115 |
|
| 116 |
-
|
| 117 |
-
|
| 118 |
-
Results are saved as CSV files with predictions for 6 subcellular locations, ranked by probability:
|
| 119 |
|
| 120 |
```csv
|
| 121 |
Sequence_ID,Prediction 1,Prediction 2,Prediction 3,Prediction 4,Prediction 5,Prediction 6
|
| 122 |
-
|
|
|
|
| 123 |
```
|
| 124 |
|
| 125 |
-
**Predicted Locations:**
|
| 126 |
-
- **Cytoplasmic**: Interior of the cell
|
| 127 |
-
- **CytoplasmicMembrane**: Inner membrane
|
| 128 |
-
- **Periplasmic**: Space between inner and outer membranes
|
| 129 |
-
- **Extracellular**: Outside the cell
|
| 130 |
-
- **OuterMembrane**: Outer membrane
|
| 131 |
-
- **Cellwall**: Cell wall structure
|
| 132 |
-
|
| 133 |
## Model Details
|
| 134 |
|
| 135 |
-
|
| 136 |
-
|
| 137 |
-
-
|
| 138 |
-
-
|
| 139 |
-
-
|
| 140 |
-
|
| 141 |
-
### ESM-C Models
|
| 142 |
-
- **Base Models**: ESM-C 300M/600M
|
| 143 |
-
- **Embedding Dimension**: Variable (300M: 960, 600M: 1280)
|
| 144 |
-
- **Classifier**: Support Vector Machine (SVM)
|
| 145 |
-
- **Memory Usage**: 300M: ~2GB GPU, 600M: ~4GB GPU
|
| 146 |
-
|
| 147 |
-
## Troubleshooting
|
| 148 |
-
|
| 149 |
-
### Common Issues
|
| 150 |
-
|
| 151 |
-
1. **Out of Memory Errors**
|
| 152 |
-
- Reduce batch size or use CPU-only mode
|
| 153 |
-
- Close other applications to free memory
|
| 154 |
-
- Try smaller model (ESM-C 300M instead of 600M)
|
| 155 |
-
|
| 156 |
-
2. **Model Loading Errors**
|
| 157 |
-
- Ensure model files are in the correct `Models/` directory
|
| 158 |
-
- Check file permissions and integrity
|
| 159 |
-
- Clear Hugging Face cache: `rm -rf ~/.cache/huggingface/`
|
| 160 |
-
|
| 161 |
-
3. **CUDA Errors**
|
| 162 |
-
- Update GPU drivers
|
| 163 |
-
- Ensure CUDA-compatible PyTorch installation
|
| 164 |
-
- Fall back to CPU mode if GPU issues persist
|
| 165 |
-
|
| 166 |
-
### Performance Tips
|
| 167 |
-
|
| 168 |
-
- **GPU Usage**: Models automatically detect and use GPU when available
|
| 169 |
-
- **Memory Management**: CUDA cache is cleared after each prediction
|
| 170 |
-
- **Sequential Processing**: Sequences are processed one at a time with progress tracking
|
| 171 |
|
| 172 |
## Project Structure
|
| 173 |
|
| 174 |
```
|
| 175 |
ProteinLocationPredictor/
|
| 176 |
-
βββ gui.py
|
| 177 |
βββ src/
|
| 178 |
-
β βββ my_utils.py
|
| 179 |
-
|
| 180 |
-
β βββ
|
| 181 |
-
β βββ Prost T5_le_svm.joblib
|
| 182 |
β βββ ESMC-300m_svm.joblib
|
| 183 |
β βββ ESMC-600m_svm.joblib
|
| 184 |
β βββ ...
|
| 185 |
-
βββ environment.yml
|
| 186 |
-
|
|
|
|
|
|
|
|
|
|
| 187 |
```
|
| 188 |
|
| 189 |
## Contributing
|
| 190 |
|
| 191 |
1. Fork the repository
|
| 192 |
-
2. Create a feature branch
|
| 193 |
-
|
| 194 |
-
|
| 195 |
-
|
| 196 |
-
|
| 197 |
-
|
| 198 |
-
|
| 199 |
-
|
| 200 |
-
|
| 201 |
-
|
| 202 |
-
|
| 203 |
-
|
| 204 |
-
|
| 205 |
-
|
| 206 |
-
|
| 207 |
-
|
| 208 |
-
author={Juan Diego Puglia},
|
| 209 |
-
year={2025},
|
| 210 |
-
url={https://huggingface.co/jpuglia/ProteinLocationPredictor}
|
| 211 |
-
}
|
| 212 |
-
```
|
| 213 |
-
|
| 214 |
-
## Acknowledgments
|
| 215 |
-
|
| 216 |
-
- [Rostlab](https://rostlab.org/) for the PROST-T5 model
|
| 217 |
-
- [Meta AI](https://ai.meta.com/) for the ESM models
|
| 218 |
-
- [Hugging Face](https://huggingface.co/) for model hosting and transformers library
|
| 219 |
-
- [BioPython](https://biopython.org/) for sequence handling utilities
|
| 220 |
-
|
| 221 |
-
## Contact
|
| 222 |
-
|
| 223 |
-
For questions, issues, or collaborations, please:
|
| 224 |
-
- Visit the [Hugging Face repository](https://huggingface.co/jpuglia/ProteinLocationPredictor)
|
| 225 |
-
- Open a discussion on the Hugging Face platform
|
| 226 |
-
|
| 227 |
-
---
|
| 228 |
-
|
| 229 |
-
**Note**: This tool is for research purposes. Please validate predictions with experimental methods for critical applications.
|
|
|
|
| 1 |
+
## Table of Contents
|
| 2 |
|
| 3 |
+
* [Protein Location Predictor](#protein-location-predictor)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
|
| 5 |
+
* [Features](#features)
|
| 6 |
+
* [Requirements](#requirements)
|
|
|
|
|
|
|
| 7 |
|
| 8 |
+
* [Supported Python Version](#supported-python-version)
|
| 9 |
+
* [Dependencies ](#dependencies-full-environmentyml)
|
| 10 |
+
* [Hardware Requirements](#hardware-requirements)
|
| 11 |
+
* [Installation](#installation)
|
| 12 |
+
* [Usage](#usage)
|
| 13 |
|
| 14 |
+
* [GUI Mode](#gui-mode)
|
| 15 |
+
* [Example Input & Output](#example-input--output)
|
| 16 |
+
* [Model Details](#model-details)
|
| 17 |
+
* [Project Structure](#project-structure)
|
| 18 |
+
* [Contributing](#contributing)
|
| 19 |
|
| 20 |
+
## Protein Location Predictor
|
| 21 |
|
| 22 |
+
A comprehensive GUI application for predicting protein subcellular localization using state-of-the-art machine learning models including PROST-T5 and ESM-C embeddings.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 23 |
|
| 24 |
+
### Features
|
| 25 |
|
| 26 |
+
* **Multiple Model Support**: Choose from three different prediction models:
|
|
|
|
|
|
|
| 27 |
|
| 28 |
+
* PROST-T5: Transformer-based protein language model
|
| 29 |
+
* ESM-C 300M: Evolutionary Scale Modeling (300M parameters)
|
| 30 |
+
* ESM-C 600M: Evolutionary Scale Modeling (600M parameters)
|
| 31 |
+
* **User-Friendly GUI**: Simple Tkinter-based interface with progress tracking (see screenshot below)
|
| 32 |
+
* **Sequential Processing**: Process multiple protein sequences from FASTA files
|
| 33 |
+
* **Flexible Output**: Save predictions with confidence scores in text (CSV) format
|
| 34 |
+
* **Error Handling**: Comprehensive error handling and user feedback
|
| 35 |
|
| 36 |
+
### Supported Python Version
|
| 37 |
|
| 38 |
+
This project has been tested on **Python 3.10+**.
|
| 39 |
|
| 40 |
+
## Requirements
|
|
|
|
|
|
|
| 41 |
|
| 42 |
+
#### Dependencies (Full environment.yml)
|
| 43 |
+
|
| 44 |
+
The complete environment definition is located in `environment.yml`. This file includes all necessary packages for PyTorch, Transformers, ESM models, and GUI operation. Here is a brief excerpt:
|
| 45 |
+
|
| 46 |
+
```yaml
|
| 47 |
+
name: tesisEnv
|
| 48 |
+
channels:
|
| 49 |
+
- bioconda
|
| 50 |
+
- anaconda
|
| 51 |
+
- conda-forge
|
| 52 |
+
- defaults
|
| 53 |
+
|
| 54 |
+
# Python version and major packages
|
| 55 |
+
dependencies:
|
| 56 |
+
- python=3.10.16
|
| 57 |
+
- pytorch=2.6.0
|
| 58 |
+
- torchvision=0.21.0
|
| 59 |
+
- torchtext=0.18.0
|
| 60 |
+
- transformers=4.46.3
|
| 61 |
+
- scikit-learn=1.6.1
|
| 62 |
+
- biopython=1.85
|
| 63 |
+
- esm=3.1.4
|
| 64 |
+
- numpy=1.26.4
|
| 65 |
+
- joblib=1.4.2
|
| 66 |
+
- tk
|
| 67 |
+
# plus many others (see full file for complete list)
|
| 68 |
```
|
| 69 |
|
| 70 |
+
To ensure exact reproducibility, use:
|
|
|
|
|
|
|
|
|
|
|
|
|
| 71 |
|
|
|
|
| 72 |
```bash
|
| 73 |
+
conda env create -f environment.yml
|
| 74 |
```
|
| 75 |
|
| 76 |
+
### Hardware Requirements
|
|
|
|
|
|
|
|
|
|
|
|
|
| 77 |
|
| 78 |
+
* **Minimum**: 8β―GB RAM, CPU-only execution
|
| 79 |
+
* **Recommended**: 16β―GB+ RAM, NVIDIA GPU with 8β―GB+ VRAM
|
| 80 |
+
* **Storage**: \~5β―GB for model weights and cache
|
| 81 |
|
| 82 |
+
## Installation
|
| 83 |
|
| 84 |
+
1. **Clone the repository** (with Gitβ―LFS for large model files):
|
| 85 |
|
| 86 |
+
```bash
|
| 87 |
+
git lfs install
|
| 88 |
+
git clone https://huggingface.co/jpuglia/ProteinLocationPredictor
|
| 89 |
+
```
|
| 90 |
|
| 91 |
+
If you prefer to skip downloading model weights initially:
|
| 92 |
|
| 93 |
+
```bash
|
| 94 |
+
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/jpuglia/ProteinLocationPredictor
|
| 95 |
+
```
|
| 96 |
|
| 97 |
+
2. **Navigate into the project directory**:
|
|
|
|
|
|
|
| 98 |
|
| 99 |
+
```bash
|
| 100 |
+
cd ProteinLocationPredictor
|
| 101 |
+
```
|
| 102 |
+
|
| 103 |
+
3. **Create and activate the Conda environment**:
|
| 104 |
+
|
| 105 |
+
```bash
|
| 106 |
+
conda env create -f environment.yml
|
| 107 |
+
conda activate tesisEnv
|
| 108 |
+
```
|
| 109 |
+
|
| 110 |
+
4. **(If skipped above) Download model weights manually**:
|
| 111 |
+
Model files live in the `Models/` directory. If you used `GIT_LFS_SKIP_SMUDGE`, run:
|
| 112 |
+
|
| 113 |
+
```bash
|
| 114 |
+
git lfs pull
|
| 115 |
+
```
|
| 116 |
+
|
| 117 |
+
## Usage
|
| 118 |
|
| 119 |
+
### GUI Mode
|
|
|
|
|
|
|
|
|
|
| 120 |
|
| 121 |
+
1. Launch the application:
|
|
|
|
|
|
|
| 122 |
|
| 123 |
+
```bash
|
| 124 |
+
python gui.py
|
| 125 |
+
```
|
| 126 |
+
2. In the menu, click **File β Load FASTA** and select your input file (`.fasta`, `.fa`, or `.fas`).
|
| 127 |
+
3. Choose one of the prediction models (PROST-T5, ESM-C 300M, or ESM-C 600M).
|
| 128 |
+
4. Click **Run Prediction** and monitor the progress bar.
|
| 129 |
+
5. When complete, you will be prompted to choose an output directory and filename.
|
| 130 |
|
| 131 |
+
## Example Input & Output
|
| 132 |
+
|
| 133 |
+
**Input FASTA (********`example/input.fasta`****\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*):**
|
| 134 |
|
| 135 |
```
|
| 136 |
>protein_1
|
|
|
|
| 139 |
MKTIIALSYIFCLVFAHATAKASEQTDNLQWDLAAIDNSGGHNAVDIKQNLQFQCQNNLHGCF
|
| 140 |
```
|
| 141 |
|
| 142 |
+
**Output CSV (********`example/output.csv`****\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*):**
|
|
|
|
|
|
|
| 143 |
|
| 144 |
```csv
|
| 145 |
Sequence_ID,Prediction 1,Prediction 2,Prediction 3,Prediction 4,Prediction 5,Prediction 6
|
| 146 |
+
protein_1,Cytoplasmic (0.9860),CytoplasmicMembrane (0.0081),Periplasmic (0.0029),Extracellular (0.0019),OuterMembrane (0.0007),Cellwall (0.0003)
|
| 147 |
+
protein_2,SignalPeptide (0.7523),Extracellular (0.1234),CytoplasmicMembrane (0.0645),Cellwall (0.0345),Periplasmic (0.0201),OuterMembrane (0.0052)
|
| 148 |
```
|
| 149 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 150 |
## Model Details
|
| 151 |
|
| 152 |
+
| Model | Embedding Dim. | Classifier | GPU VRAM | RAM Usage |
|
| 153 |
+
| ------------ | -------------- | ---------- | -------- | --------- |
|
| 154 |
+
| PROST-T5 | 1024 | SVM | \~4β―GB | \~8β―GB |
|
| 155 |
+
| ESM-C (300M) | 960 | SVM | \~2β―GB | \~6β―GB |
|
| 156 |
+
| ESM-C (600M) | 1280 | SVM | \~4β―GB | \~10β―GB |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 157 |
|
| 158 |
## Project Structure
|
| 159 |
|
| 160 |
```
|
| 161 |
ProteinLocationPredictor/
|
| 162 |
+
βββ gui.py
|
| 163 |
βββ src/
|
| 164 |
+
β βββ my_utils.py
|
| 165 |
+
βοΏ½οΏ½β Models/
|
| 166 |
+
β βββ ProstT5_svm.joblib
|
|
|
|
| 167 |
β βββ ESMC-300m_svm.joblib
|
| 168 |
β βββ ESMC-600m_svm.joblib
|
| 169 |
β βββ ...
|
| 170 |
+
βββ environment.yml
|
| 171 |
+
βββ README.md
|
| 172 |
+
βββ doc/
|
| 173 |
+
βββ screenshots/
|
| 174 |
+
βββ gui_example.png
|
| 175 |
```
|
| 176 |
|
| 177 |
## Contributing
|
| 178 |
|
| 179 |
1. Fork the repository
|
| 180 |
+
2. Create a feature branch:
|
| 181 |
+
|
| 182 |
+
```bash
|
| 183 |
+
git checkout -b feature/amazing-feature
|
| 184 |
+
```
|
| 185 |
+
3. Commit your changes:
|
| 186 |
+
|
| 187 |
+
```bash
|
| 188 |
+
git commit -m "Add amazing feature"
|
| 189 |
+
```
|
| 190 |
+
4. Push to your branch:
|
| 191 |
+
|
| 192 |
+
```bash
|
| 193 |
+
git push origin feature/amazing-feature
|
| 194 |
+
```
|
| 195 |
+
5. Open a Pull Request or start a discussion: [Repository Discussions](https://huggingface.co/jpuglia/ProteinLocationPredictor/discussions)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|