jpuglia commited on
Commit
bf0fbb5
Β·
1 Parent(s): ec4615b

Update README.md: Revise table of contents, enhance features section, and clarify installation instructions

Browse files
Files changed (1) hide show
  1. README.md +135 -169
README.md CHANGED
@@ -1,110 +1,136 @@
1
- # Protein Location Predictor
2
 
3
- A comprehensive GUI application for predicting protein subcellular localization using state-of-the-art machine learning models including PROST-T5 and ESM-C embeddings.
4
-
5
- ## Features
6
-
7
- - **Multiple Model Support**: Choose from three different prediction models:
8
- - PROST-T5: Transformer-based protein language model
9
- - ESM-C 300M: Evolutionary Scale Modeling (300M parameters)
10
- - ESM-C 600M: Evolutionary Scale Modeling (600M parameters)
11
 
12
- - **User-Friendly GUI**: Simple Tkinter-based interface with progress tracking
13
- - **Sequential Processing**: Process multiple protein sequences from FASTA files
14
- - **Flexible Output**: Save predictions with confidence scores in text format
15
- - **Error Handling**: Comprehensive error handling and user feedback
16
 
17
- ## Requirements
 
 
 
 
18
 
19
- ### Dependencies
 
 
 
 
20
 
21
- The project uses conda for environment management. All dependencies are specified in `environment.yml` and include:
22
 
23
- - PyTorch with CUDA support
24
- - Transformers library
25
- - ESM models
26
- - Scikit-learn
27
- - BioPython
28
- - NumPy, Joblib
29
- - Tkinter (GUI components)
30
 
31
- ### Hardware Requirements
32
 
33
- - **Minimum**: 8GB RAM, CPU-only execution
34
- - **Recommended**: 16GB+ RAM, NVIDIA GPU with 8GB+ VRAM
35
- - **Storage**: ~5GB for model weights and cache
36
 
37
- ## Installation
 
 
 
 
 
 
38
 
39
- **Prerequisites**: Conda must be installed on your system. [Download Conda](https://docs.conda.io/en/latest/miniconda.html)
40
 
41
- 1. **Clone the repository from Hugging Face**:
42
 
43
- ```bash
44
- # Make sure git-lfs is installed (https://git-lfs.com)
45
- git lfs install
46
 
47
- # Clone with all files
48
- git clone https://huggingface.co/jpuglia/ProteinLocationPredictor
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
  ```
50
 
51
- **Optional - Clone without large files** (just pointers):
52
- ```bash
53
- # If you want to clone without large files - just their pointers
54
- GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/jpuglia/ProteinLocationPredictor
55
- ```
56
 
57
- 2. **Navigate to the project directory**:
58
  ```bash
59
- cd ProteinLocationPredictor
60
  ```
61
 
62
- 3. **Create conda environment**:
63
- ```bash
64
- conda env create -n protein-predictor -f environment.yml
65
- conda activate protein-predictor
66
- ```
67
 
68
- 4. **Pre-trained models**:
69
- - Model files are included in the repository via Git LFS
70
- - If you cloned without large files, you'll need to download them separately
71
 
72
- ## Usage
73
 
74
- ### Running the GUI Application
75
 
76
- ```bash
77
- conda activate protein-predictor
78
- python gui.py
79
- ```
80
 
81
- ### Step-by-Step Instructions
82
 
83
- 1. **Launch the application**
84
- - Run the GUI script
85
- - The main window will appear with prediction options
86
 
87
- 2. **Load a FASTA file**
88
- - Click "File" β†’ "Load FASTA"
89
- - Select your protein sequences file (`.fasta`, `.fa`, or `.fas`)
90
 
91
- 3. **Choose a prediction model**
92
- - **PROST-T5**
93
- - **ESM-C 300M**
94
- - **ESM-C 600M**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
95
 
96
- 4. **Run prediction**
97
- - Click the corresponding prediction button
98
- - Monitor progress in the progress bar window
99
- - Select output directory when prompted
100
 
101
- 5. **Save results**
102
- - Choose location and filename for prediction results
103
- - Results are saved in CSV format with confidence scores for each subcellular location
104
 
105
- ### Input Format
 
 
 
 
 
 
106
 
107
- FASTA files should contain protein sequences in standard format:
 
 
108
 
109
  ```
110
  >protein_1
@@ -113,117 +139,57 @@ MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG
113
  MKTIIALSYIFCLVFAHATAKASEQTDNLQWDLAAIDNSGGHNAVDIKQNLQFQCQNNLHGCF
114
  ```
115
 
116
- ### Output Format
117
-
118
- Results are saved as CSV files with predictions for 6 subcellular locations, ranked by probability:
119
 
120
  ```csv
121
  Sequence_ID,Prediction 1,Prediction 2,Prediction 3,Prediction 4,Prediction 5,Prediction 6
122
- sp|P0A7V8|RS4_ECOLI,Cytoplasmic (0.9860),CytoplasmicMembrane (0.0081),Periplasmic (0.0029),Extracellular (0.0019),OuterMembrane (0.0007),Cellwall (0.0003)
 
123
  ```
124
 
125
- **Predicted Locations:**
126
- - **Cytoplasmic**: Interior of the cell
127
- - **CytoplasmicMembrane**: Inner membrane
128
- - **Periplasmic**: Space between inner and outer membranes
129
- - **Extracellular**: Outside the cell
130
- - **OuterMembrane**: Outer membrane
131
- - **Cellwall**: Cell wall structure
132
-
133
  ## Model Details
134
 
135
- ### PROST-T5
136
- - **Base Model**: Rostlab/ProstT5
137
- - **Embedding Dimension**: 1024
138
- - **Classifier**: Support Vector Machine (SVM)
139
- - **Memory Usage**: ~4GB GPU/8GB RAM
140
-
141
- ### ESM-C Models
142
- - **Base Models**: ESM-C 300M/600M
143
- - **Embedding Dimension**: Variable (300M: 960, 600M: 1280)
144
- - **Classifier**: Support Vector Machine (SVM)
145
- - **Memory Usage**: 300M: ~2GB GPU, 600M: ~4GB GPU
146
-
147
- ## Troubleshooting
148
-
149
- ### Common Issues
150
-
151
- 1. **Out of Memory Errors**
152
- - Reduce batch size or use CPU-only mode
153
- - Close other applications to free memory
154
- - Try smaller model (ESM-C 300M instead of 600M)
155
-
156
- 2. **Model Loading Errors**
157
- - Ensure model files are in the correct `Models/` directory
158
- - Check file permissions and integrity
159
- - Clear Hugging Face cache: `rm -rf ~/.cache/huggingface/`
160
-
161
- 3. **CUDA Errors**
162
- - Update GPU drivers
163
- - Ensure CUDA-compatible PyTorch installation
164
- - Fall back to CPU mode if GPU issues persist
165
-
166
- ### Performance Tips
167
-
168
- - **GPU Usage**: Models automatically detect and use GPU when available
169
- - **Memory Management**: CUDA cache is cleared after each prediction
170
- - **Sequential Processing**: Sequences are processed one at a time with progress tracking
171
 
172
  ## Project Structure
173
 
174
  ```
175
  ProteinLocationPredictor/
176
- β”œβ”€β”€ gui.py # Main GUI application
177
  β”œβ”€β”€ src/
178
- β”‚ └── my_utils.py # Core prediction functions
179
- β”œβ”€β”€ Models/ # Pre-trained model files (via Git LFS)
180
- β”‚ β”œβ”€β”€ Prost T5_svm.joblib
181
- β”‚ β”œβ”€β”€ Prost T5_le_svm.joblib
182
  β”‚ β”œβ”€β”€ ESMC-300m_svm.joblib
183
  β”‚ β”œβ”€β”€ ESMC-600m_svm.joblib
184
  β”‚ └── ...
185
- β”œβ”€β”€ environment.yml # Conda environment specification
186
- └── README.md # This file
 
 
 
187
  ```
188
 
189
  ## Contributing
190
 
191
  1. Fork the repository
192
- 2. Create a feature branch (`git checkout -b feature/amazing-feature`)
193
- 3. Commit your changes (`git commit -m 'Add amazing feature'`)
194
- 4. Push to the branch (`git push origin feature/amazing-feature`)
195
- 5. Open a Pull Request
196
-
197
- ## License
198
-
199
- This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
200
-
201
- ## Citation
202
-
203
- If you use this tool in your research, please cite:
204
-
205
- ```bibtex
206
- @software{protein_location_predictor,
207
- title={Protein Location Predictor},
208
- author={Juan Diego Puglia},
209
- year={2025},
210
- url={https://huggingface.co/jpuglia/ProteinLocationPredictor}
211
- }
212
- ```
213
-
214
- ## Acknowledgments
215
-
216
- - [Rostlab](https://rostlab.org/) for the PROST-T5 model
217
- - [Meta AI](https://ai.meta.com/) for the ESM models
218
- - [Hugging Face](https://huggingface.co/) for model hosting and transformers library
219
- - [BioPython](https://biopython.org/) for sequence handling utilities
220
-
221
- ## Contact
222
-
223
- For questions, issues, or collaborations, please:
224
- - Visit the [Hugging Face repository](https://huggingface.co/jpuglia/ProteinLocationPredictor)
225
- - Open a discussion on the Hugging Face platform
226
-
227
- ---
228
-
229
- **Note**: This tool is for research purposes. Please validate predictions with experimental methods for critical applications.
 
1
+ ## Table of Contents
2
 
3
+ * [Protein Location Predictor](#protein-location-predictor)
 
 
 
 
 
 
 
4
 
5
+ * [Features](#features)
6
+ * [Requirements](#requirements)
 
 
7
 
8
+ * [Supported Python Version](#supported-python-version)
9
+ * [Dependencies ](#dependencies-full-environmentyml)
10
+ * [Hardware Requirements](#hardware-requirements)
11
+ * [Installation](#installation)
12
+ * [Usage](#usage)
13
 
14
+ * [GUI Mode](#gui-mode)
15
+ * [Example Input & Output](#example-input--output)
16
+ * [Model Details](#model-details)
17
+ * [Project Structure](#project-structure)
18
+ * [Contributing](#contributing)
19
 
20
+ ## Protein Location Predictor
21
 
22
+ A comprehensive GUI application for predicting protein subcellular localization using state-of-the-art machine learning models including PROST-T5 and ESM-C embeddings.
 
 
 
 
 
 
23
 
24
+ ### Features
25
 
26
+ * **Multiple Model Support**: Choose from three different prediction models:
 
 
27
 
28
+ * PROST-T5: Transformer-based protein language model
29
+ * ESM-C 300M: Evolutionary Scale Modeling (300M parameters)
30
+ * ESM-C 600M: Evolutionary Scale Modeling (600M parameters)
31
+ * **User-Friendly GUI**: Simple Tkinter-based interface with progress tracking (see screenshot below)
32
+ * **Sequential Processing**: Process multiple protein sequences from FASTA files
33
+ * **Flexible Output**: Save predictions with confidence scores in text (CSV) format
34
+ * **Error Handling**: Comprehensive error handling and user feedback
35
 
36
+ ### Supported Python Version
37
 
38
+ This project has been tested on **Python 3.10+**.
39
 
40
+ ## Requirements
 
 
41
 
42
+ #### Dependencies (Full environment.yml)
43
+
44
+ The complete environment definition is located in `environment.yml`. This file includes all necessary packages for PyTorch, Transformers, ESM models, and GUI operation. Here is a brief excerpt:
45
+
46
+ ```yaml
47
+ name: tesisEnv
48
+ channels:
49
+ - bioconda
50
+ - anaconda
51
+ - conda-forge
52
+ - defaults
53
+
54
+ # Python version and major packages
55
+ dependencies:
56
+ - python=3.10.16
57
+ - pytorch=2.6.0
58
+ - torchvision=0.21.0
59
+ - torchtext=0.18.0
60
+ - transformers=4.46.3
61
+ - scikit-learn=1.6.1
62
+ - biopython=1.85
63
+ - esm=3.1.4
64
+ - numpy=1.26.4
65
+ - joblib=1.4.2
66
+ - tk
67
+ # plus many others (see full file for complete list)
68
  ```
69
 
70
+ To ensure exact reproducibility, use:
 
 
 
 
71
 
 
72
  ```bash
73
+ conda env create -f environment.yml
74
  ```
75
 
76
+ ### Hardware Requirements
 
 
 
 
77
 
78
+ * **Minimum**: 8β€―GB RAM, CPU-only execution
79
+ * **Recommended**: 16β€―GB+ RAM, NVIDIA GPU with 8β€―GB+ VRAM
80
+ * **Storage**: \~5β€―GB for model weights and cache
81
 
82
+ ## Installation
83
 
84
+ 1. **Clone the repository** (with Gitβ€―LFS for large model files):
85
 
86
+ ```bash
87
+ git lfs install
88
+ git clone https://huggingface.co/jpuglia/ProteinLocationPredictor
89
+ ```
90
 
91
+ If you prefer to skip downloading model weights initially:
92
 
93
+ ```bash
94
+ GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/jpuglia/ProteinLocationPredictor
95
+ ```
96
 
97
+ 2. **Navigate into the project directory**:
 
 
98
 
99
+ ```bash
100
+ cd ProteinLocationPredictor
101
+ ```
102
+
103
+ 3. **Create and activate the Conda environment**:
104
+
105
+ ```bash
106
+ conda env create -f environment.yml
107
+ conda activate tesisEnv
108
+ ```
109
+
110
+ 4. **(If skipped above) Download model weights manually**:
111
+ Model files live in the `Models/` directory. If you used `GIT_LFS_SKIP_SMUDGE`, run:
112
+
113
+ ```bash
114
+ git lfs pull
115
+ ```
116
+
117
+ ## Usage
118
 
119
+ ### GUI Mode
 
 
 
120
 
121
+ 1. Launch the application:
 
 
122
 
123
+ ```bash
124
+ python gui.py
125
+ ```
126
+ 2. In the menu, click **File β†’ Load FASTA** and select your input file (`.fasta`, `.fa`, or `.fas`).
127
+ 3. Choose one of the prediction models (PROST-T5, ESM-C 300M, or ESM-C 600M).
128
+ 4. Click **Run Prediction** and monitor the progress bar.
129
+ 5. When complete, you will be prompted to choose an output directory and filename.
130
 
131
+ ## Example Input & Output
132
+
133
+ **Input FASTA (********`example/input.fasta`****\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*):**
134
 
135
  ```
136
  >protein_1
 
139
  MKTIIALSYIFCLVFAHATAKASEQTDNLQWDLAAIDNSGGHNAVDIKQNLQFQCQNNLHGCF
140
  ```
141
 
142
+ **Output CSV (********`example/output.csv`****\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*):**
 
 
143
 
144
  ```csv
145
  Sequence_ID,Prediction 1,Prediction 2,Prediction 3,Prediction 4,Prediction 5,Prediction 6
146
+ protein_1,Cytoplasmic (0.9860),CytoplasmicMembrane (0.0081),Periplasmic (0.0029),Extracellular (0.0019),OuterMembrane (0.0007),Cellwall (0.0003)
147
+ protein_2,SignalPeptide (0.7523),Extracellular (0.1234),CytoplasmicMembrane (0.0645),Cellwall (0.0345),Periplasmic (0.0201),OuterMembrane (0.0052)
148
  ```
149
 
 
 
 
 
 
 
 
 
150
  ## Model Details
151
 
152
+ | Model | Embedding Dim. | Classifier | GPU VRAM | RAM Usage |
153
+ | ------------ | -------------- | ---------- | -------- | --------- |
154
+ | PROST-T5 | 1024 | SVM | \~4β€―GB | \~8β€―GB |
155
+ | ESM-C (300M) | 960 | SVM | \~2β€―GB | \~6β€―GB |
156
+ | ESM-C (600M) | 1280 | SVM | \~4β€―GB | \~10β€―GB |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
157
 
158
  ## Project Structure
159
 
160
  ```
161
  ProteinLocationPredictor/
162
+ β”œβ”€β”€ gui.py
163
  β”œβ”€β”€ src/
164
+ β”‚ └── my_utils.py
165
+ β”œοΏ½οΏ½β”€ Models/
166
+ β”‚ β”œβ”€β”€ ProstT5_svm.joblib
 
167
  β”‚ β”œβ”€β”€ ESMC-300m_svm.joblib
168
  β”‚ β”œβ”€β”€ ESMC-600m_svm.joblib
169
  β”‚ └── ...
170
+ β”œβ”€β”€ environment.yml
171
+ β”œβ”€β”€ README.md
172
+ └── doc/
173
+ └── screenshots/
174
+ └── gui_example.png
175
  ```
176
 
177
  ## Contributing
178
 
179
  1. Fork the repository
180
+ 2. Create a feature branch:
181
+
182
+ ```bash
183
+ git checkout -b feature/amazing-feature
184
+ ```
185
+ 3. Commit your changes:
186
+
187
+ ```bash
188
+ git commit -m "Add amazing feature"
189
+ ```
190
+ 4. Push to your branch:
191
+
192
+ ```bash
193
+ git push origin feature/amazing-feature
194
+ ```
195
+ 5. Open a Pull Request or start a discussion: [Repository Discussions](https://huggingface.co/jpuglia/ProteinLocationPredictor/discussions)