File size: 5,226 Bytes
c276617 bf0fbb5 ec4615b bf0fbb5 ec4615b bf0fbb5 ec4615b bf0fbb5 ec4615b bf0fbb5 ec4615b bf0fbb5 ec4615b 458f017 ec4615b bf0fbb5 ec4615b bf0fbb5 ec4615b bf0fbb5 ec4615b bf0fbb5 ec4615b bf0fbb5 ec4615b bf0fbb5 ec4615b bf0fbb5 ec4615b bf0fbb5 ec4615b bf0fbb5 ec4615b bf0fbb5 ec4615b bf0fbb5 ec4615b bf0fbb5 ec4615b bf0fbb5 ec4615b bf0fbb5 ec4615b bf0fbb5 ec4615b bf0fbb5 ec4615b bf0fbb5 ec4615b bf0fbb5 ec4615b bf0fbb5 ec4615b bf0fbb5 ec4615b bf0fbb5 ec4615b bf0fbb5 458f017 ec4615b 458f017 ec4615b bf0fbb5 ec4615b bf0fbb5 ec4615b bf0fbb5 ec4615b bf0fbb5 ec4615b bf0fbb5 c276617 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 |
---
license: mit
language:
- en
base_model:
- EvolutionaryScale/esmc-300m-2024-12
- EvolutionaryScale/esmc-600m-2024-12
- Rostlab/ProstT5
---
## Table of Contents
* [Protein Location Predictor](#protein-location-predictor)
* [Features](#features)
* [Requirements](#requirements)
* [Supported Python Version](#supported-python-version)
* [Dependencies ](#dependencies-full-environmentyml)
* [Hardware Requirements](#hardware-requirements)
* [Installation](#installation)
* [Usage](#usage)
* [GUI Mode](#gui-mode)
* [Example Input & Output](#example-input--output)
* [Project Structure](#project-structure)
* [Contributing](#contributing)
## Protein Location Predictor
A comprehensive GUI application for predicting protein subcellular localization using SVM and Random Forest classifiers using state-of-the-art protein language models including PROST-T5 and ESM-C embeddings as training data.
### Features
* **Multiple Model Support**: Choose from three different prediction models:
* PROST-T5: Transformer-based protein language model
* ESM-C 300M: Evolutionary Scale Modeling (300M parameters)
* ESM-C 600M: Evolutionary Scale Modeling (600M parameters)
* **User-Friendly GUI**: Simple Tkinter-based interface with progress tracking (see screenshot below)
* **Sequential Processing**: Process multiple protein sequences from FASTA files
* **Flexible Output**: Save predictions with confidence scores in text (CSV) format
* **Error Handling**: Comprehensive error handling and user feedback
### Supported Python Version
This project has been tested on **Python 3.10+**.
## Requirements
#### Dependencies (Full environment.yml)
The complete environment definition is located in `environment.yml`. This file includes all necessary packages for PyTorch, Transformers, ESM models, and GUI operation. Here is a brief excerpt:
```yaml
name: tesisEnv
channels:
- bioconda
- anaconda
- conda-forge
- defaults
# Python version and major packages
dependencies:
- python=3.10.16
- pytorch=2.6.0
- torchvision=0.21.0
- torchtext=0.18.0
- transformers=4.46.3
- scikit-learn=1.6.1
- biopython=1.85
- esm=3.1.4
- numpy=1.26.4
- joblib=1.4.2
- tk
# plus many others (see full file for complete list)
```
To ensure exact reproducibility, use:
```bash
conda env create -f environment.yml
```
### Hardware Requirements
* **Minimum**: 8β―GB RAM, CPU-only execution
* **Recommended**: 16β―GB+ RAM, NVIDIA GPU with 8β―GB+ VRAM
* **Storage**: \~5β―GB for model weights and cache
## Installation
1. **Clone the repository** (with Gitβ―LFS for large model files):
```bash
git lfs install
git clone https://huggingface.co/jpuglia/ProteinLocationPredictor
```
If you prefer to skip downloading model weights initially:
```bash
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/jpuglia/ProteinLocationPredictor
```
2. **Navigate into the project directory**:
```bash
cd ProteinLocationPredictor
```
3. **Create and activate the Conda environment**:
```bash
conda env create -f environment.yml
conda activate tesisEnv
```
4. **(If skipped above) Download model weights manually**:
Model files live in the `Models/` directory. If you used `GIT_LFS_SKIP_SMUDGE`, run:
```bash
git lfs pull
```
## Usage
### GUI Mode
1. Launch the application:
```bash
python gui.py
```
2. In the menu, click **File β Load FASTA** and select your input file (`.fasta`, `.fa`, or `.fas`).
3. Choose one of the prediction models (PROST-T5, ESM-C 300M, or ESM-C 600M).
4. Click **Run Prediction** and monitor the progress bar.
5. When complete, you will be prompted to choose an output directory and filename.
## Example Input & Output
**Input FASTA (`example/input.fasta`):**
```
>protein_1
MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG
>protein_2
MKTIIALSYIFCLVFAHATAKASEQTDNLQWDLAAIDNSGGHNAVDIKQNLQFQCQNNLHGCF
```
**Output CSV (`example/output.csv`):**
```csv
Sequence_ID,Prediction 1,Prediction 2,Prediction 3,Prediction 4,Prediction 5,Prediction 6
protein_1,Cytoplasmic (0.9860),CytoplasmicMembrane (0.0081),Periplasmic (0.0029),Extracellular (0.0019),OuterMembrane (0.0007),Cellwall (0.0003)
protein_2,SignalPeptide (0.7523),Extracellular (0.1234),CytoplasmicMembrane (0.0645),Cellwall (0.0345),Periplasmic (0.0201),OuterMembrane (0.0052)
```
## Project Structure
```
ProteinLocationPredictor/
βββ gui.py
βββ src/
β βββ my_utils.py
βββ Models/
β βββ ProstT5_svm.joblib
β βββ ESMC-300m_svm.joblib
β βββ ESMC-600m_svm.joblib
β βββ ...
βββ environment.yml
βββ README.md
βββ doc/
βββ screenshots/
βββ gui_example.png
```
## Contributing
1. Fork the repository
2. Create a feature branch:
```bash
git checkout -b feature/amazing-feature
```
3. Commit your changes:
```bash
git commit -m "Add amazing feature"
```
4. Push to your branch:
```bash
git push origin feature/amazing-feature
```
5. Open a Pull Request or start a discussion: [Repository Discussions](https://huggingface.co/jpuglia/ProteinLocationPredictor/discussions) |