|
|
--- |
|
|
license: mit |
|
|
language: |
|
|
- en |
|
|
base_model: |
|
|
- EvolutionaryScale/esmc-300m-2024-12 |
|
|
- EvolutionaryScale/esmc-600m-2024-12 |
|
|
- Rostlab/ProstT5 |
|
|
--- |
|
|
## Table of Contents |
|
|
|
|
|
* [Protein Location Predictor](#protein-location-predictor) |
|
|
|
|
|
* [Features](#features) |
|
|
* [Requirements](#requirements) |
|
|
|
|
|
* [Supported Python Version](#supported-python-version) |
|
|
* [Dependencies ](#dependencies-full-environmentyml) |
|
|
* [Hardware Requirements](#hardware-requirements) |
|
|
* [Installation](#installation) |
|
|
* [Usage](#usage) |
|
|
|
|
|
* [GUI Mode](#gui-mode) |
|
|
* [Example Input & Output](#example-input--output) |
|
|
* [Project Structure](#project-structure) |
|
|
* [Contributing](#contributing) |
|
|
|
|
|
## Protein Location Predictor |
|
|
|
|
|
A comprehensive GUI application for predicting protein subcellular localization using SVM and Random Forest classifiers using state-of-the-art protein language models including PROST-T5 and ESM-C embeddings as training data. |
|
|
|
|
|
### Features |
|
|
|
|
|
* **Multiple Model Support**: Choose from three different prediction models: |
|
|
|
|
|
* PROST-T5: Transformer-based protein language model |
|
|
* ESM-C 300M: Evolutionary Scale Modeling (300M parameters) |
|
|
* ESM-C 600M: Evolutionary Scale Modeling (600M parameters) |
|
|
* **User-Friendly GUI**: Simple Tkinter-based interface with progress tracking (see screenshot below) |
|
|
* **Sequential Processing**: Process multiple protein sequences from FASTA files |
|
|
* **Flexible Output**: Save predictions with confidence scores in text (CSV) format |
|
|
* **Error Handling**: Comprehensive error handling and user feedback |
|
|
|
|
|
### Supported Python Version |
|
|
|
|
|
This project has been tested on **Python 3.10+**. |
|
|
|
|
|
## Requirements |
|
|
|
|
|
#### Dependencies (Full environment.yml) |
|
|
|
|
|
The complete environment definition is located in `environment.yml`. This file includes all necessary packages for PyTorch, Transformers, ESM models, and GUI operation. Here is a brief excerpt: |
|
|
|
|
|
```yaml |
|
|
name: tesisEnv |
|
|
channels: |
|
|
- bioconda |
|
|
- anaconda |
|
|
- conda-forge |
|
|
- defaults |
|
|
|
|
|
# Python version and major packages |
|
|
dependencies: |
|
|
- python=3.10.16 |
|
|
- pytorch=2.6.0 |
|
|
- torchvision=0.21.0 |
|
|
- torchtext=0.18.0 |
|
|
- transformers=4.46.3 |
|
|
- scikit-learn=1.6.1 |
|
|
- biopython=1.85 |
|
|
- esm=3.1.4 |
|
|
- numpy=1.26.4 |
|
|
- joblib=1.4.2 |
|
|
- tk |
|
|
# plus many others (see full file for complete list) |
|
|
``` |
|
|
|
|
|
To ensure exact reproducibility, use: |
|
|
|
|
|
```bash |
|
|
conda env create -f environment.yml |
|
|
``` |
|
|
|
|
|
### Hardware Requirements |
|
|
|
|
|
* **Minimum**: 8β―GB RAM, CPU-only execution |
|
|
* **Recommended**: 16β―GB+ RAM, NVIDIA GPU with 8β―GB+ VRAM |
|
|
* **Storage**: \~5β―GB for model weights and cache |
|
|
|
|
|
## Installation |
|
|
|
|
|
1. **Clone the repository** (with Gitβ―LFS for large model files): |
|
|
|
|
|
```bash |
|
|
git lfs install |
|
|
git clone https://huggingface.co/jpuglia/ProteinLocationPredictor |
|
|
``` |
|
|
|
|
|
If you prefer to skip downloading model weights initially: |
|
|
|
|
|
```bash |
|
|
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/jpuglia/ProteinLocationPredictor |
|
|
``` |
|
|
|
|
|
2. **Navigate into the project directory**: |
|
|
|
|
|
```bash |
|
|
cd ProteinLocationPredictor |
|
|
``` |
|
|
|
|
|
3. **Create and activate the Conda environment**: |
|
|
|
|
|
```bash |
|
|
conda env create -f environment.yml |
|
|
conda activate tesisEnv |
|
|
``` |
|
|
|
|
|
4. **(If skipped above) Download model weights manually**: |
|
|
Model files live in the `Models/` directory. If you used `GIT_LFS_SKIP_SMUDGE`, run: |
|
|
|
|
|
```bash |
|
|
git lfs pull |
|
|
``` |
|
|
|
|
|
## Usage |
|
|
|
|
|
### GUI Mode |
|
|
|
|
|
1. Launch the application: |
|
|
|
|
|
```bash |
|
|
python gui.py |
|
|
``` |
|
|
2. In the menu, click **File β Load FASTA** and select your input file (`.fasta`, `.fa`, or `.fas`). |
|
|
3. Choose one of the prediction models (PROST-T5, ESM-C 300M, or ESM-C 600M). |
|
|
4. Click **Run Prediction** and monitor the progress bar. |
|
|
5. When complete, you will be prompted to choose an output directory and filename. |
|
|
|
|
|
## Example Input & Output |
|
|
|
|
|
**Input FASTA (`example/input.fasta`):** |
|
|
|
|
|
``` |
|
|
>protein_1 |
|
|
MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG |
|
|
>protein_2 |
|
|
MKTIIALSYIFCLVFAHATAKASEQTDNLQWDLAAIDNSGGHNAVDIKQNLQFQCQNNLHGCF |
|
|
``` |
|
|
|
|
|
**Output CSV (`example/output.csv`):** |
|
|
|
|
|
```csv |
|
|
Sequence_ID,Prediction 1,Prediction 2,Prediction 3,Prediction 4,Prediction 5,Prediction 6 |
|
|
protein_1,Cytoplasmic (0.9860),CytoplasmicMembrane (0.0081),Periplasmic (0.0029),Extracellular (0.0019),OuterMembrane (0.0007),Cellwall (0.0003) |
|
|
protein_2,SignalPeptide (0.7523),Extracellular (0.1234),CytoplasmicMembrane (0.0645),Cellwall (0.0345),Periplasmic (0.0201),OuterMembrane (0.0052) |
|
|
``` |
|
|
|
|
|
## Project Structure |
|
|
|
|
|
``` |
|
|
ProteinLocationPredictor/ |
|
|
βββ gui.py |
|
|
βββ src/ |
|
|
β βββ my_utils.py |
|
|
βββ Models/ |
|
|
β βββ ProstT5_svm.joblib |
|
|
β βββ ESMC-300m_svm.joblib |
|
|
β βββ ESMC-600m_svm.joblib |
|
|
β βββ ... |
|
|
βββ environment.yml |
|
|
βββ README.md |
|
|
βββ doc/ |
|
|
βββ screenshots/ |
|
|
βββ gui_example.png |
|
|
``` |
|
|
|
|
|
## Contributing |
|
|
|
|
|
1. Fork the repository |
|
|
2. Create a feature branch: |
|
|
|
|
|
```bash |
|
|
git checkout -b feature/amazing-feature |
|
|
``` |
|
|
3. Commit your changes: |
|
|
|
|
|
```bash |
|
|
git commit -m "Add amazing feature" |
|
|
``` |
|
|
4. Push to your branch: |
|
|
|
|
|
```bash |
|
|
git push origin feature/amazing-feature |
|
|
``` |
|
|
5. Open a Pull Request or start a discussion: [Repository Discussions](https://huggingface.co/jpuglia/ProteinLocationPredictor/discussions) |