jpuglia's picture
Update README.md: Clarify application description and improve example input/output formatting
458f017
---
license: mit
language:
- en
base_model:
- EvolutionaryScale/esmc-300m-2024-12
- EvolutionaryScale/esmc-600m-2024-12
- Rostlab/ProstT5
---
## Table of Contents
* [Protein Location Predictor](#protein-location-predictor)
* [Features](#features)
* [Requirements](#requirements)
* [Supported Python Version](#supported-python-version)
* [Dependencies ](#dependencies-full-environmentyml)
* [Hardware Requirements](#hardware-requirements)
* [Installation](#installation)
* [Usage](#usage)
* [GUI Mode](#gui-mode)
* [Example Input & Output](#example-input--output)
* [Project Structure](#project-structure)
* [Contributing](#contributing)
## Protein Location Predictor
A comprehensive GUI application for predicting protein subcellular localization using SVM and Random Forest classifiers using state-of-the-art protein language models including PROST-T5 and ESM-C embeddings as training data.
### Features
* **Multiple Model Support**: Choose from three different prediction models:
* PROST-T5: Transformer-based protein language model
* ESM-C 300M: Evolutionary Scale Modeling (300M parameters)
* ESM-C 600M: Evolutionary Scale Modeling (600M parameters)
* **User-Friendly GUI**: Simple Tkinter-based interface with progress tracking (see screenshot below)
* **Sequential Processing**: Process multiple protein sequences from FASTA files
* **Flexible Output**: Save predictions with confidence scores in text (CSV) format
* **Error Handling**: Comprehensive error handling and user feedback
### Supported Python Version
This project has been tested on **Python 3.10+**.
## Requirements
#### Dependencies (Full environment.yml)
The complete environment definition is located in `environment.yml`. This file includes all necessary packages for PyTorch, Transformers, ESM models, and GUI operation. Here is a brief excerpt:
```yaml
name: tesisEnv
channels:
- bioconda
- anaconda
- conda-forge
- defaults
# Python version and major packages
dependencies:
- python=3.10.16
- pytorch=2.6.0
- torchvision=0.21.0
- torchtext=0.18.0
- transformers=4.46.3
- scikit-learn=1.6.1
- biopython=1.85
- esm=3.1.4
- numpy=1.26.4
- joblib=1.4.2
- tk
# plus many others (see full file for complete list)
```
To ensure exact reproducibility, use:
```bash
conda env create -f environment.yml
```
### Hardware Requirements
* **Minimum**: 8β€―GB RAM, CPU-only execution
* **Recommended**: 16β€―GB+ RAM, NVIDIA GPU with 8β€―GB+ VRAM
* **Storage**: \~5β€―GB for model weights and cache
## Installation
1. **Clone the repository** (with Gitβ€―LFS for large model files):
```bash
git lfs install
git clone https://huggingface.co/jpuglia/ProteinLocationPredictor
```
If you prefer to skip downloading model weights initially:
```bash
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/jpuglia/ProteinLocationPredictor
```
2. **Navigate into the project directory**:
```bash
cd ProteinLocationPredictor
```
3. **Create and activate the Conda environment**:
```bash
conda env create -f environment.yml
conda activate tesisEnv
```
4. **(If skipped above) Download model weights manually**:
Model files live in the `Models/` directory. If you used `GIT_LFS_SKIP_SMUDGE`, run:
```bash
git lfs pull
```
## Usage
### GUI Mode
1. Launch the application:
```bash
python gui.py
```
2. In the menu, click **File β†’ Load FASTA** and select your input file (`.fasta`, `.fa`, or `.fas`).
3. Choose one of the prediction models (PROST-T5, ESM-C 300M, or ESM-C 600M).
4. Click **Run Prediction** and monitor the progress bar.
5. When complete, you will be prompted to choose an output directory and filename.
## Example Input & Output
**Input FASTA (`example/input.fasta`):**
```
>protein_1
MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG
>protein_2
MKTIIALSYIFCLVFAHATAKASEQTDNLQWDLAAIDNSGGHNAVDIKQNLQFQCQNNLHGCF
```
**Output CSV (`example/output.csv`):**
```csv
Sequence_ID,Prediction 1,Prediction 2,Prediction 3,Prediction 4,Prediction 5,Prediction 6
protein_1,Cytoplasmic (0.9860),CytoplasmicMembrane (0.0081),Periplasmic (0.0029),Extracellular (0.0019),OuterMembrane (0.0007),Cellwall (0.0003)
protein_2,SignalPeptide (0.7523),Extracellular (0.1234),CytoplasmicMembrane (0.0645),Cellwall (0.0345),Periplasmic (0.0201),OuterMembrane (0.0052)
```
## Project Structure
```
ProteinLocationPredictor/
β”œβ”€β”€ gui.py
β”œβ”€β”€ src/
β”‚ └── my_utils.py
β”œβ”€β”€ Models/
β”‚ β”œβ”€β”€ ProstT5_svm.joblib
β”‚ β”œβ”€β”€ ESMC-300m_svm.joblib
β”‚ β”œβ”€β”€ ESMC-600m_svm.joblib
β”‚ └── ...
β”œβ”€β”€ environment.yml
β”œβ”€β”€ README.md
└── doc/
└── screenshots/
└── gui_example.png
```
## Contributing
1. Fork the repository
2. Create a feature branch:
```bash
git checkout -b feature/amazing-feature
```
3. Commit your changes:
```bash
git commit -m "Add amazing feature"
```
4. Push to your branch:
```bash
git push origin feature/amazing-feature
```
5. Open a Pull Request or start a discussion: [Repository Discussions](https://huggingface.co/jpuglia/ProteinLocationPredictor/discussions)