--- license: mit language: - en base_model: - EvolutionaryScale/esmc-300m-2024-12 - EvolutionaryScale/esmc-600m-2024-12 - Rostlab/ProstT5 --- ## Table of Contents * [Protein Location Predictor](#protein-location-predictor) * [Features](#features) * [Requirements](#requirements) * [Supported Python Version](#supported-python-version) * [Dependencies ](#dependencies-full-environmentyml) * [Hardware Requirements](#hardware-requirements) * [Installation](#installation) * [Usage](#usage) * [GUI Mode](#gui-mode) * [Example Input & Output](#example-input--output) * [Project Structure](#project-structure) * [Contributing](#contributing) ## Protein Location Predictor A comprehensive GUI application for predicting protein subcellular localization using SVM and Random Forest classifiers using state-of-the-art protein language models including PROST-T5 and ESM-C embeddings as training data. ### Features * **Multiple Model Support**: Choose from three different prediction models: * PROST-T5: Transformer-based protein language model * ESM-C 300M: Evolutionary Scale Modeling (300M parameters) * ESM-C 600M: Evolutionary Scale Modeling (600M parameters) * **User-Friendly GUI**: Simple Tkinter-based interface with progress tracking (see screenshot below) * **Sequential Processing**: Process multiple protein sequences from FASTA files * **Flexible Output**: Save predictions with confidence scores in text (CSV) format * **Error Handling**: Comprehensive error handling and user feedback ### Supported Python Version This project has been tested on **Python 3.10+**. ## Requirements #### Dependencies (Full environment.yml) The complete environment definition is located in `environment.yml`. This file includes all necessary packages for PyTorch, Transformers, ESM models, and GUI operation. Here is a brief excerpt: ```yaml name: tesisEnv channels: - bioconda - anaconda - conda-forge - defaults # Python version and major packages dependencies: - python=3.10.16 - pytorch=2.6.0 - torchvision=0.21.0 - torchtext=0.18.0 - transformers=4.46.3 - scikit-learn=1.6.1 - biopython=1.85 - esm=3.1.4 - numpy=1.26.4 - joblib=1.4.2 - tk # plus many others (see full file for complete list) ``` To ensure exact reproducibility, use: ```bash conda env create -f environment.yml ``` ### Hardware Requirements * **Minimum**: 8 GB RAM, CPU-only execution * **Recommended**: 16 GB+ RAM, NVIDIA GPU with 8 GB+ VRAM * **Storage**: \~5 GB for model weights and cache ## Installation 1. **Clone the repository** (with Git LFS for large model files): ```bash git lfs install git clone https://huggingface.co/jpuglia/ProteinLocationPredictor ``` If you prefer to skip downloading model weights initially: ```bash GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/jpuglia/ProteinLocationPredictor ``` 2. **Navigate into the project directory**: ```bash cd ProteinLocationPredictor ``` 3. **Create and activate the Conda environment**: ```bash conda env create -f environment.yml conda activate tesisEnv ``` 4. **(If skipped above) Download model weights manually**: Model files live in the `Models/` directory. If you used `GIT_LFS_SKIP_SMUDGE`, run: ```bash git lfs pull ``` ## Usage ### GUI Mode 1. Launch the application: ```bash python gui.py ``` 2. In the menu, click **File → Load FASTA** and select your input file (`.fasta`, `.fa`, or `.fas`). 3. Choose one of the prediction models (PROST-T5, ESM-C 300M, or ESM-C 600M). 4. Click **Run Prediction** and monitor the progress bar. 5. When complete, you will be prompted to choose an output directory and filename. ## Example Input & Output **Input FASTA (`example/input.fasta`):** ``` >protein_1 MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG >protein_2 MKTIIALSYIFCLVFAHATAKASEQTDNLQWDLAAIDNSGGHNAVDIKQNLQFQCQNNLHGCF ``` **Output CSV (`example/output.csv`):** ```csv Sequence_ID,Prediction 1,Prediction 2,Prediction 3,Prediction 4,Prediction 5,Prediction 6 protein_1,Cytoplasmic (0.9860),CytoplasmicMembrane (0.0081),Periplasmic (0.0029),Extracellular (0.0019),OuterMembrane (0.0007),Cellwall (0.0003) protein_2,SignalPeptide (0.7523),Extracellular (0.1234),CytoplasmicMembrane (0.0645),Cellwall (0.0345),Periplasmic (0.0201),OuterMembrane (0.0052) ``` ## Project Structure ``` ProteinLocationPredictor/ ├── gui.py ├── src/ │ └── my_utils.py ├── Models/ │ ├── ProstT5_svm.joblib │ ├── ESMC-300m_svm.joblib │ ├── ESMC-600m_svm.joblib │ └── ... ├── environment.yml ├── README.md └── doc/ └── screenshots/ └── gui_example.png ``` ## Contributing 1. Fork the repository 2. Create a feature branch: ```bash git checkout -b feature/amazing-feature ``` 3. Commit your changes: ```bash git commit -m "Add amazing feature" ``` 4. Push to your branch: ```bash git push origin feature/amazing-feature ``` 5. Open a Pull Request or start a discussion: [Repository Discussions](https://huggingface.co/jpuglia/ProteinLocationPredictor/discussions)