File size: 5,226 Bytes
c276617
 
 
 
 
 
 
 
 
bf0fbb5
ec4615b
bf0fbb5
ec4615b
bf0fbb5
 
ec4615b
bf0fbb5
 
 
 
 
ec4615b
bf0fbb5
 
 
 
ec4615b
bf0fbb5
ec4615b
458f017
ec4615b
bf0fbb5
ec4615b
bf0fbb5
ec4615b
bf0fbb5
 
 
 
 
 
 
ec4615b
bf0fbb5
ec4615b
bf0fbb5
ec4615b
bf0fbb5
ec4615b
bf0fbb5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ec4615b
 
bf0fbb5
ec4615b
 
bf0fbb5
ec4615b
 
bf0fbb5
ec4615b
bf0fbb5
 
 
ec4615b
bf0fbb5
ec4615b
bf0fbb5
ec4615b
bf0fbb5
 
 
 
ec4615b
bf0fbb5
ec4615b
bf0fbb5
 
 
ec4615b
bf0fbb5
ec4615b
bf0fbb5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ec4615b
bf0fbb5
ec4615b
bf0fbb5
ec4615b
bf0fbb5
 
 
 
 
 
 
ec4615b
bf0fbb5
 
458f017
ec4615b
 
 
 
 
 
 
 
458f017
ec4615b
 
 
bf0fbb5
 
ec4615b
 
 
 
 
 
bf0fbb5
ec4615b
bf0fbb5
 
 
ec4615b
 
 
bf0fbb5
 
 
 
 
ec4615b
 
 
 
 
bf0fbb5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c276617
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
---
license: mit
language:
- en
base_model:
- EvolutionaryScale/esmc-300m-2024-12
- EvolutionaryScale/esmc-600m-2024-12
- Rostlab/ProstT5
---
## Table of Contents

* [Protein Location Predictor](#protein-location-predictor)

  * [Features](#features)
  * [Requirements](#requirements)

    * [Supported Python Version](#supported-python-version)
    * [Dependencies ](#dependencies-full-environmentyml)
    * [Hardware Requirements](#hardware-requirements)
  * [Installation](#installation)
  * [Usage](#usage)

    * [GUI Mode](#gui-mode)
  * [Example Input & Output](#example-input--output)
  * [Project Structure](#project-structure)
  * [Contributing](#contributing)

## Protein Location Predictor

A comprehensive GUI application for predicting protein subcellular localization using SVM and Random Forest classifiers using state-of-the-art protein language models including PROST-T5 and ESM-C embeddings as training data.

### Features

* **Multiple Model Support**: Choose from three different prediction models:

  * PROST-T5: Transformer-based protein language model
  * ESM-C 300M: Evolutionary Scale Modeling (300M parameters)
  * ESM-C 600M: Evolutionary Scale Modeling (600M parameters)
* **User-Friendly GUI**: Simple Tkinter-based interface with progress tracking (see screenshot below)
* **Sequential Processing**: Process multiple protein sequences from FASTA files
* **Flexible Output**: Save predictions with confidence scores in text (CSV) format
* **Error Handling**: Comprehensive error handling and user feedback

### Supported Python Version

This project has been tested on **Python 3.10+**.

## Requirements

#### Dependencies (Full environment.yml)

The complete environment definition is located in `environment.yml`. This file includes all necessary packages for PyTorch, Transformers, ESM models, and GUI operation. Here is a brief excerpt:

```yaml
name: tesisEnv
channels:
  - bioconda
  - anaconda
  - conda-forge
  - defaults

# Python version and major packages
dependencies:
  - python=3.10.16
  - pytorch=2.6.0
  - torchvision=0.21.0
  - torchtext=0.18.0
  - transformers=4.46.3
  - scikit-learn=1.6.1
  - biopython=1.85
  - esm=3.1.4
  - numpy=1.26.4
  - joblib=1.4.2
  - tk
  # plus many others (see full file for complete list)
```

To ensure exact reproducibility, use:

```bash
conda env create -f environment.yml
```

### Hardware Requirements

* **Minimum**: 8β€―GB RAM, CPU-only execution
* **Recommended**: 16β€―GB+ RAM, NVIDIA GPU with 8β€―GB+ VRAM
* **Storage**: \~5β€―GB for model weights and cache

## Installation

1. **Clone the repository** (with Gitβ€―LFS for large model files):

   ```bash
   git lfs install
   git clone https://huggingface.co/jpuglia/ProteinLocationPredictor
   ```

   If you prefer to skip downloading model weights initially:

   ```bash
   GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/jpuglia/ProteinLocationPredictor
   ```

2. **Navigate into the project directory**:

   ```bash
   cd ProteinLocationPredictor
   ```

3. **Create and activate the Conda environment**:

   ```bash
   conda env create -f environment.yml
   conda activate tesisEnv
   ```

4. **(If skipped above) Download model weights manually**:
   Model files live in the `Models/` directory. If you used `GIT_LFS_SKIP_SMUDGE`, run:

   ```bash
   git lfs pull
   ```

## Usage

### GUI Mode

1. Launch the application:

   ```bash
   python gui.py
   ```
2. In the menu, click **File β†’ Load FASTA** and select your input file (`.fasta`, `.fa`, or `.fas`).
3. Choose one of the prediction models (PROST-T5, ESM-C 300M, or ESM-C 600M).
4. Click **Run Prediction** and monitor the progress bar.
5. When complete, you will be prompted to choose an output directory and filename.

## Example Input & Output

**Input FASTA (`example/input.fasta`):**

```
>protein_1
MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG
>protein_2
MKTIIALSYIFCLVFAHATAKASEQTDNLQWDLAAIDNSGGHNAVDIKQNLQFQCQNNLHGCF
```

**Output CSV (`example/output.csv`):**

```csv
Sequence_ID,Prediction 1,Prediction 2,Prediction 3,Prediction 4,Prediction 5,Prediction 6
protein_1,Cytoplasmic (0.9860),CytoplasmicMembrane (0.0081),Periplasmic (0.0029),Extracellular (0.0019),OuterMembrane (0.0007),Cellwall (0.0003)
protein_2,SignalPeptide (0.7523),Extracellular (0.1234),CytoplasmicMembrane (0.0645),Cellwall (0.0345),Periplasmic (0.0201),OuterMembrane (0.0052)
```

## Project Structure

```
ProteinLocationPredictor/
β”œβ”€β”€ gui.py
β”œβ”€β”€ src/
β”‚   └── my_utils.py
β”œβ”€β”€ Models/
β”‚   β”œβ”€β”€ ProstT5_svm.joblib
β”‚   β”œβ”€β”€ ESMC-300m_svm.joblib
β”‚   β”œβ”€β”€ ESMC-600m_svm.joblib
β”‚   └── ...
β”œβ”€β”€ environment.yml
β”œβ”€β”€ README.md
└── doc/
    └── screenshots/
        └── gui_example.png
```

## Contributing

1. Fork the repository
2. Create a feature branch:

   ```bash
   git checkout -b feature/amazing-feature
   ```
3. Commit your changes:

   ```bash
   git commit -m "Add amazing feature"
   ```
4. Push to your branch:

   ```bash
   git push origin feature/amazing-feature
   ```
5. Open a Pull Request or start a discussion: [Repository Discussions](https://huggingface.co/jpuglia/ProteinLocationPredictor/discussions)