Update README.md: Clarify application description and improve example input/output formatting
Browse filesEnhance my_utils.py: Add docstrings for randomSVM and randomSearch functions to improve code documentation
Update notebook files: Modify versioning information for 01_EDA_Psort.ipynb and 04_Training.ipynb
- README.md +3 -12
- notebooks/01_EDA_Psort.ipynb +2 -2
- notebooks/04_Training.ipynb +2 -2
- src/my_utils.py +26 -0
README.md
CHANGED
|
@@ -22,13 +22,12 @@ base_model:
|
|
| 22 |
|
| 23 |
* [GUI Mode](#gui-mode)
|
| 24 |
* [Example Input & Output](#example-input--output)
|
| 25 |
-
* [Model Details](#model-details)
|
| 26 |
* [Project Structure](#project-structure)
|
| 27 |
* [Contributing](#contributing)
|
| 28 |
|
| 29 |
## Protein Location Predictor
|
| 30 |
|
| 31 |
-
A comprehensive GUI application for predicting protein subcellular localization using state-of-the-art
|
| 32 |
|
| 33 |
### Features
|
| 34 |
|
|
@@ -139,7 +138,7 @@ conda env create -f environment.yml
|
|
| 139 |
|
| 140 |
## Example Input & Output
|
| 141 |
|
| 142 |
-
**Input FASTA (
|
| 143 |
|
| 144 |
```
|
| 145 |
>protein_1
|
|
@@ -148,7 +147,7 @@ MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG
|
|
| 148 |
MKTIIALSYIFCLVFAHATAKASEQTDNLQWDLAAIDNSGGHNAVDIKQNLQFQCQNNLHGCF
|
| 149 |
```
|
| 150 |
|
| 151 |
-
**Output CSV (
|
| 152 |
|
| 153 |
```csv
|
| 154 |
Sequence_ID,Prediction 1,Prediction 2,Prediction 3,Prediction 4,Prediction 5,Prediction 6
|
|
@@ -156,14 +155,6 @@ protein_1,Cytoplasmic (0.9860),CytoplasmicMembrane (0.0081),Periplasmic (0.0029)
|
|
| 156 |
protein_2,SignalPeptide (0.7523),Extracellular (0.1234),CytoplasmicMembrane (0.0645),Cellwall (0.0345),Periplasmic (0.0201),OuterMembrane (0.0052)
|
| 157 |
```
|
| 158 |
|
| 159 |
-
## Model Details
|
| 160 |
-
|
| 161 |
-
| Model | Embedding Dim. | Classifier | GPU VRAM | RAM Usage |
|
| 162 |
-
| ------------ | -------------- | ---------- | -------- | --------- |
|
| 163 |
-
| PROST-T5 | 1024 | SVM | \~4 GB | \~8 GB |
|
| 164 |
-
| ESM-C (300M) | 960 | SVM | \~2 GB | \~6 GB |
|
| 165 |
-
| ESM-C (600M) | 1280 | SVM | \~4 GB | \~10 GB |
|
| 166 |
-
|
| 167 |
## Project Structure
|
| 168 |
|
| 169 |
```
|
|
|
|
| 22 |
|
| 23 |
* [GUI Mode](#gui-mode)
|
| 24 |
* [Example Input & Output](#example-input--output)
|
|
|
|
| 25 |
* [Project Structure](#project-structure)
|
| 26 |
* [Contributing](#contributing)
|
| 27 |
|
| 28 |
## Protein Location Predictor
|
| 29 |
|
| 30 |
+
A comprehensive GUI application for predicting protein subcellular localization using SVM and Random Forest classifiers using state-of-the-art protein language models including PROST-T5 and ESM-C embeddings as training data.
|
| 31 |
|
| 32 |
### Features
|
| 33 |
|
|
|
|
| 138 |
|
| 139 |
## Example Input & Output
|
| 140 |
|
| 141 |
+
**Input FASTA (`example/input.fasta`):**
|
| 142 |
|
| 143 |
```
|
| 144 |
>protein_1
|
|
|
|
| 147 |
MKTIIALSYIFCLVFAHATAKASEQTDNLQWDLAAIDNSGGHNAVDIKQNLQFQCQNNLHGCF
|
| 148 |
```
|
| 149 |
|
| 150 |
+
**Output CSV (`example/output.csv`):**
|
| 151 |
|
| 152 |
```csv
|
| 153 |
Sequence_ID,Prediction 1,Prediction 2,Prediction 3,Prediction 4,Prediction 5,Prediction 6
|
|
|
|
| 155 |
protein_2,SignalPeptide (0.7523),Extracellular (0.1234),CytoplasmicMembrane (0.0645),Cellwall (0.0345),Periplasmic (0.0201),OuterMembrane (0.0052)
|
| 156 |
```
|
| 157 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 158 |
## Project Structure
|
| 159 |
|
| 160 |
```
|
notebooks/01_EDA_Psort.ipynb
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:f99370bb677795f54b6778596ada73015381847b1838e3cc22553ede7a03dbc3
|
| 3 |
+
size 10363061
|
notebooks/04_Training.ipynb
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:93220b22403b1a7c4d2e062362f86b4aeeba65c024c74f427a9defe76f13cafe
|
| 3 |
+
size 699779
|
src/my_utils.py
CHANGED
|
@@ -375,6 +375,19 @@ def train_svm(title: str, x: np.ndarray, y: np.ndarray, params: dict) -> tuple[P
|
|
| 375 |
|
| 376 |
|
| 377 |
def randomSVM(x: np.ndarray, y: np.ndarray) -> dict:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 378 |
|
| 379 |
le = LabelEncoder()
|
| 380 |
y_encoded = le.fit_transform(y)
|
|
@@ -418,6 +431,19 @@ def randomSVM(x: np.ndarray, y: np.ndarray) -> dict:
|
|
| 418 |
return random_search.best_params_
|
| 419 |
|
| 420 |
def randomSearch(x: np.ndarray, y: np.ndarray) -> dict: #type: ignore
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 421 |
|
| 422 |
le = LabelEncoder()
|
| 423 |
y_encoded = le.fit_transform(y)
|
|
|
|
| 375 |
|
| 376 |
|
| 377 |
def randomSVM(x: np.ndarray, y: np.ndarray) -> dict:
|
| 378 |
+
"""
|
| 379 |
+
Performs randomized hyperparameter search for an SVM classifier using a pipeline with feature scaling.
|
| 380 |
+
|
| 381 |
+
Args:
|
| 382 |
+
x (np.ndarray): Feature matrix of shape (n_samples, n_features).
|
| 383 |
+
y (np.ndarray): Target labels of shape (n_samples,).
|
| 384 |
+
|
| 385 |
+
Returns:
|
| 386 |
+
dict: The best hyperparameters found during randomized search.
|
| 387 |
+
|
| 388 |
+
The function encodes the target labels, splits the data for training, constructs a pipeline with a StandardScaler and SVM,
|
| 389 |
+
and performs RandomizedSearchCV over a predefined hyperparameter space using weighted F1 score as the evaluation metric.
|
| 390 |
+
"""
|
| 391 |
|
| 392 |
le = LabelEncoder()
|
| 393 |
y_encoded = le.fit_transform(y)
|
|
|
|
| 431 |
return random_search.best_params_
|
| 432 |
|
| 433 |
def randomSearch(x: np.ndarray, y: np.ndarray) -> dict: #type: ignore
|
| 434 |
+
|
| 435 |
+
"""
|
| 436 |
+
Performs a randomized hyperparameter search for a RandomForestClassifier using the provided feature matrix and labels.
|
| 437 |
+
Args:
|
| 438 |
+
x (np.ndarray): Feature matrix of shape (n_samples, n_features).
|
| 439 |
+
y (np.ndarray): Target labels of shape (n_samples,).
|
| 440 |
+
Returns:
|
| 441 |
+
dict: The best hyperparameters found during the randomized search.
|
| 442 |
+
Notes:
|
| 443 |
+
- The function encodes the labels, splits the data for training, and uses RandomizedSearchCV to optimize hyperparameters.
|
| 444 |
+
- The search is performed using weighted F1 score and 3-fold cross-validation.
|
| 445 |
+
- Prints the best parameters found during the search.
|
| 446 |
+
"""
|
| 447 |
|
| 448 |
le = LabelEncoder()
|
| 449 |
y_encoded = le.fit_transform(y)
|