jpuglia commited on
Commit
458f017
·
1 Parent(s): c276617

Update README.md: Clarify application description and improve example input/output formatting

Browse files

Enhance my_utils.py: Add docstrings for randomSVM and randomSearch functions to improve code documentation

Update notebook files: Modify versioning information for 01_EDA_Psort.ipynb and 04_Training.ipynb

README.md CHANGED
@@ -22,13 +22,12 @@ base_model:
22
 
23
  * [GUI Mode](#gui-mode)
24
  * [Example Input & Output](#example-input--output)
25
- * [Model Details](#model-details)
26
  * [Project Structure](#project-structure)
27
  * [Contributing](#contributing)
28
 
29
  ## Protein Location Predictor
30
 
31
- A comprehensive GUI application for predicting protein subcellular localization using state-of-the-art machine learning models including PROST-T5 and ESM-C embeddings.
32
 
33
  ### Features
34
 
@@ -139,7 +138,7 @@ conda env create -f environment.yml
139
 
140
  ## Example Input & Output
141
 
142
- **Input FASTA (********`example/input.fasta`****\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*):**
143
 
144
  ```
145
  >protein_1
@@ -148,7 +147,7 @@ MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG
148
  MKTIIALSYIFCLVFAHATAKASEQTDNLQWDLAAIDNSGGHNAVDIKQNLQFQCQNNLHGCF
149
  ```
150
 
151
- **Output CSV (********`example/output.csv`****\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*):**
152
 
153
  ```csv
154
  Sequence_ID,Prediction 1,Prediction 2,Prediction 3,Prediction 4,Prediction 5,Prediction 6
@@ -156,14 +155,6 @@ protein_1,Cytoplasmic (0.9860),CytoplasmicMembrane (0.0081),Periplasmic (0.0029)
156
  protein_2,SignalPeptide (0.7523),Extracellular (0.1234),CytoplasmicMembrane (0.0645),Cellwall (0.0345),Periplasmic (0.0201),OuterMembrane (0.0052)
157
  ```
158
 
159
- ## Model Details
160
-
161
- | Model | Embedding Dim. | Classifier | GPU VRAM | RAM Usage |
162
- | ------------ | -------------- | ---------- | -------- | --------- |
163
- | PROST-T5 | 1024 | SVM | \~4 GB | \~8 GB |
164
- | ESM-C (300M) | 960 | SVM | \~2 GB | \~6 GB |
165
- | ESM-C (600M) | 1280 | SVM | \~4 GB | \~10 GB |
166
-
167
  ## Project Structure
168
 
169
  ```
 
22
 
23
  * [GUI Mode](#gui-mode)
24
  * [Example Input & Output](#example-input--output)
 
25
  * [Project Structure](#project-structure)
26
  * [Contributing](#contributing)
27
 
28
  ## Protein Location Predictor
29
 
30
+ A comprehensive GUI application for predicting protein subcellular localization using SVM and Random Forest classifiers using state-of-the-art protein language models including PROST-T5 and ESM-C embeddings as training data.
31
 
32
  ### Features
33
 
 
138
 
139
  ## Example Input & Output
140
 
141
+ **Input FASTA (`example/input.fasta`):**
142
 
143
  ```
144
  >protein_1
 
147
  MKTIIALSYIFCLVFAHATAKASEQTDNLQWDLAAIDNSGGHNAVDIKQNLQFQCQNNLHGCF
148
  ```
149
 
150
+ **Output CSV (`example/output.csv`):**
151
 
152
  ```csv
153
  Sequence_ID,Prediction 1,Prediction 2,Prediction 3,Prediction 4,Prediction 5,Prediction 6
 
155
  protein_2,SignalPeptide (0.7523),Extracellular (0.1234),CytoplasmicMembrane (0.0645),Cellwall (0.0345),Periplasmic (0.0201),OuterMembrane (0.0052)
156
  ```
157
 
 
 
 
 
 
 
 
 
158
  ## Project Structure
159
 
160
  ```
notebooks/01_EDA_Psort.ipynb CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:fae1ec4b018bfc4f3c580f34f7fa8e1e75a78848d2f3064778a2112fd8962fa4
3
- size 10331242
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f99370bb677795f54b6778596ada73015381847b1838e3cc22553ede7a03dbc3
3
+ size 10363061
notebooks/04_Training.ipynb CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:7a8413d0e23a5a332be68556dba9115fd4691f4c7fbb1a3572657ca4a9e6b035
3
- size 699750
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:93220b22403b1a7c4d2e062362f86b4aeeba65c024c74f427a9defe76f13cafe
3
+ size 699779
src/my_utils.py CHANGED
@@ -375,6 +375,19 @@ def train_svm(title: str, x: np.ndarray, y: np.ndarray, params: dict) -> tuple[P
375
 
376
 
377
  def randomSVM(x: np.ndarray, y: np.ndarray) -> dict:
 
 
 
 
 
 
 
 
 
 
 
 
 
378
 
379
  le = LabelEncoder()
380
  y_encoded = le.fit_transform(y)
@@ -418,6 +431,19 @@ def randomSVM(x: np.ndarray, y: np.ndarray) -> dict:
418
  return random_search.best_params_
419
 
420
  def randomSearch(x: np.ndarray, y: np.ndarray) -> dict: #type: ignore
 
 
 
 
 
 
 
 
 
 
 
 
 
421
 
422
  le = LabelEncoder()
423
  y_encoded = le.fit_transform(y)
 
375
 
376
 
377
  def randomSVM(x: np.ndarray, y: np.ndarray) -> dict:
378
+ """
379
+ Performs randomized hyperparameter search for an SVM classifier using a pipeline with feature scaling.
380
+
381
+ Args:
382
+ x (np.ndarray): Feature matrix of shape (n_samples, n_features).
383
+ y (np.ndarray): Target labels of shape (n_samples,).
384
+
385
+ Returns:
386
+ dict: The best hyperparameters found during randomized search.
387
+
388
+ The function encodes the target labels, splits the data for training, constructs a pipeline with a StandardScaler and SVM,
389
+ and performs RandomizedSearchCV over a predefined hyperparameter space using weighted F1 score as the evaluation metric.
390
+ """
391
 
392
  le = LabelEncoder()
393
  y_encoded = le.fit_transform(y)
 
431
  return random_search.best_params_
432
 
433
  def randomSearch(x: np.ndarray, y: np.ndarray) -> dict: #type: ignore
434
+
435
+ """
436
+ Performs a randomized hyperparameter search for a RandomForestClassifier using the provided feature matrix and labels.
437
+ Args:
438
+ x (np.ndarray): Feature matrix of shape (n_samples, n_features).
439
+ y (np.ndarray): Target labels of shape (n_samples,).
440
+ Returns:
441
+ dict: The best hyperparameters found during the randomized search.
442
+ Notes:
443
+ - The function encodes the labels, splits the data for training, and uses RandomizedSearchCV to optimize hyperparameters.
444
+ - The search is performed using weighted F1 score and 3-fold cross-validation.
445
+ - Prints the best parameters found during the search.
446
+ """
447
 
448
  le = LabelEncoder()
449
  y_encoded = le.fit_transform(y)