Update README.md: Clarify application description and improve example input/output formatting

Enhance my_utils.py: Add docstrings for randomSVM and randomSearch functions to improve code documentation

Update notebook files: Modify versioning information for 01_EDA_Psort.ipynb and 04_Training.ipynb

Files changed (4) hide show

README.md +3 -12
notebooks/01_EDA_Psort.ipynb +2 -2
notebooks/04_Training.ipynb +2 -2
src/my_utils.py +26 -0

README.md CHANGED Viewed

@@ -22,13 +22,12 @@ base_model:
     * [GUI Mode](#gui-mode)
   * [Example Input & Output](#example-input--output)
-  * [Model Details](#model-details)
   * [Project Structure](#project-structure)
   * [Contributing](#contributing)
 ## Protein Location Predictor
-A comprehensive GUI application for predicting protein subcellular localization using state-of-the-art machine learning models including PROST-T5 and ESM-C embeddings.
 ### Features
@@ -139,7 +138,7 @@ conda env create -f environment.yml
 ## Example Input & Output
-**Input FASTA (********`example/input.fasta`****\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*):**
 ```
 >protein_1
@@ -148,7 +147,7 @@ MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG
 MKTIIALSYIFCLVFAHATAKASEQTDNLQWDLAAIDNSGGHNAVDIKQNLQFQCQNNLHGCF
 ```
-**Output CSV (********`example/output.csv`****\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*):**
 ```csv
 Sequence_ID,Prediction 1,Prediction 2,Prediction 3,Prediction 4,Prediction 5,Prediction 6
@@ -156,14 +155,6 @@ protein_1,Cytoplasmic (0.9860),CytoplasmicMembrane (0.0081),Periplasmic (0.0029)
 protein_2,SignalPeptide (0.7523),Extracellular (0.1234),CytoplasmicMembrane (0.0645),Cellwall (0.0345),Periplasmic (0.0201),OuterMembrane (0.0052)
 ```
-## Model Details
-| Model        | Embedding Dim. | Classifier | GPU VRAM | RAM Usage |
-| ------------ | -------------- | ---------- | -------- | --------- |
-| PROST-T5     | 1024           | SVM        | \~4 GB   | \~8 GB    |
-| ESM-C (300M) | 960            | SVM        | \~2 GB   | \~6 GB    |
-| ESM-C (600M) | 1280           | SVM        | \~4 GB   | \~10 GB   |
 ## Project Structure
 ```

     * [GUI Mode](#gui-mode)
   * [Example Input & Output](#example-input--output)
   * [Project Structure](#project-structure)
   * [Contributing](#contributing)
 ## Protein Location Predictor
+A comprehensive GUI application for predicting protein subcellular localization using SVM and Random Forest classifiers using state-of-the-art protein language models including PROST-T5 and ESM-C embeddings as training data.
 ### Features
 ## Example Input & Output
+**Input FASTA (`example/input.fasta`):**
 ```
 >protein_1
 MKTIIALSYIFCLVFAHATAKASEQTDNLQWDLAAIDNSGGHNAVDIKQNLQFQCQNNLHGCF
 ```
+**Output CSV (`example/output.csv`):**
 ```csv
 Sequence_ID,Prediction 1,Prediction 2,Prediction 3,Prediction 4,Prediction 5,Prediction 6
 protein_2,SignalPeptide (0.7523),Extracellular (0.1234),CytoplasmicMembrane (0.0645),Cellwall (0.0345),Periplasmic (0.0201),OuterMembrane (0.0052)
 ```
 ## Project Structure
 ```

notebooks/01_EDA_Psort.ipynb CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:fae1ec4b018bfc4f3c580f34f7fa8e1e75a78848d2f3064778a2112fd8962fa4
-size 10331242

 version https://git-lfs.github.com/spec/v1
+oid sha256:f99370bb677795f54b6778596ada73015381847b1838e3cc22553ede7a03dbc3
+size 10363061

notebooks/04_Training.ipynb CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:7a8413d0e23a5a332be68556dba9115fd4691f4c7fbb1a3572657ca4a9e6b035
-size 699750

 version https://git-lfs.github.com/spec/v1
+oid sha256:93220b22403b1a7c4d2e062362f86b4aeeba65c024c74f427a9defe76f13cafe
+size 699779

src/my_utils.py CHANGED Viewed

@@ -375,6 +375,19 @@ def train_svm(title: str, x: np.ndarray, y: np.ndarray, params: dict) -> tuple[P
 def randomSVM(x: np.ndarray, y: np.ndarray) -> dict:
     le = LabelEncoder()
     y_encoded = le.fit_transform(y)
@@ -418,6 +431,19 @@ def randomSVM(x: np.ndarray, y: np.ndarray) -> dict:
     return random_search.best_params_
 def randomSearch(x: np.ndarray, y: np.ndarray) -> dict: #type: ignore
     le = LabelEncoder()
     y_encoded = le.fit_transform(y)

 def randomSVM(x: np.ndarray, y: np.ndarray) -> dict:
+    """
+    Performs randomized hyperparameter search for an SVM classifier using a pipeline with feature scaling.
+    Args:
+        x (np.ndarray): Feature matrix of shape (n_samples, n_features).
+        y (np.ndarray): Target labels of shape (n_samples,).
+    Returns:
+        dict: The best hyperparameters found during randomized search.
+    The function encodes the target labels, splits the data for training, constructs a pipeline with a StandardScaler and SVM,
+    and performs RandomizedSearchCV over a predefined hyperparameter space using weighted F1 score as the evaluation metric.
+    """
     le = LabelEncoder()
     y_encoded = le.fit_transform(y)
     return random_search.best_params_
 def randomSearch(x: np.ndarray, y: np.ndarray) -> dict: #type: ignore
+    """
+    Performs a randomized hyperparameter search for a RandomForestClassifier using the provided feature matrix and labels.
+    Args:
+        x (np.ndarray): Feature matrix of shape (n_samples, n_features).
+        y (np.ndarray): Target labels of shape (n_samples,).
+    Returns:
+        dict: The best hyperparameters found during the randomized search.
+    Notes:
+        - The function encodes the labels, splits the data for training, and uses RandomizedSearchCV to optimize hyperparameters.
+        - The search is performed using weighted F1 score and 3-fold cross-validation.
+        - Prints the best parameters found during the search.
+    """
     le = LabelEncoder()
     y_encoded = le.fit_transform(y)