elsayedelmandoh commited on
Commit
c8cd7f6
·
1 Parent(s): 7ffdb9e

upload results and update path in savefig function

Browse files
README.md CHANGED
@@ -1,5 +1,14 @@
1
  # Sentiment Sleuth
2
 
 
 
 
 
 
 
 
 
 
3
  ## Table of Contents
4
  - [Overview](#overview)
5
  - [Key Features](#key-features)
@@ -12,9 +21,8 @@
12
  - [Contributing](#contributing)
13
  - [Author](#author)
14
 
15
- ---
16
- ## Overview
17
 
 
18
  ًThis is a project for performing sentiment analysis on Amazon product reviews using classical machine-learning models. The project includes data processing and feature engineering notebooks, multiple trained classifiers saved as joblib artifacts, a TF-IDF vectorizer, and a Streamlit UI to analyze custom review text.
19
 
20
  Key components in the repository:
@@ -27,13 +35,13 @@ Key components in the repository:
27
 
28
  The Streamlit app loads saved artifacts via `src.utils.helpers` and exposes multiple classifiers (`Logistic Regression, Naive Bayes, SVM variants, KNN, Decision Trees, Random Forest, SGD, XGBoost and LightGBM`) so you can compare predictions and confidence scores side-by-side.
29
 
30
- ---
31
  ## Key Features
32
  * **Multiple Models:** Compare results from several traditional classifiers (Logistic Regression, Naive Bayes, SVMs, KNN, Decision Trees, Random Forests, SGD, XGBoost, LightGBM).
33
  * **Reusable Artifacts:** TF-IDF vectorizer and trained models are persisted under `data/vectorizers/` and `data/models/` for fast local inference.
34
  * **Notebooks for Reproducibility:** Step-by-step Jupyter notebooks for data acquisition, EDA, preprocessing, feature engineering and model training are included under `notebooks/`.
35
 
36
- ---
37
  ## Setup
38
  0. Prerequisites
39
  Before running this project, ensure you have the following installed:
@@ -60,7 +68,7 @@ pip install -r requirements.txt
60
  3. Environment Variables
61
  Create a `.env` file at the project root and add any necessary API keys or configuration variables
62
 
63
- ---
64
  ## Usage
65
  This project uses Streamlit for the interactive UI. Start the app locally with one of the following commands:
66
 
@@ -84,7 +92,7 @@ The `notebooks/` directory contains step-by-step analysis and model training not
84
 
85
  Use these notebooks to retrain or refine models and regenerate the `joblib` artifacts saved in `data/models/`.
86
 
87
- ---
88
  ## Contributing
89
  Contributions are welcome! If you'd like to improve this project, please follow these steps:
90
  1. Fork the repository.
 
1
  # Sentiment Sleuth
2
 
3
+ [![github](https://img.shields.io/badge/GitHub-sentiment__sleuth-100000?style=for-the-badge&logo=github&logoColor=white)](https://github.com/elsayedelmandoh/sentiment-sleuth)
4
+ [![huggingface](https://img.shields.io/badge/Space-Hugging%20Face-yellow?style=for-the-badge&logo=huggingface&logoColor=black)](https://elsayedelmandoh-sentiment-sleuth.hf.space)
5
+
6
+ <p align="center">
7
+ <img src="docs/02_results/pos_results.png" alt="Sentiment Sleuth — pos results" width="45%">
8
+ &nbsp; &nbsp;
9
+ <img src="docs/02_results/neg_results.png" alt="Sentiment Sleuth — neg results" width="45%">
10
+ </p>
11
+
12
  ## Table of Contents
13
  - [Overview](#overview)
14
  - [Key Features](#key-features)
 
21
  - [Contributing](#contributing)
22
  - [Author](#author)
23
 
 
 
24
 
25
+ ## Overview
26
  ًThis is a project for performing sentiment analysis on Amazon product reviews using classical machine-learning models. The project includes data processing and feature engineering notebooks, multiple trained classifiers saved as joblib artifacts, a TF-IDF vectorizer, and a Streamlit UI to analyze custom review text.
27
 
28
  Key components in the repository:
 
35
 
36
  The Streamlit app loads saved artifacts via `src.utils.helpers` and exposes multiple classifiers (`Logistic Regression, Naive Bayes, SVM variants, KNN, Decision Trees, Random Forest, SGD, XGBoost and LightGBM`) so you can compare predictions and confidence scores side-by-side.
37
 
38
+
39
  ## Key Features
40
  * **Multiple Models:** Compare results from several traditional classifiers (Logistic Regression, Naive Bayes, SVMs, KNN, Decision Trees, Random Forests, SGD, XGBoost, LightGBM).
41
  * **Reusable Artifacts:** TF-IDF vectorizer and trained models are persisted under `data/vectorizers/` and `data/models/` for fast local inference.
42
  * **Notebooks for Reproducibility:** Step-by-step Jupyter notebooks for data acquisition, EDA, preprocessing, feature engineering and model training are included under `notebooks/`.
43
 
44
+
45
  ## Setup
46
  0. Prerequisites
47
  Before running this project, ensure you have the following installed:
 
68
  3. Environment Variables
69
  Create a `.env` file at the project root and add any necessary API keys or configuration variables
70
 
71
+
72
  ## Usage
73
  This project uses Streamlit for the interactive UI. Start the app locally with one of the following commands:
74
 
 
92
 
93
  Use these notebooks to retrain or refine models and regenerate the `joblib` artifacts saved in `data/models/`.
94
 
95
+
96
  ## Contributing
97
  Contributions are welcome! If you'd like to improve this project, please follow these steps:
98
  1. Fork the repository.
app.py CHANGED
@@ -28,6 +28,7 @@ def _safe_predict(model, X):
28
  def _try_predict(input_X):
29
  pred = model.predict(input_X)[0]
30
  prob = None
 
31
  if hasattr(model, "predict_proba"):
32
  try:
33
  probs = model.predict_proba(input_X)[0]
@@ -41,6 +42,29 @@ def _safe_predict(model, X):
41
  prob = float(probs.max())
42
  except Exception:
43
  prob = None
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
44
  return pred, prob
45
 
46
  try:
@@ -59,25 +83,50 @@ def _safe_predict(model, X):
59
  return None, None, f"predict failed: {e1}"
60
 
61
 
62
- def map_label(pred):
 
 
 
 
 
 
 
 
63
  if pred is None:
64
  return "Unknown"
65
- # numeric encodings
 
 
 
 
 
 
 
 
 
 
 
 
 
 
66
  try:
67
  p = int(pred)
68
- if p in (1,):
69
  return "Negative"
 
 
70
  if p == 2:
71
  return "Positive"
72
  except Exception:
73
  pass
74
- # string encodings
75
  if isinstance(pred, str):
76
  l = pred.lower()
77
  if "neg" in l:
78
  return "Negative"
79
  if "pos" in l:
80
  return "Positive"
 
81
  return str(pred)
82
 
83
 
@@ -152,7 +201,7 @@ def main():
152
  with col:
153
  st.subheader(name)
154
  raw, prob, err = _safe_predict(model, X)
155
- label = map_label(raw)
156
  if label == "Positive":
157
  st.success(label)
158
  elif label == "Negative":
 
28
  def _try_predict(input_X):
29
  pred = model.predict(input_X)[0]
30
  prob = None
31
+ # First try predict_proba when available
32
  if hasattr(model, "predict_proba"):
33
  try:
34
  probs = model.predict_proba(input_X)[0]
 
42
  prob = float(probs.max())
43
  except Exception:
44
  prob = None
45
+ # If no predict_proba, try decision_function fallback for an approximate confidence
46
+ elif hasattr(model, "decision_function"):
47
+ try:
48
+ score = model.decision_function(input_X)
49
+ # decision_function can return (n_samples,) or (n_samples, n_classes)
50
+ if hasattr(score, '__len__') and getattr(score, 'ndim', 0) == 1:
51
+ score_val = float(score[0])
52
+ # convert distance to a pseudo-probability via a sigmoid
53
+ prob_pos = 1.0 / (1.0 + __import__('math').exp(-score_val))
54
+ # If classes_ available, align probability to predicted class
55
+ if hasattr(model, 'classes_') and len(model.classes_) >= 2:
56
+ # assume classes_[1] corresponds to the positive side of decision_function
57
+ if pred == model.classes_[1]:
58
+ prob = float(prob_pos)
59
+ else:
60
+ prob = float(1.0 - prob_pos)
61
+ else:
62
+ prob = float(max(min(prob_pos, 1.0), 0.0))
63
+ else:
64
+ # multi-dimensional decision function — skip
65
+ prob = None
66
+ except Exception:
67
+ prob = None
68
  return pred, prob
69
 
70
  try:
 
83
  return None, None, f"predict failed: {e1}"
84
 
85
 
86
+ def map_label(pred, model=None):
87
+ """Map a raw model prediction to a human label.
88
+
89
+ Supports both common encodings used in this repo:
90
+ - {0,1} where 0 -> Negative, 1 -> Positive
91
+ - {1,2} where 1 -> Negative, 2 -> Positive
92
+
93
+ If `model` is provided and has `classes_`, we use that to disambiguate.
94
+ """
95
  if pred is None:
96
  return "Unknown"
97
+
98
+ # If model provides classes_, prefer that mapping
99
+ try:
100
+ if model is not None and hasattr(model, 'classes_'):
101
+ classes = tuple(model.classes_)
102
+ if set(classes) == {0, 1}:
103
+ p = int(pred)
104
+ return "Negative" if p == 0 else "Positive"
105
+ if set(classes) == {1, 2}:
106
+ p = int(pred)
107
+ return "Negative" if p == 1 else "Positive"
108
+ except Exception:
109
+ pass
110
+
111
+ # Fallback heuristics
112
  try:
113
  p = int(pred)
114
+ if p == 0:
115
  return "Negative"
116
+ if p == 1:
117
+ return "Positive"
118
  if p == 2:
119
  return "Positive"
120
  except Exception:
121
  pass
122
+
123
  if isinstance(pred, str):
124
  l = pred.lower()
125
  if "neg" in l:
126
  return "Negative"
127
  if "pos" in l:
128
  return "Positive"
129
+
130
  return str(pred)
131
 
132
 
 
201
  with col:
202
  st.subheader(name)
203
  raw, prob, err = _safe_predict(model, X)
204
+ label = map_label(raw, model)
205
  if label == "Positive":
206
  st.success(label)
207
  elif label == "Negative":
notebooks/02_eda.ipynb CHANGED
@@ -524,7 +524,7 @@
524
  },
525
  {
526
  "cell_type": "code",
527
- "execution_count": 12,
528
  "id": "2da64228",
529
  "metadata": {},
530
  "outputs": [
@@ -552,6 +552,7 @@
552
  "plt.title('Distribution of Target Classes in Sample Train Dataset')\n",
553
  "plt.xlabel('Target Class')\n",
554
  "plt.ylabel('Count')\n",
 
555
  "plt.show()"
556
  ]
557
  },
@@ -577,7 +578,7 @@
577
  },
578
  {
579
  "cell_type": "code",
580
- "execution_count": 13,
581
  "id": "aaa59508",
582
  "metadata": {},
583
  "outputs": [
@@ -607,6 +608,7 @@
607
  "plt.title('Distribution of Target Classes in Balanced Sample')\n",
608
  "plt.xlabel('Target Class')\n",
609
  "plt.ylabel('Count')\n",
 
610
  "plt.show()"
611
  ]
612
  },
@@ -750,7 +752,7 @@
750
  },
751
  {
752
  "cell_type": "code",
753
- "execution_count": 16,
754
  "id": "18e51117",
755
  "metadata": {},
756
  "outputs": [
@@ -769,6 +771,7 @@
769
  "plt.figure(figsize=(8, 5))\n",
770
  "sns.boxplot(x='review_target', y='review_content_char_count', data=balanced_sample_train)\n",
771
  "plt.title('Review Character Count by Review Target for Review Content')\n",
 
772
  "plt.show()"
773
  ]
774
  },
@@ -898,7 +901,7 @@
898
  },
899
  {
900
  "cell_type": "code",
901
- "execution_count": 18,
902
  "id": "6d2ff248",
903
  "metadata": {},
904
  "outputs": [
@@ -917,6 +920,7 @@
917
  "plt.figure(figsize=(8, 5))\n",
918
  "sns.boxplot(x='review_target', y='review_content_word_count', data=balanced_sample_train)\n",
919
  "plt.title('Review Word Count by Review Target for Review Content')\n",
 
920
  "plt.show()"
921
  ]
922
  },
@@ -1047,7 +1051,7 @@
1047
  },
1048
  {
1049
  "cell_type": "code",
1050
- "execution_count": 21,
1051
  "id": "b75a8f64",
1052
  "metadata": {},
1053
  "outputs": [
@@ -1070,6 +1074,7 @@
1070
  "plt.subplot(1,2,2)\n",
1071
  "sns.histplot(balanced_sample_train['review_content_word_count'], bins=50, kde=True)\n",
1072
  "plt.title('Word count distribution for review content')\n",
 
1073
  "plt.show()"
1074
  ]
1075
  },
@@ -1256,7 +1261,7 @@
1256
  ],
1257
  "metadata": {
1258
  "kernelspec": {
1259
- "display_name": ".venv",
1260
  "language": "python",
1261
  "name": "python3"
1262
  },
@@ -1270,7 +1275,7 @@
1270
  "name": "python",
1271
  "nbconvert_exporter": "python",
1272
  "pygments_lexer": "ipython3",
1273
- "version": "3.13.11"
1274
  }
1275
  },
1276
  "nbformat": 4,
 
524
  },
525
  {
526
  "cell_type": "code",
527
+ "execution_count": null,
528
  "id": "2da64228",
529
  "metadata": {},
530
  "outputs": [
 
552
  "plt.title('Distribution of Target Classes in Sample Train Dataset')\n",
553
  "plt.xlabel('Target Class')\n",
554
  "plt.ylabel('Count')\n",
555
+ "plt.savefig('docs/02_results/target_class_distribution.png', dpi=300, bbox_inches='tight')\n",
556
  "plt.show()"
557
  ]
558
  },
 
578
  },
579
  {
580
  "cell_type": "code",
581
+ "execution_count": null,
582
  "id": "aaa59508",
583
  "metadata": {},
584
  "outputs": [
 
608
  "plt.title('Distribution of Target Classes in Balanced Sample')\n",
609
  "plt.xlabel('Target Class')\n",
610
  "plt.ylabel('Count')\n",
611
+ "plt.savefig('docs/02_results/balanced_target_class_distribution.png', dpi=300, bbox_inches='tight')\n",
612
  "plt.show()"
613
  ]
614
  },
 
752
  },
753
  {
754
  "cell_type": "code",
755
+ "execution_count": null,
756
  "id": "18e51117",
757
  "metadata": {},
758
  "outputs": [
 
771
  "plt.figure(figsize=(8, 5))\n",
772
  "sns.boxplot(x='review_target', y='review_content_char_count', data=balanced_sample_train)\n",
773
  "plt.title('Review Character Count by Review Target for Review Content')\n",
774
+ "plt.savefig('docs/02_results/balanced_review_content_char_count.png', dpi=300, bbox_inches='tight')\n",
775
  "plt.show()"
776
  ]
777
  },
 
901
  },
902
  {
903
  "cell_type": "code",
904
+ "execution_count": null,
905
  "id": "6d2ff248",
906
  "metadata": {},
907
  "outputs": [
 
920
  "plt.figure(figsize=(8, 5))\n",
921
  "sns.boxplot(x='review_target', y='review_content_word_count', data=balanced_sample_train)\n",
922
  "plt.title('Review Word Count by Review Target for Review Content')\n",
923
+ "plt.savefig('docs/02_results/balanced_review_content_word_count.png', dpi=300, bbox_inches='tight')\n",
924
  "plt.show()"
925
  ]
926
  },
 
1051
  },
1052
  {
1053
  "cell_type": "code",
1054
+ "execution_count": null,
1055
  "id": "b75a8f64",
1056
  "metadata": {},
1057
  "outputs": [
 
1074
  "plt.subplot(1,2,2)\n",
1075
  "sns.histplot(balanced_sample_train['review_content_word_count'], bins=50, kde=True)\n",
1076
  "plt.title('Word count distribution for review content')\n",
1077
+ "plt.savefig('docs/02_results/balanced_review_content_word_count_char_count.png', dpi=300, bbox_inches='tight')\n",
1078
  "plt.show()"
1079
  ]
1080
  },
 
1261
  ],
1262
  "metadata": {
1263
  "kernelspec": {
1264
+ "display_name": "mlqueens",
1265
  "language": "python",
1266
  "name": "python3"
1267
  },
 
1275
  "name": "python",
1276
  "nbconvert_exporter": "python",
1277
  "pygments_lexer": "ipython3",
1278
+ "version": "3.12.12"
1279
  }
1280
  },
1281
  "nbformat": 4,
notebooks/03_data_preprocessing.ipynb CHANGED
@@ -1034,7 +1034,7 @@
1034
  ],
1035
  "metadata": {
1036
  "kernelspec": {
1037
- "display_name": ".venv",
1038
  "language": "python",
1039
  "name": "python3"
1040
  },
@@ -1048,7 +1048,7 @@
1048
  "name": "python",
1049
  "nbconvert_exporter": "python",
1050
  "pygments_lexer": "ipython3",
1051
- "version": "3.13.11"
1052
  }
1053
  },
1054
  "nbformat": 4,
 
1034
  ],
1035
  "metadata": {
1036
  "kernelspec": {
1037
+ "display_name": "mlqueens",
1038
  "language": "python",
1039
  "name": "python3"
1040
  },
 
1048
  "name": "python",
1049
  "nbconvert_exporter": "python",
1050
  "pygments_lexer": "ipython3",
1051
+ "version": "3.12.12"
1052
  }
1053
  },
1054
  "nbformat": 4,
notebooks/04_feature_engineering.ipynb CHANGED
@@ -1935,7 +1935,7 @@
1935
  ],
1936
  "metadata": {
1937
  "kernelspec": {
1938
- "display_name": ".venv",
1939
  "language": "python",
1940
  "name": "python3"
1941
  },
@@ -1949,7 +1949,7 @@
1949
  "name": "python",
1950
  "nbconvert_exporter": "python",
1951
  "pygments_lexer": "ipython3",
1952
- "version": "3.13.11"
1953
  }
1954
  },
1955
  "nbformat": 4,
 
1935
  ],
1936
  "metadata": {
1937
  "kernelspec": {
1938
+ "display_name": "mlqueens",
1939
  "language": "python",
1940
  "name": "python3"
1941
  },
 
1949
  "name": "python",
1950
  "nbconvert_exporter": "python",
1951
  "pygments_lexer": "ipython3",
1952
+ "version": "3.12.12"
1953
  }
1954
  },
1955
  "nbformat": 4,
notebooks/05_logistic_regression.ipynb CHANGED
@@ -321,7 +321,7 @@
321
  },
322
  {
323
  "cell_type": "code",
324
- "execution_count": 11,
325
  "id": "bea893df",
326
  "metadata": {},
327
  "outputs": [
@@ -354,7 +354,7 @@
354
  "cm_valid = confusion_matrix(y_valid, y_valid_pred)\n",
355
  "ConfusionMatrixDisplay(cm_valid, display_labels=['Negative', 'Positive']).plot(cmap='Blues')\n",
356
  "plt.title('Logistic Regression - Validation Confusion Matrix')\n",
357
- "plt.savefig('data/predictions/logistic_regression_validation_confusion_matrix.png', dpi=300, bbox_inches='tight')\n",
358
  "plt.show()\n",
359
  "print(f\"Confusion Matrix:\\n{cm_valid}\")\n",
360
  "print(f\"\\nTrue Negatives: {cm_valid[0,0]}\")\n",
@@ -446,7 +446,7 @@
446
  },
447
  {
448
  "cell_type": "code",
449
- "execution_count": 14,
450
  "id": "059822c9",
451
  "metadata": {},
452
  "outputs": [
@@ -479,7 +479,7 @@
479
  "cm_test = confusion_matrix(y_test, y_test_pred)\n",
480
  "ConfusionMatrixDisplay(cm_test, display_labels=['Negative', 'Positive']).plot(cmap='Blues')\n",
481
  "plt.title('Logistic Regression - Testing Confusion Matrix')\n",
482
- "plt.savefig('data/predictions/logistic_regression_testing_confusion_matrix.png', dpi=300, bbox_inches='tight')\n",
483
  "plt.show()\n",
484
  "print(f\"Confusion Matrix:\\n{cm_test}\")\n",
485
  "print(f\"\\nTrue Negatives: {cm_test[0,0]}\")\n",
@@ -506,7 +506,7 @@
506
  },
507
  {
508
  "cell_type": "code",
509
- "execution_count": 15,
510
  "id": "8ffb2e59",
511
  "metadata": {},
512
  "outputs": [
@@ -539,7 +539,7 @@
539
  "axes[1].set_xlabel('Predicted')\n",
540
  "\n",
541
  "plt.tight_layout()\n",
542
- "plt.savefig('data/predictions/logistic_regression_valid_test_confusion_matrices.png', dpi=300, bbox_inches='tight')\n",
543
  "plt.show()"
544
  ]
545
  },
 
321
  },
322
  {
323
  "cell_type": "code",
324
+ "execution_count": null,
325
  "id": "bea893df",
326
  "metadata": {},
327
  "outputs": [
 
354
  "cm_valid = confusion_matrix(y_valid, y_valid_pred)\n",
355
  "ConfusionMatrixDisplay(cm_valid, display_labels=['Negative', 'Positive']).plot(cmap='Blues')\n",
356
  "plt.title('Logistic Regression - Validation Confusion Matrix')\n",
357
+ "plt.savefig('docs/02_results/logistic_regression_validation_confusion_matrix.png', dpi=300, bbox_inches='tight')\n",
358
  "plt.show()\n",
359
  "print(f\"Confusion Matrix:\\n{cm_valid}\")\n",
360
  "print(f\"\\nTrue Negatives: {cm_valid[0,0]}\")\n",
 
446
  },
447
  {
448
  "cell_type": "code",
449
+ "execution_count": null,
450
  "id": "059822c9",
451
  "metadata": {},
452
  "outputs": [
 
479
  "cm_test = confusion_matrix(y_test, y_test_pred)\n",
480
  "ConfusionMatrixDisplay(cm_test, display_labels=['Negative', 'Positive']).plot(cmap='Blues')\n",
481
  "plt.title('Logistic Regression - Testing Confusion Matrix')\n",
482
+ "plt.savefig('docs/02_results/logistic_regression_testing_confusion_matrix.png', dpi=300, bbox_inches='tight')\n",
483
  "plt.show()\n",
484
  "print(f\"Confusion Matrix:\\n{cm_test}\")\n",
485
  "print(f\"\\nTrue Negatives: {cm_test[0,0]}\")\n",
 
506
  },
507
  {
508
  "cell_type": "code",
509
+ "execution_count": null,
510
  "id": "8ffb2e59",
511
  "metadata": {},
512
  "outputs": [
 
539
  "axes[1].set_xlabel('Predicted')\n",
540
  "\n",
541
  "plt.tight_layout()\n",
542
+ "plt.savefig('docs/02_results/logistic_regression_valid_test_confusion_matrices.png', dpi=300, bbox_inches='tight')\n",
543
  "plt.show()"
544
  ]
545
  },
notebooks/06_naive_bayes.ipynb CHANGED
@@ -619,7 +619,7 @@
619
  },
620
  {
621
  "cell_type": "code",
622
- "execution_count": 26,
623
  "id": "ac4fa4a0",
624
  "metadata": {},
625
  "outputs": [
@@ -652,7 +652,7 @@
652
  "cm_valid = confusion_matrix(y_valid, y_valid_pred)\n",
653
  "ConfusionMatrixDisplay(cm_valid, display_labels=['Negative', 'Positive']).plot(cmap='Blues')\n",
654
  "plt.title('Naive Bayes - Validation Confusion Matrix')\n",
655
- "plt.savefig('data/predictions/naive_bayes_validation_confusion_matrix.png', dpi=300, bbox_inches='tight')\n",
656
  "plt.show()\n",
657
  "print(f\"Confusion Matrix:\\n{cm_valid}\")\n",
658
  "print(f\"\\nTrue Negatives: {cm_valid[0,0]}\")\n",
@@ -744,7 +744,7 @@
744
  },
745
  {
746
  "cell_type": "code",
747
- "execution_count": 29,
748
  "id": "e14b37a2",
749
  "metadata": {},
750
  "outputs": [
@@ -777,7 +777,7 @@
777
  "cm_test = confusion_matrix(y_test, y_test_pred)\n",
778
  "ConfusionMatrixDisplay(cm_test, display_labels=['Negative', 'Positive']).plot(cmap='Blues')\n",
779
  "plt.title('Naive Bayes - Testing Confusion Matrix')\n",
780
- "plt.savefig('data/predictions/naive_bayes_testing_confusion_matrix.png', dpi=300, bbox_inches='tight')\n",
781
  "plt.show()\n",
782
  "print(f\"Confusion Matrix:\\n{cm_test}\")\n",
783
  "print(f\"\\nTrue Negatives: {cm_test[0,0]}\")\n",
@@ -804,7 +804,7 @@
804
  },
805
  {
806
  "cell_type": "code",
807
- "execution_count": 30,
808
  "id": "0cfe7623",
809
  "metadata": {},
810
  "outputs": [
@@ -836,7 +836,7 @@
836
  "axes[1].set_ylabel('Actual')\n",
837
  "axes[1].set_xlabel('Predicted')\n",
838
  "plt.tight_layout()\n",
839
- "plt.savefig('data/predictions/naive_bayes_valid_test_confusion_matrices.png', dpi=300, bbox_inches='tight')\n",
840
  "plt.show()"
841
  ]
842
  },
 
619
  },
620
  {
621
  "cell_type": "code",
622
+ "execution_count": null,
623
  "id": "ac4fa4a0",
624
  "metadata": {},
625
  "outputs": [
 
652
  "cm_valid = confusion_matrix(y_valid, y_valid_pred)\n",
653
  "ConfusionMatrixDisplay(cm_valid, display_labels=['Negative', 'Positive']).plot(cmap='Blues')\n",
654
  "plt.title('Naive Bayes - Validation Confusion Matrix')\n",
655
+ "plt.savefig('docs/02_results/naive_bayes_validation_confusion_matrix.png', dpi=300, bbox_inches='tight')\n",
656
  "plt.show()\n",
657
  "print(f\"Confusion Matrix:\\n{cm_valid}\")\n",
658
  "print(f\"\\nTrue Negatives: {cm_valid[0,0]}\")\n",
 
744
  },
745
  {
746
  "cell_type": "code",
747
+ "execution_count": null,
748
  "id": "e14b37a2",
749
  "metadata": {},
750
  "outputs": [
 
777
  "cm_test = confusion_matrix(y_test, y_test_pred)\n",
778
  "ConfusionMatrixDisplay(cm_test, display_labels=['Negative', 'Positive']).plot(cmap='Blues')\n",
779
  "plt.title('Naive Bayes - Testing Confusion Matrix')\n",
780
+ "plt.savefig('docs/02_results/naive_bayes_testing_confusion_matrix.png', dpi=300, bbox_inches='tight')\n",
781
  "plt.show()\n",
782
  "print(f\"Confusion Matrix:\\n{cm_test}\")\n",
783
  "print(f\"\\nTrue Negatives: {cm_test[0,0]}\")\n",
 
804
  },
805
  {
806
  "cell_type": "code",
807
+ "execution_count": null,
808
  "id": "0cfe7623",
809
  "metadata": {},
810
  "outputs": [
 
836
  "axes[1].set_ylabel('Actual')\n",
837
  "axes[1].set_xlabel('Predicted')\n",
838
  "plt.tight_layout()\n",
839
+ "plt.savefig('docs/02_results/naive_bayes_valid_test_confusion_matrices.png', dpi=300, bbox_inches='tight')\n",
840
  "plt.show()"
841
  ]
842
  },
notebooks/07_support_vector_machine.ipynb CHANGED
@@ -471,7 +471,7 @@
471
  },
472
  {
473
  "cell_type": "code",
474
- "execution_count": 17,
475
  "id": "18",
476
  "metadata": {
477
  "colab": {
@@ -520,7 +520,7 @@
520
  "plt.ylabel('True Label')\n",
521
  "plt.xlabel('Predicted Label')\n",
522
  "plt.tight_layout()\n",
523
- "plt.savefig('svm_confusion_matrix.png', dpi=300, bbox_inches='tight')\n",
524
  "plt.show()\n",
525
  "print(f\"Confusion Matrix:\\n{cm}\")\n",
526
  "print(f\"\\nTrue Negatives: {cm[0,0]}\")\n",
@@ -541,7 +541,7 @@
541
  },
542
  {
543
  "cell_type": "code",
544
- "execution_count": 24,
545
  "id": "20",
546
  "metadata": {
547
  "colab": {
@@ -624,7 +624,7 @@
624
  "axes[1].set_title('Top 10 Negative Coefficients (Negative Sentiment)')\n",
625
  "axes[1].set_xlabel('Coefficient Value')\n",
626
  "plt.tight_layout()\n",
627
- "plt.savefig('svm_feature_coefficients.png', dpi=300, bbox_inches='tight')\n",
628
  "plt.show()\n",
629
  "print(\"\\nTop 10 Positive Coefficients (Positive Sentiment Indicators):\")\n",
630
  "print(top_positive.to_string(index=False))\n",
@@ -644,7 +644,7 @@
644
  },
645
  {
646
  "cell_type": "code",
647
- "execution_count": 19,
648
  "id": "22",
649
  "metadata": {
650
  "colab": {
@@ -691,7 +691,7 @@
691
  "plt.title('SVM Decision Function Scores Distribution')\n",
692
  "plt.legend()\n",
693
  "plt.tight_layout()\n",
694
- "plt.savefig('svm_decision_scores.png', dpi=300, bbox_inches='tight')\n",
695
  "plt.show()\n",
696
  "print(f\"\\nDecision Function Statistics:\")\n",
697
  "print(f\"Mean score for positive reviews: {decision_scores[y_test == 1].mean():.4f}\")\n",
@@ -827,7 +827,7 @@
827
  "provenance": []
828
  },
829
  "kernelspec": {
830
- "display_name": ".venv",
831
  "language": "python",
832
  "name": "python3"
833
  },
@@ -841,7 +841,7 @@
841
  "name": "python",
842
  "nbconvert_exporter": "python",
843
  "pygments_lexer": "ipython3",
844
- "version": "3.13.11"
845
  }
846
  },
847
  "nbformat": 4,
 
471
  },
472
  {
473
  "cell_type": "code",
474
+ "execution_count": null,
475
  "id": "18",
476
  "metadata": {
477
  "colab": {
 
520
  "plt.ylabel('True Label')\n",
521
  "plt.xlabel('Predicted Label')\n",
522
  "plt.tight_layout()\n",
523
+ "plt.savefig('docs/02_results/svm_confusion_matrix.png', dpi=300, bbox_inches='tight')\n",
524
  "plt.show()\n",
525
  "print(f\"Confusion Matrix:\\n{cm}\")\n",
526
  "print(f\"\\nTrue Negatives: {cm[0,0]}\")\n",
 
541
  },
542
  {
543
  "cell_type": "code",
544
+ "execution_count": null,
545
  "id": "20",
546
  "metadata": {
547
  "colab": {
 
624
  "axes[1].set_title('Top 10 Negative Coefficients (Negative Sentiment)')\n",
625
  "axes[1].set_xlabel('Coefficient Value')\n",
626
  "plt.tight_layout()\n",
627
+ "plt.savefig('docs/02_results/svm_feature_coefficients.png', dpi=300, bbox_inches='tight')\n",
628
  "plt.show()\n",
629
  "print(\"\\nTop 10 Positive Coefficients (Positive Sentiment Indicators):\")\n",
630
  "print(top_positive.to_string(index=False))\n",
 
644
  },
645
  {
646
  "cell_type": "code",
647
+ "execution_count": null,
648
  "id": "22",
649
  "metadata": {
650
  "colab": {
 
691
  "plt.title('SVM Decision Function Scores Distribution')\n",
692
  "plt.legend()\n",
693
  "plt.tight_layout()\n",
694
+ "plt.savefig('docs/02_results/svm_decision_scores.png', dpi=300, bbox_inches='tight')\n",
695
  "plt.show()\n",
696
  "print(f\"\\nDecision Function Statistics:\")\n",
697
  "print(f\"Mean score for positive reviews: {decision_scores[y_test == 1].mean():.4f}\")\n",
 
827
  "provenance": []
828
  },
829
  "kernelspec": {
830
+ "display_name": "mlqueens",
831
  "language": "python",
832
  "name": "python3"
833
  },
 
841
  "name": "python",
842
  "nbconvert_exporter": "python",
843
  "pygments_lexer": "ipython3",
844
+ "version": "3.12.12"
845
  }
846
  },
847
  "nbformat": 4,
notebooks/08_k_nearest_neighbors.ipynb CHANGED
@@ -412,7 +412,7 @@
412
  },
413
  {
414
  "cell_type": "code",
415
- "execution_count": 14,
416
  "id": "21",
417
  "metadata": {},
418
  "outputs": [
@@ -454,7 +454,7 @@
454
  "plt.ylabel('True Label')\n",
455
  "plt.xlabel('Predicted Label')\n",
456
  "plt.tight_layout()\n",
457
- "plt.savefig('knn_confusion_matrix.png', dpi=300, bbox_inches='tight')\n",
458
  "plt.show()\n",
459
  "print(f\"Confusion Matrix:\\n{cm}\")\n",
460
  "print(f\"\\nTrue Negatives: {cm[0,0]}\")\n",
@@ -475,7 +475,7 @@
475
  },
476
  {
477
  "cell_type": "code",
478
- "execution_count": 15,
479
  "id": "23",
480
  "metadata": {},
481
  "outputs": [
@@ -511,7 +511,7 @@
511
  "plt.legend(title='Distance Metric')\n",
512
  "plt.grid(True, alpha=0.3)\n",
513
  "plt.tight_layout()\n",
514
- "plt.savefig('knn_k_sensitivity.png', dpi=300, bbox_inches='tight')\n",
515
  "plt.show()"
516
  ]
517
  },
@@ -631,7 +631,7 @@
631
  ],
632
  "metadata": {
633
  "kernelspec": {
634
- "display_name": ".venv",
635
  "language": "python",
636
  "name": "python3"
637
  },
@@ -645,7 +645,7 @@
645
  "name": "python",
646
  "nbconvert_exporter": "python",
647
  "pygments_lexer": "ipython3",
648
- "version": "3.13.11"
649
  }
650
  },
651
  "nbformat": 4,
 
412
  },
413
  {
414
  "cell_type": "code",
415
+ "execution_count": null,
416
  "id": "21",
417
  "metadata": {},
418
  "outputs": [
 
454
  "plt.ylabel('True Label')\n",
455
  "plt.xlabel('Predicted Label')\n",
456
  "plt.tight_layout()\n",
457
+ "plt.savefig('docs/02_results/knn_confusion_matrix.png', dpi=300, bbox_inches='tight')\n",
458
  "plt.show()\n",
459
  "print(f\"Confusion Matrix:\\n{cm}\")\n",
460
  "print(f\"\\nTrue Negatives: {cm[0,0]}\")\n",
 
475
  },
476
  {
477
  "cell_type": "code",
478
+ "execution_count": null,
479
  "id": "23",
480
  "metadata": {},
481
  "outputs": [
 
511
  "plt.legend(title='Distance Metric')\n",
512
  "plt.grid(True, alpha=0.3)\n",
513
  "plt.tight_layout()\n",
514
+ "plt.savefig('docs/02_results/knn_k_sensitivity.png', dpi=300, bbox_inches='tight')\n",
515
  "plt.show()"
516
  ]
517
  },
 
631
  ],
632
  "metadata": {
633
  "kernelspec": {
634
+ "display_name": "mlqueens",
635
  "language": "python",
636
  "name": "python3"
637
  },
 
645
  "name": "python",
646
  "nbconvert_exporter": "python",
647
  "pygments_lexer": "ipython3",
648
+ "version": "3.12.12"
649
  }
650
  },
651
  "nbformat": 4,
notebooks/09_decision_trees.ipynb CHANGED
@@ -349,7 +349,7 @@
349
  "plt.title('Decision Tree: max_depth vs F1-Score (3-Fold CV)', fontsize=14, fontweight='bold')\n",
350
  "plt.grid(True, alpha=0.3)\n",
351
  "plt.tight_layout()\n",
352
- "plt.savefig('dt_maxdepth_sensitivity.png', dpi=300, bbox_inches='tight')\n",
353
  "plt.show()\n",
354
  "print(\"\\nMax Depth Performance Analysis:\")\n",
355
  "print(depth_performance.to_string(index=False))"
@@ -500,7 +500,7 @@
500
  },
501
  {
502
  "cell_type": "code",
503
- "execution_count": 14,
504
  "id": "18",
505
  "metadata": {
506
  "colab": {
@@ -549,7 +549,7 @@
549
  "plt.ylabel('True Label')\n",
550
  "plt.xlabel('Predicted Label')\n",
551
  "plt.tight_layout()\n",
552
- "plt.savefig('decision_tree_confusion_matrix.png', dpi=300, bbox_inches='tight')\n",
553
  "plt.show()\n",
554
  "print(f\"Confusion Matrix:\\n{cm}\")\n",
555
  "print(f\"\\nTrue Negatives: {cm[0,0]}\")\n",
@@ -570,7 +570,7 @@
570
  },
571
  {
572
  "cell_type": "code",
573
- "execution_count": 25,
574
  "id": "20",
575
  "metadata": {
576
  "colab": {
@@ -638,7 +638,7 @@
638
  "plt.xlabel('Importance Score')\n",
639
  "plt.ylabel('Features')\n",
640
  "plt.tight_layout()\n",
641
- "plt.savefig('decision_tree_feature_importance.png', dpi=300, bbox_inches='tight')\n",
642
  "plt.show()\n",
643
  "print(\"\\nTop 20 Important Features:\")\n",
644
  "print(importance_df.to_string(index=False))"
@@ -755,7 +755,7 @@
755
  "provenance": []
756
  },
757
  "kernelspec": {
758
- "display_name": ".venv",
759
  "language": "python",
760
  "name": "python3"
761
  },
@@ -769,7 +769,7 @@
769
  "name": "python",
770
  "nbconvert_exporter": "python",
771
  "pygments_lexer": "ipython3",
772
- "version": "3.13.11"
773
  }
774
  },
775
  "nbformat": 4,
 
349
  "plt.title('Decision Tree: max_depth vs F1-Score (3-Fold CV)', fontsize=14, fontweight='bold')\n",
350
  "plt.grid(True, alpha=0.3)\n",
351
  "plt.tight_layout()\n",
352
+ "plt.savefig('docs/02_results/dt_maxdepth_sensitivity.png', dpi=300, bbox_inches='tight')\n",
353
  "plt.show()\n",
354
  "print(\"\\nMax Depth Performance Analysis:\")\n",
355
  "print(depth_performance.to_string(index=False))"
 
500
  },
501
  {
502
  "cell_type": "code",
503
+ "execution_count": null,
504
  "id": "18",
505
  "metadata": {
506
  "colab": {
 
549
  "plt.ylabel('True Label')\n",
550
  "plt.xlabel('Predicted Label')\n",
551
  "plt.tight_layout()\n",
552
+ "plt.savefig('docs/02_results/decision_tree_confusion_matrix.png', dpi=300, bbox_inches='tight')\n",
553
  "plt.show()\n",
554
  "print(f\"Confusion Matrix:\\n{cm}\")\n",
555
  "print(f\"\\nTrue Negatives: {cm[0,0]}\")\n",
 
570
  },
571
  {
572
  "cell_type": "code",
573
+ "execution_count": null,
574
  "id": "20",
575
  "metadata": {
576
  "colab": {
 
638
  "plt.xlabel('Importance Score')\n",
639
  "plt.ylabel('Features')\n",
640
  "plt.tight_layout()\n",
641
+ "plt.savefig('docs/02_results/decision_tree_feature_importance.png', dpi=300, bbox_inches='tight')\n",
642
  "plt.show()\n",
643
  "print(\"\\nTop 20 Important Features:\")\n",
644
  "print(importance_df.to_string(index=False))"
 
755
  "provenance": []
756
  },
757
  "kernelspec": {
758
+ "display_name": "mlqueens",
759
  "language": "python",
760
  "name": "python3"
761
  },
 
769
  "name": "python",
770
  "nbconvert_exporter": "python",
771
  "pygments_lexer": "ipython3",
772
+ "version": "3.12.12"
773
  }
774
  },
775
  "nbformat": 4,
notebooks/10_random_forest.ipynb CHANGED
@@ -310,7 +310,7 @@
310
  },
311
  {
312
  "cell_type": "code",
313
- "execution_count": 9,
314
  "id": "1309ffb1",
315
  "metadata": {},
316
  "outputs": [
@@ -356,7 +356,7 @@
356
  "plt.title('Random Forest: n_estimators vs F1-Score (3-Fold CV)', fontsize=14, fontweight='bold')\n",
357
  "plt.grid(True, alpha=0.3)\n",
358
  "plt.tight_layout()\n",
359
- "plt.savefig('rf_nestimators_sensitivity.png', dpi=300, bbox_inches='tight')\n",
360
  "plt.show()\n",
361
  "\n",
362
  "print(\"\\nNumber of Estimators Performance Analysis:\")\n",
@@ -508,7 +508,7 @@
508
  },
509
  {
510
  "cell_type": "code",
511
- "execution_count": 14,
512
  "id": "18",
513
  "metadata": {
514
  "colab": {
@@ -557,7 +557,7 @@
557
  "plt.ylabel('True Label')\n",
558
  "plt.xlabel('Predicted Label')\n",
559
  "plt.tight_layout()\n",
560
- "plt.savefig('random_forest_confusion_matrix.png', dpi=300, bbox_inches='tight')\n",
561
  "plt.show()\n",
562
  "print(f\"Confusion Matrix:\\n{cm}\")\n",
563
  "print(f\"\\nTrue Negatives: {cm[0,0]}\")\n",
@@ -578,7 +578,7 @@
578
  },
579
  {
580
  "cell_type": "code",
581
- "execution_count": 15,
582
  "id": "20",
583
  "metadata": {
584
  "colab": {
@@ -646,7 +646,7 @@
646
  "plt.xlabel('Importance Score')\n",
647
  "plt.ylabel('Features')\n",
648
  "plt.tight_layout()\n",
649
- "plt.savefig('random_forest_feature_importance.png', dpi=300, bbox_inches='tight')\n",
650
  "plt.show()\n",
651
  "print(\"\\nTop 20 Important Features:\")\n",
652
  "print(importance_df.to_string(index=False))"
@@ -766,7 +766,7 @@
766
  "provenance": []
767
  },
768
  "kernelspec": {
769
- "display_name": ".venv",
770
  "language": "python",
771
  "name": "python3"
772
  },
@@ -780,7 +780,7 @@
780
  "name": "python",
781
  "nbconvert_exporter": "python",
782
  "pygments_lexer": "ipython3",
783
- "version": "3.13.11"
784
  }
785
  },
786
  "nbformat": 4,
 
310
  },
311
  {
312
  "cell_type": "code",
313
+ "execution_count": null,
314
  "id": "1309ffb1",
315
  "metadata": {},
316
  "outputs": [
 
356
  "plt.title('Random Forest: n_estimators vs F1-Score (3-Fold CV)', fontsize=14, fontweight='bold')\n",
357
  "plt.grid(True, alpha=0.3)\n",
358
  "plt.tight_layout()\n",
359
+ "plt.savefig('docs/02_results/rf_nestimators_sensitivity.png', dpi=300, bbox_inches='tight')\n",
360
  "plt.show()\n",
361
  "\n",
362
  "print(\"\\nNumber of Estimators Performance Analysis:\")\n",
 
508
  },
509
  {
510
  "cell_type": "code",
511
+ "execution_count": null,
512
  "id": "18",
513
  "metadata": {
514
  "colab": {
 
557
  "plt.ylabel('True Label')\n",
558
  "plt.xlabel('Predicted Label')\n",
559
  "plt.tight_layout()\n",
560
+ "plt.savefig('docs/02_results/random_forest_confusion_matrix.png', dpi=300, bbox_inches='tight')\n",
561
  "plt.show()\n",
562
  "print(f\"Confusion Matrix:\\n{cm}\")\n",
563
  "print(f\"\\nTrue Negatives: {cm[0,0]}\")\n",
 
578
  },
579
  {
580
  "cell_type": "code",
581
+ "execution_count": null,
582
  "id": "20",
583
  "metadata": {
584
  "colab": {
 
646
  "plt.xlabel('Importance Score')\n",
647
  "plt.ylabel('Features')\n",
648
  "plt.tight_layout()\n",
649
+ "plt.savefig('docs/02_results/random_forest_feature_importance.png', dpi=300, bbox_inches='tight')\n",
650
  "plt.show()\n",
651
  "print(\"\\nTop 20 Important Features:\")\n",
652
  "print(importance_df.to_string(index=False))"
 
766
  "provenance": []
767
  },
768
  "kernelspec": {
769
+ "display_name": "mlqueens",
770
  "language": "python",
771
  "name": "python3"
772
  },
 
780
  "name": "python",
781
  "nbconvert_exporter": "python",
782
  "pygments_lexer": "ipython3",
783
+ "version": "3.12.12"
784
  }
785
  },
786
  "nbformat": 4,
notebooks/11_stochastic_gradient_descent.ipynb CHANGED
@@ -304,7 +304,7 @@
304
  },
305
  {
306
  "cell_type": "code",
307
- "execution_count": 9,
308
  "id": "7c849d0a",
309
  "metadata": {},
310
  "outputs": [
@@ -351,7 +351,7 @@
351
  "plt.xticks(np.arange(len(alpha_performance)), [f'{a:.5f}' for a in alpha_performance['alpha']], rotation=45)\n",
352
  "plt.grid(True, alpha=0.3)\n",
353
  "plt.tight_layout()\n",
354
- "plt.savefig('sgd_alpha_sensitivity.png', dpi=300, bbox_inches='tight')\n",
355
  "plt.show()\n",
356
  "\n",
357
  "print(\"\\nRegularization Strength (alpha) Performance Analysis:\")\n",
@@ -506,7 +506,7 @@
506
  },
507
  {
508
  "cell_type": "code",
509
- "execution_count": 13,
510
  "id": "18",
511
  "metadata": {
512
  "colab": {
@@ -555,7 +555,7 @@
555
  "plt.ylabel('True Label')\n",
556
  "plt.xlabel('Predicted Label')\n",
557
  "plt.tight_layout()\n",
558
- "plt.savefig('sgd_confusion_matrix.png', dpi=300, bbox_inches='tight')\n",
559
  "plt.show()\n",
560
  "print(f\"Confusion Matrix:\\n{cm}\")\n",
561
  "print(f\"\\nTrue Negatives: {cm[0,0]}\")\n",
@@ -576,7 +576,7 @@
576
  },
577
  {
578
  "cell_type": "code",
579
- "execution_count": 14,
580
  "id": "20",
581
  "metadata": {
582
  "colab": {
@@ -652,7 +652,7 @@
652
  "axes[1].set_title('Top 10 Negative Coefficients (Negative Sentiment)')\n",
653
  "axes[1].set_xlabel('Coefficient Value')\n",
654
  "plt.tight_layout()\n",
655
- "plt.savefig('sgd_feature_coefficients.png', dpi=300, bbox_inches='tight')\n",
656
  "plt.show()\n",
657
  "print(\"\\nTop 10 Positive Coefficients:\")\n",
658
  "print(top_positive.to_string(index=False))\n",
@@ -825,7 +825,7 @@
825
  "provenance": []
826
  },
827
  "kernelspec": {
828
- "display_name": ".venv",
829
  "language": "python",
830
  "name": "python3"
831
  },
@@ -839,7 +839,7 @@
839
  "name": "python",
840
  "nbconvert_exporter": "python",
841
  "pygments_lexer": "ipython3",
842
- "version": "3.13.11"
843
  }
844
  },
845
  "nbformat": 4,
 
304
  },
305
  {
306
  "cell_type": "code",
307
+ "execution_count": null,
308
  "id": "7c849d0a",
309
  "metadata": {},
310
  "outputs": [
 
351
  "plt.xticks(np.arange(len(alpha_performance)), [f'{a:.5f}' for a in alpha_performance['alpha']], rotation=45)\n",
352
  "plt.grid(True, alpha=0.3)\n",
353
  "plt.tight_layout()\n",
354
+ "plt.savefig('docs/02_results/sgd_alpha_sensitivity.png', dpi=300, bbox_inches='tight')\n",
355
  "plt.show()\n",
356
  "\n",
357
  "print(\"\\nRegularization Strength (alpha) Performance Analysis:\")\n",
 
506
  },
507
  {
508
  "cell_type": "code",
509
+ "execution_count": null,
510
  "id": "18",
511
  "metadata": {
512
  "colab": {
 
555
  "plt.ylabel('True Label')\n",
556
  "plt.xlabel('Predicted Label')\n",
557
  "plt.tight_layout()\n",
558
+ "plt.savefig('docs/02_results/sgd_confusion_matrix.png', dpi=300, bbox_inches='tight')\n",
559
  "plt.show()\n",
560
  "print(f\"Confusion Matrix:\\n{cm}\")\n",
561
  "print(f\"\\nTrue Negatives: {cm[0,0]}\")\n",
 
576
  },
577
  {
578
  "cell_type": "code",
579
+ "execution_count": null,
580
  "id": "20",
581
  "metadata": {
582
  "colab": {
 
652
  "axes[1].set_title('Top 10 Negative Coefficients (Negative Sentiment)')\n",
653
  "axes[1].set_xlabel('Coefficient Value')\n",
654
  "plt.tight_layout()\n",
655
+ "plt.savefig('docs/02_results/sgd_feature_coefficients.png', dpi=300, bbox_inches='tight')\n",
656
  "plt.show()\n",
657
  "print(\"\\nTop 10 Positive Coefficients:\")\n",
658
  "print(top_positive.to_string(index=False))\n",
 
825
  "provenance": []
826
  },
827
  "kernelspec": {
828
+ "display_name": "mlqueens",
829
  "language": "python",
830
  "name": "python3"
831
  },
 
839
  "name": "python",
840
  "nbconvert_exporter": "python",
841
  "pygments_lexer": "ipython3",
842
+ "version": "3.12.12"
843
  }
844
  },
845
  "nbformat": 4,
notebooks/12_xgboost.ipynb CHANGED
@@ -328,7 +328,7 @@
328
  },
329
  {
330
  "cell_type": "code",
331
- "execution_count": 12,
332
  "id": "6fae4f6d",
333
  "metadata": {},
334
  "outputs": [
@@ -373,7 +373,7 @@
373
  "plt.xticks(np.arange(len(lr_performance)), [f'{lr:.3f}' for lr in lr_performance['learning_rate']])\n",
374
  "plt.grid(True, alpha=0.3)\n",
375
  "plt.tight_layout()\n",
376
- "plt.savefig('xgb_learning_rate_sensitivity.png', dpi=300, bbox_inches='tight')\n",
377
  "plt.show()\n",
378
  "print(\"\\nLearning Rate Performance Analysis:\")\n",
379
  "print(lr_performance.to_string(index=False))"
@@ -527,7 +527,7 @@
527
  },
528
  {
529
  "cell_type": "code",
530
- "execution_count": 16,
531
  "id": "18",
532
  "metadata": {
533
  "colab": {
@@ -576,7 +576,7 @@
576
  "plt.ylabel('True Label')\n",
577
  "plt.xlabel('Predicted Label')\n",
578
  "plt.tight_layout()\n",
579
- "plt.savefig('xgboost_confusion_matrix.png', dpi=300, bbox_inches='tight')\n",
580
  "plt.show()\n",
581
  "print(f\"Confusion Matrix:\\n{cm}\")\n",
582
  "print(f\"\\nTrue Negatives: {cm[0,0]}\")\n",
@@ -597,7 +597,7 @@
597
  },
598
  {
599
  "cell_type": "code",
600
- "execution_count": 17,
601
  "id": "20",
602
  "metadata": {
603
  "colab": {
@@ -665,7 +665,7 @@
665
  "plt.xlabel('Importance Score')\n",
666
  "plt.ylabel('Features')\n",
667
  "plt.tight_layout()\n",
668
- "plt.savefig('xgboost_feature_importance.png', dpi=300, bbox_inches='tight')\n",
669
  "plt.show()\n",
670
  "print(\"\\nTop 20 Important Features:\")\n",
671
  "print(importance_df.to_string(index=False))"
@@ -851,7 +851,7 @@
851
  "provenance": []
852
  },
853
  "kernelspec": {
854
- "display_name": ".venv",
855
  "language": "python",
856
  "name": "python3"
857
  },
@@ -865,7 +865,7 @@
865
  "name": "python",
866
  "nbconvert_exporter": "python",
867
  "pygments_lexer": "ipython3",
868
- "version": "3.13.11"
869
  }
870
  },
871
  "nbformat": 4,
 
328
  },
329
  {
330
  "cell_type": "code",
331
+ "execution_count": null,
332
  "id": "6fae4f6d",
333
  "metadata": {},
334
  "outputs": [
 
373
  "plt.xticks(np.arange(len(lr_performance)), [f'{lr:.3f}' for lr in lr_performance['learning_rate']])\n",
374
  "plt.grid(True, alpha=0.3)\n",
375
  "plt.tight_layout()\n",
376
+ "plt.savefig('docs/02_results/xgb_learning_rate_sensitivity.png', dpi=300, bbox_inches='tight')\n",
377
  "plt.show()\n",
378
  "print(\"\\nLearning Rate Performance Analysis:\")\n",
379
  "print(lr_performance.to_string(index=False))"
 
527
  },
528
  {
529
  "cell_type": "code",
530
+ "execution_count": null,
531
  "id": "18",
532
  "metadata": {
533
  "colab": {
 
576
  "plt.ylabel('True Label')\n",
577
  "plt.xlabel('Predicted Label')\n",
578
  "plt.tight_layout()\n",
579
+ "plt.savefig('docs/02_results/xgboost_confusion_matrix.png', dpi=300, bbox_inches='tight')\n",
580
  "plt.show()\n",
581
  "print(f\"Confusion Matrix:\\n{cm}\")\n",
582
  "print(f\"\\nTrue Negatives: {cm[0,0]}\")\n",
 
597
  },
598
  {
599
  "cell_type": "code",
600
+ "execution_count": null,
601
  "id": "20",
602
  "metadata": {
603
  "colab": {
 
665
  "plt.xlabel('Importance Score')\n",
666
  "plt.ylabel('Features')\n",
667
  "plt.tight_layout()\n",
668
+ "plt.savefig('docs/02_results/xgboost_feature_importance.png', dpi=300, bbox_inches='tight')\n",
669
  "plt.show()\n",
670
  "print(\"\\nTop 20 Important Features:\")\n",
671
  "print(importance_df.to_string(index=False))"
 
851
  "provenance": []
852
  },
853
  "kernelspec": {
854
+ "display_name": "mlqueens",
855
  "language": "python",
856
  "name": "python3"
857
  },
 
865
  "name": "python",
866
  "nbconvert_exporter": "python",
867
  "pygments_lexer": "ipython3",
868
+ "version": "3.12.12"
869
  }
870
  },
871
  "nbformat": 4,
notebooks/13_lightgbm.ipynb CHANGED
@@ -331,7 +331,7 @@
331
  },
332
  {
333
  "cell_type": "code",
334
- "execution_count": 10,
335
  "id": "4097cd49",
336
  "metadata": {},
337
  "outputs": [
@@ -376,7 +376,7 @@
376
  "plt.xticks(np.arange(len(leaves_performance)), leaves_performance['num_leaves'].astype(int))\n",
377
  "plt.grid(True, alpha=0.3)\n",
378
  "plt.tight_layout()\n",
379
- "plt.savefig('lgb_num_leaves_sensitivity.png', dpi=300, bbox_inches='tight')\n",
380
  "plt.show()\n",
381
  "print(\"\\nTree Complexity (num_leaves) Performance Analysis:\")\n",
382
  "print(leaves_performance.to_string(index=False))"
@@ -530,7 +530,7 @@
530
  },
531
  {
532
  "cell_type": "code",
533
- "execution_count": 14,
534
  "id": "18",
535
  "metadata": {
536
  "colab": {
@@ -579,7 +579,7 @@
579
  "plt.ylabel('True Label')\n",
580
  "plt.xlabel('Predicted Label')\n",
581
  "plt.tight_layout()\n",
582
- "plt.savefig('lightgbm_confusion_matrix.png', dpi=300, bbox_inches='tight')\n",
583
  "plt.show()\n",
584
  "print(f\"Confusion Matrix:\\n{cm}\")\n",
585
  "print(f\"\\nTrue Negatives: {cm[0,0]}\")\n",
@@ -600,7 +600,7 @@
600
  },
601
  {
602
  "cell_type": "code",
603
- "execution_count": 15,
604
  "id": "20",
605
  "metadata": {
606
  "colab": {
@@ -668,7 +668,7 @@
668
  "plt.xlabel('Importance Score')\n",
669
  "plt.ylabel('Features')\n",
670
  "plt.tight_layout()\n",
671
- "plt.savefig('lightgbm_feature_importance.png', dpi=300, bbox_inches='tight')\n",
672
  "plt.show()\n",
673
  "print(\"\\nTop 20 Important Features:\")\n",
674
  "print(importance_df.to_string(index=False))"
@@ -857,7 +857,7 @@
857
  "provenance": []
858
  },
859
  "kernelspec": {
860
- "display_name": ".venv",
861
  "language": "python",
862
  "name": "python3"
863
  },
@@ -871,7 +871,7 @@
871
  "name": "python",
872
  "nbconvert_exporter": "python",
873
  "pygments_lexer": "ipython3",
874
- "version": "3.13.11"
875
  }
876
  },
877
  "nbformat": 4,
 
331
  },
332
  {
333
  "cell_type": "code",
334
+ "execution_count": null,
335
  "id": "4097cd49",
336
  "metadata": {},
337
  "outputs": [
 
376
  "plt.xticks(np.arange(len(leaves_performance)), leaves_performance['num_leaves'].astype(int))\n",
377
  "plt.grid(True, alpha=0.3)\n",
378
  "plt.tight_layout()\n",
379
+ "plt.savefig('docs/02_results/lgb_num_leaves_sensitivity.png', dpi=300, bbox_inches='tight')\n",
380
  "plt.show()\n",
381
  "print(\"\\nTree Complexity (num_leaves) Performance Analysis:\")\n",
382
  "print(leaves_performance.to_string(index=False))"
 
530
  },
531
  {
532
  "cell_type": "code",
533
+ "execution_count": null,
534
  "id": "18",
535
  "metadata": {
536
  "colab": {
 
579
  "plt.ylabel('True Label')\n",
580
  "plt.xlabel('Predicted Label')\n",
581
  "plt.tight_layout()\n",
582
+ "plt.savefig('docs/02_results/lightgbm_confusion_matrix.png', dpi=300, bbox_inches='tight')\n",
583
  "plt.show()\n",
584
  "print(f\"Confusion Matrix:\\n{cm}\")\n",
585
  "print(f\"\\nTrue Negatives: {cm[0,0]}\")\n",
 
600
  },
601
  {
602
  "cell_type": "code",
603
+ "execution_count": null,
604
  "id": "20",
605
  "metadata": {
606
  "colab": {
 
668
  "plt.xlabel('Importance Score')\n",
669
  "plt.ylabel('Features')\n",
670
  "plt.tight_layout()\n",
671
+ "plt.savefig('docs/02_results/lightgbm_feature_importance.png', dpi=300, bbox_inches='tight')\n",
672
  "plt.show()\n",
673
  "print(\"\\nTop 20 Important Features:\")\n",
674
  "print(importance_df.to_string(index=False))"
 
857
  "provenance": []
858
  },
859
  "kernelspec": {
860
+ "display_name": "mlqueens",
861
  "language": "python",
862
  "name": "python3"
863
  },
 
871
  "name": "python",
872
  "nbconvert_exporter": "python",
873
  "pygments_lexer": "ipython3",
874
+ "version": "3.12.12"
875
  }
876
  },
877
  "nbformat": 4,
src/utils/helpers.py CHANGED
@@ -109,58 +109,57 @@ def apply_balance(df: pd.DataFrame, target_col: str = "target", random_state: in
109
 
110
 
111
  def plot_top_ngrams(corpus, n=1, top_k=20, stop_words='english', max_features=20000, figsize=(10,6), title=None):
112
- """
113
- Compute and plot the top n-grams from a text corpus.
114
-
115
- Parameters
116
- ----------
117
- corpus : iterable-like
118
- Iterable of text documents (e.g., pandas Series).
119
- n : int, optional
120
- The n in n-grams (uses ngram_range=(n,n)). Default is 1 (unigrams).
121
- top_k : int, optional
122
- Number of top n-grams to show. Default is 20.
123
- stop_words : str or list, optional
124
- Stop words parameter forwarded to CountVectorizer. Default 'english'.
125
- max_features : int, optional
126
- Max features for the vectorizer. Default 20000.
127
- figsize : tuple, optional
128
- Figure size for the plot.
129
- title : str, optional
130
- Custom title for the plot. If None, a default title is used.
131
-
132
- Returns
133
- -------
134
- list of (term, count)
135
- The top n-grams and their counts (sorted descending).
136
- """
137
-
138
- vec = CountVectorizer(ngram_range=(n, n), stop_words=stop_words, max_features=max_features)
139
- X = vec.fit_transform(corpus)
140
- sums = np.array(X.sum(axis=0)).ravel()
141
- terms = np.array(vec.get_feature_names_out())
142
-
143
- if terms.size == 0:
144
- print("No terms found for the given corpus/parameters.")
145
- return []
146
-
147
- top_idx = sums.argsort()[::-1][:top_k]
148
- top_terms = terms[top_idx]
149
- top_counts = sums[top_idx]
150
-
151
- # Plot horizontal bar chart with largest on top
152
- plt.figure(figsize=figsize)
153
- plt.barh(top_terms[::-1], top_counts[::-1], color='steelblue')
154
- plt.xlabel("Count")
155
- plt.tight_layout()
156
- if title is None:
157
- title = f"Top {min(top_k, len(top_terms))} {n}-grams"
158
- plt.title(title)
159
- plt.show()
160
-
161
- return list(zip(top_terms, top_counts))
162
 
 
 
 
 
163
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
164
 
165
  # preprocessing notebook
166
  def clean_text(s):
@@ -292,7 +291,7 @@ def show_top_ngrams_by_class(df, target_col='review_target', text_col='review_cl
292
  plt.title(f"Top {len(terms)} {nname} for class {cls}")
293
  plt.xlabel("Count")
294
  plt.tight_layout()
295
- plt.savefig(f'data/predictions/top_{nname}_for_class_{cls}.png', dpi=300, bbox_inches='tight')
296
  plt.show()
297
 
298
  return results
@@ -391,7 +390,7 @@ def plot_dimensionality_reduction(X, labels, method='PCA', sample=1000, random_s
391
  plt.xlabel('dim1')
392
  plt.ylabel('dim2')
393
  plt.title(f'{method} projection')
394
- plt.savefig(f'data/predictions/{method}_projection_{data_name}_{i}.png', dpi=300, bbox_inches='tight')
395
  plt.show()
396
 
397
  return emb
 
109
 
110
 
111
  def plot_top_ngrams(corpus, n=1, top_k=20, stop_words='english', max_features=20000, figsize=(10,6), title=None):
112
+ """
113
+ Compute and plot the top n-grams from a text corpus.
114
+
115
+ Parameters
116
+ ----------
117
+ corpus : iterable-like
118
+ Iterable of text documents (e.g., pandas Series).
119
+ n : int, optional
120
+ The n in n-grams (uses ngram_range=(n,n)). Default is 1 (unigrams).
121
+ top_k : int, optional
122
+ Number of top n-grams to show. Default is 20.
123
+ stop_words : str or list, optional
124
+ Stop words parameter forwarded to CountVectorizer. Default 'english'.
125
+ max_features : int, optional
126
+ Max features for the vectorizer. Default 20000.
127
+ figsize : tuple, optional
128
+ Figure size for the plot.
129
+ title : str, optional
130
+ Custom title for the plot. If None, a default title is used.
131
+
132
+ Returns
133
+ -------
134
+ list of (term, count)
135
+ The top n-grams and their counts (sorted descending).
136
+ """
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
137
 
138
+ vec = CountVectorizer(ngram_range=(n, n), stop_words=stop_words, max_features=max_features)
139
+ X = vec.fit_transform(corpus)
140
+ sums = np.array(X.sum(axis=0)).ravel()
141
+ terms = np.array(vec.get_feature_names_out())
142
 
143
+ if terms.size == 0:
144
+ print("No terms found for the given corpus/parameters.")
145
+ return []
146
+
147
+ top_idx = sums.argsort()[::-1][:top_k]
148
+ top_terms = terms[top_idx]
149
+ top_counts = sums[top_idx]
150
+
151
+ # Plot horizontal bar chart with largest on top
152
+ plt.figure(figsize=figsize)
153
+ plt.barh(top_terms[::-1], top_counts[::-1], color='steelblue')
154
+ plt.xlabel("Count")
155
+ plt.tight_layout()
156
+ if title is None:
157
+ title = f"Top {min(top_k, len(top_terms))} {n}-grams"
158
+ plt.title(title)
159
+ plt.savefig(f'docs/02_results/top_{top_k}_{n}grams.png', dpi=300, bbox_inches='tight')
160
+ plt.show()
161
+
162
+ return list(zip(top_terms, top_counts))
163
 
164
  # preprocessing notebook
165
  def clean_text(s):
 
291
  plt.title(f"Top {len(terms)} {nname} for class {cls}")
292
  plt.xlabel("Count")
293
  plt.tight_layout()
294
+ plt.savefig(f'docs/02_results/top_{nname}_for_class_{cls}.png', dpi=300, bbox_inches='tight')
295
  plt.show()
296
 
297
  return results
 
390
  plt.xlabel('dim1')
391
  plt.ylabel('dim2')
392
  plt.title(f'{method} projection')
393
+ plt.savefig(f'docs/02_results/{method}_projection_{data_name}_{i}.png', dpi=300, bbox_inches='tight')
394
  plt.show()
395
 
396
  return emb