Spaces:

elsayedelmandoh
/

sentiment-sleuth

Sleeping

App Files Files Community

elsayedelmandoh commited on Mar 10

Commit

c8cd7f6

1 Parent(s): 7ffdb9e

upload results and update path in savefig function

Browse files

Files changed (15) hide show

README.md +14 -6
app.py +54 -5
notebooks/02_eda.ipynb +12 -7
notebooks/03_data_preprocessing.ipynb +2 -2
notebooks/04_feature_engineering.ipynb +2 -2
notebooks/05_logistic_regression.ipynb +6 -6
notebooks/06_naive_bayes.ipynb +6 -6
notebooks/07_support_vector_machine.ipynb +8 -8
notebooks/08_k_nearest_neighbors.ipynb +6 -6
notebooks/09_decision_trees.ipynb +7 -7
notebooks/10_random_forest.ipynb +8 -8
notebooks/11_stochastic_gradient_descent.ipynb +8 -8
notebooks/12_xgboost.ipynb +8 -8
notebooks/13_lightgbm.ipynb +8 -8
src/utils/helpers.py +51 -52

README.md CHANGED Viewed

@@ -1,5 +1,14 @@
 # Sentiment Sleuth
 ## Table of Contents
 - [Overview](#overview)
 - [Key Features](#key-features)
@@ -12,9 +21,8 @@
 - [Contributing](#contributing)
 - [Author](#author)
----
-## Overview
 ًThis is a project for performing sentiment analysis on Amazon product reviews using classical machine-learning models. The project includes data processing and feature engineering notebooks, multiple trained classifiers saved as joblib artifacts, a TF-IDF vectorizer, and a Streamlit UI to analyze custom review text.
 Key components in the repository:
@@ -27,13 +35,13 @@ Key components in the repository:
 The Streamlit app loads saved artifacts via `src.utils.helpers` and exposes multiple classifiers (`Logistic Regression, Naive Bayes, SVM variants, KNN, Decision Trees, Random Forest, SGD, XGBoost and LightGBM`) so you can compare predictions and confidence scores side-by-side.
----
 ## Key Features
 * **Multiple Models:** Compare results from several traditional classifiers (Logistic Regression, Naive Bayes, SVMs, KNN, Decision Trees, Random Forests, SGD, XGBoost, LightGBM).
 * **Reusable Artifacts:** TF-IDF vectorizer and trained models are persisted under `data/vectorizers/` and `data/models/` for fast local inference.
 * **Notebooks for Reproducibility:** Step-by-step Jupyter notebooks for data acquisition, EDA, preprocessing, feature engineering and model training are included under `notebooks/`.
----
 ## Setup
 0. Prerequisites
 Before running this project, ensure you have the following installed:
@@ -60,7 +68,7 @@ pip install -r requirements.txt
 3. Environment Variables
 Create a `.env` file at the project root and add any necessary API keys or configuration variables
----
 ## Usage
 This project uses Streamlit for the interactive UI. Start the app locally with one of the following commands:
@@ -84,7 +92,7 @@ The `notebooks/` directory contains step-by-step analysis and model training not
 Use these notebooks to retrain or refine models and regenerate the `joblib` artifacts saved in `data/models/`.
----
 ## Contributing
 Contributions are welcome! If you'd like to improve this project, please follow these steps:
 1. Fork the repository.

 # Sentiment Sleuth
+[![github](https://img.shields.io/badge/GitHub-sentiment__sleuth-100000?style=for-the-badge&logo=github&logoColor=white)](https://github.com/elsayedelmandoh/sentiment-sleuth)
+[![huggingface](https://img.shields.io/badge/Space-Hugging%20Face-yellow?style=for-the-badge&logo=huggingface&logoColor=black)](https://elsayedelmandoh-sentiment-sleuth.hf.space)
+<p align="center">
+  <img src="docs/02_results/pos_results.png" alt="Sentiment Sleuth — pos results" width="45%">
+  &nbsp; &nbsp;
+  <img src="docs/02_results/neg_results.png" alt="Sentiment Sleuth — neg results" width="45%">
+</p>
 ## Table of Contents
 - [Overview](#overview)
 - [Key Features](#key-features)
 - [Contributing](#contributing)
 - [Author](#author)
+## Overview
 ًThis is a project for performing sentiment analysis on Amazon product reviews using classical machine-learning models. The project includes data processing and feature engineering notebooks, multiple trained classifiers saved as joblib artifacts, a TF-IDF vectorizer, and a Streamlit UI to analyze custom review text.
 Key components in the repository:
 The Streamlit app loads saved artifacts via `src.utils.helpers` and exposes multiple classifiers (`Logistic Regression, Naive Bayes, SVM variants, KNN, Decision Trees, Random Forest, SGD, XGBoost and LightGBM`) so you can compare predictions and confidence scores side-by-side.
 ## Key Features
 * **Multiple Models:** Compare results from several traditional classifiers (Logistic Regression, Naive Bayes, SVMs, KNN, Decision Trees, Random Forests, SGD, XGBoost, LightGBM).
 * **Reusable Artifacts:** TF-IDF vectorizer and trained models are persisted under `data/vectorizers/` and `data/models/` for fast local inference.
 * **Notebooks for Reproducibility:** Step-by-step Jupyter notebooks for data acquisition, EDA, preprocessing, feature engineering and model training are included under `notebooks/`.
 ## Setup
 0. Prerequisites
 Before running this project, ensure you have the following installed:
 3. Environment Variables
 Create a `.env` file at the project root and add any necessary API keys or configuration variables
 ## Usage
 This project uses Streamlit for the interactive UI. Start the app locally with one of the following commands:
 Use these notebooks to retrain or refine models and regenerate the `joblib` artifacts saved in `data/models/`.
 ## Contributing
 Contributions are welcome! If you'd like to improve this project, please follow these steps:
 1. Fork the repository.

app.py CHANGED Viewed

@@ -28,6 +28,7 @@ def _safe_predict(model, X):
 	def _try_predict(input_X):
 		pred = model.predict(input_X)[0]
 		prob = None
 		if hasattr(model, "predict_proba"):
 			try:
 				probs = model.predict_proba(input_X)[0]
@@ -41,6 +42,29 @@ def _safe_predict(model, X):
 					prob = float(probs.max())
 			except Exception:
 				prob = None
 		return pred, prob
 	try:
@@ -59,25 +83,50 @@ def _safe_predict(model, X):
 		return None, None, f"predict failed: {e1}"
-def map_label(pred):
 	if pred is None:
 		return "Unknown"
-	# numeric encodings
 	try:
 		p = int(pred)
-		if p in (1,):
 			return "Negative"
 		if p == 2:
 			return "Positive"
 	except Exception:
 		pass
-	# string encodings
 	if isinstance(pred, str):
 		l = pred.lower()
 		if "neg" in l:
 			return "Negative"
 		if "pos" in l:
 			return "Positive"
 	return str(pred)
@@ -152,7 +201,7 @@ def main():
 			with col:
 				st.subheader(name)
 				raw, prob, err = _safe_predict(model, X)
-				label = map_label(raw)
 				if label == "Positive":
 					st.success(label)
 				elif label == "Negative":

 	def _try_predict(input_X):
 		pred = model.predict(input_X)[0]
 		prob = None
+		# First try predict_proba when available
 		if hasattr(model, "predict_proba"):
 			try:
 				probs = model.predict_proba(input_X)[0]
 					prob = float(probs.max())
 			except Exception:
 				prob = None
+		# If no predict_proba, try decision_function fallback for an approximate confidence
+		elif hasattr(model, "decision_function"):
+			try:
+				score = model.decision_function(input_X)
+				# decision_function can return (n_samples,) or (n_samples, n_classes)
+				if hasattr(score, '__len__') and getattr(score, 'ndim', 0) == 1:
+					score_val = float(score[0])
+					# convert distance to a pseudo-probability via a sigmoid
+					prob_pos = 1.0 / (1.0 + __import__('math').exp(-score_val))
+					# If classes_ available, align probability to predicted class
+					if hasattr(model, 'classes_') and len(model.classes_) >= 2:
+						# assume classes_[1] corresponds to the positive side of decision_function
+						if pred == model.classes_[1]:
+							prob = float(prob_pos)
+						else:
+							prob = float(1.0 - prob_pos)
+					else:
+						prob = float(max(min(prob_pos, 1.0), 0.0))
+				else:
+					# multi-dimensional decision function — skip
+					prob = None
+			except Exception:
+				prob = None
 		return pred, prob
 	try:
 		return None, None, f"predict failed: {e1}"
+def map_label(pred, model=None):
+	"""Map a raw model prediction to a human label.
+	Supports both common encodings used in this repo:
+	- {0,1} where 0 -> Negative, 1 -> Positive
+	- {1,2} where 1 -> Negative, 2 -> Positive
+	If `model` is provided and has `classes_`, we use that to disambiguate.
+	"""
 	if pred is None:
 		return "Unknown"
+	# If model provides classes_, prefer that mapping
+	try:
+		if model is not None and hasattr(model, 'classes_'):
+			classes = tuple(model.classes_)
+			if set(classes) == {0, 1}:
+				p = int(pred)
+				return "Negative" if p == 0 else "Positive"
+			if set(classes) == {1, 2}:
+				p = int(pred)
+				return "Negative" if p == 1 else "Positive"
+	except Exception:
+		pass
+	# Fallback heuristics
 	try:
 		p = int(pred)
+		if p == 0:
 			return "Negative"
+		if p == 1:
+			return "Positive"
 		if p == 2:
 			return "Positive"
 	except Exception:
 		pass
 	if isinstance(pred, str):
 		l = pred.lower()
 		if "neg" in l:
 			return "Negative"
 		if "pos" in l:
 			return "Positive"
 	return str(pred)
 			with col:
 				st.subheader(name)
 				raw, prob, err = _safe_predict(model, X)
+				label = map_label(raw, model)
 				if label == "Positive":
 					st.success(label)
 				elif label == "Negative":

notebooks/02_eda.ipynb CHANGED Viewed

@@ -524,7 +524,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 12,
    "id": "2da64228",
    "metadata": {},
    "outputs": [
@@ -552,6 +552,7 @@
     "plt.title('Distribution of Target Classes in Sample Train Dataset')\n",
     "plt.xlabel('Target Class')\n",
     "plt.ylabel('Count')\n",
     "plt.show()"
    ]
   },
@@ -577,7 +578,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 13,
    "id": "aaa59508",
    "metadata": {},
    "outputs": [
@@ -607,6 +608,7 @@
     "plt.title('Distribution of Target Classes in Balanced Sample')\n",
     "plt.xlabel('Target Class')\n",
     "plt.ylabel('Count')\n",
     "plt.show()"
    ]
   },
@@ -750,7 +752,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 16,
    "id": "18e51117",
    "metadata": {},
    "outputs": [
@@ -769,6 +771,7 @@
     "plt.figure(figsize=(8, 5))\n",
     "sns.boxplot(x='review_target', y='review_content_char_count', data=balanced_sample_train)\n",
     "plt.title('Review Character Count by Review Target for Review Content')\n",
     "plt.show()"
    ]
   },
@@ -898,7 +901,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 18,
    "id": "6d2ff248",
    "metadata": {},
    "outputs": [
@@ -917,6 +920,7 @@
     "plt.figure(figsize=(8, 5))\n",
     "sns.boxplot(x='review_target', y='review_content_word_count', data=balanced_sample_train)\n",
     "plt.title('Review Word Count by Review Target for Review Content')\n",
     "plt.show()"
    ]
   },
@@ -1047,7 +1051,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 21,
    "id": "b75a8f64",
    "metadata": {},
    "outputs": [
@@ -1070,6 +1074,7 @@
     "plt.subplot(1,2,2)\n",
     "sns.histplot(balanced_sample_train['review_content_word_count'], bins=50, kde=True)\n",
     "plt.title('Word count distribution for review content')\n",
     "plt.show()"
    ]
   },
@@ -1256,7 +1261,7 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": ".venv",
    "language": "python",
    "name": "python3"
   },
@@ -1270,7 +1275,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.13.11"
   }
  },
  "nbformat": 4,

   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "id": "2da64228",
    "metadata": {},
    "outputs": [
     "plt.title('Distribution of Target Classes in Sample Train Dataset')\n",
     "plt.xlabel('Target Class')\n",
     "plt.ylabel('Count')\n",
+    "plt.savefig('docs/02_results/target_class_distribution.png', dpi=300, bbox_inches='tight')\n",
     "plt.show()"
    ]
   },
   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "id": "aaa59508",
    "metadata": {},
    "outputs": [
     "plt.title('Distribution of Target Classes in Balanced Sample')\n",
     "plt.xlabel('Target Class')\n",
     "plt.ylabel('Count')\n",
+    "plt.savefig('docs/02_results/balanced_target_class_distribution.png', dpi=300, bbox_inches='tight')\n",
     "plt.show()"
    ]
   },
   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "id": "18e51117",
    "metadata": {},
    "outputs": [
     "plt.figure(figsize=(8, 5))\n",
     "sns.boxplot(x='review_target', y='review_content_char_count', data=balanced_sample_train)\n",
     "plt.title('Review Character Count by Review Target for Review Content')\n",
+    "plt.savefig('docs/02_results/balanced_review_content_char_count.png', dpi=300, bbox_inches='tight')\n",
     "plt.show()"
    ]
   },
   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "id": "6d2ff248",
    "metadata": {},
    "outputs": [
     "plt.figure(figsize=(8, 5))\n",
     "sns.boxplot(x='review_target', y='review_content_word_count', data=balanced_sample_train)\n",
     "plt.title('Review Word Count by Review Target for Review Content')\n",
+    "plt.savefig('docs/02_results/balanced_review_content_word_count.png', dpi=300, bbox_inches='tight')\n",
     "plt.show()"
    ]
   },
   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "id": "b75a8f64",
    "metadata": {},
    "outputs": [
     "plt.subplot(1,2,2)\n",
     "sns.histplot(balanced_sample_train['review_content_word_count'], bins=50, kde=True)\n",
     "plt.title('Word count distribution for review content')\n",
+    "plt.savefig('docs/02_results/balanced_review_content_word_count_char_count.png', dpi=300, bbox_inches='tight')\n",
     "plt.show()"
    ]
   },
  ],
  "metadata": {
   "kernelspec": {
+   "display_name": "mlqueens",
    "language": "python",
    "name": "python3"
   },
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
+   "version": "3.12.12"
   }
  },
  "nbformat": 4,

notebooks/03_data_preprocessing.ipynb CHANGED Viewed

@@ -1034,7 +1034,7 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": ".venv",
    "language": "python",
    "name": "python3"
   },
@@ -1048,7 +1048,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.13.11"
   }
  },
  "nbformat": 4,

  ],
  "metadata": {
   "kernelspec": {
+   "display_name": "mlqueens",
    "language": "python",
    "name": "python3"
   },
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
+   "version": "3.12.12"
   }
  },
  "nbformat": 4,

notebooks/04_feature_engineering.ipynb CHANGED Viewed

@@ -1935,7 +1935,7 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": ".venv",
    "language": "python",
    "name": "python3"
   },
@@ -1949,7 +1949,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.13.11"
   }
  },
  "nbformat": 4,

  ],
  "metadata": {
   "kernelspec": {
+   "display_name": "mlqueens",
    "language": "python",
    "name": "python3"
   },
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
+   "version": "3.12.12"
   }
  },
  "nbformat": 4,

notebooks/05_logistic_regression.ipynb CHANGED Viewed

@@ -321,7 +321,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 11,
    "id": "bea893df",
    "metadata": {},
    "outputs": [
@@ -354,7 +354,7 @@
     "cm_valid = confusion_matrix(y_valid, y_valid_pred)\n",
     "ConfusionMatrixDisplay(cm_valid, display_labels=['Negative', 'Positive']).plot(cmap='Blues')\n",
     "plt.title('Logistic Regression - Validation Confusion Matrix')\n",
-    "plt.savefig('data/predictions/logistic_regression_validation_confusion_matrix.png', dpi=300, bbox_inches='tight')\n",
     "plt.show()\n",
     "print(f\"Confusion Matrix:\\n{cm_valid}\")\n",
     "print(f\"\\nTrue Negatives: {cm_valid[0,0]}\")\n",
@@ -446,7 +446,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 14,
    "id": "059822c9",
    "metadata": {},
    "outputs": [
@@ -479,7 +479,7 @@
     "cm_test = confusion_matrix(y_test, y_test_pred)\n",
     "ConfusionMatrixDisplay(cm_test, display_labels=['Negative', 'Positive']).plot(cmap='Blues')\n",
     "plt.title('Logistic Regression - Testing Confusion Matrix')\n",
-    "plt.savefig('data/predictions/logistic_regression_testing_confusion_matrix.png', dpi=300, bbox_inches='tight')\n",
     "plt.show()\n",
     "print(f\"Confusion Matrix:\\n{cm_test}\")\n",
     "print(f\"\\nTrue Negatives: {cm_test[0,0]}\")\n",
@@ -506,7 +506,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 15,
    "id": "8ffb2e59",
    "metadata": {},
    "outputs": [
@@ -539,7 +539,7 @@
     "axes[1].set_xlabel('Predicted')\n",
     "\n",
     "plt.tight_layout()\n",
-    "plt.savefig('data/predictions/logistic_regression_valid_test_confusion_matrices.png', dpi=300, bbox_inches='tight')\n",
     "plt.show()"
    ]
   },

   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "id": "bea893df",
    "metadata": {},
    "outputs": [
     "cm_valid = confusion_matrix(y_valid, y_valid_pred)\n",
     "ConfusionMatrixDisplay(cm_valid, display_labels=['Negative', 'Positive']).plot(cmap='Blues')\n",
     "plt.title('Logistic Regression - Validation Confusion Matrix')\n",
+    "plt.savefig('docs/02_results/logistic_regression_validation_confusion_matrix.png', dpi=300, bbox_inches='tight')\n",
     "plt.show()\n",
     "print(f\"Confusion Matrix:\\n{cm_valid}\")\n",
     "print(f\"\\nTrue Negatives: {cm_valid[0,0]}\")\n",
   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "id": "059822c9",
    "metadata": {},
    "outputs": [
     "cm_test = confusion_matrix(y_test, y_test_pred)\n",
     "ConfusionMatrixDisplay(cm_test, display_labels=['Negative', 'Positive']).plot(cmap='Blues')\n",
     "plt.title('Logistic Regression - Testing Confusion Matrix')\n",
+    "plt.savefig('docs/02_results/logistic_regression_testing_confusion_matrix.png', dpi=300, bbox_inches='tight')\n",
     "plt.show()\n",
     "print(f\"Confusion Matrix:\\n{cm_test}\")\n",
     "print(f\"\\nTrue Negatives: {cm_test[0,0]}\")\n",
   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "id": "8ffb2e59",
    "metadata": {},
    "outputs": [
     "axes[1].set_xlabel('Predicted')\n",
     "\n",
     "plt.tight_layout()\n",
+    "plt.savefig('docs/02_results/logistic_regression_valid_test_confusion_matrices.png', dpi=300, bbox_inches='tight')\n",
     "plt.show()"
    ]
   },

notebooks/06_naive_bayes.ipynb CHANGED Viewed

@@ -619,7 +619,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 26,
    "id": "ac4fa4a0",
    "metadata": {},
    "outputs": [
@@ -652,7 +652,7 @@
     "cm_valid = confusion_matrix(y_valid, y_valid_pred)\n",
     "ConfusionMatrixDisplay(cm_valid, display_labels=['Negative', 'Positive']).plot(cmap='Blues')\n",
     "plt.title('Naive Bayes - Validation Confusion Matrix')\n",
-    "plt.savefig('data/predictions/naive_bayes_validation_confusion_matrix.png', dpi=300, bbox_inches='tight')\n",
     "plt.show()\n",
     "print(f\"Confusion Matrix:\\n{cm_valid}\")\n",
     "print(f\"\\nTrue Negatives: {cm_valid[0,0]}\")\n",
@@ -744,7 +744,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 29,
    "id": "e14b37a2",
    "metadata": {},
    "outputs": [
@@ -777,7 +777,7 @@
     "cm_test = confusion_matrix(y_test, y_test_pred)\n",
     "ConfusionMatrixDisplay(cm_test, display_labels=['Negative', 'Positive']).plot(cmap='Blues')\n",
     "plt.title('Naive Bayes - Testing Confusion Matrix')\n",
-    "plt.savefig('data/predictions/naive_bayes_testing_confusion_matrix.png', dpi=300, bbox_inches='tight')\n",
     "plt.show()\n",
     "print(f\"Confusion Matrix:\\n{cm_test}\")\n",
     "print(f\"\\nTrue Negatives: {cm_test[0,0]}\")\n",
@@ -804,7 +804,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 30,
    "id": "0cfe7623",
    "metadata": {},
    "outputs": [
@@ -836,7 +836,7 @@
     "axes[1].set_ylabel('Actual')\n",
     "axes[1].set_xlabel('Predicted')\n",
     "plt.tight_layout()\n",
-    "plt.savefig('data/predictions/naive_bayes_valid_test_confusion_matrices.png', dpi=300, bbox_inches='tight')\n",
     "plt.show()"
    ]
   },

   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "id": "ac4fa4a0",
    "metadata": {},
    "outputs": [
     "cm_valid = confusion_matrix(y_valid, y_valid_pred)\n",
     "ConfusionMatrixDisplay(cm_valid, display_labels=['Negative', 'Positive']).plot(cmap='Blues')\n",
     "plt.title('Naive Bayes - Validation Confusion Matrix')\n",
+    "plt.savefig('docs/02_results/naive_bayes_validation_confusion_matrix.png', dpi=300, bbox_inches='tight')\n",
     "plt.show()\n",
     "print(f\"Confusion Matrix:\\n{cm_valid}\")\n",
     "print(f\"\\nTrue Negatives: {cm_valid[0,0]}\")\n",
   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "id": "e14b37a2",
    "metadata": {},
    "outputs": [
     "cm_test = confusion_matrix(y_test, y_test_pred)\n",
     "ConfusionMatrixDisplay(cm_test, display_labels=['Negative', 'Positive']).plot(cmap='Blues')\n",
     "plt.title('Naive Bayes - Testing Confusion Matrix')\n",
+    "plt.savefig('docs/02_results/naive_bayes_testing_confusion_matrix.png', dpi=300, bbox_inches='tight')\n",
     "plt.show()\n",
     "print(f\"Confusion Matrix:\\n{cm_test}\")\n",
     "print(f\"\\nTrue Negatives: {cm_test[0,0]}\")\n",
   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "id": "0cfe7623",
    "metadata": {},
    "outputs": [
     "axes[1].set_ylabel('Actual')\n",
     "axes[1].set_xlabel('Predicted')\n",
     "plt.tight_layout()\n",
+    "plt.savefig('docs/02_results/naive_bayes_valid_test_confusion_matrices.png', dpi=300, bbox_inches='tight')\n",
     "plt.show()"
    ]
   },

notebooks/07_support_vector_machine.ipynb CHANGED Viewed

@@ -471,7 +471,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": 17,
       "id": "18",
       "metadata": {
         "colab": {
@@ -520,7 +520,7 @@
         "plt.ylabel('True Label')\n",
         "plt.xlabel('Predicted Label')\n",
         "plt.tight_layout()\n",
-        "plt.savefig('svm_confusion_matrix.png', dpi=300, bbox_inches='tight')\n",
         "plt.show()\n",
         "print(f\"Confusion Matrix:\\n{cm}\")\n",
         "print(f\"\\nTrue Negatives: {cm[0,0]}\")\n",
@@ -541,7 +541,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": 24,
       "id": "20",
       "metadata": {
         "colab": {
@@ -624,7 +624,7 @@
         "axes[1].set_title('Top 10 Negative Coefficients (Negative Sentiment)')\n",
         "axes[1].set_xlabel('Coefficient Value')\n",
         "plt.tight_layout()\n",
-        "plt.savefig('svm_feature_coefficients.png', dpi=300, bbox_inches='tight')\n",
         "plt.show()\n",
         "print(\"\\nTop 10 Positive Coefficients (Positive Sentiment Indicators):\")\n",
         "print(top_positive.to_string(index=False))\n",
@@ -644,7 +644,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": 19,
       "id": "22",
       "metadata": {
         "colab": {
@@ -691,7 +691,7 @@
         "plt.title('SVM Decision Function Scores Distribution')\n",
         "plt.legend()\n",
         "plt.tight_layout()\n",
-        "plt.savefig('svm_decision_scores.png', dpi=300, bbox_inches='tight')\n",
         "plt.show()\n",
         "print(f\"\\nDecision Function Statistics:\")\n",
         "print(f\"Mean score for positive reviews: {decision_scores[y_test == 1].mean():.4f}\")\n",
@@ -827,7 +827,7 @@
       "provenance": []
     },
     "kernelspec": {
-      "display_name": ".venv",
       "language": "python",
       "name": "python3"
     },
@@ -841,7 +841,7 @@
       "name": "python",
       "nbconvert_exporter": "python",
       "pygments_lexer": "ipython3",
-      "version": "3.13.11"
     }
   },
   "nbformat": 4,

     },
     {
       "cell_type": "code",
+      "execution_count": null,
       "id": "18",
       "metadata": {
         "colab": {
         "plt.ylabel('True Label')\n",
         "plt.xlabel('Predicted Label')\n",
         "plt.tight_layout()\n",
+        "plt.savefig('docs/02_results/svm_confusion_matrix.png', dpi=300, bbox_inches='tight')\n",
         "plt.show()\n",
         "print(f\"Confusion Matrix:\\n{cm}\")\n",
         "print(f\"\\nTrue Negatives: {cm[0,0]}\")\n",
     },
     {
       "cell_type": "code",
+      "execution_count": null,
       "id": "20",
       "metadata": {
         "colab": {
         "axes[1].set_title('Top 10 Negative Coefficients (Negative Sentiment)')\n",
         "axes[1].set_xlabel('Coefficient Value')\n",
         "plt.tight_layout()\n",
+        "plt.savefig('docs/02_results/svm_feature_coefficients.png', dpi=300, bbox_inches='tight')\n",
         "plt.show()\n",
         "print(\"\\nTop 10 Positive Coefficients (Positive Sentiment Indicators):\")\n",
         "print(top_positive.to_string(index=False))\n",
     },
     {
       "cell_type": "code",
+      "execution_count": null,
       "id": "22",
       "metadata": {
         "colab": {
         "plt.title('SVM Decision Function Scores Distribution')\n",
         "plt.legend()\n",
         "plt.tight_layout()\n",
+        "plt.savefig('docs/02_results/svm_decision_scores.png', dpi=300, bbox_inches='tight')\n",
         "plt.show()\n",
         "print(f\"\\nDecision Function Statistics:\")\n",
         "print(f\"Mean score for positive reviews: {decision_scores[y_test == 1].mean():.4f}\")\n",
       "provenance": []
     },
     "kernelspec": {
+      "display_name": "mlqueens",
       "language": "python",
       "name": "python3"
     },
       "name": "python",
       "nbconvert_exporter": "python",
       "pygments_lexer": "ipython3",
+      "version": "3.12.12"
     }
   },
   "nbformat": 4,

notebooks/08_k_nearest_neighbors.ipynb CHANGED Viewed

@@ -412,7 +412,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": 14,
       "id": "21",
       "metadata": {},
       "outputs": [
@@ -454,7 +454,7 @@
         "plt.ylabel('True Label')\n",
         "plt.xlabel('Predicted Label')\n",
         "plt.tight_layout()\n",
-        "plt.savefig('knn_confusion_matrix.png', dpi=300, bbox_inches='tight')\n",
         "plt.show()\n",
         "print(f\"Confusion Matrix:\\n{cm}\")\n",
         "print(f\"\\nTrue Negatives: {cm[0,0]}\")\n",
@@ -475,7 +475,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": 15,
       "id": "23",
       "metadata": {},
       "outputs": [
@@ -511,7 +511,7 @@
         "plt.legend(title='Distance Metric')\n",
         "plt.grid(True, alpha=0.3)\n",
         "plt.tight_layout()\n",
-        "plt.savefig('knn_k_sensitivity.png', dpi=300, bbox_inches='tight')\n",
         "plt.show()"
       ]
     },
@@ -631,7 +631,7 @@
   ],
   "metadata": {
     "kernelspec": {
-      "display_name": ".venv",
       "language": "python",
       "name": "python3"
     },
@@ -645,7 +645,7 @@
       "name": "python",
       "nbconvert_exporter": "python",
       "pygments_lexer": "ipython3",
-      "version": "3.13.11"
     }
   },
   "nbformat": 4,

     },
     {
       "cell_type": "code",
+      "execution_count": null,
       "id": "21",
       "metadata": {},
       "outputs": [
         "plt.ylabel('True Label')\n",
         "plt.xlabel('Predicted Label')\n",
         "plt.tight_layout()\n",
+        "plt.savefig('docs/02_results/knn_confusion_matrix.png', dpi=300, bbox_inches='tight')\n",
         "plt.show()\n",
         "print(f\"Confusion Matrix:\\n{cm}\")\n",
         "print(f\"\\nTrue Negatives: {cm[0,0]}\")\n",
     },
     {
       "cell_type": "code",
+      "execution_count": null,
       "id": "23",
       "metadata": {},
       "outputs": [
         "plt.legend(title='Distance Metric')\n",
         "plt.grid(True, alpha=0.3)\n",
         "plt.tight_layout()\n",
+        "plt.savefig('docs/02_results/knn_k_sensitivity.png', dpi=300, bbox_inches='tight')\n",
         "plt.show()"
       ]
     },
   ],
   "metadata": {
     "kernelspec": {
+      "display_name": "mlqueens",
       "language": "python",
       "name": "python3"
     },
       "name": "python",
       "nbconvert_exporter": "python",
       "pygments_lexer": "ipython3",
+      "version": "3.12.12"
     }
   },
   "nbformat": 4,

notebooks/09_decision_trees.ipynb CHANGED Viewed

@@ -349,7 +349,7 @@
         "plt.title('Decision Tree: max_depth vs F1-Score (3-Fold CV)', fontsize=14, fontweight='bold')\n",
         "plt.grid(True, alpha=0.3)\n",
         "plt.tight_layout()\n",
-        "plt.savefig('dt_maxdepth_sensitivity.png', dpi=300, bbox_inches='tight')\n",
         "plt.show()\n",
         "print(\"\\nMax Depth Performance Analysis:\")\n",
         "print(depth_performance.to_string(index=False))"
@@ -500,7 +500,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": 14,
       "id": "18",
       "metadata": {
         "colab": {
@@ -549,7 +549,7 @@
         "plt.ylabel('True Label')\n",
         "plt.xlabel('Predicted Label')\n",
         "plt.tight_layout()\n",
-        "plt.savefig('decision_tree_confusion_matrix.png', dpi=300, bbox_inches='tight')\n",
         "plt.show()\n",
         "print(f\"Confusion Matrix:\\n{cm}\")\n",
         "print(f\"\\nTrue Negatives: {cm[0,0]}\")\n",
@@ -570,7 +570,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": 25,
       "id": "20",
       "metadata": {
         "colab": {
@@ -638,7 +638,7 @@
         "plt.xlabel('Importance Score')\n",
         "plt.ylabel('Features')\n",
         "plt.tight_layout()\n",
-        "plt.savefig('decision_tree_feature_importance.png', dpi=300, bbox_inches='tight')\n",
         "plt.show()\n",
         "print(\"\\nTop 20 Important Features:\")\n",
         "print(importance_df.to_string(index=False))"
@@ -755,7 +755,7 @@
       "provenance": []
     },
     "kernelspec": {
-      "display_name": ".venv",
       "language": "python",
       "name": "python3"
     },
@@ -769,7 +769,7 @@
       "name": "python",
       "nbconvert_exporter": "python",
       "pygments_lexer": "ipython3",
-      "version": "3.13.11"
     }
   },
   "nbformat": 4,

         "plt.title('Decision Tree: max_depth vs F1-Score (3-Fold CV)', fontsize=14, fontweight='bold')\n",
         "plt.grid(True, alpha=0.3)\n",
         "plt.tight_layout()\n",
+        "plt.savefig('docs/02_results/dt_maxdepth_sensitivity.png', dpi=300, bbox_inches='tight')\n",
         "plt.show()\n",
         "print(\"\\nMax Depth Performance Analysis:\")\n",
         "print(depth_performance.to_string(index=False))"
     },
     {
       "cell_type": "code",
+      "execution_count": null,
       "id": "18",
       "metadata": {
         "colab": {
         "plt.ylabel('True Label')\n",
         "plt.xlabel('Predicted Label')\n",
         "plt.tight_layout()\n",
+        "plt.savefig('docs/02_results/decision_tree_confusion_matrix.png', dpi=300, bbox_inches='tight')\n",
         "plt.show()\n",
         "print(f\"Confusion Matrix:\\n{cm}\")\n",
         "print(f\"\\nTrue Negatives: {cm[0,0]}\")\n",
     },
     {
       "cell_type": "code",
+      "execution_count": null,
       "id": "20",
       "metadata": {
         "colab": {
         "plt.xlabel('Importance Score')\n",
         "plt.ylabel('Features')\n",
         "plt.tight_layout()\n",
+        "plt.savefig('docs/02_results/decision_tree_feature_importance.png', dpi=300, bbox_inches='tight')\n",
         "plt.show()\n",
         "print(\"\\nTop 20 Important Features:\")\n",
         "print(importance_df.to_string(index=False))"
       "provenance": []
     },
     "kernelspec": {
+      "display_name": "mlqueens",
       "language": "python",
       "name": "python3"
     },
       "name": "python",
       "nbconvert_exporter": "python",
       "pygments_lexer": "ipython3",
+      "version": "3.12.12"
     }
   },
   "nbformat": 4,

notebooks/10_random_forest.ipynb CHANGED Viewed

@@ -310,7 +310,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": 9,
       "id": "1309ffb1",
       "metadata": {},
       "outputs": [
@@ -356,7 +356,7 @@
         "plt.title('Random Forest: n_estimators vs F1-Score (3-Fold CV)', fontsize=14, fontweight='bold')\n",
         "plt.grid(True, alpha=0.3)\n",
         "plt.tight_layout()\n",
-        "plt.savefig('rf_nestimators_sensitivity.png', dpi=300, bbox_inches='tight')\n",
         "plt.show()\n",
         "\n",
         "print(\"\\nNumber of Estimators Performance Analysis:\")\n",
@@ -508,7 +508,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": 14,
       "id": "18",
       "metadata": {
         "colab": {
@@ -557,7 +557,7 @@
         "plt.ylabel('True Label')\n",
         "plt.xlabel('Predicted Label')\n",
         "plt.tight_layout()\n",
-        "plt.savefig('random_forest_confusion_matrix.png', dpi=300, bbox_inches='tight')\n",
         "plt.show()\n",
         "print(f\"Confusion Matrix:\\n{cm}\")\n",
         "print(f\"\\nTrue Negatives: {cm[0,0]}\")\n",
@@ -578,7 +578,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": 15,
       "id": "20",
       "metadata": {
         "colab": {
@@ -646,7 +646,7 @@
         "plt.xlabel('Importance Score')\n",
         "plt.ylabel('Features')\n",
         "plt.tight_layout()\n",
-        "plt.savefig('random_forest_feature_importance.png', dpi=300, bbox_inches='tight')\n",
         "plt.show()\n",
         "print(\"\\nTop 20 Important Features:\")\n",
         "print(importance_df.to_string(index=False))"
@@ -766,7 +766,7 @@
       "provenance": []
     },
     "kernelspec": {
-      "display_name": ".venv",
       "language": "python",
       "name": "python3"
     },
@@ -780,7 +780,7 @@
       "name": "python",
       "nbconvert_exporter": "python",
       "pygments_lexer": "ipython3",
-      "version": "3.13.11"
     }
   },
   "nbformat": 4,

     },
     {
       "cell_type": "code",
+      "execution_count": null,
       "id": "1309ffb1",
       "metadata": {},
       "outputs": [
         "plt.title('Random Forest: n_estimators vs F1-Score (3-Fold CV)', fontsize=14, fontweight='bold')\n",
         "plt.grid(True, alpha=0.3)\n",
         "plt.tight_layout()\n",
+        "plt.savefig('docs/02_results/rf_nestimators_sensitivity.png', dpi=300, bbox_inches='tight')\n",
         "plt.show()\n",
         "\n",
         "print(\"\\nNumber of Estimators Performance Analysis:\")\n",
     },
     {
       "cell_type": "code",
+      "execution_count": null,
       "id": "18",
       "metadata": {
         "colab": {
         "plt.ylabel('True Label')\n",
         "plt.xlabel('Predicted Label')\n",
         "plt.tight_layout()\n",
+        "plt.savefig('docs/02_results/random_forest_confusion_matrix.png', dpi=300, bbox_inches='tight')\n",
         "plt.show()\n",
         "print(f\"Confusion Matrix:\\n{cm}\")\n",
         "print(f\"\\nTrue Negatives: {cm[0,0]}\")\n",
     },
     {
       "cell_type": "code",
+      "execution_count": null,
       "id": "20",
       "metadata": {
         "colab": {
         "plt.xlabel('Importance Score')\n",
         "plt.ylabel('Features')\n",
         "plt.tight_layout()\n",
+        "plt.savefig('docs/02_results/random_forest_feature_importance.png', dpi=300, bbox_inches='tight')\n",
         "plt.show()\n",
         "print(\"\\nTop 20 Important Features:\")\n",
         "print(importance_df.to_string(index=False))"
       "provenance": []
     },
     "kernelspec": {
+      "display_name": "mlqueens",
       "language": "python",
       "name": "python3"
     },
       "name": "python",
       "nbconvert_exporter": "python",
       "pygments_lexer": "ipython3",
+      "version": "3.12.12"
     }
   },
   "nbformat": 4,

notebooks/11_stochastic_gradient_descent.ipynb CHANGED Viewed

@@ -304,7 +304,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": 9,
       "id": "7c849d0a",
       "metadata": {},
       "outputs": [
@@ -351,7 +351,7 @@
         "plt.xticks(np.arange(len(alpha_performance)), [f'{a:.5f}' for a in alpha_performance['alpha']], rotation=45)\n",
         "plt.grid(True, alpha=0.3)\n",
         "plt.tight_layout()\n",
-        "plt.savefig('sgd_alpha_sensitivity.png', dpi=300, bbox_inches='tight')\n",
         "plt.show()\n",
         "\n",
         "print(\"\\nRegularization Strength (alpha) Performance Analysis:\")\n",
@@ -506,7 +506,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": 13,
       "id": "18",
       "metadata": {
         "colab": {
@@ -555,7 +555,7 @@
         "plt.ylabel('True Label')\n",
         "plt.xlabel('Predicted Label')\n",
         "plt.tight_layout()\n",
-        "plt.savefig('sgd_confusion_matrix.png', dpi=300, bbox_inches='tight')\n",
         "plt.show()\n",
         "print(f\"Confusion Matrix:\\n{cm}\")\n",
         "print(f\"\\nTrue Negatives: {cm[0,0]}\")\n",
@@ -576,7 +576,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": 14,
       "id": "20",
       "metadata": {
         "colab": {
@@ -652,7 +652,7 @@
         "axes[1].set_title('Top 10 Negative Coefficients (Negative Sentiment)')\n",
         "axes[1].set_xlabel('Coefficient Value')\n",
         "plt.tight_layout()\n",
-        "plt.savefig('sgd_feature_coefficients.png', dpi=300, bbox_inches='tight')\n",
         "plt.show()\n",
         "print(\"\\nTop 10 Positive Coefficients:\")\n",
         "print(top_positive.to_string(index=False))\n",
@@ -825,7 +825,7 @@
       "provenance": []
     },
     "kernelspec": {
-      "display_name": ".venv",
       "language": "python",
       "name": "python3"
     },
@@ -839,7 +839,7 @@
       "name": "python",
       "nbconvert_exporter": "python",
       "pygments_lexer": "ipython3",
-      "version": "3.13.11"
     }
   },
   "nbformat": 4,

     },
     {
       "cell_type": "code",
+      "execution_count": null,
       "id": "7c849d0a",
       "metadata": {},
       "outputs": [
         "plt.xticks(np.arange(len(alpha_performance)), [f'{a:.5f}' for a in alpha_performance['alpha']], rotation=45)\n",
         "plt.grid(True, alpha=0.3)\n",
         "plt.tight_layout()\n",
+        "plt.savefig('docs/02_results/sgd_alpha_sensitivity.png', dpi=300, bbox_inches='tight')\n",
         "plt.show()\n",
         "\n",
         "print(\"\\nRegularization Strength (alpha) Performance Analysis:\")\n",
     },
     {
       "cell_type": "code",
+      "execution_count": null,
       "id": "18",
       "metadata": {
         "colab": {
         "plt.ylabel('True Label')\n",
         "plt.xlabel('Predicted Label')\n",
         "plt.tight_layout()\n",
+        "plt.savefig('docs/02_results/sgd_confusion_matrix.png', dpi=300, bbox_inches='tight')\n",
         "plt.show()\n",
         "print(f\"Confusion Matrix:\\n{cm}\")\n",
         "print(f\"\\nTrue Negatives: {cm[0,0]}\")\n",
     },
     {
       "cell_type": "code",
+      "execution_count": null,
       "id": "20",
       "metadata": {
         "colab": {
         "axes[1].set_title('Top 10 Negative Coefficients (Negative Sentiment)')\n",
         "axes[1].set_xlabel('Coefficient Value')\n",
         "plt.tight_layout()\n",
+        "plt.savefig('docs/02_results/sgd_feature_coefficients.png', dpi=300, bbox_inches='tight')\n",
         "plt.show()\n",
         "print(\"\\nTop 10 Positive Coefficients:\")\n",
         "print(top_positive.to_string(index=False))\n",
       "provenance": []
     },
     "kernelspec": {
+      "display_name": "mlqueens",
       "language": "python",
       "name": "python3"
     },
       "name": "python",
       "nbconvert_exporter": "python",
       "pygments_lexer": "ipython3",
+      "version": "3.12.12"
     }
   },
   "nbformat": 4,

notebooks/12_xgboost.ipynb CHANGED Viewed

@@ -328,7 +328,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 12,
    "id": "6fae4f6d",
    "metadata": {},
    "outputs": [
@@ -373,7 +373,7 @@
     "plt.xticks(np.arange(len(lr_performance)), [f'{lr:.3f}' for lr in lr_performance['learning_rate']])\n",
     "plt.grid(True, alpha=0.3)\n",
     "plt.tight_layout()\n",
-    "plt.savefig('xgb_learning_rate_sensitivity.png', dpi=300, bbox_inches='tight')\n",
     "plt.show()\n",
     "print(\"\\nLearning Rate Performance Analysis:\")\n",
     "print(lr_performance.to_string(index=False))"
@@ -527,7 +527,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 16,
    "id": "18",
    "metadata": {
     "colab": {
@@ -576,7 +576,7 @@
     "plt.ylabel('True Label')\n",
     "plt.xlabel('Predicted Label')\n",
     "plt.tight_layout()\n",
-    "plt.savefig('xgboost_confusion_matrix.png', dpi=300, bbox_inches='tight')\n",
     "plt.show()\n",
     "print(f\"Confusion Matrix:\\n{cm}\")\n",
     "print(f\"\\nTrue Negatives: {cm[0,0]}\")\n",
@@ -597,7 +597,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 17,
    "id": "20",
    "metadata": {
     "colab": {
@@ -665,7 +665,7 @@
     "plt.xlabel('Importance Score')\n",
     "plt.ylabel('Features')\n",
     "plt.tight_layout()\n",
-    "plt.savefig('xgboost_feature_importance.png', dpi=300, bbox_inches='tight')\n",
     "plt.show()\n",
     "print(\"\\nTop 20 Important Features:\")\n",
     "print(importance_df.to_string(index=False))"
@@ -851,7 +851,7 @@
    "provenance": []
   },
   "kernelspec": {
-   "display_name": ".venv",
    "language": "python",
    "name": "python3"
   },
@@ -865,7 +865,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.13.11"
   }
  },
  "nbformat": 4,

   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "id": "6fae4f6d",
    "metadata": {},
    "outputs": [
     "plt.xticks(np.arange(len(lr_performance)), [f'{lr:.3f}' for lr in lr_performance['learning_rate']])\n",
     "plt.grid(True, alpha=0.3)\n",
     "plt.tight_layout()\n",
+    "plt.savefig('docs/02_results/xgb_learning_rate_sensitivity.png', dpi=300, bbox_inches='tight')\n",
     "plt.show()\n",
     "print(\"\\nLearning Rate Performance Analysis:\")\n",
     "print(lr_performance.to_string(index=False))"
   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "id": "18",
    "metadata": {
     "colab": {
     "plt.ylabel('True Label')\n",
     "plt.xlabel('Predicted Label')\n",
     "plt.tight_layout()\n",
+    "plt.savefig('docs/02_results/xgboost_confusion_matrix.png', dpi=300, bbox_inches='tight')\n",
     "plt.show()\n",
     "print(f\"Confusion Matrix:\\n{cm}\")\n",
     "print(f\"\\nTrue Negatives: {cm[0,0]}\")\n",
   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "id": "20",
    "metadata": {
     "colab": {
     "plt.xlabel('Importance Score')\n",
     "plt.ylabel('Features')\n",
     "plt.tight_layout()\n",
+    "plt.savefig('docs/02_results/xgboost_feature_importance.png', dpi=300, bbox_inches='tight')\n",
     "plt.show()\n",
     "print(\"\\nTop 20 Important Features:\")\n",
     "print(importance_df.to_string(index=False))"
    "provenance": []
   },
   "kernelspec": {
+   "display_name": "mlqueens",
    "language": "python",
    "name": "python3"
   },
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
+   "version": "3.12.12"
   }
  },
  "nbformat": 4,

notebooks/13_lightgbm.ipynb CHANGED Viewed

@@ -331,7 +331,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": 10,
       "id": "4097cd49",
       "metadata": {},
       "outputs": [
@@ -376,7 +376,7 @@
         "plt.xticks(np.arange(len(leaves_performance)), leaves_performance['num_leaves'].astype(int))\n",
         "plt.grid(True, alpha=0.3)\n",
         "plt.tight_layout()\n",
-        "plt.savefig('lgb_num_leaves_sensitivity.png', dpi=300, bbox_inches='tight')\n",
         "plt.show()\n",
         "print(\"\\nTree Complexity (num_leaves) Performance Analysis:\")\n",
         "print(leaves_performance.to_string(index=False))"
@@ -530,7 +530,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": 14,
       "id": "18",
       "metadata": {
         "colab": {
@@ -579,7 +579,7 @@
         "plt.ylabel('True Label')\n",
         "plt.xlabel('Predicted Label')\n",
         "plt.tight_layout()\n",
-        "plt.savefig('lightgbm_confusion_matrix.png', dpi=300, bbox_inches='tight')\n",
         "plt.show()\n",
         "print(f\"Confusion Matrix:\\n{cm}\")\n",
         "print(f\"\\nTrue Negatives: {cm[0,0]}\")\n",
@@ -600,7 +600,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": 15,
       "id": "20",
       "metadata": {
         "colab": {
@@ -668,7 +668,7 @@
         "plt.xlabel('Importance Score')\n",
         "plt.ylabel('Features')\n",
         "plt.tight_layout()\n",
-        "plt.savefig('lightgbm_feature_importance.png', dpi=300, bbox_inches='tight')\n",
         "plt.show()\n",
         "print(\"\\nTop 20 Important Features:\")\n",
         "print(importance_df.to_string(index=False))"
@@ -857,7 +857,7 @@
       "provenance": []
     },
     "kernelspec": {
-      "display_name": ".venv",
       "language": "python",
       "name": "python3"
     },
@@ -871,7 +871,7 @@
       "name": "python",
       "nbconvert_exporter": "python",
       "pygments_lexer": "ipython3",
-      "version": "3.13.11"
     }
   },
   "nbformat": 4,

     },
     {
       "cell_type": "code",
+      "execution_count": null,
       "id": "4097cd49",
       "metadata": {},
       "outputs": [
         "plt.xticks(np.arange(len(leaves_performance)), leaves_performance['num_leaves'].astype(int))\n",
         "plt.grid(True, alpha=0.3)\n",
         "plt.tight_layout()\n",
+        "plt.savefig('docs/02_results/lgb_num_leaves_sensitivity.png', dpi=300, bbox_inches='tight')\n",
         "plt.show()\n",
         "print(\"\\nTree Complexity (num_leaves) Performance Analysis:\")\n",
         "print(leaves_performance.to_string(index=False))"
     },
     {
       "cell_type": "code",
+      "execution_count": null,
       "id": "18",
       "metadata": {
         "colab": {
         "plt.ylabel('True Label')\n",
         "plt.xlabel('Predicted Label')\n",
         "plt.tight_layout()\n",
+        "plt.savefig('docs/02_results/lightgbm_confusion_matrix.png', dpi=300, bbox_inches='tight')\n",
         "plt.show()\n",
         "print(f\"Confusion Matrix:\\n{cm}\")\n",
         "print(f\"\\nTrue Negatives: {cm[0,0]}\")\n",
     },
     {
       "cell_type": "code",
+      "execution_count": null,
       "id": "20",
       "metadata": {
         "colab": {
         "plt.xlabel('Importance Score')\n",
         "plt.ylabel('Features')\n",
         "plt.tight_layout()\n",
+        "plt.savefig('docs/02_results/lightgbm_feature_importance.png', dpi=300, bbox_inches='tight')\n",
         "plt.show()\n",
         "print(\"\\nTop 20 Important Features:\")\n",
         "print(importance_df.to_string(index=False))"
       "provenance": []
     },
     "kernelspec": {
+      "display_name": "mlqueens",
       "language": "python",
       "name": "python3"
     },
       "name": "python",
       "nbconvert_exporter": "python",
       "pygments_lexer": "ipython3",
+      "version": "3.12.12"
     }
   },
   "nbformat": 4,

src/utils/helpers.py CHANGED Viewed

@@ -109,58 +109,57 @@ def apply_balance(df: pd.DataFrame, target_col: str = "target", random_state: in
 def plot_top_ngrams(corpus, n=1, top_k=20, stop_words='english', max_features=20000, figsize=(10,6), title=None):
-    """
-    Compute and plot the top n-grams from a text corpus.
-    Parameters
-    ----------
-    corpus : iterable-like
-        Iterable of text documents (e.g., pandas Series).
-    n : int, optional
-        The n in n-grams (uses ngram_range=(n,n)). Default is 1 (unigrams).
-    top_k : int, optional
-        Number of top n-grams to show. Default is 20.
-    stop_words : str or list, optional
-        Stop words parameter forwarded to CountVectorizer. Default 'english'.
-    max_features : int, optional
-        Max features for the vectorizer. Default 20000.
-    figsize : tuple, optional
-        Figure size for the plot.
-    title : str, optional
-        Custom title for the plot. If None, a default title is used.
-    Returns
-    -------
-    list of (term, count)
-        The top n-grams and their counts (sorted descending).
-    """
-    vec = CountVectorizer(ngram_range=(n, n), stop_words=stop_words, max_features=max_features)
-    X = vec.fit_transform(corpus)
-    sums = np.array(X.sum(axis=0)).ravel()
-    terms = np.array(vec.get_feature_names_out())
-    if terms.size == 0:
-        print("No terms found for the given corpus/parameters.")
-        return []
-    top_idx = sums.argsort()[::-1][:top_k]
-    top_terms = terms[top_idx]
-    top_counts = sums[top_idx]
-    # Plot horizontal bar chart with largest on top
-    plt.figure(figsize=figsize)
-    plt.barh(top_terms[::-1], top_counts[::-1], color='steelblue')
-    plt.xlabel("Count")
-    plt.tight_layout()
-    if title is None:
-        title = f"Top {min(top_k, len(top_terms))} {n}-grams"
-    plt.title(title)
-    plt.show()
-    return list(zip(top_terms, top_counts))
 # preprocessing notebook
 def clean_text(s):
@@ -292,7 +291,7 @@ def show_top_ngrams_by_class(df, target_col='review_target', text_col='review_cl
 				plt.title(f"Top {len(terms)} {nname} for class {cls}")
 				plt.xlabel("Count")
 				plt.tight_layout()
-				plt.savefig(f'data/predictions/top_{nname}_for_class_{cls}.png', dpi=300, bbox_inches='tight')
 				plt.show()
 	return results
@@ -391,7 +390,7 @@ def plot_dimensionality_reduction(X, labels, method='PCA', sample=1000, random_s
 	plt.xlabel('dim1')
 	plt.ylabel('dim2')
 	plt.title(f'{method} projection')
-	plt.savefig(f'data/predictions/{method}_projection_{data_name}_{i}.png', dpi=300, bbox_inches='tight')
 	plt.show()
 	return emb

 def plot_top_ngrams(corpus, n=1, top_k=20, stop_words='english', max_features=20000, figsize=(10,6), title=None):
+	"""
+	Compute and plot the top n-grams from a text corpus.
+	Parameters
+	----------
+	corpus : iterable-like
+		Iterable of text documents (e.g., pandas Series).
+	n : int, optional
+		The n in n-grams (uses ngram_range=(n,n)). Default is 1 (unigrams).
+	top_k : int, optional
+		Number of top n-grams to show. Default is 20.
+	stop_words : str or list, optional
+		Stop words parameter forwarded to CountVectorizer. Default 'english'.
+	max_features : int, optional
+		Max features for the vectorizer. Default 20000.
+	figsize : tuple, optional
+		Figure size for the plot.
+	title : str, optional
+		Custom title for the plot. If None, a default title is used.
+	Returns
+	-------
+	list of (term, count)
+		The top n-grams and their counts (sorted descending).
+	"""
+	vec = CountVectorizer(ngram_range=(n, n), stop_words=stop_words, max_features=max_features)
+	X = vec.fit_transform(corpus)
+	sums = np.array(X.sum(axis=0)).ravel()
+	terms = np.array(vec.get_feature_names_out())
+	if terms.size == 0:
+		print("No terms found for the given corpus/parameters.")
+		return []
+	top_idx = sums.argsort()[::-1][:top_k]
+	top_terms = terms[top_idx]
+	top_counts = sums[top_idx]
+	# Plot horizontal bar chart with largest on top
+	plt.figure(figsize=figsize)
+	plt.barh(top_terms[::-1], top_counts[::-1], color='steelblue')
+	plt.xlabel("Count")
+	plt.tight_layout()
+	if title is None:
+		title = f"Top {min(top_k, len(top_terms))} {n}-grams"
+	plt.title(title)
+	plt.savefig(f'docs/02_results/top_{top_k}_{n}grams.png', dpi=300, bbox_inches='tight')
+	plt.show()
+	return list(zip(top_terms, top_counts))
 # preprocessing notebook
 def clean_text(s):
 				plt.title(f"Top {len(terms)} {nname} for class {cls}")
 				plt.xlabel("Count")
 				plt.tight_layout()
+				plt.savefig(f'docs/02_results/top_{nname}_for_class_{cls}.png', dpi=300, bbox_inches='tight')
 				plt.show()
 	return results
 	plt.xlabel('dim1')
 	plt.ylabel('dim2')
 	plt.title(f'{method} projection')
+	plt.savefig(f'docs/02_results/{method}_projection_{data_name}_{i}.png', dpi=300, bbox_inches='tight')
 	plt.show()
 	return emb