elsayedelmandoh commited on
Commit
413d3a1
Β·
1 Parent(s): 50430e0

update readme, datasets, and structure

Browse files
README.md CHANGED
@@ -1 +1,114 @@
1
- #
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Sentiment Analysis of Amazon Reviews using Machine Learning
2
+
3
+ [![Python 3.12](https://img.shields.io/badge/python-3.12-blue.svg)](https://www.python.org/downloads/release/python-3120/)
4
+ [![Gradio](https://img.shields.io/badge/UI-Gradio-orange)](https://gradio.app/)
5
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
6
+
7
+ **Live Demo (Hugging Face Space):** [Sentiment Sleuth](https://huggingface.co/spaces/elsayedelmandoh/project-name)
8
+ **GitHub Repository:** [Sentiment Sleuth](https://github.com/elsayedelmandoh/sentiment-analysis-of-amazon-reviews-using-machine-learning-ml-queens)
9
+
10
+ ## Table of Contents
11
+ - [Overview](#overview)
12
+ - [Key Features](#key-features)
13
+ - [Setup](#setup)
14
+ - [0. Prerequisites](#0-prerequisites)
15
+ - [1. Clone the Repository](#1-clone-the-repository)
16
+ - [2. Create Conda Environment](#2-create-conda-environment)
17
+ - [3. Environment Variables](#3-environment-variables)
18
+ - [Usage](#usage)
19
+ - [Contributing](#contributing)
20
+ - [Author](#author)
21
+
22
+ ---
23
+ ## Overview
24
+
25
+ Ω‹This is a project for performing sentiment analysis on Amazon product reviews using classical machine-learning models. The project includes data processing and feature engineering notebooks, multiple trained classifiers saved as joblib artifacts, a TF-IDF vectorizer, and a Streamlit UI to analyze custom review text.
26
+
27
+ Key components in the repository:
28
+ - Interactive app: `app.py` (Streamlit)
29
+ - Saved models: `data/models/*.joblib`
30
+ - Vectorizer and precomputed TF-IDF sparse matrices: `data/vectorizers/`
31
+ - Processed datasets and samples: `data/processed/` and `data/samples/`
32
+ - Notebooks: `notebooks/` (EDA, preprocessing, feature engineering, and model notebooks)
33
+ - Documentation: `docs/` (research notes, project definition, workflow, and report)
34
+
35
+ The Streamlit app loads saved artifacts via `src.utils.helpers` and exposes multiple classifiers (`Logistic Regression, Naive Bayes, SVM variants, KNN, Decision Trees, Random Forest, SGD, XGBoost and LightGBM`) so you can compare predictions and confidence scores side-by-side.
36
+
37
+ ---
38
+ ## Key Features
39
+ * **Multiple Models:** Compare results from several traditional classifiers (Logistic Regression, Naive Bayes, SVMs, KNN, Decision Trees, Random Forests, SGD, XGBoost, LightGBM).
40
+ * **Reusable Artifacts:** TF-IDF vectorizer and trained models are persisted under `data/vectorizers/` and `data/models/` for fast local inference.
41
+ * **Notebooks for Reproducibility:** Step-by-step Jupyter notebooks for data acquisition, EDA, preprocessing, feature engineering and model training are included under `notebooks/`.
42
+
43
+ ---
44
+ ## Setup
45
+ 0. Prerequisites
46
+ Before running this project, ensure you have the following installed:
47
+ * [Git](https://git-scm.com/)
48
+ * [Anaconda](https://www.anaconda.com/) or Miniconda
49
+ * Python 3.12 (recommended)
50
+
51
+ 1. Clone the Repository
52
+ ```bash
53
+ git clone https://github.com/elsayedelmandoh/sentiment-analysis-of-amazon-reviews-using-machine-learning-ml-queens
54
+ cd sentiment-analysis-of-amazon-reviews-using-machine-learning-ml-queens
55
+ ```
56
+ 2. Create Conda Environment
57
+ ```bash
58
+ # Create & activate the environment
59
+ conda create -n env-name python=3.12 -y
60
+ conda activate env-name
61
+
62
+ # Install pip and project dependencies
63
+ conda install pip -y
64
+ pip install -r requirements.txt
65
+ ```
66
+
67
+ 3. Environment Variables
68
+ Create a `.env` file at the project root and add any necessary API keys or configuration variables
69
+
70
+ ---
71
+ ## Usage
72
+ This project uses Streamlit for the interactive UI. Start the app locally with one of the following commands:
73
+
74
+ ```bash
75
+ # Run via Streamlit
76
+ streamlit run app.py
77
+ ```
78
+
79
+ When the app starts, open the local URL printed in your terminal (usually http://localhost:8501) and paste an Amazon review into the text area to see per-model sentiment predictions and confidence scores.
80
+
81
+ Model artifacts and vectorizers are loaded from `data/models/` and `data/vectorizers/`. If the vectorizer or model files are missing, the app will show an error message pointing to the expected files.
82
+
83
+ ---
84
+ ## Reproducibility & Notebooks
85
+ The `notebooks/` directory contains step-by-step analysis and model training notebooks. Key notebooks:
86
+ - `01_data_acquisition.ipynb` β€” dataset loading and brief description
87
+ - `02_eda.ipynb` β€” exploratory data analysis
88
+ - `03_data_preprocessing.ipynb` β€” cleaning and preprocessing
89
+ - `04_feature_engineering.ipynb` β€” TF-IDF vectorization and feature prep
90
+ - `05_logistic_regression.ipynb` through `13_lightgbm.ipynb` β€” one notebook per model
91
+ - `14_comparsion.ipynb` β€” model comparison and summary
92
+
93
+ Use these notebooks to retrain or refine models and regenerate the `joblib` artifacts saved in `data/models/`.
94
+
95
+ ---
96
+ ## Contributing
97
+ Contributions are welcome! If you'd like to improve this project, please follow these steps:
98
+ 1. Fork the repository.
99
+ 2. Create a branch for your feature or bug fix (`git checkout -b feature/my-new-feature`).
100
+ 3. Commit your changes with clear messages (`git commit -m 'add some feature'`).
101
+ 4. Push to your fork (`git push origin feature/my-new-feature`).
102
+ 5. Open a pull request.
103
+
104
+ Please include reproducible steps and, if applicable, updated notebooks or scripts to regenerate models.
105
+
106
+ ## License
107
+ This project is provided under the MIT license. See the `LICENSE` file for details.
108
+
109
+ ## Author
110
+ Elsayed Elmandoh - NLP Engineer
111
+ * Connect on LinkedIn & X [Linktree](https://linktr.ee/elsayedelmandoh)
112
+
113
+ Mohamed Kamal - AI Engineer
114
+ * Connect on [LinkedIn](https://www.linkedin.com/in/mohamed-kamal-has/?utm_source=share_via&utm_content=profile&utm_medium=member_android)
docs/00_research/datasets.md CHANGED
@@ -12,4 +12,15 @@ Description: The dataset consists of approximately 1.8M training samples and 200
12
  - Title: The review summary.
13
  - Text: The full review body.
14
 
15
- Scale: Given the constraints of the Anaconda/Jupyter environment and the 4-week timeline, we will utilize a stratified subset of this data. This ensures we maintain a balanced distribution of classes while keeping training times feasible for iterative experimentation.
 
 
 
 
 
 
 
 
 
 
 
 
12
  - Title: The review summary.
13
  - Text: The full review body.
14
 
15
+ Scale: Given the constraints of the Anaconda/Jupyter environment and the 4-week timeline, we will utilize a stratified subset of this data. This ensures we maintain a balanced distribution of classes while keeping training times feasible for iterative experimentation.
16
+
17
+ Current project files (workspace snapshot):
18
+
19
+ - data/raw/: `train.csv`, `test.csv`, `readme.txt`
20
+ - data/processed/: `processed_train.csv`, `processed_valid.csv`, `processed_test.csv`, `feat_eng_train.csv`, `balanced_sample_train.csv`, `y_train.csv`, `y_valid.csv`, `y_test.csv`
21
+ - data/vectorizers/: `tfidf_vectorizer.joblib`, `X_train_tfidf.npz`, `X_valid_tfidf.npz`, `X_test_tfidf.npz`
22
+ - data/models/: several pre-trained model artifacts (see docs/01_project_definition/07_structure.md for full tree)
23
+
24
+ Notes:
25
+ - The workspace stores preprocessed train/valid/test splits under `data/processed` to allow reproducible training and evaluation without re-running heavy preprocessing steps.
26
+ - TF-IDF artifacts are persisted under `data/vectorizers` and are used to transform text into sparse matrices for model training and inference.
docs/01_project_definition/00_quickstart.md CHANGED
@@ -1,2 +1,34 @@
1
- # Project Definition - quickstart
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
 
 
1
+ # Project Definition - Quickstart
2
+
3
+ This quickstart explains how to prepare the environment and reproduce core experiments and inference from the repository.
4
+
5
+ 1) Create a Python environment and install dependencies:
6
+
7
+ ```
8
+ python -m venv .venv
9
+ source .venv/Scripts/activate
10
+ pip install -r requirements.txt
11
+ ```
12
+
13
+ 2) Inspect processed data and vectorizers (already available in repo):
14
+
15
+ - `data/processed/` contains prepared train/valid/test CSVs and labels.
16
+ - `data/vectorizers/` contains the fitted TF-IDF vectorizer and sparse matrices.
17
+
18
+ 3) Run notebooks (recommended order):
19
+
20
+ - `notebooks/01_data_acquisition.ipynb`
21
+ - `notebooks/02_eda.ipynb`
22
+ - `notebooks/03_data_preprocessing.ipynb`
23
+ - `notebooks/04_feature_engineering.ipynb`
24
+ - modeling notebooks `05_*.ipynb` β†’ `14_comparsion.ipynb`
25
+
26
+ 4) Run the demo/app:
27
+
28
+ ```
29
+ streamlit run app.py
30
+ ```
31
+
32
+ Notes:
33
+ - Preprocessed artifacts and trained model joblib files are stored under `data/processed`, `data/vectorizers`, and `data/models` to speed up reproduction.
34
 
docs/01_project_definition/07_structure.md CHANGED
@@ -3,47 +3,69 @@
3
  ```text
4
  sentiment-analysis-of-amazon-reviews-using-machine-learning/
5
  β”œβ”€β”€ data/
6
- β”‚ β”œβ”€β”€ raw/ # Original, immutable Kaggle dataset
7
- β”‚ β”œβ”€β”€ processed/ # Cleaned data ready for modeling
8
- β”‚ β”œβ”€β”€ predictions/ # Model predictions on test set
9
- β”‚ └── models/ # Saved model files
 
 
 
 
 
 
 
 
 
 
 
10
  |
11
  β”œβ”€β”€ docs/
12
- β”‚ └── 00_research/ #
13
- | β”œβ”€β”€ datasets.md
14
- | β”œβ”€β”€ references.md
15
- | └── related_projects.md
 
 
 
 
 
 
 
 
 
 
16
  |
17
- β”‚ └── 01_project_definition/ #
18
- | β”œβ”€β”€ 00_quickstart.md
19
- | β”œβ”€β”€ 01_problem.md
20
- | β”œβ”€β”€ 08_report.md
21
- | └── proposal_sentiment_analysis.pdf
 
 
 
 
 
 
 
 
 
 
 
22
  |
23
- β”œβ”€β”€ notebooks/
24
- β”‚ β”œβ”€β”€ 00_quickstart.ipynb #
25
- β”‚ β”œβ”€β”€ 01_logistic_regression.ipynb #
26
- β”‚ β”œβ”€β”€ 02_naive_bayes.ipynb #
27
- β”‚ β”œβ”€β”€ 03_support_vector_machines.ipynb #
28
- β”‚ β”œβ”€β”€ 04_k_nearest_neighbors.ipynb #
29
- β”‚ β”œβ”€β”€ 05_decision_trees.ipynb #
30
- β”‚ β”œβ”€β”€ 06_random_forest.ipynb #
31
- β”‚ β”œβ”€β”€ 07_stochastic_gradient_descent.ipynb #
32
- β”‚ └── 08_comparsion.ipynb #
33
- |
34
- β”œβ”€β”€ src/ # Production-style source code
35
  β”‚ β”œβ”€β”€ config/
36
  β”‚ β”‚ β”œβ”€β”€ __init__.py
37
- β”‚ β”‚ └── settings # Configuration files
38
- β”‚ β”œβ”€β”€ utils/
39
- β”‚ | β”œβ”€β”€ __init__.py
40
- β”‚ | └── helpers.py
41
- β”‚ └── __init__.py
42
  |
43
- β”œβ”€β”€ .env #
44
- β”œβ”€β”€ .env.example #
45
- β”œβ”€β”€ .gitattributes #
46
- β”œβ”€β”€ .gitignore #
47
- β”œβ”€β”€ appy.py #
 
48
  β”œβ”€β”€ README.md # Project overview and instructions to run
49
- └── requirements.txt # List of dependencies (pandas, scikit-learn, etc.)
 
 
3
  ```text
4
  sentiment-analysis-of-amazon-reviews-using-machine-learning/
5
  β”œβ”€β”€ data/
6
+ β”‚ β”œβ”€β”€ models/ # Saved model files (.joblib)
7
+ β”‚ β”œβ”€β”€ predictions/ # Model prediction outputs (CSV)
8
+ β”‚ β”œβ”€β”€ processed/ # Cleaned & feature-engineered datasets
9
+ β”‚ β”‚ β”œβ”€β”€ processed_train.csv
10
+ β”‚ β”‚ β”œβ”€β”€ processed_valid.csv
11
+ β”‚ β”‚ β”œβ”€β”€ processed_test.csv
12
+ β”‚ β”‚ └── feat_eng_train.csv
13
+ β”‚ β”œβ”€β”€ raw/ # Original immutable dataset
14
+ β”‚ β”‚ β”œβ”€β”€ train.csv
15
+ β”‚ β”‚ └── test.csv
16
+ β”‚ β”œβ”€β”€ samples/ # Small sample files for quick testing
17
+ β”‚ └── vectorizers/ # Saved vectorizers and sparse matrices (TF-IDF)
18
+ β”‚ β”œβ”€β”€ tfidf_vectorizer.joblib
19
+ β”‚ β”œβ”€β”€ X_train_tfidf.npz
20
+ β”‚ └── X_test_tfidf.npz
21
  |
22
  β”œβ”€β”€ docs/
23
+ β”‚ β”œβ”€β”€ 00_research/
24
+ β”‚ β”‚ β”œβ”€β”€ datasets.md
25
+ β”‚ β”‚ β”œβ”€β”€ references.md
26
+ β”‚ β”‚ └── related_projects.md
27
+ β”‚ └── 01_project_definition/
28
+ β”‚ β”œβ”€β”€ 00_quickstart.md
29
+ β”‚ β”œβ”€β”€ 01_problem.md
30
+ β”‚ β”œβ”€β”€ 02_goal.md
31
+ β”‚ β”œβ”€β”€ 03_solution.md
32
+ β”‚ β”œβ”€β”€ 04_stack.md
33
+ β”‚ β”œβ”€β”€ 05_architecture.md
34
+ β”‚ β”œβ”€β”€ 06_workflow.md
35
+ β”‚ β”œβ”€β”€ 07_structure.md
36
+ β”‚ └── 08_report.md
37
  |
38
+ β”œβ”€β”€ notebooks/
39
+ β”‚ β”œβ”€β”€ 00_quickstartt.ipynb
40
+ β”‚ β”œβ”€β”€ 01_data_acquisition.ipynb
41
+ β”‚ β”œβ”€β”€ 02_eda.ipynb
42
+ β”‚ β”œβ”€β”€ 03_data_preprocessing.ipynb
43
+ β”‚ β”œβ”€β”€ 04_feature_engineering.ipynb
44
+ β”‚ β”œβ”€β”€ 05_logistic_regression.ipynb
45
+ β”‚ β”œβ”€β”€ 06_naive_bayes.ipynb
46
+ β”‚ β”œβ”€β”€ 07_support_vector_machine.ipynb
47
+ β”‚ β”œβ”€β”€ 08_k_nearest_neighbors.ipynb
48
+ β”‚ β”œβ”€β”€ 09_decision_trees.ipynb
49
+ β”‚ β”œβ”€β”€ 10_random_forest.ipynb
50
+ β”‚ β”œβ”€β”€ 11_stochastic_gradient_descent.ipynb
51
+ β”‚ β”œβ”€β”€ 12_xgboost.ipynb
52
+ β”‚ β”œβ”€β”€ 13_lightgbm.ipynb
53
+ β”‚ └── 14_comparsion.ipynb
54
  |
55
+ β”œβ”€β”€ src/ # Production-style source code and helpers
 
 
 
 
 
 
 
 
 
 
 
56
  β”‚ β”œβ”€β”€ config/
57
  β”‚ β”‚ β”œβ”€β”€ __init__.py
58
+ β”‚ β”‚ └── settings.py # configuration values and constants
59
+ β”‚ └── utils/
60
+ β”‚ β”œβ”€β”€ __init__.py
61
+ | └── helpers.py # Helper functions used by notebooks and app
 
62
  |
63
+ β”œβ”€β”€ .env # Environment variables
64
+ β”œβ”€β”€ .gitignore # List of files to ignore by git
65
+ β”œβ”€β”€ .env.example # Example of environment variables
66
+ β”œβ”€β”€ .gitattributes
67
+ β”œβ”€β”€ .gitignore
68
+ β”œβ”€β”€ app.py # App/runner for model inference or demo
69
  β”œβ”€β”€ README.md # Project overview and instructions to run
70
+ └── requirements.txt # List of dependencies (pandas, scikit-learn, etc.)
71
+ ```