Spaces:
Sleeping
Sleeping
Commit Β·
413d3a1
1
Parent(s): 50430e0
update readme, datasets, and structure
Browse files- README.md +114 -1
- docs/00_research/datasets.md +12 -1
- docs/01_project_definition/00_quickstart.md +33 -1
- docs/01_project_definition/07_structure.md +58 -36
README.md
CHANGED
|
@@ -1 +1,114 @@
|
|
| 1 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Sentiment Analysis of Amazon Reviews using Machine Learning
|
| 2 |
+
|
| 3 |
+
[](https://www.python.org/downloads/release/python-3120/)
|
| 4 |
+
[](https://gradio.app/)
|
| 5 |
+
[](https://opensource.org/licenses/MIT)
|
| 6 |
+
|
| 7 |
+
**Live Demo (Hugging Face Space):** [Sentiment Sleuth](https://huggingface.co/spaces/elsayedelmandoh/project-name)
|
| 8 |
+
**GitHub Repository:** [Sentiment Sleuth](https://github.com/elsayedelmandoh/sentiment-analysis-of-amazon-reviews-using-machine-learning-ml-queens)
|
| 9 |
+
|
| 10 |
+
## Table of Contents
|
| 11 |
+
- [Overview](#overview)
|
| 12 |
+
- [Key Features](#key-features)
|
| 13 |
+
- [Setup](#setup)
|
| 14 |
+
- [0. Prerequisites](#0-prerequisites)
|
| 15 |
+
- [1. Clone the Repository](#1-clone-the-repository)
|
| 16 |
+
- [2. Create Conda Environment](#2-create-conda-environment)
|
| 17 |
+
- [3. Environment Variables](#3-environment-variables)
|
| 18 |
+
- [Usage](#usage)
|
| 19 |
+
- [Contributing](#contributing)
|
| 20 |
+
- [Author](#author)
|
| 21 |
+
|
| 22 |
+
---
|
| 23 |
+
## Overview
|
| 24 |
+
|
| 25 |
+
ΩThis is a project for performing sentiment analysis on Amazon product reviews using classical machine-learning models. The project includes data processing and feature engineering notebooks, multiple trained classifiers saved as joblib artifacts, a TF-IDF vectorizer, and a Streamlit UI to analyze custom review text.
|
| 26 |
+
|
| 27 |
+
Key components in the repository:
|
| 28 |
+
- Interactive app: `app.py` (Streamlit)
|
| 29 |
+
- Saved models: `data/models/*.joblib`
|
| 30 |
+
- Vectorizer and precomputed TF-IDF sparse matrices: `data/vectorizers/`
|
| 31 |
+
- Processed datasets and samples: `data/processed/` and `data/samples/`
|
| 32 |
+
- Notebooks: `notebooks/` (EDA, preprocessing, feature engineering, and model notebooks)
|
| 33 |
+
- Documentation: `docs/` (research notes, project definition, workflow, and report)
|
| 34 |
+
|
| 35 |
+
The Streamlit app loads saved artifacts via `src.utils.helpers` and exposes multiple classifiers (`Logistic Regression, Naive Bayes, SVM variants, KNN, Decision Trees, Random Forest, SGD, XGBoost and LightGBM`) so you can compare predictions and confidence scores side-by-side.
|
| 36 |
+
|
| 37 |
+
---
|
| 38 |
+
## Key Features
|
| 39 |
+
* **Multiple Models:** Compare results from several traditional classifiers (Logistic Regression, Naive Bayes, SVMs, KNN, Decision Trees, Random Forests, SGD, XGBoost, LightGBM).
|
| 40 |
+
* **Reusable Artifacts:** TF-IDF vectorizer and trained models are persisted under `data/vectorizers/` and `data/models/` for fast local inference.
|
| 41 |
+
* **Notebooks for Reproducibility:** Step-by-step Jupyter notebooks for data acquisition, EDA, preprocessing, feature engineering and model training are included under `notebooks/`.
|
| 42 |
+
|
| 43 |
+
---
|
| 44 |
+
## Setup
|
| 45 |
+
0. Prerequisites
|
| 46 |
+
Before running this project, ensure you have the following installed:
|
| 47 |
+
* [Git](https://git-scm.com/)
|
| 48 |
+
* [Anaconda](https://www.anaconda.com/) or Miniconda
|
| 49 |
+
* Python 3.12 (recommended)
|
| 50 |
+
|
| 51 |
+
1. Clone the Repository
|
| 52 |
+
```bash
|
| 53 |
+
git clone https://github.com/elsayedelmandoh/sentiment-analysis-of-amazon-reviews-using-machine-learning-ml-queens
|
| 54 |
+
cd sentiment-analysis-of-amazon-reviews-using-machine-learning-ml-queens
|
| 55 |
+
```
|
| 56 |
+
2. Create Conda Environment
|
| 57 |
+
```bash
|
| 58 |
+
# Create & activate the environment
|
| 59 |
+
conda create -n env-name python=3.12 -y
|
| 60 |
+
conda activate env-name
|
| 61 |
+
|
| 62 |
+
# Install pip and project dependencies
|
| 63 |
+
conda install pip -y
|
| 64 |
+
pip install -r requirements.txt
|
| 65 |
+
```
|
| 66 |
+
|
| 67 |
+
3. Environment Variables
|
| 68 |
+
Create a `.env` file at the project root and add any necessary API keys or configuration variables
|
| 69 |
+
|
| 70 |
+
---
|
| 71 |
+
## Usage
|
| 72 |
+
This project uses Streamlit for the interactive UI. Start the app locally with one of the following commands:
|
| 73 |
+
|
| 74 |
+
```bash
|
| 75 |
+
# Run via Streamlit
|
| 76 |
+
streamlit run app.py
|
| 77 |
+
```
|
| 78 |
+
|
| 79 |
+
When the app starts, open the local URL printed in your terminal (usually http://localhost:8501) and paste an Amazon review into the text area to see per-model sentiment predictions and confidence scores.
|
| 80 |
+
|
| 81 |
+
Model artifacts and vectorizers are loaded from `data/models/` and `data/vectorizers/`. If the vectorizer or model files are missing, the app will show an error message pointing to the expected files.
|
| 82 |
+
|
| 83 |
+
---
|
| 84 |
+
## Reproducibility & Notebooks
|
| 85 |
+
The `notebooks/` directory contains step-by-step analysis and model training notebooks. Key notebooks:
|
| 86 |
+
- `01_data_acquisition.ipynb` β dataset loading and brief description
|
| 87 |
+
- `02_eda.ipynb` β exploratory data analysis
|
| 88 |
+
- `03_data_preprocessing.ipynb` β cleaning and preprocessing
|
| 89 |
+
- `04_feature_engineering.ipynb` β TF-IDF vectorization and feature prep
|
| 90 |
+
- `05_logistic_regression.ipynb` through `13_lightgbm.ipynb` β one notebook per model
|
| 91 |
+
- `14_comparsion.ipynb` β model comparison and summary
|
| 92 |
+
|
| 93 |
+
Use these notebooks to retrain or refine models and regenerate the `joblib` artifacts saved in `data/models/`.
|
| 94 |
+
|
| 95 |
+
---
|
| 96 |
+
## Contributing
|
| 97 |
+
Contributions are welcome! If you'd like to improve this project, please follow these steps:
|
| 98 |
+
1. Fork the repository.
|
| 99 |
+
2. Create a branch for your feature or bug fix (`git checkout -b feature/my-new-feature`).
|
| 100 |
+
3. Commit your changes with clear messages (`git commit -m 'add some feature'`).
|
| 101 |
+
4. Push to your fork (`git push origin feature/my-new-feature`).
|
| 102 |
+
5. Open a pull request.
|
| 103 |
+
|
| 104 |
+
Please include reproducible steps and, if applicable, updated notebooks or scripts to regenerate models.
|
| 105 |
+
|
| 106 |
+
## License
|
| 107 |
+
This project is provided under the MIT license. See the `LICENSE` file for details.
|
| 108 |
+
|
| 109 |
+
## Author
|
| 110 |
+
Elsayed Elmandoh - NLP Engineer
|
| 111 |
+
* Connect on LinkedIn & X [Linktree](https://linktr.ee/elsayedelmandoh)
|
| 112 |
+
|
| 113 |
+
Mohamed Kamal - AI Engineer
|
| 114 |
+
* Connect on [LinkedIn](https://www.linkedin.com/in/mohamed-kamal-has/?utm_source=share_via&utm_content=profile&utm_medium=member_android)
|
docs/00_research/datasets.md
CHANGED
|
@@ -12,4 +12,15 @@ Description: The dataset consists of approximately 1.8M training samples and 200
|
|
| 12 |
- Title: The review summary.
|
| 13 |
- Text: The full review body.
|
| 14 |
|
| 15 |
-
Scale: Given the constraints of the Anaconda/Jupyter environment and the 4-week timeline, we will utilize a stratified subset of this data. This ensures we maintain a balanced distribution of classes while keeping training times feasible for iterative experimentation.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
- Title: The review summary.
|
| 13 |
- Text: The full review body.
|
| 14 |
|
| 15 |
+
Scale: Given the constraints of the Anaconda/Jupyter environment and the 4-week timeline, we will utilize a stratified subset of this data. This ensures we maintain a balanced distribution of classes while keeping training times feasible for iterative experimentation.
|
| 16 |
+
|
| 17 |
+
Current project files (workspace snapshot):
|
| 18 |
+
|
| 19 |
+
- data/raw/: `train.csv`, `test.csv`, `readme.txt`
|
| 20 |
+
- data/processed/: `processed_train.csv`, `processed_valid.csv`, `processed_test.csv`, `feat_eng_train.csv`, `balanced_sample_train.csv`, `y_train.csv`, `y_valid.csv`, `y_test.csv`
|
| 21 |
+
- data/vectorizers/: `tfidf_vectorizer.joblib`, `X_train_tfidf.npz`, `X_valid_tfidf.npz`, `X_test_tfidf.npz`
|
| 22 |
+
- data/models/: several pre-trained model artifacts (see docs/01_project_definition/07_structure.md for full tree)
|
| 23 |
+
|
| 24 |
+
Notes:
|
| 25 |
+
- The workspace stores preprocessed train/valid/test splits under `data/processed` to allow reproducible training and evaluation without re-running heavy preprocessing steps.
|
| 26 |
+
- TF-IDF artifacts are persisted under `data/vectorizers` and are used to transform text into sparse matrices for model training and inference.
|
docs/01_project_definition/00_quickstart.md
CHANGED
|
@@ -1,2 +1,34 @@
|
|
| 1 |
-
# Project Definition -
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
|
|
|
|
| 1 |
+
# Project Definition - Quickstart
|
| 2 |
+
|
| 3 |
+
This quickstart explains how to prepare the environment and reproduce core experiments and inference from the repository.
|
| 4 |
+
|
| 5 |
+
1) Create a Python environment and install dependencies:
|
| 6 |
+
|
| 7 |
+
```
|
| 8 |
+
python -m venv .venv
|
| 9 |
+
source .venv/Scripts/activate
|
| 10 |
+
pip install -r requirements.txt
|
| 11 |
+
```
|
| 12 |
+
|
| 13 |
+
2) Inspect processed data and vectorizers (already available in repo):
|
| 14 |
+
|
| 15 |
+
- `data/processed/` contains prepared train/valid/test CSVs and labels.
|
| 16 |
+
- `data/vectorizers/` contains the fitted TF-IDF vectorizer and sparse matrices.
|
| 17 |
+
|
| 18 |
+
3) Run notebooks (recommended order):
|
| 19 |
+
|
| 20 |
+
- `notebooks/01_data_acquisition.ipynb`
|
| 21 |
+
- `notebooks/02_eda.ipynb`
|
| 22 |
+
- `notebooks/03_data_preprocessing.ipynb`
|
| 23 |
+
- `notebooks/04_feature_engineering.ipynb`
|
| 24 |
+
- modeling notebooks `05_*.ipynb` β `14_comparsion.ipynb`
|
| 25 |
+
|
| 26 |
+
4) Run the demo/app:
|
| 27 |
+
|
| 28 |
+
```
|
| 29 |
+
streamlit run app.py
|
| 30 |
+
```
|
| 31 |
+
|
| 32 |
+
Notes:
|
| 33 |
+
- Preprocessed artifacts and trained model joblib files are stored under `data/processed`, `data/vectorizers`, and `data/models` to speed up reproduction.
|
| 34 |
|
docs/01_project_definition/07_structure.md
CHANGED
|
@@ -3,47 +3,69 @@
|
|
| 3 |
```text
|
| 4 |
sentiment-analysis-of-amazon-reviews-using-machine-learning/
|
| 5 |
βββ data/
|
| 6 |
-
β βββ
|
| 7 |
-
β βββ
|
| 8 |
-
β βββ
|
| 9 |
-
β
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
|
|
| 11 |
βββ docs/
|
| 12 |
-
β
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
|
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 22 |
|
|
| 23 |
-
βββ
|
| 24 |
-
β βββ 00_quickstart.ipynb #
|
| 25 |
-
β βββ 01_logistic_regression.ipynb #
|
| 26 |
-
β βββ 02_naive_bayes.ipynb #
|
| 27 |
-
β βββ 03_support_vector_machines.ipynb #
|
| 28 |
-
β βββ 04_k_nearest_neighbors.ipynb #
|
| 29 |
-
β βββ 05_decision_trees.ipynb #
|
| 30 |
-
β βββ 06_random_forest.ipynb #
|
| 31 |
-
β βββ 07_stochastic_gradient_descent.ipynb #
|
| 32 |
-
β βββ 08_comparsion.ipynb #
|
| 33 |
-
|
|
| 34 |
-
βββ src/ # Production-style source code
|
| 35 |
β βββ config/
|
| 36 |
β β βββ __init__.py
|
| 37 |
-
β β βββ settings
|
| 38 |
-
β
|
| 39 |
-
β
|
| 40 |
-
|
| 41 |
-
β βββ __init__.py
|
| 42 |
|
|
| 43 |
-
βββ .env
|
| 44 |
-
βββ .
|
| 45 |
-
βββ .
|
| 46 |
-
βββ .
|
| 47 |
-
βββ
|
|
|
|
| 48 |
βββ README.md # Project overview and instructions to run
|
| 49 |
-
βββ requirements.txt # List of dependencies (pandas, scikit-learn, etc.)
|
|
|
|
|
|
| 3 |
```text
|
| 4 |
sentiment-analysis-of-amazon-reviews-using-machine-learning/
|
| 5 |
βββ data/
|
| 6 |
+
β βββ models/ # Saved model files (.joblib)
|
| 7 |
+
β βββ predictions/ # Model prediction outputs (CSV)
|
| 8 |
+
β βββ processed/ # Cleaned & feature-engineered datasets
|
| 9 |
+
β β βββ processed_train.csv
|
| 10 |
+
β β βββ processed_valid.csv
|
| 11 |
+
β β βββ processed_test.csv
|
| 12 |
+
β β βββ feat_eng_train.csv
|
| 13 |
+
β βββ raw/ # Original immutable dataset
|
| 14 |
+
β β βββ train.csv
|
| 15 |
+
β β βββ test.csv
|
| 16 |
+
β βββ samples/ # Small sample files for quick testing
|
| 17 |
+
β βββ vectorizers/ # Saved vectorizers and sparse matrices (TF-IDF)
|
| 18 |
+
β βββ tfidf_vectorizer.joblib
|
| 19 |
+
β βββ X_train_tfidf.npz
|
| 20 |
+
β βββ X_test_tfidf.npz
|
| 21 |
|
|
| 22 |
βββ docs/
|
| 23 |
+
β βββ 00_research/
|
| 24 |
+
β β βββ datasets.md
|
| 25 |
+
β β βββ references.md
|
| 26 |
+
β β βββ related_projects.md
|
| 27 |
+
β βββ 01_project_definition/
|
| 28 |
+
β βββ 00_quickstart.md
|
| 29 |
+
β βββ 01_problem.md
|
| 30 |
+
β βββ 02_goal.md
|
| 31 |
+
β βββ 03_solution.md
|
| 32 |
+
β βββ 04_stack.md
|
| 33 |
+
β βββ 05_architecture.md
|
| 34 |
+
β βββ 06_workflow.md
|
| 35 |
+
β βββ 07_structure.md
|
| 36 |
+
β βββ 08_report.md
|
| 37 |
|
|
| 38 |
+
βββ notebooks/
|
| 39 |
+
β βββ 00_quickstartt.ipynb
|
| 40 |
+
β βββ 01_data_acquisition.ipynb
|
| 41 |
+
β βββ 02_eda.ipynb
|
| 42 |
+
β βββ 03_data_preprocessing.ipynb
|
| 43 |
+
β βββ 04_feature_engineering.ipynb
|
| 44 |
+
β βββ 05_logistic_regression.ipynb
|
| 45 |
+
β βββ 06_naive_bayes.ipynb
|
| 46 |
+
β βββ 07_support_vector_machine.ipynb
|
| 47 |
+
β βββ 08_k_nearest_neighbors.ipynb
|
| 48 |
+
β βββ 09_decision_trees.ipynb
|
| 49 |
+
β βββ 10_random_forest.ipynb
|
| 50 |
+
β βββ 11_stochastic_gradient_descent.ipynb
|
| 51 |
+
β βββ 12_xgboost.ipynb
|
| 52 |
+
β βββ 13_lightgbm.ipynb
|
| 53 |
+
β βββ 14_comparsion.ipynb
|
| 54 |
|
|
| 55 |
+
βββ src/ # Production-style source code and helpers
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 56 |
β βββ config/
|
| 57 |
β β βββ __init__.py
|
| 58 |
+
β β βββ settings.py # configuration values and constants
|
| 59 |
+
β βββ utils/
|
| 60 |
+
β βββ __init__.py
|
| 61 |
+
| βββ helpers.py # Helper functions used by notebooks and app
|
|
|
|
| 62 |
|
|
| 63 |
+
βββ .env # Environment variables
|
| 64 |
+
βββ .gitignore # List of files to ignore by git
|
| 65 |
+
βββ .env.example # Example of environment variables
|
| 66 |
+
βββ .gitattributes
|
| 67 |
+
βββ .gitignore
|
| 68 |
+
βββ app.py # App/runner for model inference or demo
|
| 69 |
βββ README.md # Project overview and instructions to run
|
| 70 |
+
βββ requirements.txt # List of dependencies (pandas, scikit-learn, etc.)
|
| 71 |
+
```
|