Spaces:

elsayedelmandoh
/

sentiment-sleuth

Sleeping

App Files Files Community

elsayedelmandoh commited on Mar 9

Commit

413d3a1

1 Parent(s): 50430e0

update readme, datasets, and structure

Browse files

Files changed (4) hide show

README.md +114 -1
docs/00_research/datasets.md +12 -1
docs/01_project_definition/00_quickstart.md +33 -1
docs/01_project_definition/07_structure.md +58 -36

README.md CHANGED Viewed

	@@ -1 +1,114 @@
1	- #

+# Sentiment Analysis of Amazon Reviews using Machine Learning
+[![Python 3.12](https://img.shields.io/badge/python-3.12-blue.svg)](https://www.python.org/downloads/release/python-3120/)
+[![Gradio](https://img.shields.io/badge/UI-Gradio-orange)](https://gradio.app/)
+[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
+**Live Demo (Hugging Face Space):** [Sentiment Sleuth](https://huggingface.co/spaces/elsayedelmandoh/project-name)
+**GitHub Repository:** [Sentiment Sleuth](https://github.com/elsayedelmandoh/sentiment-analysis-of-amazon-reviews-using-machine-learning-ml-queens)
+## Table of Contents
+- [Overview](#overview)
+- [Key Features](#key-features)
+- [Setup](#setup)
+	- [0. Prerequisites](#0-prerequisites)
+	- [1. Clone the Repository](#1-clone-the-repository)
+	- [2. Create Conda Environment](#2-create-conda-environment)
+	- [3. Environment Variables](#3-environment-variables)
+- [Usage](#usage)
+- [Contributing](#contributing)
+- [Author](#author)
+---
+## Overview
+ًThis is a project for performing sentiment analysis on Amazon product reviews using classical machine-learning models. The project includes data processing and feature engineering notebooks, multiple trained classifiers saved as joblib artifacts, a TF-IDF vectorizer, and a Streamlit UI to analyze custom review text.
+Key components in the repository:
+- Interactive app: `app.py` (Streamlit)
+- Saved models: `data/models/*.joblib`
+- Vectorizer and precomputed TF-IDF sparse matrices: `data/vectorizers/`
+- Processed datasets and samples: `data/processed/` and `data/samples/`
+- Notebooks: `notebooks/` (EDA, preprocessing, feature engineering, and model notebooks)
+- Documentation: `docs/` (research notes, project definition, workflow, and report)
+The Streamlit app loads saved artifacts via `src.utils.helpers` and exposes multiple classifiers (`Logistic Regression, Naive Bayes, SVM variants, KNN, Decision Trees, Random Forest, SGD, XGBoost and LightGBM`) so you can compare predictions and confidence scores side-by-side.
+---
+## Key Features
+* **Multiple Models:** Compare results from several traditional classifiers (Logistic Regression, Naive Bayes, SVMs, KNN, Decision Trees, Random Forests, SGD, XGBoost, LightGBM).
+* **Reusable Artifacts:** TF-IDF vectorizer and trained models are persisted under `data/vectorizers/` and `data/models/` for fast local inference.
+* **Notebooks for Reproducibility:** Step-by-step Jupyter notebooks for data acquisition, EDA, preprocessing, feature engineering and model training are included under `notebooks/`.
+---
+## Setup
+0. Prerequisites
+Before running this project, ensure you have the following installed:
+* [Git](https://git-scm.com/)
+* [Anaconda](https://www.anaconda.com/) or Miniconda
+* Python 3.12 (recommended)
+1. Clone the Repository
+```bash
+git clone https://github.com/elsayedelmandoh/sentiment-analysis-of-amazon-reviews-using-machine-learning-ml-queens
+cd sentiment-analysis-of-amazon-reviews-using-machine-learning-ml-queens
+```
+2. Create Conda Environment
+```bash
+# Create & activate the environment
+conda create -n env-name python=3.12 -y
+conda activate env-name
+# Install pip and project dependencies
+conda install pip -y
+pip install -r requirements.txt
+```
+3. Environment Variables
+Create a `.env` file at the project root and add any necessary API keys or configuration variables
+---
+## Usage
+This project uses Streamlit for the interactive UI. Start the app locally with one of the following commands:
+```bash
+# Run via Streamlit
+streamlit run app.py
+```
+When the app starts, open the local URL printed in your terminal (usually http://localhost:8501) and paste an Amazon review into the text area to see per-model sentiment predictions and confidence scores.
+Model artifacts and vectorizers are loaded from `data/models/` and `data/vectorizers/`. If the vectorizer or model files are missing, the app will show an error message pointing to the expected files.
+---
+## Reproducibility & Notebooks
+The `notebooks/` directory contains step-by-step analysis and model training notebooks. Key notebooks:
+- `01_data_acquisition.ipynb` — dataset loading and brief description
+- `02_eda.ipynb` — exploratory data analysis
+- `03_data_preprocessing.ipynb` — cleaning and preprocessing
+- `04_feature_engineering.ipynb` — TF-IDF vectorization and feature prep
+- `05_logistic_regression.ipynb` through `13_lightgbm.ipynb` — one notebook per model
+- `14_comparsion.ipynb` — model comparison and summary
+Use these notebooks to retrain or refine models and regenerate the `joblib` artifacts saved in `data/models/`.
+---
+## Contributing
+Contributions are welcome! If you'd like to improve this project, please follow these steps:
+1. Fork the repository.
+2. Create a branch for your feature or bug fix (`git checkout -b feature/my-new-feature`).
+3. Commit your changes with clear messages (`git commit -m 'add some feature'`).
+4. Push to your fork (`git push origin feature/my-new-feature`).
+5. Open a pull request.
+Please include reproducible steps and, if applicable, updated notebooks or scripts to regenerate models.
+## License
+This project is provided under the MIT license. See the `LICENSE` file for details.
+## Author
+Elsayed Elmandoh - NLP Engineer
+* Connect on LinkedIn & X [Linktree](https://linktr.ee/elsayedelmandoh)
+Mohamed Kamal - AI Engineer
+* Connect on [LinkedIn](https://www.linkedin.com/in/mohamed-kamal-has/?utm_source=share_via&utm_content=profile&utm_medium=member_android)

docs/00_research/datasets.md CHANGED Viewed

@@ -12,4 +12,15 @@ Description: The dataset consists of approximately 1.8M training samples and 200
 - Title: The review summary.
 - Text: The full review body.
-Scale: Given the constraints of the Anaconda/Jupyter environment and the 4-week timeline, we will utilize a stratified subset of this data. This ensures we maintain a balanced distribution of classes while keeping training times feasible for iterative experimentation.

 - Title: The review summary.
 - Text: The full review body.
+Scale: Given the constraints of the Anaconda/Jupyter environment and the 4-week timeline, we will utilize a stratified subset of this data. This ensures we maintain a balanced distribution of classes while keeping training times feasible for iterative experimentation.
+Current project files (workspace snapshot):
+- data/raw/: `train.csv`, `test.csv`, `readme.txt`
+- data/processed/: `processed_train.csv`, `processed_valid.csv`, `processed_test.csv`, `feat_eng_train.csv`, `balanced_sample_train.csv`, `y_train.csv`, `y_valid.csv`, `y_test.csv`
+- data/vectorizers/: `tfidf_vectorizer.joblib`, `X_train_tfidf.npz`, `X_valid_tfidf.npz`, `X_test_tfidf.npz`
+- data/models/: several pre-trained model artifacts (see docs/01_project_definition/07_structure.md for full tree)
+Notes:
+- The workspace stores preprocessed train/valid/test splits under `data/processed` to allow reproducible training and evaluation without re-running heavy preprocessing steps.
+- TF-IDF artifacts are persisted under `data/vectorizers` and are used to transform text into sparse matrices for model training and inference.

docs/01_project_definition/00_quickstart.md CHANGED Viewed

	@@ -1,2 +1,34 @@
1	- # Project Definition - ~~quickstart~~
































2

+# Project Definition - Quickstart
+This quickstart explains how to prepare the environment and reproduce core experiments and inference from the repository.
+1) Create a Python environment and install dependencies:
+```
+python -m venv .venv
+source .venv/Scripts/activate
+pip install -r requirements.txt
+```
+2) Inspect processed data and vectorizers (already available in repo):
+- `data/processed/` contains prepared train/valid/test CSVs and labels.
+- `data/vectorizers/` contains the fitted TF-IDF vectorizer and sparse matrices.
+3) Run notebooks (recommended order):
+- `notebooks/01_data_acquisition.ipynb`
+- `notebooks/02_eda.ipynb`
+- `notebooks/03_data_preprocessing.ipynb`
+- `notebooks/04_feature_engineering.ipynb`
+- modeling notebooks `05_*.ipynb` → `14_comparsion.ipynb`
+4) Run the demo/app:
+```
+streamlit run app.py
+```
+Notes:
+- Preprocessed artifacts and trained model joblib files are stored under `data/processed`, `data/vectorizers`, and `data/models` to speed up reproduction.

docs/01_project_definition/07_structure.md CHANGED Viewed

@@ -3,47 +3,69 @@
 ```text
 sentiment-analysis-of-amazon-reviews-using-machine-learning/
 ├── data/
-│   ├── raw/             # Original, immutable Kaggle dataset
-│   ├── processed/       # Cleaned data ready for modeling
-│   ├── predictions/     # Model predictions on test set
-│   └── models/          # Saved model files
 |
 ├── docs/
-│   └── 00_research/     #
-|       ├── datasets.md
-|       ├── references.md
-|       └── related_projects.md
 |
-│   └── 01_project_definition/          #
-|       ├── 00_quickstart.md
-|       ├── 01_problem.md
-|       ├── 08_report.md
-|       └── proposal_sentiment_analysis.pdf
 |
-├── notebooks/
-│   ├── 00_quickstart.ipynb     #
-│   ├── 01_logistic_regression.ipynb     #
-│   ├── 02_naive_bayes.ipynb     #
-│   ├── 03_support_vector_machines.ipynb     #
-│   ├── 04_k_nearest_neighbors.ipynb     #
-│   ├── 05_decision_trees.ipynb     #
-│   ├── 06_random_forest.ipynb     #
-│   ├── 07_stochastic_gradient_descent.ipynb     #
-│   └── 08_comparsion.ipynb #
-|
-├── src/                 # Production-style source code
 │   ├── config/
 │   │   ├── __init__.py
-│   │   └── settings     # Configuration files
-│   ├── utils/
-│   |   ├── __init__.py
-│   |   └── helpers.py
-│   └── __init__.py
 |
-├── .env             #
-├── .env.example             #
-├── .gitattributes             #
-├── .gitignore             #
-├── appy.py             #
 ├── README.md            # Project overview and instructions to run
-└── requirements.txt     # List of dependencies (pandas, scikit-learn, etc.)

 ```text
 sentiment-analysis-of-amazon-reviews-using-machine-learning/
 ├── data/
+│   ├── models/          # Saved model files (.joblib)
+│   ├── predictions/     # Model prediction outputs (CSV)
+│   ├── processed/       # Cleaned & feature-engineered datasets
+│   │   ├── processed_train.csv
+│   │   ├── processed_valid.csv
+│   │   ├── processed_test.csv
+│   │   └── feat_eng_train.csv
+│   ├── raw/             # Original immutable dataset
+│   │   ├── train.csv
+│   │   └── test.csv
+│   ├── samples/         # Small sample files for quick testing
+│   └── vectorizers/     # Saved vectorizers and sparse matrices (TF-IDF)
+│       ├── tfidf_vectorizer.joblib
+│       ├── X_train_tfidf.npz
+│       └── X_test_tfidf.npz
 |
 ├── docs/
+│   ├── 00_research/
+│   │   ├── datasets.md
+│   │   ├── references.md
+│   │   └── related_projects.md
+│   └── 01_project_definition/
+│       ├── 00_quickstart.md
+│       ├── 01_problem.md
+│       ├── 02_goal.md
+│       ├── 03_solution.md
+│       ├── 04_stack.md
+│       ├── 05_architecture.md
+│       ├── 06_workflow.md
+│       ├── 07_structure.md
+│       └── 08_report.md
 |
+├── notebooks/
+│   ├── 00_quickstartt.ipynb
+│   ├── 01_data_acquisition.ipynb
+│   ├── 02_eda.ipynb
+│   ├── 03_data_preprocessing.ipynb
+│   ├── 04_feature_engineering.ipynb
+│   ├── 05_logistic_regression.ipynb
+│   ├── 06_naive_bayes.ipynb
+│   ├── 07_support_vector_machine.ipynb
+│   ├── 08_k_nearest_neighbors.ipynb
+│   ├── 09_decision_trees.ipynb
+│   ├── 10_random_forest.ipynb
+│   ├── 11_stochastic_gradient_descent.ipynb
+│   ├── 12_xgboost.ipynb
+│   ├── 13_lightgbm.ipynb
+│   └── 14_comparsion.ipynb
 |
+├── src/                 # Production-style source code and helpers
 │   ├── config/
 │   │   ├── __init__.py
+│   │   └── settings.py   # configuration values and constants
+│   └── utils/
+│       ├── __init__.py
+|       └── helpers.py     # Helper functions used by notebooks and app
 |
+├── .env                 # Environment variables
+├── .gitignore           # List of files to ignore by git
+├── .env.example         # Example of environment variables
+├── .gitattributes
+├── .gitignore
+├── app.py               # App/runner for model inference or demo
 ├── README.md            # Project overview and instructions to run
+└── requirements.txt     # List of dependencies (pandas, scikit-learn, etc.)
+```