Spaces:
Sleeping
Sleeping
msi
commited on
Commit
·
dd38019
1
Parent(s):
1e2bf10
first commit
Browse files- .gitignore +1 -0
- README.md +128 -13
- app.py +103 -0
- data_clean.py +29 -0
- data_process.py +40 -0
- requirements.txt +9 -0
.gitignore
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
.env
|
README.md
CHANGED
|
@@ -1,13 +1,128 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Sentiment-Analysis
|
| 2 |
+
|
| 3 |
+
A lightweight sentiment analysis project that demonstrates data preprocessing, model training, evaluation, and inference for text sentiment classification. This repository contains code, datasets examples, and utility scripts to build and experiment with machine-learning and deep-learning approaches to classify text (e.g., positive, negative, neutral).
|
| 4 |
+
|
| 5 |
+
## Table of contents
|
| 6 |
+
- [Project Overview](#project-overview)
|
| 7 |
+
- [Features](#features)
|
| 8 |
+
- [Repository structure](#repository-structure)
|
| 9 |
+
- [Requirements](#requirements)
|
| 10 |
+
- [Installation](#installation)
|
| 11 |
+
- [Dataset](#dataset)
|
| 12 |
+
- [Usage](#usage)
|
| 13 |
+
- [Training a model](#training-a-model)
|
| 14 |
+
- [Evaluating a model](#evaluating-a-model)
|
| 15 |
+
- [Running inference](#running-inference)
|
| 16 |
+
- [Modeling notes](#modeling-notes)
|
| 17 |
+
- [Best practices & tips](#best-practices--tips)
|
| 18 |
+
- [Contributing](#contributing)
|
| 19 |
+
- [License](#license)
|
| 20 |
+
- [Contact](#contact)
|
| 21 |
+
|
| 22 |
+
## Project Overview
|
| 23 |
+
This project aims to provide a clear, reproducible example of building a sentiment analysis pipeline:
|
| 24 |
+
- load and clean text data,
|
| 25 |
+
- convert text into features (tokenization, embeddings, TF-IDF),
|
| 26 |
+
- train classification models (baseline and neural),
|
| 27 |
+
- evaluate performance with standard metrics,
|
| 28 |
+
- run inference on new texts.
|
| 29 |
+
|
| 30 |
+
It is suitable for learning, experimentation, classroom demos, and small production prototyping.
|
| 31 |
+
|
| 32 |
+
## Features
|
| 33 |
+
- Data preprocessing utilities (cleaning, tokenization, train/test split).
|
| 34 |
+
- Feature extraction options (TF-IDF, pre-trained embeddings).
|
| 35 |
+
- Example classifiers: logistic regression, SVM, simple neural network (PyTorch/Keras/TensorFlow depending on supplied code).
|
| 36 |
+
- Training and evaluation scripts with metrics: accuracy, precision, recall, F1, confusion matrix.
|
| 37 |
+
- Inference script to classify individual sentences or batch inputs.
|
| 38 |
+
|
| 39 |
+
## Repository structure
|
| 40 |
+
(Adjust paths if your code differs)
|
| 41 |
+
- data/ — example datasets, `.csv` samples (do NOT store large proprietary datasets here).
|
| 42 |
+
- src/
|
| 43 |
+
- data_processing.py — cleaning and preprocessing utilities.
|
| 44 |
+
- features.py — TF-IDF and embedding feature builders.
|
| 45 |
+
- models.py — model definitions and wrappers.
|
| 46 |
+
- train.py — training entrypoint.
|
| 47 |
+
- evaluate.py — evaluation scripts and metrics.
|
| 48 |
+
- predict.py — inference script for new text.
|
| 49 |
+
- notebooks/ — exploratory notebooks and experiments.
|
| 50 |
+
- requirements.txt — Python dependencies.
|
| 51 |
+
- README.md — this file.
|
| 52 |
+
|
| 53 |
+
## Requirements
|
| 54 |
+
- Python 3.8+
|
| 55 |
+
- Typical libraries: numpy, pandas, scikit-learn, nltk, transformers (optional), torch or tensorflow (optional)
|
| 56 |
+
- See `requirements.txt` for an exact list.
|
| 57 |
+
|
| 58 |
+
Install with:
|
| 59 |
+
pip install -r requirements.txt
|
| 60 |
+
|
| 61 |
+
## Installation
|
| 62 |
+
1. Clone the repo:
|
| 63 |
+
git clone https://github.com/missaouimedamine/Sentiment-Analysis.git
|
| 64 |
+
2. Create and activate a virtual environment (recommended):
|
| 65 |
+
python -m venv venv
|
| 66 |
+
source venv/bin/activate # macOS / Linux
|
| 67 |
+
venv\Scripts\activate # Windows
|
| 68 |
+
3. Install dependencies:
|
| 69 |
+
pip install -r requirements.txt
|
| 70 |
+
|
| 71 |
+
## Dataset
|
| 72 |
+
Provide your dataset in data/ as a CSV with at least two columns:
|
| 73 |
+
- text — the text to classify
|
| 74 |
+
- label — the sentiment label (e.g., "positive", "negative", "neutral" or 1/0)
|
| 75 |
+
|
| 76 |
+
If you plan to use external datasets (e.g., IMDb, SST, Twitter Sentiment), add instructions or scripts to download them into `data/`.
|
| 77 |
+
|
| 78 |
+
## Usage
|
| 79 |
+
|
| 80 |
+
### Training a model
|
| 81 |
+
Example (replace flags with code's CLI options if present):
|
| 82 |
+
python src/train.py --data data/train.csv --model-dir models/ --epochs 10 --batch-size 32 --feature tfidf
|
| 83 |
+
|
| 84 |
+
This will:
|
| 85 |
+
- load and preprocess the data,
|
| 86 |
+
- extract features,
|
| 87 |
+
- train the selected model,
|
| 88 |
+
- save the trained model and preprocessing artifacts to `models/`.
|
| 89 |
+
|
| 90 |
+
### Evaluating a model
|
| 91 |
+
python src/evaluate.py --data data/test.csv --model models/latest_model.pkl --output results/eval.json
|
| 92 |
+
|
| 93 |
+
Generates metrics (accuracy, precision, recall, F1) and a confusion matrix saved in the output path.
|
| 94 |
+
|
| 95 |
+
### Running inference
|
| 96 |
+
Single sentence:
|
| 97 |
+
python src/predict.py --model models/latest_model.pkl --text "I love this product!"
|
| 98 |
+
|
| 99 |
+
Batch mode (CSV input):
|
| 100 |
+
python src/predict.py --model models/latest_model.pkl --input data/new_texts.csv --output predictions.csv
|
| 101 |
+
|
| 102 |
+
## Modeling notes
|
| 103 |
+
- Baselines: TF-IDF + Logistic Regression or SVM often give strong baselines for sentiment tasks.
|
| 104 |
+
- For higher performance, use pre-trained transformer encoders (BERT variants) and fine-tune.
|
| 105 |
+
- Pay attention to class imbalance; consider stratified splitting, class weights, or resampling.
|
| 106 |
+
- Monitor overfitting with validation curves and apply regularization / dropout as needed.
|
| 107 |
+
|
| 108 |
+
## Best practices & tips
|
| 109 |
+
- Clean and normalize text (lowercasing, removing extra whitespace, handling emojis if relevant).
|
| 110 |
+
- Preserve tokens like negations ("not", "never") because they strongly affect sentiment.
|
| 111 |
+
- Use consistent label encoding and save label->index mappings with the model.
|
| 112 |
+
- Version models and preprocessing steps so results are reproducible.
|
| 113 |
+
|
| 114 |
+
## Contributing
|
| 115 |
+
Contributions are welcome. Typical ways to help:
|
| 116 |
+
- Open issues for bugs or feature requests.
|
| 117 |
+
- Provide pull requests with bug fixes, added models, or improved preprocessing.
|
| 118 |
+
- Add example notebooks showing experiments and model comparisons.
|
| 119 |
+
Before submitting PRs, run linters / tests if available.
|
| 120 |
+
|
| 121 |
+
## License
|
| 122 |
+
Specify your license here (e.g., MIT). If absent, add a LICENSE file to the repository.
|
| 123 |
+
|
| 124 |
+
## Contact
|
| 125 |
+
Maintainer: missaouimedamine
|
| 126 |
+
Project: https://github.com/missaouimedamine/Sentiment-Analysis
|
| 127 |
+
|
| 128 |
+
|
app.py
ADDED
|
@@ -0,0 +1,103 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import gradio as gr
|
| 2 |
+
from data_clean import clean_text
|
| 3 |
+
from data_process import query_hf_sentiment, extract_themes, create_openai_client
|
| 4 |
+
import os
|
| 5 |
+
from dotenv import load_dotenv
|
| 6 |
+
|
| 7 |
+
# Load environment variables
|
| 8 |
+
load_dotenv()
|
| 9 |
+
API_KEY = os.getenv("API_KEY")
|
| 10 |
+
HF_MODEL_URL = "https://api-inference.huggingface.co/models/tabularisai/multilingual-sentiment-analysis"
|
| 11 |
+
|
| 12 |
+
|
| 13 |
+
def analyze_post_gradio(post):
|
| 14 |
+
"""
|
| 15 |
+
Gradio version of the analyze_post function
|
| 16 |
+
"""
|
| 17 |
+
try:
|
| 18 |
+
# Clean the text
|
| 19 |
+
cleaned_post = clean_text(post)
|
| 20 |
+
|
| 21 |
+
# Get sentiment analysis
|
| 22 |
+
sentiment_result = query_hf_sentiment(cleaned_post, API_KEY, HF_MODEL_URL)
|
| 23 |
+
sentiment = sentiment_result['raw_output'][0]
|
| 24 |
+
|
| 25 |
+
# Extract themes
|
| 26 |
+
themes = extract_themes(cleaned_post, create_openai_client(API_KEY))
|
| 27 |
+
|
| 28 |
+
# Prepare data for storage and display
|
| 29 |
+
data = {
|
| 30 |
+
"post": post,
|
| 31 |
+
"cleaned_post": cleaned_post,
|
| 32 |
+
"sentiment": {
|
| 33 |
+
"label": sentiment['label'],
|
| 34 |
+
"score": sentiment['score']
|
| 35 |
+
},
|
| 36 |
+
"theme": themes
|
| 37 |
+
}
|
| 38 |
+
|
| 39 |
+
# Store in MongoDB
|
| 40 |
+
#insert_post(data)
|
| 41 |
+
|
| 42 |
+
# Format output for display
|
| 43 |
+
output = f"""
|
| 44 |
+
**Original Post:** {post}
|
| 45 |
+
|
| 46 |
+
**Cleaned Post:** {cleaned_post}
|
| 47 |
+
|
| 48 |
+
**Sentiment Analysis:**
|
| 49 |
+
- Label: {sentiment['label']}
|
| 50 |
+
- Score: {sentiment['score']:.4f}
|
| 51 |
+
|
| 52 |
+
**Extracted Themes:** {themes}
|
| 53 |
+
|
| 54 |
+
**Status:** ✅ Post analyzed and stored successfully in MongoDB
|
| 55 |
+
"""
|
| 56 |
+
|
| 57 |
+
return output
|
| 58 |
+
|
| 59 |
+
except Exception as e:
|
| 60 |
+
return f"❌ Error analyzing post: {str(e)}"
|
| 61 |
+
|
| 62 |
+
# Create Gradio interface
|
| 63 |
+
with gr.Blocks(title="Post Analyzer", theme=gr.themes.Soft()) as demo:
|
| 64 |
+
gr.Markdown("# 📊 Post Analysis Tool")
|
| 65 |
+
gr.Markdown("Enter a post to analyze its sentiment and extract themes")
|
| 66 |
+
|
| 67 |
+
with gr.Row():
|
| 68 |
+
with gr.Column():
|
| 69 |
+
post_input = gr.Textbox(
|
| 70 |
+
label="Enter your post",
|
| 71 |
+
placeholder="Type your post here...",
|
| 72 |
+
lines=4,
|
| 73 |
+
max_lines=10
|
| 74 |
+
)
|
| 75 |
+
analyze_btn = gr.Button("Analyze Post", variant="primary")
|
| 76 |
+
|
| 77 |
+
with gr.Column():
|
| 78 |
+
output = gr.Markdown(label="Analysis Results")
|
| 79 |
+
|
| 80 |
+
# Set up the interaction
|
| 81 |
+
analyze_btn.click(
|
| 82 |
+
fn=analyze_post_gradio,
|
| 83 |
+
inputs=post_input,
|
| 84 |
+
outputs=output
|
| 85 |
+
)
|
| 86 |
+
|
| 87 |
+
# Add examples
|
| 88 |
+
gr.Examples(
|
| 89 |
+
examples=[
|
| 90 |
+
"I absolutely love this product! It's amazing and works perfectly.",
|
| 91 |
+
"I'm really disappointed with the service. It was slow and unhelpful.",
|
| 92 |
+
"The weather today is nice, but I'm concerned about climate change."
|
| 93 |
+
],
|
| 94 |
+
inputs=post_input
|
| 95 |
+
)
|
| 96 |
+
|
| 97 |
+
# Launch the app
|
| 98 |
+
if __name__ == "__main__":
|
| 99 |
+
demo.launch(
|
| 100 |
+
server_name="0.0.0.0",
|
| 101 |
+
server_port=7860,
|
| 102 |
+
share=True
|
| 103 |
+
)
|
data_clean.py
ADDED
|
@@ -0,0 +1,29 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
|
| 2 |
+
import re
|
| 3 |
+
import emoji
|
| 4 |
+
import unicodedata
|
| 5 |
+
|
| 6 |
+
def clean_text(text):
|
| 7 |
+
# 1. Supprimer les URLs
|
| 8 |
+
text = re.sub(r"http\S+|www\S+|https\S+", "", text)
|
| 9 |
+
|
| 10 |
+
# 2. Supprimer les mentions @
|
| 11 |
+
text = re.sub(r"@\w+", "", text)
|
| 12 |
+
|
| 13 |
+
# 3. Supprimer les hashtags (garder le mot ou tout enlever ? ici on garde le mot)
|
| 14 |
+
text = re.sub(r"#(\w+)", r"\1", text)
|
| 15 |
+
|
| 16 |
+
# 4. Supprimer les emojis
|
| 17 |
+
text = emoji.replace_emoji(text, replace="")
|
| 18 |
+
|
| 19 |
+
# 5. Supprimer les caractères spéciaux (sauf lettres, chiffres et ponctuation de base)
|
| 20 |
+
text = re.sub(r"[^a-zA-ZÀ-ÿ0-9\s.,!?;:()\"'-]", " ", text)
|
| 21 |
+
|
| 22 |
+
# 6. Normaliser en minuscules
|
| 23 |
+
text = text.lower()
|
| 24 |
+
|
| 25 |
+
# 7. Supprimer les espaces multiples
|
| 26 |
+
text = re.sub(r"\s+", " ", text).strip()
|
| 27 |
+
|
| 28 |
+
return text
|
| 29 |
+
|
data_process.py
ADDED
|
@@ -0,0 +1,40 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import requests
|
| 2 |
+
from openai import OpenAI
|
| 3 |
+
|
| 4 |
+
# === Hugging Face Sentiment Analysis ===
|
| 5 |
+
|
| 6 |
+
def create_hf_headers(api_key):
|
| 7 |
+
"""Create headers for Hugging Face API requests."""
|
| 8 |
+
return {"Authorization": f"Bearer {api_key}"}
|
| 9 |
+
|
| 10 |
+
def query_hf_sentiment(post, api_key, model_url):
|
| 11 |
+
"""Query Hugging Face sentiment analysis API for a list of posts."""
|
| 12 |
+
headers = create_hf_headers(api_key)
|
| 13 |
+
response = requests.post(model_url, headers=headers, json={"inputs": post})
|
| 14 |
+
return {
|
| 15 |
+
"text": post,
|
| 16 |
+
"raw_output": response.json()[0]
|
| 17 |
+
}
|
| 18 |
+
|
| 19 |
+
|
| 20 |
+
# === OpenAI / Hugging Face GPT-based Theme Extraction ===
|
| 21 |
+
|
| 22 |
+
def create_openai_client(api_key, base_url="https://router.huggingface.co/v1"):
|
| 23 |
+
"""Create an OpenAI client for Hugging Face router."""
|
| 24 |
+
return OpenAI(api_key=api_key, base_url=base_url)
|
| 25 |
+
|
| 26 |
+
def extract_themes(post, client, prompt=None):
|
| 27 |
+
"""Extract main themes from posts using GPT model."""
|
| 28 |
+
if prompt is None:
|
| 29 |
+
prompt = ("Donne uniquement les thèmes principaux en un ou deux mots pour le texte suivant, "
|
| 30 |
+
"sans explication. (par exemple, contenus liés à la solitude, "
|
| 31 |
+
"au soutien psychologique, aux symptômes, etc.).")
|
| 32 |
+
|
| 33 |
+
completion = client.chat.completions.create(
|
| 34 |
+
model="openai/gpt-oss-20b:fireworks-ai",
|
| 35 |
+
messages=[{"role": "user", "content": f"{prompt}\n\nTexte : {post}"}],
|
| 36 |
+
)
|
| 37 |
+
return completion.choices[0].message.content
|
| 38 |
+
|
| 39 |
+
|
| 40 |
+
# === Example Usage ===
|
requirements.txt
ADDED
|
@@ -0,0 +1,9 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
emoji==2.6.0
|
| 2 |
+
regex==2023.10.11
|
| 3 |
+
requests==2.31.0
|
| 4 |
+
openai==0.27.8
|
| 5 |
+
fastapi==0.95.2
|
| 6 |
+
pydantic==1.10.12
|
| 7 |
+
uvicorn==0.22.0
|
| 8 |
+
python-dotenv==1.0.0
|
| 9 |
+
pymongo==4.4.0
|