msi commited on
Commit
dd38019
·
1 Parent(s): 1e2bf10

first commit

Browse files
Files changed (6) hide show
  1. .gitignore +1 -0
  2. README.md +128 -13
  3. app.py +103 -0
  4. data_clean.py +29 -0
  5. data_process.py +40 -0
  6. requirements.txt +9 -0
.gitignore ADDED
@@ -0,0 +1 @@
 
 
1
+ .env
README.md CHANGED
@@ -1,13 +1,128 @@
1
- ---
2
- title: Sentiment Analysis
3
- emoji: 📈
4
- colorFrom: red
5
- colorTo: yellow
6
- sdk: gradio
7
- sdk_version: 5.49.1
8
- app_file: app.py
9
- pinned: false
10
- license: apache-2.0
11
- ---
12
-
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Sentiment-Analysis
2
+
3
+ A lightweight sentiment analysis project that demonstrates data preprocessing, model training, evaluation, and inference for text sentiment classification. This repository contains code, datasets examples, and utility scripts to build and experiment with machine-learning and deep-learning approaches to classify text (e.g., positive, negative, neutral).
4
+
5
+ ## Table of contents
6
+ - [Project Overview](#project-overview)
7
+ - [Features](#features)
8
+ - [Repository structure](#repository-structure)
9
+ - [Requirements](#requirements)
10
+ - [Installation](#installation)
11
+ - [Dataset](#dataset)
12
+ - [Usage](#usage)
13
+ - [Training a model](#training-a-model)
14
+ - [Evaluating a model](#evaluating-a-model)
15
+ - [Running inference](#running-inference)
16
+ - [Modeling notes](#modeling-notes)
17
+ - [Best practices & tips](#best-practices--tips)
18
+ - [Contributing](#contributing)
19
+ - [License](#license)
20
+ - [Contact](#contact)
21
+
22
+ ## Project Overview
23
+ This project aims to provide a clear, reproducible example of building a sentiment analysis pipeline:
24
+ - load and clean text data,
25
+ - convert text into features (tokenization, embeddings, TF-IDF),
26
+ - train classification models (baseline and neural),
27
+ - evaluate performance with standard metrics,
28
+ - run inference on new texts.
29
+
30
+ It is suitable for learning, experimentation, classroom demos, and small production prototyping.
31
+
32
+ ## Features
33
+ - Data preprocessing utilities (cleaning, tokenization, train/test split).
34
+ - Feature extraction options (TF-IDF, pre-trained embeddings).
35
+ - Example classifiers: logistic regression, SVM, simple neural network (PyTorch/Keras/TensorFlow depending on supplied code).
36
+ - Training and evaluation scripts with metrics: accuracy, precision, recall, F1, confusion matrix.
37
+ - Inference script to classify individual sentences or batch inputs.
38
+
39
+ ## Repository structure
40
+ (Adjust paths if your code differs)
41
+ - data/ — example datasets, `.csv` samples (do NOT store large proprietary datasets here).
42
+ - src/
43
+ - data_processing.py — cleaning and preprocessing utilities.
44
+ - features.py — TF-IDF and embedding feature builders.
45
+ - models.py — model definitions and wrappers.
46
+ - train.py — training entrypoint.
47
+ - evaluate.py — evaluation scripts and metrics.
48
+ - predict.py — inference script for new text.
49
+ - notebooks/ — exploratory notebooks and experiments.
50
+ - requirements.txt — Python dependencies.
51
+ - README.md — this file.
52
+
53
+ ## Requirements
54
+ - Python 3.8+
55
+ - Typical libraries: numpy, pandas, scikit-learn, nltk, transformers (optional), torch or tensorflow (optional)
56
+ - See `requirements.txt` for an exact list.
57
+
58
+ Install with:
59
+ pip install -r requirements.txt
60
+
61
+ ## Installation
62
+ 1. Clone the repo:
63
+ git clone https://github.com/missaouimedamine/Sentiment-Analysis.git
64
+ 2. Create and activate a virtual environment (recommended):
65
+ python -m venv venv
66
+ source venv/bin/activate # macOS / Linux
67
+ venv\Scripts\activate # Windows
68
+ 3. Install dependencies:
69
+ pip install -r requirements.txt
70
+
71
+ ## Dataset
72
+ Provide your dataset in data/ as a CSV with at least two columns:
73
+ - text — the text to classify
74
+ - label — the sentiment label (e.g., "positive", "negative", "neutral" or 1/0)
75
+
76
+ If you plan to use external datasets (e.g., IMDb, SST, Twitter Sentiment), add instructions or scripts to download them into `data/`.
77
+
78
+ ## Usage
79
+
80
+ ### Training a model
81
+ Example (replace flags with code's CLI options if present):
82
+ python src/train.py --data data/train.csv --model-dir models/ --epochs 10 --batch-size 32 --feature tfidf
83
+
84
+ This will:
85
+ - load and preprocess the data,
86
+ - extract features,
87
+ - train the selected model,
88
+ - save the trained model and preprocessing artifacts to `models/`.
89
+
90
+ ### Evaluating a model
91
+ python src/evaluate.py --data data/test.csv --model models/latest_model.pkl --output results/eval.json
92
+
93
+ Generates metrics (accuracy, precision, recall, F1) and a confusion matrix saved in the output path.
94
+
95
+ ### Running inference
96
+ Single sentence:
97
+ python src/predict.py --model models/latest_model.pkl --text "I love this product!"
98
+
99
+ Batch mode (CSV input):
100
+ python src/predict.py --model models/latest_model.pkl --input data/new_texts.csv --output predictions.csv
101
+
102
+ ## Modeling notes
103
+ - Baselines: TF-IDF + Logistic Regression or SVM often give strong baselines for sentiment tasks.
104
+ - For higher performance, use pre-trained transformer encoders (BERT variants) and fine-tune.
105
+ - Pay attention to class imbalance; consider stratified splitting, class weights, or resampling.
106
+ - Monitor overfitting with validation curves and apply regularization / dropout as needed.
107
+
108
+ ## Best practices & tips
109
+ - Clean and normalize text (lowercasing, removing extra whitespace, handling emojis if relevant).
110
+ - Preserve tokens like negations ("not", "never") because they strongly affect sentiment.
111
+ - Use consistent label encoding and save label->index mappings with the model.
112
+ - Version models and preprocessing steps so results are reproducible.
113
+
114
+ ## Contributing
115
+ Contributions are welcome. Typical ways to help:
116
+ - Open issues for bugs or feature requests.
117
+ - Provide pull requests with bug fixes, added models, or improved preprocessing.
118
+ - Add example notebooks showing experiments and model comparisons.
119
+ Before submitting PRs, run linters / tests if available.
120
+
121
+ ## License
122
+ Specify your license here (e.g., MIT). If absent, add a LICENSE file to the repository.
123
+
124
+ ## Contact
125
+ Maintainer: missaouimedamine
126
+ Project: https://github.com/missaouimedamine/Sentiment-Analysis
127
+
128
+
app.py ADDED
@@ -0,0 +1,103 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ from data_clean import clean_text
3
+ from data_process import query_hf_sentiment, extract_themes, create_openai_client
4
+ import os
5
+ from dotenv import load_dotenv
6
+
7
+ # Load environment variables
8
+ load_dotenv()
9
+ API_KEY = os.getenv("API_KEY")
10
+ HF_MODEL_URL = "https://api-inference.huggingface.co/models/tabularisai/multilingual-sentiment-analysis"
11
+
12
+
13
+ def analyze_post_gradio(post):
14
+ """
15
+ Gradio version of the analyze_post function
16
+ """
17
+ try:
18
+ # Clean the text
19
+ cleaned_post = clean_text(post)
20
+
21
+ # Get sentiment analysis
22
+ sentiment_result = query_hf_sentiment(cleaned_post, API_KEY, HF_MODEL_URL)
23
+ sentiment = sentiment_result['raw_output'][0]
24
+
25
+ # Extract themes
26
+ themes = extract_themes(cleaned_post, create_openai_client(API_KEY))
27
+
28
+ # Prepare data for storage and display
29
+ data = {
30
+ "post": post,
31
+ "cleaned_post": cleaned_post,
32
+ "sentiment": {
33
+ "label": sentiment['label'],
34
+ "score": sentiment['score']
35
+ },
36
+ "theme": themes
37
+ }
38
+
39
+ # Store in MongoDB
40
+ #insert_post(data)
41
+
42
+ # Format output for display
43
+ output = f"""
44
+ **Original Post:** {post}
45
+
46
+ **Cleaned Post:** {cleaned_post}
47
+
48
+ **Sentiment Analysis:**
49
+ - Label: {sentiment['label']}
50
+ - Score: {sentiment['score']:.4f}
51
+
52
+ **Extracted Themes:** {themes}
53
+
54
+ **Status:** ✅ Post analyzed and stored successfully in MongoDB
55
+ """
56
+
57
+ return output
58
+
59
+ except Exception as e:
60
+ return f"❌ Error analyzing post: {str(e)}"
61
+
62
+ # Create Gradio interface
63
+ with gr.Blocks(title="Post Analyzer", theme=gr.themes.Soft()) as demo:
64
+ gr.Markdown("# 📊 Post Analysis Tool")
65
+ gr.Markdown("Enter a post to analyze its sentiment and extract themes")
66
+
67
+ with gr.Row():
68
+ with gr.Column():
69
+ post_input = gr.Textbox(
70
+ label="Enter your post",
71
+ placeholder="Type your post here...",
72
+ lines=4,
73
+ max_lines=10
74
+ )
75
+ analyze_btn = gr.Button("Analyze Post", variant="primary")
76
+
77
+ with gr.Column():
78
+ output = gr.Markdown(label="Analysis Results")
79
+
80
+ # Set up the interaction
81
+ analyze_btn.click(
82
+ fn=analyze_post_gradio,
83
+ inputs=post_input,
84
+ outputs=output
85
+ )
86
+
87
+ # Add examples
88
+ gr.Examples(
89
+ examples=[
90
+ "I absolutely love this product! It's amazing and works perfectly.",
91
+ "I'm really disappointed with the service. It was slow and unhelpful.",
92
+ "The weather today is nice, but I'm concerned about climate change."
93
+ ],
94
+ inputs=post_input
95
+ )
96
+
97
+ # Launch the app
98
+ if __name__ == "__main__":
99
+ demo.launch(
100
+ server_name="0.0.0.0",
101
+ server_port=7860,
102
+ share=True
103
+ )
data_clean.py ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ import re
3
+ import emoji
4
+ import unicodedata
5
+
6
+ def clean_text(text):
7
+ # 1. Supprimer les URLs
8
+ text = re.sub(r"http\S+|www\S+|https\S+", "", text)
9
+
10
+ # 2. Supprimer les mentions @
11
+ text = re.sub(r"@\w+", "", text)
12
+
13
+ # 3. Supprimer les hashtags (garder le mot ou tout enlever ? ici on garde le mot)
14
+ text = re.sub(r"#(\w+)", r"\1", text)
15
+
16
+ # 4. Supprimer les emojis
17
+ text = emoji.replace_emoji(text, replace="")
18
+
19
+ # 5. Supprimer les caractères spéciaux (sauf lettres, chiffres et ponctuation de base)
20
+ text = re.sub(r"[^a-zA-ZÀ-ÿ0-9\s.,!?;:()\"'-]", " ", text)
21
+
22
+ # 6. Normaliser en minuscules
23
+ text = text.lower()
24
+
25
+ # 7. Supprimer les espaces multiples
26
+ text = re.sub(r"\s+", " ", text).strip()
27
+
28
+ return text
29
+
data_process.py ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import requests
2
+ from openai import OpenAI
3
+
4
+ # === Hugging Face Sentiment Analysis ===
5
+
6
+ def create_hf_headers(api_key):
7
+ """Create headers for Hugging Face API requests."""
8
+ return {"Authorization": f"Bearer {api_key}"}
9
+
10
+ def query_hf_sentiment(post, api_key, model_url):
11
+ """Query Hugging Face sentiment analysis API for a list of posts."""
12
+ headers = create_hf_headers(api_key)
13
+ response = requests.post(model_url, headers=headers, json={"inputs": post})
14
+ return {
15
+ "text": post,
16
+ "raw_output": response.json()[0]
17
+ }
18
+
19
+
20
+ # === OpenAI / Hugging Face GPT-based Theme Extraction ===
21
+
22
+ def create_openai_client(api_key, base_url="https://router.huggingface.co/v1"):
23
+ """Create an OpenAI client for Hugging Face router."""
24
+ return OpenAI(api_key=api_key, base_url=base_url)
25
+
26
+ def extract_themes(post, client, prompt=None):
27
+ """Extract main themes from posts using GPT model."""
28
+ if prompt is None:
29
+ prompt = ("Donne uniquement les thèmes principaux en un ou deux mots pour le texte suivant, "
30
+ "sans explication. (par exemple, contenus liés à la solitude, "
31
+ "au soutien psychologique, aux symptômes, etc.).")
32
+
33
+ completion = client.chat.completions.create(
34
+ model="openai/gpt-oss-20b:fireworks-ai",
35
+ messages=[{"role": "user", "content": f"{prompt}\n\nTexte : {post}"}],
36
+ )
37
+ return completion.choices[0].message.content
38
+
39
+
40
+ # === Example Usage ===
requirements.txt ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ emoji==2.6.0
2
+ regex==2023.10.11
3
+ requests==2.31.0
4
+ openai==0.27.8
5
+ fastapi==0.95.2
6
+ pydantic==1.10.12
7
+ uvicorn==0.22.0
8
+ python-dotenv==1.0.0
9
+ pymongo==4.4.0