ahs95 commited on
Commit
3520dbd
·
verified ·
1 Parent(s): 6c29ea0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +182 -3
README.md CHANGED
@@ -1,3 +1,182 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - SayedShaun/sentigold
5
+ language:
6
+ - bn
7
+ metrics:
8
+ - accuracy
9
+ - f1
10
+ base_model:
11
+ - csebuetnlp/banglabert
12
+ pipeline_tag: text-classification
13
+ tags:
14
+ - sentiment-analysis
15
+ - bengali
16
+ - bangla
17
+ - multilabel-classification
18
+ ---
19
+
20
+ # BanglaBERT Fine-tuned for Bangla Sentiment Analysis
21
+
22
+ ## Model Description
23
+
24
+ This model is a fine-tuned version of [`csebuetnlp/banglabert`](https://huggingface.co/csebuetnlp/banglabert) on the SentiGOLD dataset for 5-class sentiment analysis in Bengali. It classifies text into:
25
+ 1. 😠 Very Negative (SN)
26
+ 2. 😞 Negative (WN)
27
+ 3. 😐 Neutral (N)
28
+ 4. 😊 Positive (WP)
29
+ 5. 😍 Very Positive (SP)
30
+
31
+ **Key Features:**
32
+ - State-of-the-art Bangla language understanding
33
+ - Handles both formal and informal Bengali text
34
+ - Optimized for social media, reviews, and customer feedback
35
+ - Requires text normalization using [Bangla Normalizer](https://github.com/csebuetnlp/normalizer)
36
+
37
+ ## Intended Uses & Limitations
38
+
39
+ ### Primary Use
40
+ - Sentiment analysis of Bengali text
41
+ - Social media monitoring
42
+ - Customer feedback analysis
43
+ - Product review classification
44
+
45
+ ### Limitations
46
+ - Performance may degrade on code-mixed text (Bengali-English)
47
+ - May struggle with sarcasm and highly contextual expressions
48
+ - Best for short to medium-length texts (up to 512 tokens)
49
+
50
+ ## Training Data
51
+
52
+ The model was fine-tuned on [**SentiGOLD**](https://arxiv.org/pdf/2306.06147), the largest gold-standard Bangla sentiment analysis dataset:
53
+
54
+ | Feature | Value |
55
+ |------------------------|---------------|
56
+ | Total Samples | 70,000 |
57
+ | Domains Covered | 30+ |
58
+ | Source Diversity | Social media, news, blogs, reviews |
59
+ | Class Distribution | Balanced across 5 classes |
60
+ | Annotation Quality | Fleiss' kappa = 0.88 |
61
+
62
+ ## Training Procedure
63
+
64
+ ### Hyperparameters
65
+
66
+ | Parameter | Value |
67
+ | --- | --- |
68
+ | Learning Rate | 2e-5 → 1.05e-6 |
69
+ | Batch Size | 48 |
70
+ | Epochs | 5 |
71
+ | Optimizer | AdamW |
72
+ | Scheduler | ReduceLROnPlateau |
73
+ | Weight Decay | 0.01 |
74
+ | Gradient Accumulation | 4 steps |
75
+ | Warmup Ratio | 5% |
76
+
77
+ ### Techniques
78
+
79
+ * Class-weighted loss handling imbalance
80
+ * Early stopping (patience=3)
81
+ * Mixed precision (FP16) training
82
+ * Gradient checkpointing
83
+ * Text normalization using Bangla Normalizer
84
+
85
+ ## Evaluation Results
86
+
87
+ ### Validation Performance
88
+
89
+ | Epoch | F1 (Macro) | Accuracy | Very Neg F1 | Neg F1 | Neu F1 | Pos F1 | Very Pos F1 |
90
+ | --- | --- | --- | --- | --- | --- | --- | --- |
91
+ | 1 | 0.6334 | 0.6331 | 0.6789 | 0.5834 | 0.6407 | 0.5635 | 0.7004 |
92
+ | 5 | 0.6537 | 0.6551 | 0.7081 | 0.6157 | 0.6421 | 0.5789 | 0.7236 |
93
+
94
+ ### Final Test Performance
95
+
96
+ | Metric | Score |
97
+ | --- | --- |
98
+ | Macro F1 | 0.6660 |
99
+ | Accuracy | 0.6671 |
100
+
101
+ ## How to Use
102
+
103
+ ### Direct Inference
104
+
105
+ ```python
106
+ from transformers import pipeline
107
+ from normalizer import normalize
108
+
109
+ # Load model
110
+ classifier = pipeline(
111
+ "text-classification",
112
+ model="ahs95/banglabert-sentiment-analysis",
113
+ tokenizer="ahs95/banglabert-sentiment-analysis"
114
+ )
115
+
116
+ # Prepare text
117
+ text = "আপনার পণ্যটি অসাধারণ! আমি খুবই সন্তুষ্ট।"
118
+ normalized_text = normalize(text) # Important for BanglaBERT
119
+
120
+ # Classify
121
+ result = classifier(normalized_text)
122
+ print(f"Sentiment: {result[0]['label']} (Confidence: {result[0]['score']:.2f})")
123
+ ```
124
+
125
+ ### Advanced Usage
126
+ ```python
127
+
128
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
129
+ import torch
130
+ from normalizer import normalize
131
+
132
+ # Load model and tokenizer
133
+ model_name = "ahs95/banglabert-sentiment-analysis"
134
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
135
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
136
+
137
+ # Prepare inputs
138
+ texts = [
139
+ "সেবা খুব খারাপ ছিল। আমি কখনো ফিরে আসব না।",
140
+ "পণ্যটির গুণগত মান মোটামুটি ভাল"
141
+ ]
142
+ normalized_texts = [normalize(t) for t in texts]
143
+
144
+ # Tokenize and predict
145
+ inputs = tokenizer(normalized_texts, padding=True, truncation=True, return_tensors="pt")
146
+ with torch.no_grad():
147
+ outputs = model(**inputs)
148
+ probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
149
+
150
+ # Get predictions
151
+ sentiment_labels = ["Very Negative", "Negative", "Neutral", "Positive", "Very Positive"]
152
+ predictions = [sentiment_labels[p] for p in probabilities.argmax(dim=1)]
153
+
154
+ for text, pred in zip(texts, predictions):
155
+ print(f"Text: {text}\nPredicted Sentiment: {pred}\n")
156
+ ```
157
+
158
+ ### Ethical Considerations
159
+ - **Bias:** While SentiGOLD reduces bias through synthetic data, real-world validation is recommended
160
+
161
+ - **Use Cases:** Suitable for:
162
+ * Product feedback analysis
163
+ * Social media monitoring
164
+ * Market research
165
+
166
+ - **Avoid:** Critical decision systems without human oversight
167
+
168
+ ### Citation
169
+ If you use this model, please cite:
170
+
171
+ ```bibtex
172
+ @misc{banglabert-sentiment,
173
+ author = {Arshadul Hoque},
174
+ title = {Fine-tuned BanglaBERT for Bengali Sentiment Analysis},
175
+ year = {2025},
176
+ publisher = {Hugging Face},
177
+ howpublished = {\url{https://huggingface.co/ahs95/banglabert-sentiment-analysis}}
178
+ }
179
+ ```
180
+
181
+ ### Contact
182
+ For questions and support: ahsbd95@gmail.com