--- license: apache-2.0 datasets: - SayedShaun/sentigold language: - bn metrics: - accuracy - f1 base_model: - csebuetnlp/banglabert pipeline_tag: text-classification tags: - sentiment-analysis - bengali - bangla - multilabel-classification library_name: transformers --- # BanglaBERT Fine-tuned for Bangla Sentiment Analysis ## Model Description This model is a fine-tuned version of [`csebuetnlp/banglabert`](https://huggingface.co/csebuetnlp/banglabert) on the [SentiGOLD](https://arxiv.org/pdf/2306.06147) dataset for 5-class sentiment analysis in Bengali. It classifies text into: 1. 😠 Very Negative (SN) 2. 😞 Negative (WN) 3. 😐 Neutral (N) 4. 😊 Positive (WP) 5. 😍 Very Positive (SP) **Key Features:** - State-of-the-art Bangla language understanding - Handles both formal and informal Bengali text - Optimized for social media, reviews, and customer feedback - Requires text normalization using [Bangla Normalizer](https://github.com/csebuetnlp/normalizer) ## Intended Uses & Limitations ### Primary Use - Sentiment analysis of Bengali text - Social media monitoring - Customer feedback analysis - Product review classification ### Limitations - Performance may degrade on code-mixed text (Bengali-English) - May struggle with sarcasm and highly contextual expressions - Best for short to medium-length texts (up to 512 tokens) ## Training Data The model was fine-tuned on **SentiGOLD**, the largest gold-standard Bangla sentiment analysis dataset: | Feature | Value | |------------------------|---------------| | Total Samples | 70,000 | | Domains Covered | 30+ | | Source Diversity | Social media, news, blogs, reviews | | Class Distribution | Balanced across 5 classes | | Annotation Quality | Fleiss' kappa = 0.88 | ## Training Procedure ### Hyperparameters | Parameter | Value | | --- | --- | | Learning Rate | 2e-5 → 1.05e-6 | | Batch Size | 48 | | Epochs | 5 | | Optimizer | AdamW | | Scheduler | ReduceLROnPlateau | | Weight Decay | 0.01 | | Gradient Accumulation | 4 steps | | Warmup Ratio | 5% | ### Techniques * Class-weighted loss handling imbalance * Early stopping (patience=3) * Mixed precision (FP16) training * Gradient checkpointing * Text normalization using Bangla Normalizer ## Evaluation Results ### Validation Performance | Epoch | F1 (Macro) | Accuracy | Very Neg F1 | Neg F1 | Neu F1 | Pos F1 | Very Pos F1 | | --- | --- | --- | --- | --- | --- | --- | --- | | 1 | 0.6334 | 0.6331 | 0.6789 | 0.5834 | 0.6407 | 0.5635 | 0.7004 | | 5 | 0.6537 | 0.6551 | 0.7081 | 0.6157 | 0.6421 | 0.5789 | 0.7236 | ### Final Test Performance | Metric | Score | | --- | --- | | Macro F1 | 0.6660 | | Accuracy | 0.6671 | ## How to Use ### Direct Inference ```python from transformers import pipeline from normalizer import normalize # Load model classifier = pipeline( "text-classification", model="ahs95/banglabert-sentiment-analysis", tokenizer="ahs95/banglabert-sentiment-analysis" ) # Prepare text text = "āφāĻĒāύāĻžāϰ āĻĒāĻŖā§āϝāϟāĻŋ āĻ…āϏāĻžāϧāĻžāϰāĻŖ! āφāĻŽāĻŋ āϖ⧁āĻŦāχ āϏāĻ¨ā§āϤ⧁āĻˇā§āϟāĨ¤" normalized_text = normalize(text) # Important for BanglaBERT # Classify result = classifier(normalized_text) print(f"Sentiment: {result[0]['label']} (Confidence: {result[0]['score']:.2f})") ``` ### Advanced Usage ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch from normalizer import normalize # Load model and tokenizer model_name = "ahs95/banglabert-sentiment-analysis" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name) # Prepare inputs texts = [ "āϏ⧇āĻŦāĻž āϖ⧁āĻŦ āĻ–āĻžāϰāĻžāĻĒ āĻ›āĻŋāϞāĨ¤ āφāĻŽāĻŋ āĻ•āĻ–āύ⧋ āĻĢāĻŋāϰ⧇ āφāϏāĻŦ āύāĻžāĨ¤", "āĻĒāĻŖā§āϝāϟāĻŋāϰ āϗ⧁āĻŖāĻ—āϤ āĻŽāĻžāύ āĻŽā§‹āϟāĻžāĻŽā§āϟāĻŋ āĻ­āĻžāϞ" ] normalized_texts = [normalize(t) for t in texts] # Tokenize and predict inputs = tokenizer(normalized_texts, padding=True, truncation=True, return_tensors="pt") with torch.no_grad(): outputs = model(**inputs) probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1) # Get predictions sentiment_labels = ["Very Negative", "Negative", "Neutral", "Positive", "Very Positive"] predictions = [sentiment_labels[p] for p in probabilities.argmax(dim=1)] for text, pred in zip(texts, predictions): print(f"Text: {text}\nPredicted Sentiment: {pred}\n") ``` ### Ethical Considerations - **Bias:** While SentiGOLD reduces bias through synthetic data, real-world validation is recommended - **Use Cases:** Suitable for: * Product feedback analysis * Social media monitoring * Market research - **Avoid:** Critical decision systems without human oversight ### Citation If you use this model, please cite: ```bibtex @misc{banglabert-sentiment, author = {Arshadul Hoque}, title = {Fine-tuned BanglaBERT for Bengali Sentiment Analysis}, year = {2025}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/ahs95/banglabert-sentiment-analysis}} } ``` ### Contact For questions and support: ahsbd95@gmail.com