Bengali-English Code-Mixed Sentiment Model
Model Summary
This model is a fine-tuned version of xlm-roberta-base for sentiment analysis on Bengali–English code-mixed text (social media posts, comments, and tweets).
- Task: Text Classification (Sentiment Analysis)
- Languages: Bengali (Romanized) + English
- Classes:
0,1,2,3 - Fine-tuning method: Full fine-tuning
- Dataset: Bengali-English Code-Mixed Sentiment Dataset
This model provides strong baseline performance for code-mixed sentiment classification and can be directly applied to social media analysis and low-resource NLP research.
How to Use
Inference Example
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_id = "Swarnadeep-28/bengali-code-mix-sentiment"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
text = "Aaj match ta khub bhalo chilo! Loved it."
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
with torch.no_grad():
logits = model(**inputs).logits
pred = torch.argmax(logits, dim=-1).item()
labels = ["0", "1", "2", "3"]
print("Predicted label:", labels[pred])
Training Details
- Base model:
xlm-roberta-base - Method: Full fine-tuning (all parameters updated)
- Optimizer: AdamW
- Learning Rate: 2e-5
- Epochs: 3
- Batch Size: 16 (train), 32 (eval)
- Hardware: Trained on a single GPU (Colab T4 / equivalent)
Evaluation
Classification Report
| Label | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| 0 | 0.80 | 0.73 | 0.77 | 528 |
| 1 | 0.73 | 0.73 | 0.73 | 617 |
| 2 | 0.69 | 0.76 | 0.72 | 675 |
| 3 | 0.67 | 0.57 | 0.62 | 182 |
Overall Metrics
- Accuracy: 0.73
- Macro Avg: Precision = 0.72, Recall = 0.70, F1 = 0.71
- Weighted Avg: Precision = 0.73, Recall = 0.73, F1 = 0.73
- Total Samples: 2002
Applications
- Sentiment classification of Bengali-English social media text
- Research in code-mixed NLP for Indic languages
- Benchmark for parameter-efficient fine-tuning (compare with LoRA model)
Limitations
- Heavily Romanized or slang-heavy Bengali may reduce accuracy
- Trained primarily on short-form text (tweets, comments, reviews)
- Not designed for abusive/toxic content moderation or safety-critical use cases
Ethical Considerations
- Data reflects natural biases from social media sources
- Misclassifications may occur in sarcasm or offensive text
- Should not be the sole basis for critical decision-making
Citation
If you use this model, please cite:
@model{das2025_bn_code_mix_sentiment,
author = {Swarnadeep Das},
title = {Bengali-English Code-Mixed Sentiment Model},
year = {2025},
url = {https://huggingface.co/Swarnadeep-28/bengali-code-mix-sentiment}
}
Acknowledgements
- Dataset: Bengali-English Code-Mixed Sentiment Dataset
- Base model:
xlm-roberta-base - Frameworks: Transformers, Datasets
- Downloads last month
- 6
Model tree for Swarnadeep-28/bengali-code-mix-sentiment
Base model
FacebookAI/xlm-roberta-base