File size: 3,886 Bytes
a5d6240 bd3c2b0 a5d6240 0c71423 a5d6240 0c71423 a5d6240 0c71423 a5d6240 0c71423 34b8ddf a5d6240 0c71423 a5d6240 0c71423 34b8ddf 0c71423 a5d6240 0c71423 a5d6240 0c71423 a5d6240 0c71423 a5d6240 0c71423 a5d6240 0c71423 a5d6240 0c71423 a5d6240 0c71423 a5d6240 0c71423 a5d6240 0c71423 a5d6240 0c71423 a5d6240 0c71423 a5d6240 0c71423 34b8ddf 0c71423 a5d6240 0c71423 34b8ddf 0c71423 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 |
---
library_name: transformers
license: apache-2.0
base_model: xlm-roberta-base
tags:
- generated_from_trainer
model-index:
- name: bengali-code-mix-sentiment
results: []
datasets:
- Swarnadeep-28/bn_code_mix_sentiment_dataset
metrics:
- accuracy
- f1
- precision
- recall
---
# Bengali-English Code-Mixed Sentiment Model
## Model Summary
This model is a **fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base)** for **sentiment analysis** on **Bengali–English code-mixed text** (social media posts, comments, and tweets).
- **Task**: Text Classification (Sentiment Analysis)
- **Languages**: Bengali (Romanized) + English
- **Classes**: `0`, `1`, `2`, `3`
- **Fine-tuning method**: Full fine-tuning
- **Dataset**: [Bengali-English Code-Mixed Sentiment Dataset](https://huggingface.co/datasets/Swarnadeep-28/bn_code_mix_sentiment_dataset)
This model provides strong baseline performance for code-mixed sentiment classification and can be directly applied to social media analysis and low-resource NLP research.
---
## How to Use
### Inference Example
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_id = "Swarnadeep-28/bengali-code-mix-sentiment"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
text = "Aaj match ta khub bhalo chilo! Loved it."
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
with torch.no_grad():
logits = model(**inputs).logits
pred = torch.argmax(logits, dim=-1).item()
labels = ["0", "1", "2", "3"]
print("Predicted label:", labels[pred])
```
---
## Training Details
- **Base model**: `xlm-roberta-base`
- **Method**: Full fine-tuning (all parameters updated)
- **Optimizer**: AdamW
- **Learning Rate**: 2e-5
- **Epochs**: 3
- **Batch Size**: 16 (train), 32 (eval)
- **Hardware**: Trained on a single GPU (Colab T4 / equivalent)
---
## Evaluation
### Classification Report
| Label | Precision | Recall | F1-Score | Support |
|-------|-----------|--------|----------|---------|
| 0 | 0.80 | 0.73 | 0.77 | 528 |
| 1 | 0.73 | 0.73 | 0.73 | 617 |
| 2 | 0.69 | 0.76 | 0.72 | 675 |
| 3 | 0.67 | 0.57 | 0.62 | 182 |
### Overall Metrics
- **Accuracy**: 0.73
- **Macro Avg**: Precision = 0.72, Recall = 0.70, F1 = 0.71
- **Weighted Avg**: Precision = 0.73, Recall = 0.73, F1 = 0.73
- **Total Samples**: 2002
---
## Applications
- Sentiment classification of Bengali-English social media text
- Research in **code-mixed NLP for Indic languages**
- Benchmark for parameter-efficient fine-tuning (compare with LoRA model)
---
## Limitations
- Heavily Romanized or slang-heavy Bengali may reduce accuracy
- Trained primarily on short-form text (tweets, comments, reviews)
- Not designed for abusive/toxic content moderation or safety-critical use cases
---
## Ethical Considerations
- Data reflects natural biases from social media sources
- Misclassifications may occur in sarcasm or offensive text
- Should not be the sole basis for critical decision-making
---
## Citation
If you use this model, please cite:
```bibtex
@model{das2025_bn_code_mix_sentiment,
author = {Swarnadeep Das},
title = {Bengali-English Code-Mixed Sentiment Model},
year = {2025},
url = {https://huggingface.co/Swarnadeep-28/bengali-code-mix-sentiment}
}
```
---
## Acknowledgements
- **Dataset**: [Bengali-English Code-Mixed Sentiment Dataset](https://huggingface.co/datasets/Swarnadeep-28/bn_code_mix_sentiment_dataset)
- **Base model**: [`xlm-roberta-base`](https://huggingface.co/xlm-roberta-base)
- **Frameworks**: [Transformers](https://huggingface.co/docs/transformers), [Datasets](https://huggingface.co/docs/datasets) |