|
|
--- |
|
|
library_name: transformers |
|
|
license: apache-2.0 |
|
|
base_model: xlm-roberta-base |
|
|
tags: |
|
|
- generated_from_trainer |
|
|
model-index: |
|
|
- name: bengali-code-mix-sentiment |
|
|
results: [] |
|
|
datasets: |
|
|
- Swarnadeep-28/bn_code_mix_sentiment_dataset |
|
|
metrics: |
|
|
- accuracy |
|
|
- f1 |
|
|
- precision |
|
|
- recall |
|
|
--- |
|
|
# Bengali-English Code-Mixed Sentiment Model |
|
|
|
|
|
## Model Summary |
|
|
This model is a **fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base)** for **sentiment analysis** on **Bengali–English code-mixed text** (social media posts, comments, and tweets). |
|
|
|
|
|
- **Task**: Text Classification (Sentiment Analysis) |
|
|
- **Languages**: Bengali (Romanized) + English |
|
|
- **Classes**: `0`, `1`, `2`, `3` |
|
|
- **Fine-tuning method**: Full fine-tuning |
|
|
- **Dataset**: [Bengali-English Code-Mixed Sentiment Dataset](https://huggingface.co/datasets/Swarnadeep-28/bn_code_mix_sentiment_dataset) |
|
|
|
|
|
This model provides strong baseline performance for code-mixed sentiment classification and can be directly applied to social media analysis and low-resource NLP research. |
|
|
|
|
|
--- |
|
|
|
|
|
## How to Use |
|
|
|
|
|
### Inference Example |
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
import torch |
|
|
|
|
|
model_id = "Swarnadeep-28/bengali-code-mix-sentiment" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
|
model = AutoModelForSequenceClassification.from_pretrained(model_id) |
|
|
|
|
|
text = "Aaj match ta khub bhalo chilo! Loved it." |
|
|
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True) |
|
|
|
|
|
with torch.no_grad(): |
|
|
logits = model(**inputs).logits |
|
|
pred = torch.argmax(logits, dim=-1).item() |
|
|
labels = ["0", "1", "2", "3"] |
|
|
print("Predicted label:", labels[pred]) |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Training Details |
|
|
|
|
|
- **Base model**: `xlm-roberta-base` |
|
|
- **Method**: Full fine-tuning (all parameters updated) |
|
|
- **Optimizer**: AdamW |
|
|
- **Learning Rate**: 2e-5 |
|
|
- **Epochs**: 3 |
|
|
- **Batch Size**: 16 (train), 32 (eval) |
|
|
- **Hardware**: Trained on a single GPU (Colab T4 / equivalent) |
|
|
|
|
|
--- |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
### Classification Report |
|
|
| Label | Precision | Recall | F1-Score | Support | |
|
|
|-------|-----------|--------|----------|---------| |
|
|
| 0 | 0.80 | 0.73 | 0.77 | 528 | |
|
|
| 1 | 0.73 | 0.73 | 0.73 | 617 | |
|
|
| 2 | 0.69 | 0.76 | 0.72 | 675 | |
|
|
| 3 | 0.67 | 0.57 | 0.62 | 182 | |
|
|
|
|
|
### Overall Metrics |
|
|
- **Accuracy**: 0.73 |
|
|
- **Macro Avg**: Precision = 0.72, Recall = 0.70, F1 = 0.71 |
|
|
- **Weighted Avg**: Precision = 0.73, Recall = 0.73, F1 = 0.73 |
|
|
- **Total Samples**: 2002 |
|
|
|
|
|
--- |
|
|
|
|
|
## Applications |
|
|
- Sentiment classification of Bengali-English social media text |
|
|
- Research in **code-mixed NLP for Indic languages** |
|
|
- Benchmark for parameter-efficient fine-tuning (compare with LoRA model) |
|
|
|
|
|
--- |
|
|
|
|
|
## Limitations |
|
|
- Heavily Romanized or slang-heavy Bengali may reduce accuracy |
|
|
- Trained primarily on short-form text (tweets, comments, reviews) |
|
|
- Not designed for abusive/toxic content moderation or safety-critical use cases |
|
|
|
|
|
--- |
|
|
|
|
|
## Ethical Considerations |
|
|
- Data reflects natural biases from social media sources |
|
|
- Misclassifications may occur in sarcasm or offensive text |
|
|
- Should not be the sole basis for critical decision-making |
|
|
|
|
|
--- |
|
|
|
|
|
## Citation |
|
|
If you use this model, please cite: |
|
|
|
|
|
```bibtex |
|
|
@model{das2025_bn_code_mix_sentiment, |
|
|
author = {Swarnadeep Das}, |
|
|
title = {Bengali-English Code-Mixed Sentiment Model}, |
|
|
year = {2025}, |
|
|
url = {https://huggingface.co/Swarnadeep-28/bengali-code-mix-sentiment} |
|
|
} |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Acknowledgements |
|
|
- **Dataset**: [Bengali-English Code-Mixed Sentiment Dataset](https://huggingface.co/datasets/Swarnadeep-28/bn_code_mix_sentiment_dataset) |
|
|
- **Base model**: [`xlm-roberta-base`](https://huggingface.co/xlm-roberta-base) |
|
|
- **Frameworks**: [Transformers](https://huggingface.co/docs/transformers), [Datasets](https://huggingface.co/docs/datasets) |