File size: 3,886 Bytes

---
library_name: transformers
license: apache-2.0
base_model: xlm-roberta-base
tags:
- generated_from_trainer
model-index:
- name: bengali-code-mix-sentiment
  results: []
datasets:
- Swarnadeep-28/bn_code_mix_sentiment_dataset
metrics:
- accuracy
- f1
- precision
- recall
---
# Bengali-English Code-Mixed Sentiment Model

## Model Summary
This model is a **fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base)** for **sentiment analysis** on **Bengali–English code-mixed text** (social media posts, comments, and tweets).  

- **Task**: Text Classification (Sentiment Analysis)  
- **Languages**: Bengali (Romanized) + English  
- **Classes**: `0`, `1`, `2`, `3`  
- **Fine-tuning method**: Full fine-tuning  
- **Dataset**: [Bengali-English Code-Mixed Sentiment Dataset](https://huggingface.co/datasets/Swarnadeep-28/bn_code_mix_sentiment_dataset)  

This model provides strong baseline performance for code-mixed sentiment classification and can be directly applied to social media analysis and low-resource NLP research.

---

## How to Use

### Inference Example
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_id = "Swarnadeep-28/bengali-code-mix-sentiment"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)

text = "Aaj match ta khub bhalo chilo! Loved it."
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)

with torch.no_grad():
    logits = model(**inputs).logits
pred = torch.argmax(logits, dim=-1).item()
labels = ["0", "1", "2", "3"]
print("Predicted label:", labels[pred])
```

---

## Training Details

- **Base model**: `xlm-roberta-base`  
- **Method**: Full fine-tuning (all parameters updated)  
- **Optimizer**: AdamW  
- **Learning Rate**: 2e-5  
- **Epochs**: 3  
- **Batch Size**: 16 (train), 32 (eval)  
- **Hardware**: Trained on a single GPU (Colab T4 / equivalent)  

---

## Evaluation

### Classification Report
| Label | Precision | Recall | F1-Score | Support |
|-------|-----------|--------|----------|---------|
| 0     | 0.80      | 0.73   | 0.77     | 528     |
| 1     | 0.73      | 0.73   | 0.73     | 617     |
| 2     | 0.69      | 0.76   | 0.72     | 675     |
| 3     | 0.67      | 0.57   | 0.62     | 182     |

### Overall Metrics
- **Accuracy**: 0.73  
- **Macro Avg**: Precision = 0.72, Recall = 0.70, F1 = 0.71  
- **Weighted Avg**: Precision = 0.73, Recall = 0.73, F1 = 0.73  
- **Total Samples**: 2002

---

## Applications
- Sentiment classification of Bengali-English social media text  
- Research in **code-mixed NLP for Indic languages**  
- Benchmark for parameter-efficient fine-tuning (compare with LoRA model)  

---

## Limitations
- Heavily Romanized or slang-heavy Bengali may reduce accuracy  
- Trained primarily on short-form text (tweets, comments, reviews)  
- Not designed for abusive/toxic content moderation or safety-critical use cases  

---

## Ethical Considerations
- Data reflects natural biases from social media sources  
- Misclassifications may occur in sarcasm or offensive text  
- Should not be the sole basis for critical decision-making  

---

## Citation
If you use this model, please cite:

```bibtex
@model{das2025_bn_code_mix_sentiment,
  author    = {Swarnadeep Das},
  title     = {Bengali-English Code-Mixed Sentiment Model},
  year      = {2025},
  url       = {https://huggingface.co/Swarnadeep-28/bengali-code-mix-sentiment}
}
```

---

## Acknowledgements
- **Dataset**: [Bengali-English Code-Mixed Sentiment Dataset](https://huggingface.co/datasets/Swarnadeep-28/bn_code_mix_sentiment_dataset)  
- **Base model**: [`xlm-roberta-base`](https://huggingface.co/xlm-roberta-base)  
- **Frameworks**: [Transformers](https://huggingface.co/docs/transformers), [Datasets](https://huggingface.co/docs/datasets)