Swarnadeep-28's picture
Update README.md
bd3c2b0 verified
---
library_name: transformers
license: apache-2.0
base_model: xlm-roberta-base
tags:
- generated_from_trainer
model-index:
- name: bengali-code-mix-sentiment
results: []
datasets:
- Swarnadeep-28/bn_code_mix_sentiment_dataset
metrics:
- accuracy
- f1
- precision
- recall
---
# Bengali-English Code-Mixed Sentiment Model
## Model Summary
This model is a **fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base)** for **sentiment analysis** on **Bengali–English code-mixed text** (social media posts, comments, and tweets).
- **Task**: Text Classification (Sentiment Analysis)
- **Languages**: Bengali (Romanized) + English
- **Classes**: `0`, `1`, `2`, `3`
- **Fine-tuning method**: Full fine-tuning
- **Dataset**: [Bengali-English Code-Mixed Sentiment Dataset](https://huggingface.co/datasets/Swarnadeep-28/bn_code_mix_sentiment_dataset)
This model provides strong baseline performance for code-mixed sentiment classification and can be directly applied to social media analysis and low-resource NLP research.
---
## How to Use
### Inference Example
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_id = "Swarnadeep-28/bengali-code-mix-sentiment"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
text = "Aaj match ta khub bhalo chilo! Loved it."
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
with torch.no_grad():
logits = model(**inputs).logits
pred = torch.argmax(logits, dim=-1).item()
labels = ["0", "1", "2", "3"]
print("Predicted label:", labels[pred])
```
---
## Training Details
- **Base model**: `xlm-roberta-base`
- **Method**: Full fine-tuning (all parameters updated)
- **Optimizer**: AdamW
- **Learning Rate**: 2e-5
- **Epochs**: 3
- **Batch Size**: 16 (train), 32 (eval)
- **Hardware**: Trained on a single GPU (Colab T4 / equivalent)
---
## Evaluation
### Classification Report
| Label | Precision | Recall | F1-Score | Support |
|-------|-----------|--------|----------|---------|
| 0 | 0.80 | 0.73 | 0.77 | 528 |
| 1 | 0.73 | 0.73 | 0.73 | 617 |
| 2 | 0.69 | 0.76 | 0.72 | 675 |
| 3 | 0.67 | 0.57 | 0.62 | 182 |
### Overall Metrics
- **Accuracy**: 0.73
- **Macro Avg**: Precision = 0.72, Recall = 0.70, F1 = 0.71
- **Weighted Avg**: Precision = 0.73, Recall = 0.73, F1 = 0.73
- **Total Samples**: 2002
---
## Applications
- Sentiment classification of Bengali-English social media text
- Research in **code-mixed NLP for Indic languages**
- Benchmark for parameter-efficient fine-tuning (compare with LoRA model)
---
## Limitations
- Heavily Romanized or slang-heavy Bengali may reduce accuracy
- Trained primarily on short-form text (tweets, comments, reviews)
- Not designed for abusive/toxic content moderation or safety-critical use cases
---
## Ethical Considerations
- Data reflects natural biases from social media sources
- Misclassifications may occur in sarcasm or offensive text
- Should not be the sole basis for critical decision-making
---
## Citation
If you use this model, please cite:
```bibtex
@model{das2025_bn_code_mix_sentiment,
author = {Swarnadeep Das},
title = {Bengali-English Code-Mixed Sentiment Model},
year = {2025},
url = {https://huggingface.co/Swarnadeep-28/bengali-code-mix-sentiment}
}
```
---
## Acknowledgements
- **Dataset**: [Bengali-English Code-Mixed Sentiment Dataset](https://huggingface.co/datasets/Swarnadeep-28/bn_code_mix_sentiment_dataset)
- **Base model**: [`xlm-roberta-base`](https://huggingface.co/xlm-roberta-base)
- **Frameworks**: [Transformers](https://huggingface.co/docs/transformers), [Datasets](https://huggingface.co/docs/datasets)