File size: 3,886 Bytes
a5d6240
 
bd3c2b0
a5d6240
 
 
 
 
 
0c71423
 
 
 
 
 
 
a5d6240
0c71423
a5d6240
0c71423
 
a5d6240
0c71423
 
 
 
34b8ddf
a5d6240
0c71423
a5d6240
0c71423
 
 
 
 
 
 
 
 
34b8ddf
0c71423
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a5d6240
0c71423
 
 
 
 
 
 
a5d6240
0c71423
a5d6240
0c71423
a5d6240
0c71423
 
 
 
 
 
 
a5d6240
0c71423
 
 
 
 
a5d6240
0c71423
a5d6240
0c71423
 
 
 
a5d6240
0c71423
a5d6240
0c71423
 
 
 
a5d6240
0c71423
a5d6240
0c71423
 
 
 
a5d6240
0c71423
 
 
 
 
 
 
 
 
 
34b8ddf
0c71423
 
 
 
a5d6240
0c71423
34b8ddf
0c71423
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
---
library_name: transformers
license: apache-2.0
base_model: xlm-roberta-base
tags:
- generated_from_trainer
model-index:
- name: bengali-code-mix-sentiment
  results: []
datasets:
- Swarnadeep-28/bn_code_mix_sentiment_dataset
metrics:
- accuracy
- f1
- precision
- recall
---
# Bengali-English Code-Mixed Sentiment Model

## Model Summary
This model is a **fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base)** for **sentiment analysis** on **Bengali–English code-mixed text** (social media posts, comments, and tweets).  

- **Task**: Text Classification (Sentiment Analysis)  
- **Languages**: Bengali (Romanized) + English  
- **Classes**: `0`, `1`, `2`, `3`  
- **Fine-tuning method**: Full fine-tuning  
- **Dataset**: [Bengali-English Code-Mixed Sentiment Dataset](https://huggingface.co/datasets/Swarnadeep-28/bn_code_mix_sentiment_dataset)  

This model provides strong baseline performance for code-mixed sentiment classification and can be directly applied to social media analysis and low-resource NLP research.

---

## How to Use

### Inference Example
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_id = "Swarnadeep-28/bengali-code-mix-sentiment"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)

text = "Aaj match ta khub bhalo chilo! Loved it."
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)

with torch.no_grad():
    logits = model(**inputs).logits
pred = torch.argmax(logits, dim=-1).item()
labels = ["0", "1", "2", "3"]
print("Predicted label:", labels[pred])
```

---

## Training Details

- **Base model**: `xlm-roberta-base`  
- **Method**: Full fine-tuning (all parameters updated)  
- **Optimizer**: AdamW  
- **Learning Rate**: 2e-5  
- **Epochs**: 3  
- **Batch Size**: 16 (train), 32 (eval)  
- **Hardware**: Trained on a single GPU (Colab T4 / equivalent)  

---

## Evaluation

### Classification Report
| Label | Precision | Recall | F1-Score | Support |
|-------|-----------|--------|----------|---------|
| 0     | 0.80      | 0.73   | 0.77     | 528     |
| 1     | 0.73      | 0.73   | 0.73     | 617     |
| 2     | 0.69      | 0.76   | 0.72     | 675     |
| 3     | 0.67      | 0.57   | 0.62     | 182     |

### Overall Metrics
- **Accuracy**: 0.73  
- **Macro Avg**: Precision = 0.72, Recall = 0.70, F1 = 0.71  
- **Weighted Avg**: Precision = 0.73, Recall = 0.73, F1 = 0.73  
- **Total Samples**: 2002

---

## Applications
- Sentiment classification of Bengali-English social media text  
- Research in **code-mixed NLP for Indic languages**  
- Benchmark for parameter-efficient fine-tuning (compare with LoRA model)  

---

## Limitations
- Heavily Romanized or slang-heavy Bengali may reduce accuracy  
- Trained primarily on short-form text (tweets, comments, reviews)  
- Not designed for abusive/toxic content moderation or safety-critical use cases  

---

## Ethical Considerations
- Data reflects natural biases from social media sources  
- Misclassifications may occur in sarcasm or offensive text  
- Should not be the sole basis for critical decision-making  

---

## Citation
If you use this model, please cite:

```bibtex
@model{das2025_bn_code_mix_sentiment,
  author    = {Swarnadeep Das},
  title     = {Bengali-English Code-Mixed Sentiment Model},
  year      = {2025},
  url       = {https://huggingface.co/Swarnadeep-28/bengali-code-mix-sentiment}
}
```

---

## Acknowledgements
- **Dataset**: [Bengali-English Code-Mixed Sentiment Dataset](https://huggingface.co/datasets/Swarnadeep-28/bn_code_mix_sentiment_dataset)  
- **Base model**: [`xlm-roberta-base`](https://huggingface.co/xlm-roberta-base)  
- **Frameworks**: [Transformers](https://huggingface.co/docs/transformers), [Datasets](https://huggingface.co/docs/datasets)