Bengali-English Code-Mixed Sentiment Model

Model Summary

This model is a fine-tuned version of xlm-roberta-base for sentiment analysis on Bengali–English code-mixed text (social media posts, comments, and tweets).

Task: Text Classification (Sentiment Analysis)
Languages: Bengali (Romanized) + English
Classes: 0, 1, 2, 3
Fine-tuning method: Full fine-tuning
Dataset: Bengali-English Code-Mixed Sentiment Dataset

This model provides strong baseline performance for code-mixed sentiment classification and can be directly applied to social media analysis and low-resource NLP research.

How to Use

Inference Example

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_id = "Swarnadeep-28/bengali-code-mix-sentiment"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)

text = "Aaj match ta khub bhalo chilo! Loved it."
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)

with torch.no_grad():
    logits = model(**inputs).logits
pred = torch.argmax(logits, dim=-1).item()
labels = ["0", "1", "2", "3"]
print("Predicted label:", labels[pred])

Training Details

Base model: xlm-roberta-base
Method: Full fine-tuning (all parameters updated)
Optimizer: AdamW
Learning Rate: 2e-5
Epochs: 3
Batch Size: 16 (train), 32 (eval)
Hardware: Trained on a single GPU (Colab T4 / equivalent)

Evaluation

Classification Report

Label	Precision	Recall	F1-Score	Support
0	0.80	0.73	0.77	528
1	0.73	0.73	0.73	617
2	0.69	0.76	0.72	675
3	0.67	0.57	0.62	182

Overall Metrics

Accuracy: 0.73
Macro Avg: Precision = 0.72, Recall = 0.70, F1 = 0.71
Weighted Avg: Precision = 0.73, Recall = 0.73, F1 = 0.73
Total Samples: 2002

Applications

Sentiment classification of Bengali-English social media text
Research in code-mixed NLP for Indic languages
Benchmark for parameter-efficient fine-tuning (compare with LoRA model)

Limitations

Heavily Romanized or slang-heavy Bengali may reduce accuracy
Trained primarily on short-form text (tweets, comments, reviews)
Not designed for abusive/toxic content moderation or safety-critical use cases

Ethical Considerations

Data reflects natural biases from social media sources
Misclassifications may occur in sarcasm or offensive text
Should not be the sole basis for critical decision-making

Citation

If you use this model, please cite:

@model{das2025_bn_code_mix_sentiment,
  author    = {Swarnadeep Das},
  title     = {Bengali-English Code-Mixed Sentiment Model},
  year      = {2025},
  url       = {https://huggingface.co/Swarnadeep-28/bengali-code-mix-sentiment}
}

Acknowledgements

Dataset: Bengali-English Code-Mixed Sentiment Dataset
Base model: xlm-roberta-base
Frameworks: Transformers, Datasets

Downloads last month: 1

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for Swarnadeep-28/bengali-code-mix-sentiment

Base model

FacebookAI/xlm-roberta-base

Finetuned

(3984)

this model

Swarnadeep-28
/

bengali-code-mix-sentiment