Bengali-English Code-Mixed Sentiment Model

Model Summary

This model is a fine-tuned version of xlm-roberta-base for sentiment analysis on Bengali–English code-mixed text (social media posts, comments, and tweets).

This model provides strong baseline performance for code-mixed sentiment classification and can be directly applied to social media analysis and low-resource NLP research.


How to Use

Inference Example

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_id = "Swarnadeep-28/bengali-code-mix-sentiment"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)

text = "Aaj match ta khub bhalo chilo! Loved it."
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)

with torch.no_grad():
    logits = model(**inputs).logits
pred = torch.argmax(logits, dim=-1).item()
labels = ["0", "1", "2", "3"]
print("Predicted label:", labels[pred])

Training Details

  • Base model: xlm-roberta-base
  • Method: Full fine-tuning (all parameters updated)
  • Optimizer: AdamW
  • Learning Rate: 2e-5
  • Epochs: 3
  • Batch Size: 16 (train), 32 (eval)
  • Hardware: Trained on a single GPU (Colab T4 / equivalent)

Evaluation

Classification Report

Label Precision Recall F1-Score Support
0 0.80 0.73 0.77 528
1 0.73 0.73 0.73 617
2 0.69 0.76 0.72 675
3 0.67 0.57 0.62 182

Overall Metrics

  • Accuracy: 0.73
  • Macro Avg: Precision = 0.72, Recall = 0.70, F1 = 0.71
  • Weighted Avg: Precision = 0.73, Recall = 0.73, F1 = 0.73
  • Total Samples: 2002

Applications

  • Sentiment classification of Bengali-English social media text
  • Research in code-mixed NLP for Indic languages
  • Benchmark for parameter-efficient fine-tuning (compare with LoRA model)

Limitations

  • Heavily Romanized or slang-heavy Bengali may reduce accuracy
  • Trained primarily on short-form text (tweets, comments, reviews)
  • Not designed for abusive/toxic content moderation or safety-critical use cases

Ethical Considerations

  • Data reflects natural biases from social media sources
  • Misclassifications may occur in sarcasm or offensive text
  • Should not be the sole basis for critical decision-making

Citation

If you use this model, please cite:

@model{das2025_bn_code_mix_sentiment,
  author    = {Swarnadeep Das},
  title     = {Bengali-English Code-Mixed Sentiment Model},
  year      = {2025},
  url       = {https://huggingface.co/Swarnadeep-28/bengali-code-mix-sentiment}
}

Acknowledgements

Downloads last month
6
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Swarnadeep-28/bengali-code-mix-sentiment

Finetuned
(3704)
this model

Dataset used to train Swarnadeep-28/bengali-code-mix-sentiment

Space using Swarnadeep-28/bengali-code-mix-sentiment 1

Evaluation results