--- library_name: transformers license: apache-2.0 base_model: xlm-roberta-base tags: - generated_from_trainer model-index: - name: bengali-code-mix-sentiment results: [] datasets: - Swarnadeep-28/bn_code_mix_sentiment_dataset metrics: - accuracy - f1 - precision - recall --- # Bengali-English Code-Mixed Sentiment Model ## Model Summary This model is a **fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base)** for **sentiment analysis** on **Bengali–English code-mixed text** (social media posts, comments, and tweets). - **Task**: Text Classification (Sentiment Analysis) - **Languages**: Bengali (Romanized) + English - **Classes**: `0`, `1`, `2`, `3` - **Fine-tuning method**: Full fine-tuning - **Dataset**: [Bengali-English Code-Mixed Sentiment Dataset](https://huggingface.co/datasets/Swarnadeep-28/bn_code_mix_sentiment_dataset) This model provides strong baseline performance for code-mixed sentiment classification and can be directly applied to social media analysis and low-resource NLP research. --- ## How to Use ### Inference Example ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch model_id = "Swarnadeep-28/bengali-code-mix-sentiment" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForSequenceClassification.from_pretrained(model_id) text = "Aaj match ta khub bhalo chilo! Loved it." inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True) with torch.no_grad(): logits = model(**inputs).logits pred = torch.argmax(logits, dim=-1).item() labels = ["0", "1", "2", "3"] print("Predicted label:", labels[pred]) ``` --- ## Training Details - **Base model**: `xlm-roberta-base` - **Method**: Full fine-tuning (all parameters updated) - **Optimizer**: AdamW - **Learning Rate**: 2e-5 - **Epochs**: 3 - **Batch Size**: 16 (train), 32 (eval) - **Hardware**: Trained on a single GPU (Colab T4 / equivalent) --- ## Evaluation ### Classification Report | Label | Precision | Recall | F1-Score | Support | |-------|-----------|--------|----------|---------| | 0 | 0.80 | 0.73 | 0.77 | 528 | | 1 | 0.73 | 0.73 | 0.73 | 617 | | 2 | 0.69 | 0.76 | 0.72 | 675 | | 3 | 0.67 | 0.57 | 0.62 | 182 | ### Overall Metrics - **Accuracy**: 0.73 - **Macro Avg**: Precision = 0.72, Recall = 0.70, F1 = 0.71 - **Weighted Avg**: Precision = 0.73, Recall = 0.73, F1 = 0.73 - **Total Samples**: 2002 --- ## Applications - Sentiment classification of Bengali-English social media text - Research in **code-mixed NLP for Indic languages** - Benchmark for parameter-efficient fine-tuning (compare with LoRA model) --- ## Limitations - Heavily Romanized or slang-heavy Bengali may reduce accuracy - Trained primarily on short-form text (tweets, comments, reviews) - Not designed for abusive/toxic content moderation or safety-critical use cases --- ## Ethical Considerations - Data reflects natural biases from social media sources - Misclassifications may occur in sarcasm or offensive text - Should not be the sole basis for critical decision-making --- ## Citation If you use this model, please cite: ```bibtex @model{das2025_bn_code_mix_sentiment, author = {Swarnadeep Das}, title = {Bengali-English Code-Mixed Sentiment Model}, year = {2025}, url = {https://huggingface.co/Swarnadeep-28/bengali-code-mix-sentiment} } ``` --- ## Acknowledgements - **Dataset**: [Bengali-English Code-Mixed Sentiment Dataset](https://huggingface.co/datasets/Swarnadeep-28/bn_code_mix_sentiment_dataset) - **Base model**: [`xlm-roberta-base`](https://huggingface.co/xlm-roberta-base) - **Frameworks**: [Transformers](https://huggingface.co/docs/transformers), [Datasets](https://huggingface.co/docs/datasets)