Bhojpuri Sentiment Analysis Model
Author: Abhimanyu Prasad | @abhiprd20
Fine-tuned XLM-RoBERTa model for 3-class sentiment analysis on Bhojpuri text in Devanagari script. This is the first publicly available sentiment model for the Bhojpuri language.
Model Description
This model is part of a cross-lingual transfer study comparing sentiment analysis across English, Hindi, Maithili, and Bhojpuri — four languages spanning high-resource to extremely low-resource.
Base model: cardiffnlp/twitter-xlm-roberta-base-sentiment
Task: 3-class sentiment classification — Positive, Negative, Neutral
Language: Bhojpuri (भोजपुरी) — Devanagari script
Training data: 18,049 unique Bhojpuri sentences (balanced across 3 classes)
Performance
| Model | Accuracy | F1 (Macro) |
|---|---|---|
| English BERT (zero-shot) | 33.13% | 0.1659 |
| XLM-RoBERTa (zero-shot) | 76.45% | 0.7630 |
| mBERT (fine-tuned) | 94.81% | 0.9481 |
| XLM-RoBERTa (fine-tuned) ← this model | 97.60% | 0.9761 |
| Out-of-distribution (30 new sentences) | 70.00% | 0.6777 |
Evaluated on a fixed balanced test set of 501 sentences (167 per class).
Cross-Lingual Findings
The zero-shot results reveal a clear pattern: English BERT fails on all three Indic languages at nearly identical rates (~33%), while multilingual models recover significantly, with Bhojpuri showing the strongest zero-shot transfer (76.45%) — likely due to its closer lexical proximity to Hindi compared to Maithili.
| Language | English Zero-Shot | XLM Zero-Shot | Fine-tuned |
|---|---|---|---|
| Maithili | 33.33% | 69.86% | 85.63% |
| Bhojpuri | 33.13% | 76.45% | 97.60% |
Usage
from transformers import pipeline
classifier = pipeline(
"text-classification",
model="abhiprd20/bhojpuri-sentiment-model"
)
# Example Bhojpuri sentences
texts = [
"ई खाना बहुत स्वादिष्ट बा।", # positive
"आज बहुत थकान लागत बा।", # negative
"हम कल पटना जाइब।", # neutral
]
for text in texts:
result = classifier(text)[0]
print(f"{text}")
print(f" → {result['label']} ({result['score']*100:.1f}%)\n")
Output:
ई खाना बहुत स्वादिष्ट बा।
→ positive (97.2%)
आज बहुत थकान लागत बा।
→ negative (95.8%)
हम कल पटना जाइब।
→ neutral (91.4%)
Labels
| Label | Integer | Meaning |
|---|---|---|
| negative | 0 | Negative sentiment |
| neutral | 1 | Neutral / factual |
| positive | 2 | Positive sentiment |
Training Details
| Parameter | Value |
|---|---|
| Base model | cardiffnlp/twitter-xlm-roberta-base-sentiment |
| Epochs | 3 |
| Batch size | 16 |
| Max sequence length | 128 |
| Warmup steps | 200 |
| Weight decay | 0.01 |
| Mixed precision | fp16 |
| Best model metric | F1 macro |
Dataset
Training data: 18,049 unique Bhojpuri sentences in Devanagari script with balanced 3-class sentiment labels. Note: Dataset contains translated content from English, acknowledged as a limitation.
Test set: Fixed balanced set of 501 sentences (167 per class), held out before training with zero leakage verified.
Related Models
abhiprd20/nlp-sentiment-model— English baselineabhiprd20/maithili-sentiment-model— Maithiliabhiprd20/hindi-sentiment-model— Hindi
Citation
If you use this model, please cite:
@misc{prasad2026bhojpuri,
author = {Abhimanyu Prasad},
title = {Bhojpuri Sentiment Analysis: Cross-Lingual Transfer Study},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/abhiprd20/bhojpuri-sentiment-model}
}
📊 Cross-Language Evaluation
Each model was evaluated on all 4 languages (300 sentences per language, 100 per class). This shows how well models trained on one language transfer to others.
Accuracy Matrix
| Model | English | Hindi | Maithili | Bhojpuri |
|---|---|---|---|---|
| English model | 79.5% ✓ | 34.0% | 33.3% | 33.0% |
| Hindi model | 60.0% | 68.0% ✓ | 63.3% | 61.7% |
| Maithili model | 63.0% | 59.0% | 90.3% ✓ | 75.0% |
| ⭐ Bhojpuri model (this model) | 59.0% | 47.3% | 47.3% | 98.0% ✓ |
F1 Matrix (macro)
| Model | English | Hindi | Maithili | Bhojpuri |
|---|---|---|---|---|
| English model | 0.5424 ✓ | 0.1912 | 0.1667 | 0.1654 |
| Hindi model | 0.4362 | 0.6778 ✓ | 0.6319 | 0.6042 |
| Maithili model | 0.4443 | 0.5757 | 0.9035 ✓ | 0.7458 |
| ⭐ Bhojpuri model (this model) | 0.4250 | 0.4166 | 0.4114 | 0.9801 ✓ |
Key Findings
- Excellent in-language performance (98%) but weak cross-lingual transfer.
- Bhojpuri → Maithili transfer is only 47.3%, worse than the reverse direction (Maithili → Bhojpuri: 75%).
- Asymmetric transfer between Maithili and Bhojpuri is a key finding of this research — despite linguistic similarity, transfer is not bidirectional.
Full paper: This cross-evaluation is part of a research study on cross-lingual transfer for low-resource Bihari languages. See the companion datasets and models: Maithili | Bhojpuri | Hindi | English
- Downloads last month
- 54