Bhojpuri Sentiment Analysis Model

Author: Abhimanyu Prasad | @abhiprd20

Fine-tuned XLM-RoBERTa model for 3-class sentiment analysis on Bhojpuri text in Devanagari script. This is the first publicly available sentiment model for the Bhojpuri language.


Model Description

This model is part of a cross-lingual transfer study comparing sentiment analysis across English, Hindi, Maithili, and Bhojpuri — four languages spanning high-resource to extremely low-resource.

Base model: cardiffnlp/twitter-xlm-roberta-base-sentiment

Task: 3-class sentiment classification — Positive, Negative, Neutral

Language: Bhojpuri (भोजपुरी) — Devanagari script

Training data: 18,049 unique Bhojpuri sentences (balanced across 3 classes)


Performance

Model Accuracy F1 (Macro)
English BERT (zero-shot) 33.13% 0.1659
XLM-RoBERTa (zero-shot) 76.45% 0.7630
mBERT (fine-tuned) 94.81% 0.9481
XLM-RoBERTa (fine-tuned) ← this model 97.60% 0.9761
Out-of-distribution (30 new sentences) 70.00% 0.6777

Evaluated on a fixed balanced test set of 501 sentences (167 per class).


Cross-Lingual Findings

The zero-shot results reveal a clear pattern: English BERT fails on all three Indic languages at nearly identical rates (~33%), while multilingual models recover significantly, with Bhojpuri showing the strongest zero-shot transfer (76.45%) — likely due to its closer lexical proximity to Hindi compared to Maithili.

Language English Zero-Shot XLM Zero-Shot Fine-tuned
Maithili 33.33% 69.86% 85.63%
Bhojpuri 33.13% 76.45% 97.60%

Usage

from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="abhiprd20/bhojpuri-sentiment-model"
)

# Example Bhojpuri sentences
texts = [
    "ई खाना बहुत स्वादिष्ट बा।",       # positive
    "आज बहुत थकान लागत बा।",            # negative
    "हम कल पटना जाइब।",                 # neutral
]

for text in texts:
    result = classifier(text)[0]
    print(f"{text}")
    print(f"  → {result['label']} ({result['score']*100:.1f}%)\n")

Output:

ई खाना बहुत स्वादिष्ट बा।
  → positive (97.2%)

आज बहुत थकान लागत बा।
  → negative (95.8%)

हम कल पटना जाइब।
  → neutral (91.4%)

Labels

Label Integer Meaning
negative 0 Negative sentiment
neutral 1 Neutral / factual
positive 2 Positive sentiment

Training Details

Parameter Value
Base model cardiffnlp/twitter-xlm-roberta-base-sentiment
Epochs 3
Batch size 16
Max sequence length 128
Warmup steps 200
Weight decay 0.01
Mixed precision fp16
Best model metric F1 macro

Dataset

Training data: 18,049 unique Bhojpuri sentences in Devanagari script with balanced 3-class sentiment labels. Note: Dataset contains translated content from English, acknowledged as a limitation.

Test set: Fixed balanced set of 501 sentences (167 per class), held out before training with zero leakage verified.


Related Models


Citation

If you use this model, please cite:

@misc{prasad2026bhojpuri,
  author    = {Abhimanyu Prasad},
  title     = {Bhojpuri Sentiment Analysis: Cross-Lingual Transfer Study},
  year      = {2026},
  publisher = {HuggingFace},
  url       = {https://huggingface.co/abhiprd20/bhojpuri-sentiment-model}
}

📊 Cross-Language Evaluation

Each model was evaluated on all 4 languages (300 sentences per language, 100 per class). This shows how well models trained on one language transfer to others.

Accuracy Matrix

Model English Hindi Maithili Bhojpuri
English model 79.5% 34.0% 33.3% 33.0%
Hindi model 60.0% 68.0% 63.3% 61.7%
Maithili model 63.0% 59.0% 90.3% 75.0%
Bhojpuri model (this model) 59.0% 47.3% 47.3% 98.0%

F1 Matrix (macro)

Model English Hindi Maithili Bhojpuri
English model 0.5424 0.1912 0.1667 0.1654
Hindi model 0.4362 0.6778 0.6319 0.6042
Maithili model 0.4443 0.5757 0.9035 0.7458
Bhojpuri model (this model) 0.4250 0.4166 0.4114 0.9801

Key Findings

  • Excellent in-language performance (98%) but weak cross-lingual transfer.
  • Bhojpuri → Maithili transfer is only 47.3%, worse than the reverse direction (Maithili → Bhojpuri: 75%).
  • Asymmetric transfer between Maithili and Bhojpuri is a key finding of this research — despite linguistic similarity, transfer is not bidirectional.

Full paper: This cross-evaluation is part of a research study on cross-lingual transfer for low-resource Bihari languages. See the companion datasets and models: Maithili | Bhojpuri | Hindi | English

Downloads last month
54
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support