Benchmarking BERT-based Models for Sentence-level Topic Classification in Nepali Language

Overview

This repository contains the implementation and experiments for benchmarking various BERT-based transformer models on sentence-level topic classification in Nepali, a low-resource language.

We evaluate multilingual, Indic, Hindi, and Nepali-specific models to understand their effectiveness in capturing linguistic nuances of Nepali text.

Objectives

Benchmark multiple BERT-based models on Nepali text classification
Analyze performance differences across multilingual, Indic, and monolingual models
Establish a strong baseline for future Nepali NLP tasks
Provide insights into low-resource language modeling

Dataset

The dataset consists of 25,006 Nepali sentences categorized into five domains:

🌾 Agriculture
🏥 Health
🎓 Education & Technology
🏔️ Culture & Tourism
💬 General Communication

The dataset is balanced across all categories.

🔗 Dataset Link: https://huggingface.co/datasets/ilprl-docse/NepSen-Nepali-Categorical-Sentences-Corpus

Models Evaluated

We benchmarked the following transformer-based models:

Multilingual Models

mBERT
XLM-RoBERTa
mDeBERTa

Indic Models

MuRIL (base & large)
IndicBERT
DevBERT

Language-Specific Models

HindiBERT
NepBERTa

English Model

RoBERTa

🔗 Model Links: https://hf.co/collections/ilprl-docse/benchmarking-bert-based-models-for-topic-classification

Visit: https://github.com/ilprl/Benchmarking-BERT-based-Models-for-Sentence-level-Topic-Classification-in-Nepali-Language

Citation

Paper Link: https://arxiv.org/abs/2602.23940

If you use this work, please cite:

@inproceedings{karki2026benchmarking,
  title={Benchmarking BERT-based Models for Sentence-level Topic Classification in Nepali Language},
  author={Karki, Nischal and Subedi, Bipesh and Poudyal, Prakash and Ghimire, Rupak Raj and Bal, Bal Krishna},
  booktitle={Proceedings of the Regional International Conference on Natural Language Processing (RegICON 2025)},
  year={2026},
  address={Guwahati, India},
  note={Gauhati University, November 27--29, 2025},
  url={https://arxiv.org/abs/2602.23940}
}

Downloads last month: 27

Safetensors

Model size

0.2B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including ilprl-docse/SL-topic-classification-HindiBERT-25k-110M

Benchmarking BERT-based Models for Topic Classification

Collection

This collection includes dataset and models for sentence level topic classification in Nepali language using BERT based models. • 6 items • Updated 21 days ago

Paper for ilprl-docse/SL-topic-classification-HindiBERT-25k-110M

Benchmarking BERT-based Models for Sentence-level Topic Classification in Nepali Language

Paper • 2602.23940 • Published Feb 27