Safetensors
bert

Benchmarking BERT-based Models for Sentence-level Topic Classification in Nepali Language

Overview

This repository contains the implementation and experiments for benchmarking various BERT-based transformer models on sentence-level topic classification in Nepali, a low-resource language.

We evaluate multilingual, Indic, Hindi, and Nepali-specific models to understand their effectiveness in capturing linguistic nuances of Nepali text.


Objectives

  • Benchmark multiple BERT-based models on Nepali text classification
  • Analyze performance differences across multilingual, Indic, and monolingual models
  • Establish a strong baseline for future Nepali NLP tasks
  • Provide insights into low-resource language modeling

Dataset

The dataset consists of 25,006 Nepali sentences categorized into five domains:

  • 🌾 Agriculture
  • πŸ₯ Health
  • πŸŽ“ Education & Technology
  • πŸ”οΈ Culture & Tourism
  • πŸ’¬ General Communication

The dataset is balanced across all categories.

πŸ”— Dataset Link: https://huggingface.co/datasets/ilprl-docse/NepSen-Nepali-Categorical-Sentences-Corpus


Models Evaluated

We benchmarked the following transformer-based models:

Multilingual Models

  • mBERT
  • XLM-RoBERTa
  • mDeBERTa

Indic Models

  • MuRIL (base & large)
  • IndicBERT
  • DevBERT

Language-Specific Models

  • HindiBERT
  • NepBERTa

English Model

  • RoBERTa

πŸ”— Model Links: https://hf.co/collections/ilprl-docse/benchmarking-bert-based-models-for-topic-classification

Visit: https://github.com/ilprl/Benchmarking-BERT-based-Models-for-Sentence-level-Topic-Classification-in-Nepali-Language


Citation

Paper Link: https://arxiv.org/abs/2602.23940

If you use this work, please cite:

@inproceedings{karki2026benchmarking,
  title={Benchmarking BERT-based Models for Sentence-level Topic Classification in Nepali Language},
  author={Karki, Nischal and Subedi, Bipesh and Poudyal, Prakash and Ghimire, Rupak Raj and Bal, Bal Krishna},
  booktitle={Proceedings of the Regional International Conference on Natural Language Processing (RegICON 2025)},
  year={2026},
  address={Guwahati, India},
  note={Gauhati University, November 27--29, 2025},
  url={https://arxiv.org/abs/2602.23940}
}
Downloads last month
27
Safetensors
Model size
0.2B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including ilprl-docse/SL-topic-classification-HindiBERT-25k-110M

Paper for ilprl-docse/SL-topic-classification-HindiBERT-25k-110M