Benchmarking BERT-based Models for Sentence-level Topic Classification in Nepali Language
Overview
This repository contains the implementation and experiments for benchmarking various BERT-based transformer models on sentence-level topic classification in Nepali, a low-resource language.
We evaluate multilingual, Indic, Hindi, and Nepali-specific models to understand their effectiveness in capturing linguistic nuances of Nepali text.
Objectives
- Benchmark multiple BERT-based models on Nepali text classification
- Analyze performance differences across multilingual, Indic, and monolingual models
- Establish a strong baseline for future Nepali NLP tasks
- Provide insights into low-resource language modeling
Dataset
The dataset consists of 25,006 Nepali sentences categorized into five domains:
- πΎ Agriculture
- π₯ Health
- π Education & Technology
- ποΈ Culture & Tourism
- π¬ General Communication
The dataset is balanced across all categories.
π Dataset Link: https://huggingface.co/datasets/ilprl-docse/NepSen-Nepali-Categorical-Sentences-Corpus
Models Evaluated
We benchmarked the following transformer-based models:
Multilingual Models
- mBERT
- XLM-RoBERTa
- mDeBERTa
Indic Models
- MuRIL (base & large)
- IndicBERT
- DevBERT
Language-Specific Models
- HindiBERT
- NepBERTa
English Model
- RoBERTa
π Model Links: https://hf.co/collections/ilprl-docse/benchmarking-bert-based-models-for-topic-classification
Citation
Paper Link: https://arxiv.org/abs/2602.23940
If you use this work, please cite:
@inproceedings{karki2026benchmarking,
title={Benchmarking BERT-based Models for Sentence-level Topic Classification in Nepali Language},
author={Karki, Nischal and Subedi, Bipesh and Poudyal, Prakash and Ghimire, Rupak Raj and Bal, Bal Krishna},
booktitle={Proceedings of the Regional International Conference on Natural Language Processing (RegICON 2025)},
year={2026},
address={Guwahati, India},
note={Gauhati University, November 27--29, 2025},
url={https://arxiv.org/abs/2602.23940}
}
- Downloads last month
- 27