BanglaBERT Toxicity Classifier

BanglaBERT Toxicity Classifier is an ELCTRA based model that can be used for judging the toxicity level of a given Bengali text string. This model was trained on the Polygl0t/bengali-toxicity-qwen-annotations dataset.

Details

For training, we added a classification head with a single regression output to csebuetnlp/banglabert_generator. Only the classification head was trained, i.e., the rest of the model was frozen.

Dataset: Polygl0t/bengali-toxicity-qwen-annotations
Language: Bengali
Number of Training Epochs: 20
Batch size: 256
Optimizer: torch.optim.AdamW
Learning Rate: 3e-4 (linear decay with no warmup steps)
Eval Metric: f1-score

This repository has the source code used to train this model.

Evaluation Results

Confusion Matrix

	1	2	3	4	5
1	12161	1035	160	12	0
2	912	1441	514	33	1
3	196	743	1436	296	3
4	18	76	323	379	33
5	3	1	15	77	132

Precision: 0.6437
Recall: 0.5963
F1 Macro: 0.6132
Accuracy: 0.7768

Usage

Here's an example of how to use this classifier using the transformers library:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = AutoTokenizer.from_pretrained("Polygl0t/bengali-banglabert-toxicity-classifier")
model = AutoModelForSequenceClassification.from_pretrained("Polygl0t/bengali-banglabert-toxicity-classifier")
model.to(device)

text = "এটি একটি নমুনা গ্রন্থ।"
encoded_input  =  tokenizer(text, return_tensors="pt", padding="longest", truncation=True).to(device)

with  torch.no_grad():
    model_output  =  model(**encoded_input)
    logits  =  model_output.logits.squeeze(-1).float().cpu().numpy()

# scores are produced in the range [0, 4]. To convert to the range [1, 5], we can simply add 1 to the score.
float_score = [x + 1 for x in logits.tolist()][0]

print({
 "text": text,
 "score": float_score,
 "int_score": [int(round(max(0, min(score, 4)))) + 1 for score in logits][0],
})

Cite as 🤗

@misc{fatimah2026liltii,
  title={{LilTii: A 0.6B Bengali Language Model that Outperforms Qwen}},
  author={Shiza Fatimah and Aniket Sen and Sophia Falk and Florian Mai and Lucie Flek and Nicholas Kluge Corr{\^e}a},
  year={2026},
  howpublished={\url{https://hf.co/blog/Polygl0t/liltii}}
}

Acknowledgments

Polyglot is a project funded by the Federal Ministry of Education and Research (BMBF) and the Ministry of Culture and Science of the State of North Rhine-Westphalia (MWK) as part of TRA Sustainable Futures (University of Bonn) and the Excellence Strategy of the federal and state governments.

We also gratefully acknowledge the granted access to the Marvin cluster hosted by University of Bonn along with the support provided by its High Performance Computing & Analytics Lab.

License

According to the paper tied to BanglaBERT, all models are released under a non-commercial license (although the license of BanglaBERT is not explicitly mentioned). Hence, we urge users to use this model for non-commercial purposes only. For any queries, please contact the authors of the original paper tied to BanglaBERT.