IndicBERT Multilingual Toxicity Detector
Fine-tuned version of ai4bharat/IndicBERTv2-MLM-only for toxicity detection in multilingual text (English, Hinglish, Hindi, Tamil).
Model Description
This model classifies text as either toxic or non-toxic. It was trained on a balanced dataset with class weights to handle imbalanced data.
Languages Supported:
- English
- Hinglish (Hindi-English code-mixed)
- Hindi
- Tamil
Training Details
- Base Model: ai4bharat/IndicBERTv2-MLM-only (278M parameters)
- Training Data: 569 samples (balanced: 53% non-toxic, 47% toxic)
- Training Split: 80/20 train/validation
- Epochs: 3
- Batch Size: 16
- Learning Rate: 2e-5
- Class Weighting: Applied to handle imbalance
Usage
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained("indic-toxicity-detector")
tokenizer = AutoTokenizer.from_pretrained("indic-toxicity-detector")
# Predict
def predict_toxicity(text):
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
outputs = model(**inputs)
probabilities = torch.softmax(outputs.logits, dim=-1)
predicted_class = torch.argmax(probabilities, dim=-1).item()
confidence = probabilities[0][predicted_class].item()
label = model.config.id2label[predicted_class]
return {"label": label, "confidence": confidence}
# Example
result = predict_toxicity("You are amazing!")
print(result) # {'label': 'non-toxic', 'confidence': 0.95}
Performance
- Validation Accuracy: See training_metrics.csv
- F1 Score: See training_metrics.csv
Limitations
- Trained on limited dataset (569 samples)
- May not generalize well to all types of toxic content
- Performance varies across languages
- Code-mixed text performance depends on training data representation
Citation
@misc{indic-toxicity-detector,
author = {Your Name},
title = {IndicBERT Multilingual Toxicity Detector},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/indic-toxicity-detector}
}
License
Apache 2.0
- Downloads last month
- 19