IndicBERT Multilingual Toxicity Detector

Fine-tuned version of ai4bharat/IndicBERTv2-MLM-only for toxicity detection in multilingual text (English, Hinglish, Hindi, Tamil).

Model Description

This model classifies text as either toxic or non-toxic. It was trained on a balanced dataset with class weights to handle imbalanced data.

Languages Supported:

  • English
  • Hinglish (Hindi-English code-mixed)
  • Hindi
  • Tamil

Training Details

  • Base Model: ai4bharat/IndicBERTv2-MLM-only (278M parameters)
  • Training Data: 569 samples (balanced: 53% non-toxic, 47% toxic)
  • Training Split: 80/20 train/validation
  • Epochs: 3
  • Batch Size: 16
  • Learning Rate: 2e-5
  • Class Weighting: Applied to handle imbalance

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained("indic-toxicity-detector")
tokenizer = AutoTokenizer.from_pretrained("indic-toxicity-detector")

# Predict
def predict_toxicity(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
    outputs = model(**inputs)
    probabilities = torch.softmax(outputs.logits, dim=-1)
    predicted_class = torch.argmax(probabilities, dim=-1).item()
    confidence = probabilities[0][predicted_class].item()
    
    label = model.config.id2label[predicted_class]
    return {"label": label, "confidence": confidence}

# Example
result = predict_toxicity("You are amazing!")
print(result)  # {'label': 'non-toxic', 'confidence': 0.95}

Performance

  • Validation Accuracy: See training_metrics.csv
  • F1 Score: See training_metrics.csv

Limitations

  • Trained on limited dataset (569 samples)
  • May not generalize well to all types of toxic content
  • Performance varies across languages
  • Code-mixed text performance depends on training data representation

Citation

@misc{indic-toxicity-detector,
  author = {Your Name},
  title = {IndicBERT Multilingual Toxicity Detector},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/indic-toxicity-detector}
}

License

Apache 2.0

Downloads last month
19
Safetensors
Model size
0.3B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support