Multilingual Toxicity Classifier (XLM-RoBERTa-base V3)

Binary toxicity classifier for Turkish, Arabic, and English text, built on XLM-RoBERTa-base.

Author

Model Details

Property Value
Base Model xlm-roberta-base (280M params)
Task Binary text classification (toxic / not-toxic)
Languages Turkish, Arabic, English
Training Data 105K balanced samples
Training Focal loss, bf16, 15 epochs

Performance

Metric Score
F1 91.3%
Accuracy 91.2%
Stress Test 97.7% (260/266)

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "gorkem371/toxicity-classifier-xlmr-base-v3"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.eval()

text = "You are a wonderful person!"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)

with torch.no_grad():
    outputs = model(**inputs)
    probs = torch.softmax(outputs.logits, dim=-1)
    prediction = torch.argmax(probs, dim=-1).item()

labels = {0: "not-toxic", 1: "toxic"}
print(f"Text: {text}")
print(f"Prediction: {labels[prediction]} (confidence: {probs[0][prediction]:.3f})")

Training Details

  • Architecture: XLM-RoBERTa-base with a classification head (2 labels)
  • Loss Function: Focal loss (gamma=2) with inverse class weights
  • Precision: bf16 (bfloat16)
  • Epochs: 15
  • Dataset: 105K balanced samples across Turkish, Arabic, and English

License

Apache 2.0

Downloads last month
24
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for gorkem371/toxicity-classifier-xlmr-base-v3

Finetuned
(3777)
this model

Space using gorkem371/toxicity-classifier-xlmr-base-v3 1