XLM-RoBERTa Toxicity Detection (Multilingual)
Overview
This model detects toxic comments across 14 languages using a single, unified model.
Supported Languages
German, Spanish, French, Italian, Russian, Chinese, Japanese, Arabic, Hebrew, Amharic, Tatar, Ukrainian, Hindi
Model Details
- Architecture: XLM-RoBERTa-Large (384M parameters)
- Task: Binary toxicity classification
- Training Data: TextDetox Multilingual Toxicity Dataset
Per-Language Performance (F1-Score)
- French: 0.9931
- Russian: 0.9877
- Hindi: 0.9403
- Ukrainian: 0.9401
- Japanese: 0.8917
- Spanish: 0.8846
- Italian: 0.8834
- German: 0.8823
- Tatar: 0.8681
- Arabic: 0.8310
- Amharic: 0.8267
- Chinese: 0.8150
- Hebrew: 0.7829
Usage
import torch
from transformers import AutoTokenizer
from huggingface_hub import hf_hub_download
from model_utils import MultiTaskXLMR
# Load model
model_path = hf_hub_download("{repo_id_multilingual}", "pytorch_model.bin")
model = MultiTaskXLMR(model_name="xlm-roberta-large", num_intents=5)
model.load_state_dict(torch.load(model_path))
model.eval()
# Inference in any supported language
tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large")
text = "Tu es un idiot!" # French
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
with torch.no_grad():
tox_logits, _ = model(inputs['input_ids'], inputs['attention_mask'])
tox_prob = torch.sigmoid(tox_logits).item()
print(f"Toxicity: {tox_prob:.4f}")
License
MIT
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support