|
|
--- |
|
|
language: |
|
|
- it |
|
|
- en |
|
|
license: mit |
|
|
library_name: transformers |
|
|
tags: |
|
|
- text-classification |
|
|
- safety |
|
|
- toxicity |
|
|
- insults |
|
|
- xlm-roberta |
|
|
- nlp |
|
|
base_model: xlm-roberta-base |
|
|
pipeline_tag: text-classification |
|
|
--- |
|
|
|
|
|
# XLM-RoBERTa Safety Classifier (Italian & English) |
|
|
|
|
|
## Model Description |
|
|
|
|
|
This is an **XLM-RoBERTa-based** binary text classification model fine-tuned to detect **toxicity and insults** in user queries. It is trained on a bilingual dataset (Italian and English) to distinguish between **SAFE** (benign) and **UNSAFE** (toxic/harmful) inputs. |
|
|
|
|
|
- **Model Type:** XLM-RoBERTa (Fine-tuned) |
|
|
- **Languages:** Italian (`it`), English (`en`) |
|
|
- **Task:** Binary Classification |
|
|
- **Training Dataset Size:** 9,035 samples |
|
|
- **Created by:** [Famezz](https://huggingface.co/Famezz) |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
This model is designed to act as a **guardrail** for Chatbots and LLMs. It can be used to: |
|
|
1. Filter out toxic user inputs before they reach a Large Language Model. |
|
|
2. Flag offensive content in user-generated text. |
|
|
|
|
|
## Label Mapping |
|
|
|
|
|
The model is trained to predict the following string labels directly: |
|
|
|
|
|
| Label | Description | |
|
|
| :--- | :--- | |
|
|
| **SAFE** | Benign queries, general knowledge, small talk. | |
|
|
| **UNSAFE** | Toxic content, insults, offensive language. | |
|
|
|
|
|
## Usage |
|
|
|
|
|
You can use this model directly with the Hugging Face `pipeline`. The pipeline will automatically output the labels "SAFE" or "UNSAFE". |
|
|
|
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
# Load the classifier |
|
|
classifier = pipeline("text-classification", model="Famezz/roberta_safety_classifier") |
|
|
|
|
|
# Test with English |
|
|
print(classifier("How do I bake a cake?")) |
|
|
# Output: [{'label': 'SAFE', 'score': 0.99}] |
|
|
|
|
|
# Test with Italian |
|
|
print(classifier("Sei un idiota")) |
|
|
# Output: [{'label': 'UNSAFE', 'score': 0.98}] |