File size: 1,781 Bytes
13a1f45
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
---
language:
- it
- en
license: mit
library_name: transformers
tags:
- text-classification
- safety
- toxicity
- insults
- xlm-roberta
- nlp
base_model: xlm-roberta-base
pipeline_tag: text-classification
---

# XLM-RoBERTa Safety Classifier (Italian & English)

## Model Description

This is an **XLM-RoBERTa-based** binary text classification model fine-tuned to detect **toxicity and insults** in user queries. It is trained on a bilingual dataset (Italian and English) to distinguish between **SAFE** (benign) and **UNSAFE** (toxic/harmful) inputs.

- **Model Type:** XLM-RoBERTa (Fine-tuned)
- **Languages:** Italian (`it`), English (`en`)
- **Task:** Binary Classification
- **Training Dataset Size:** 9,035 samples
- **Created by:** [Famezz](https://huggingface.co/Famezz)

## Intended Use

This model is designed to act as a **guardrail** for Chatbots and LLMs. It can be used to:
1.  Filter out toxic user inputs before they reach a Large Language Model.
2.  Flag offensive content in user-generated text.

## Label Mapping

The model is trained to predict the following string labels directly:

| Label | Description |
| :--- | :--- |
| **SAFE** | Benign queries, general knowledge, small talk. |
| **UNSAFE** | Toxic content, insults, offensive language. |

## Usage

You can use this model directly with the Hugging Face `pipeline`. The pipeline will automatically output the labels "SAFE" or "UNSAFE".

```python
from transformers import pipeline

# Load the classifier
classifier = pipeline("text-classification", model="Famezz/roberta_safety_classifier")

# Test with English
print(classifier("How do I bake a cake?"))
# Output: [{'label': 'SAFE', 'score': 0.99}]

# Test with Italian
print(classifier("Sei un idiota"))
# Output: [{'label': 'UNSAFE', 'score': 0.98}]