Famezz
/

roberta_safety_classifier

Text Classification

Model card Files Files and versions

Famezz commited on Dec 15, 2025

Commit

13a1f45

·

verified ·

1 Parent(s): bca77e2

Update README.md

Files changed (1) hide show

README.md +62 -3

README.md CHANGED Viewed

@@ -1,3 +1,62 @@
----
-license: mit
----

+---
+language:
+- it
+- en
+license: mit
+library_name: transformers
+tags:
+- text-classification
+- safety
+- toxicity
+- insults
+- xlm-roberta
+- nlp
+base_model: xlm-roberta-base
+pipeline_tag: text-classification
+---
+# XLM-RoBERTa Safety Classifier (Italian & English)
+## Model Description
+This is an **XLM-RoBERTa-based** binary text classification model fine-tuned to detect **toxicity and insults** in user queries. It is trained on a bilingual dataset (Italian and English) to distinguish between **SAFE** (benign) and **UNSAFE** (toxic/harmful) inputs.
+- **Model Type:** XLM-RoBERTa (Fine-tuned)
+- **Languages:** Italian (`it`), English (`en`)
+- **Task:** Binary Classification
+- **Training Dataset Size:** 9,035 samples
+- **Created by:** [Famezz](https://huggingface.co/Famezz)
+## Intended Use
+This model is designed to act as a **guardrail** for Chatbots and LLMs. It can be used to:
+1.  Filter out toxic user inputs before they reach a Large Language Model.
+2.  Flag offensive content in user-generated text.
+## Label Mapping
+The model is trained to predict the following string labels directly:
+| Label | Description |
+| :--- | :--- |
+| **SAFE** | Benign queries, general knowledge, small talk. |
+| **UNSAFE** | Toxic content, insults, offensive language. |
+## Usage
+You can use this model directly with the Hugging Face `pipeline`. The pipeline will automatically output the labels "SAFE" or "UNSAFE".
+```python
+from transformers import pipeline
+# Load the classifier
+# Make sure the repo name matches exactly what you created on HF
+classifier = pipeline("text-classification", model="Famezz/roberta_safety_classifier")
+# Test with English
+print(classifier("How do I bake a cake?"))
+# Output: [{'label': 'SAFE', 'score': 0.99}]
+# Test with Italian
+print(classifier("Sei un idiota"))
+# Output: [{'label': 'UNSAFE', 'score': 0.98}]