Famezz
/

roberta_safety_classifier

Text Classification

Model card Files Files and versions

roberta_safety_classifier / README.md

Famezz's picture

Update README.md

65f63ba verified 30 days ago

|

history blame contribute delete

1.78 kB

	---
	language:
	- it
	- en
	license: mit
	library_name: transformers
	tags:
	- text-classification
	- safety
	- toxicity
	- insults
	- xlm-roberta
	- nlp
	base_model: xlm-roberta-base
	pipeline_tag: text-classification
	---

	# XLM-RoBERTa Safety Classifier (Italian & English)

	## Model Description

	This is an XLM-RoBERTa-based binary text classification model fine-tuned to detect toxicity and insults in user queries. It is trained on a bilingual dataset (Italian and English) to distinguish between SAFE (benign) and UNSAFE (toxic/harmful) inputs.

	- Model Type: XLM-RoBERTa (Fine-tuned)
	- Languages: Italian (`it`), English (`en`)
	- Task: Binary Classification
	- Training Dataset Size: 9,035 samples
	- Created by: [Famezz](https://huggingface.co/Famezz)

	## Intended Use

	This model is designed to act as a guardrail for Chatbots and LLMs. It can be used to:
	1. Filter out toxic user inputs before they reach a Large Language Model.
	2. Flag offensive content in user-generated text.

	## Label Mapping

	The model is trained to predict the following string labels directly:

	\| Label \| Description \|
	\| :--- \| :--- \|
	\| SAFE \| Benign queries, general knowledge, small talk. \|
	\| UNSAFE \| Toxic content, insults, offensive language. \|

	## Usage

	You can use this model directly with the Hugging Face `pipeline`. The pipeline will automatically output the labels "SAFE" or "UNSAFE".

	```python
	from transformers import pipeline

	# Load the classifier
	classifier = pipeline("text-classification", model="Famezz/roberta_safety_classifier")

	# Test with English
	print(classifier("How do I bake a cake?"))
	# Output: [{'label': 'SAFE', 'score': 0.99}]

	# Test with Italian
	print(classifier("Sei un idiota"))
	# Output: [{'label': 'UNSAFE', 'score': 0.98}]