BERT-based Toxicity Classifier

This is a BERT-based model fine-tuned for multi-label toxicity classification. It can identify toxic language across several dimensions (toxic, obscene, insult, threat, and identity hate). The model is designed to assist in content moderation by providing an automated way to flag harmful text.

Model Details

Model Description

This model is a fine-tuned version of the bert-base-uncased model from the transformers library. It has been trained on a balanced dataset annotated for toxicity and is capable of multi-label classification. Each input text can be simultaneously assigned multiple toxicity labels.

Developed by: Ujjawal Kumar (ujjawalsah9801@gmail.com)
For any issues or inquiries regarding the use of this script, please contact via email.
Model type: BERT-based sequence classifier
Language(s) (NLP): English
Finetuned from model: bert-base-uncased

Model Sources

Repository: ujjawalsah/bert-toxicity-classifier
Paper: [More Information Needed]
Demo: https://digitalcontentdetection.vercel.app/

Uses

Direct Use

This model can be used directly for toxicity classification in text. It is suitable for content moderation applications where identifying harmful language is critical.

Downstream Use

The model can be integrated into larger systems (e.g., social media monitoring tools or chat moderation systems) or further fine-tuned for domain-specific toxicity detection.

Out-of-Scope Use

This model is not intended for use as the sole basis for automated content removal or decisions with legal implications. It should be complemented with human oversight and additional verification tools.

Bias, Risks, and Limitations

The model may inherit biases from the training data and might not generalize perfectly to all forms of language or dialects. It may also produce false positives or negatives, so its predictions should be interpreted with caution.

Recommendations

⚠️ Use as Part of a Pipeline: Integrate the model into a broader content moderation system.
✔️ Validate on Your Data: Evaluate the model on your specific dataset before deploying.
🔍 Monitor and Mitigate Bias: Combine automated predictions with human oversight.

How to Get Started with the Model

To use this model, install the transformers library and load the model and tokenizer as follows:

import requests

# Your Hugging Face access token header
headers = {"Authorization": "Bearer *********************"}

# The model endpoint URL
API_URL = "https://api-inference.huggingface.co/models/ujjawalsah/bert-toxicity-classifier"

# Mapping dictionary from model labels to human-friendly labels
label_mapping = {
    "LABEL_0": "toxic",
    "LABEL_1": "obscene",
    "LABEL_2": "insult",
    "LABEL_3": "threat",
    "LABEL_4": "identity hate"
}

def query_model(text):
    payload = {"inputs": text}
    response = requests.post(API_URL, headers=headers, json=payload)
    # Check for a successful request
    if response.status_code == 200:
        return response.json()
    else:
        print("Error:", response.status_code, response.text)
        return None

def print_readable_result(result):
    # The model returns a list of lists. We assume we're interested in the first result.
    predictions = result[0] if isinstance(result, list) and len(result) > 0 else []
    if not predictions:
        print("No predictions received.")
        return

    print("Human-friendly Classification Result:")
    for pred in predictions:
        # Convert label to human-readable using mapping and show score as a percentage
        human_label = label_mapping.get(pred.get("label"), pred.get("label"))
        score_percentage = pred.get("score", 0) * 100
        print(f"- {human_label.capitalize()}: {score_percentage:.2f}% confidence")

if __name__ == "__main__":
    # Example text input
    text = "You are a wonderful person."
    result = query_model(text)
    if result:
        print_readable_result(result)

License

This model is licensed under the MIT License.

Downloads last month: 3

Safetensors

Model size

0.1B params

Tensor type

F32