Upload LLaDA-tokenized toxicity classifier

f850fbb verified 18 days ago

2.42 kB

library_name: transformers
pipeline_tag: text-classification
license: openrail++
tags:
  - text-classification
  - toxicity
  - roberta
  - llada
  - distillation
language:
  - en
datasets:
  - thesofakillers/jigsaw-toxic-comment-classification-challenge
  - google/civil_comments
  - allenai/real-toxicity-prompts
metrics:
  - accuracy
  - f1
  - precision
  - recall
  - roc_auc
  - pr_auc

roberta_toxicity_classifier_LLaDA

Binary toxicity classifier for LLaDA-tokenized text.

This model is a RoBERTa-style sequence classifier using the GSAI-ML/LLaDA-8B-Base tokenizer vocabulary. It predicts:

neutral
toxic

Usage

This repo includes custom modeling code, so load with trust_remote_code=True.

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model_id = "kl1/roberta_toxicity_classifier_LLaDA"

tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    trust_remote_code=True,
    use_fast=True,
)
model = AutoModelForSequenceClassification.from_pretrained(
    model_id,
    trust_remote_code=True,
).eval()

texts = [
    "I hope you have a wonderful day.",
    "You are disgusting and should disappear.",
]

inputs = tokenizer(
    texts,
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="pt",
)

with torch.inference_mode():
    probs = torch.softmax(model(**inputs).logits, dim=-1)

toxic_id = model.config.label2id["toxic"]
print(probs[:, toxic_id].tolist())

The tokenizer prepends the required [CLS] token by default.

Training

The student classifier was initialized from and distilled against s-nlp/roberta_toxicity_classifier.

Objective:

supervised binary toxicity classification
teacher KL distillation with kl_weight=0.2

Training configuration and run metadata are included in:

distill_config.yaml
training_summary.json

Validation Metrics

Checkpoint: step 20000.

metric	value
accuracy	0.9560
F1	0.7445
precision	0.7127
recall	0.7794
ROC-AUC	0.9762
PR-AUC	0.8328

Best validation threshold from sweep: 0.5378.

License

Model weights are released under OpenRAIL++.

Third-party notices are listed in THIRD_PARTY_NOTICES.md.

Limitations

This model is intended as a toxicity scorer for research and evaluation workflows. It should not be used as a standalone moderation decision system without additional validation.