--- library_name: transformers pipeline_tag: text-classification license: openrail++ tags: - text-classification - toxicity - roberta - llada - distillation language: - en datasets: - thesofakillers/jigsaw-toxic-comment-classification-challenge - google/civil_comments - allenai/real-toxicity-prompts metrics: - accuracy - f1 - precision - recall - roc_auc - pr_auc --- # roberta_toxicity_classifier_LLaDA Binary toxicity classifier for LLaDA-tokenized text. This model is a RoBERTa-style sequence classifier using the `GSAI-ML/LLaDA-8B-Base` tokenizer vocabulary. It predicts: - `neutral` - `toxic` ## Usage This repo includes custom modeling code, so load with `trust_remote_code=True`. ```python import torch from transformers import AutoModelForSequenceClassification, AutoTokenizer model_id = "kl1/roberta_toxicity_classifier_LLaDA" tokenizer = AutoTokenizer.from_pretrained( model_id, trust_remote_code=True, use_fast=True, ) model = AutoModelForSequenceClassification.from_pretrained( model_id, trust_remote_code=True, ).eval() texts = [ "I hope you have a wonderful day.", "You are disgusting and should disappear.", ] inputs = tokenizer( texts, padding=True, truncation=True, max_length=512, return_tensors="pt", ) with torch.inference_mode(): probs = torch.softmax(model(**inputs).logits, dim=-1) toxic_id = model.config.label2id["toxic"] print(probs[:, toxic_id].tolist()) ``` The tokenizer prepends the required `[CLS]` token by default. ## Training The student classifier was initialized from and distilled against `s-nlp/roberta_toxicity_classifier`. Objective: - supervised binary toxicity classification - teacher KL distillation with `kl_weight=0.2` Training configuration and run metadata are included in: - `distill_config.yaml` - `training_summary.json` ## Validation Metrics Checkpoint: step 20000. | metric | value | | --- | ---: | | accuracy | 0.9560 | | F1 | 0.7445 | | precision | 0.7127 | | recall | 0.7794 | | ROC-AUC | 0.9762 | | PR-AUC | 0.8328 | Best validation threshold from sweep: `0.5378`. ## License Model weights are released under OpenRAIL++. Third-party notices are listed in `THIRD_PARTY_NOTICES.md`. ## Limitations This model is intended as a toxicity scorer for research and evaluation workflows. It should not be used as a standalone moderation decision system without additional validation.