Text Classification
Transformers
Safetensors
English
roberta
toxicity
llada
distillation
custom_code
text-embeddings-inference
kl1's picture
Upload LLaDA-tokenized toxicity classifier
f850fbb verified
|
Raw
History Blame Contribute Delete
2.42 kB
---
library_name: transformers
pipeline_tag: text-classification
license: openrail++
tags:
- text-classification
- toxicity
- roberta
- llada
- distillation
language:
- en
datasets:
- thesofakillers/jigsaw-toxic-comment-classification-challenge
- google/civil_comments
- allenai/real-toxicity-prompts
metrics:
- accuracy
- f1
- precision
- recall
- roc_auc
- pr_auc
---
# roberta_toxicity_classifier_LLaDA
Binary toxicity classifier for LLaDA-tokenized text.
This model is a RoBERTa-style sequence classifier using the `GSAI-ML/LLaDA-8B-Base` tokenizer vocabulary. It predicts:
- `neutral`
- `toxic`
## Usage
This repo includes custom modeling code, so load with `trust_remote_code=True`.
```python
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model_id = "kl1/roberta_toxicity_classifier_LLaDA"
tokenizer = AutoTokenizer.from_pretrained(
model_id,
trust_remote_code=True,
use_fast=True,
)
model = AutoModelForSequenceClassification.from_pretrained(
model_id,
trust_remote_code=True,
).eval()
texts = [
"I hope you have a wonderful day.",
"You are disgusting and should disappear.",
]
inputs = tokenizer(
texts,
padding=True,
truncation=True,
max_length=512,
return_tensors="pt",
)
with torch.inference_mode():
probs = torch.softmax(model(**inputs).logits, dim=-1)
toxic_id = model.config.label2id["toxic"]
print(probs[:, toxic_id].tolist())
```
The tokenizer prepends the required `[CLS]` token by default.
## Training
The student classifier was initialized from and distilled against `s-nlp/roberta_toxicity_classifier`.
Objective:
- supervised binary toxicity classification
- teacher KL distillation with `kl_weight=0.2`
Training configuration and run metadata are included in:
- `distill_config.yaml`
- `training_summary.json`
## Validation Metrics
Checkpoint: step 20000.
| metric | value |
| --- | ---: |
| accuracy | 0.9560 |
| F1 | 0.7445 |
| precision | 0.7127 |
| recall | 0.7794 |
| ROC-AUC | 0.9762 |
| PR-AUC | 0.8328 |
Best validation threshold from sweep: `0.5378`.
## License
Model weights are released under OpenRAIL++.
Third-party notices are listed in `THIRD_PARTY_NOTICES.md`.
## Limitations
This model is intended as a toxicity scorer for research and evaluation workflows. It should not be used as a standalone moderation decision system without additional validation.