---
library_name: transformers
pipeline_tag: text-classification
license: openrail++
tags:
- text-classification
- toxicity
- roberta
- llada
- distillation
language:
- en
datasets:
- thesofakillers/jigsaw-toxic-comment-classification-challenge
- google/civil_comments
- allenai/real-toxicity-prompts
metrics:
- accuracy
- f1
- precision
- recall
- roc_auc
- pr_auc
---

# roberta_toxicity_classifier_LLaDA

Binary toxicity classifier for LLaDA-tokenized text.

This model is a RoBERTa-style sequence classifier using the `GSAI-ML/LLaDA-8B-Base` tokenizer vocabulary. It predicts:

- `neutral`
- `toxic`

## Usage

This repo includes custom modeling code, so load with `trust_remote_code=True`.

```python
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model_id = "kl1/roberta_toxicity_classifier_LLaDA"

tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    trust_remote_code=True,
    use_fast=True,
)
model = AutoModelForSequenceClassification.from_pretrained(
    model_id,
    trust_remote_code=True,
).eval()

texts = [
    "I hope you have a wonderful day.",
    "You are disgusting and should disappear.",
]

inputs = tokenizer(
    texts,
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="pt",
)

with torch.inference_mode():
    probs = torch.softmax(model(**inputs).logits, dim=-1)

toxic_id = model.config.label2id["toxic"]
print(probs[:, toxic_id].tolist())
```

The tokenizer prepends the required `[CLS]` token by default.

## Training

The student classifier was initialized from and distilled against `s-nlp/roberta_toxicity_classifier`.

Objective:

- supervised binary toxicity classification
- teacher KL distillation with `kl_weight=0.2`

Training configuration and run metadata are included in:

- `distill_config.yaml`
- `training_summary.json`

## Validation Metrics

Checkpoint: step 20000.

| metric | value |
| --- | ---: |
| accuracy | 0.9560 |
| F1 | 0.7445 |
| precision | 0.7127 |
| recall | 0.7794 |
| ROC-AUC | 0.9762 |
| PR-AUC | 0.8328 |

Best validation threshold from sweep: `0.5378`.

## License

Model weights are released under OpenRAIL++.

Third-party notices are listed in `THIRD_PARTY_NOTICES.md`.

## Limitations

This model is intended as a toxicity scorer for research and evaluation workflows. It should not be used as a standalone moderation decision system without additional validation.