kl1
/

roberta_toxicity_classifier_LLaDA

Text Classification

text-embeddings-inference

Model card Files Files and versions

roberta_toxicity_classifier_LLaDA / README.md

kl1's picture

Upload LLaDA-tokenized toxicity classifier

f850fbb verified 18 days ago

|

History Blame Contribute Delete

2.42 kB

	---
	library_name: transformers
	pipeline_tag: text-classification
	license: openrail++
	tags:
	- text-classification
	- toxicity
	- roberta
	- llada
	- distillation
	language:
	- en
	datasets:
	- thesofakillers/jigsaw-toxic-comment-classification-challenge
	- google/civil_comments
	- allenai/real-toxicity-prompts
	metrics:
	- accuracy
	- f1
	- precision
	- recall
	- roc_auc
	- pr_auc
	---

	# roberta_toxicity_classifier_LLaDA

	Binary toxicity classifier for LLaDA-tokenized text.

	This model is a RoBERTa-style sequence classifier using the `GSAI-ML/LLaDA-8B-Base` tokenizer vocabulary. It predicts:

	- `neutral`
	- `toxic`

	## Usage

	This repo includes custom modeling code, so load with `trust_remote_code=True`.

	```python
	import torch
	from transformers import AutoModelForSequenceClassification, AutoTokenizer

	model_id = "kl1/roberta_toxicity_classifier_LLaDA"

	tokenizer = AutoTokenizer.from_pretrained(
	model_id,
	trust_remote_code=True,
	use_fast=True,
	)
	model = AutoModelForSequenceClassification.from_pretrained(
	model_id,
	trust_remote_code=True,
	).eval()

	texts = [
	"I hope you have a wonderful day.",
	"You are disgusting and should disappear.",
	]

	inputs = tokenizer(
	texts,
	padding=True,
	truncation=True,
	max_length=512,
	return_tensors="pt",
	)

	with torch.inference_mode():
	probs = torch.softmax(model(**inputs).logits, dim=-1)

	toxic_id = model.config.label2id["toxic"]
	print(probs[:, toxic_id].tolist())
	```

	The tokenizer prepends the required `[CLS]` token by default.

	## Training

	The student classifier was initialized from and distilled against `s-nlp/roberta_toxicity_classifier`.

	Objective:

	- supervised binary toxicity classification
	- teacher KL distillation with `kl_weight=0.2`

	Training configuration and run metadata are included in:

	- `distill_config.yaml`
	- `training_summary.json`

	## Validation Metrics

	Checkpoint: step 20000.

	\| metric \| value \|
	\| --- \| ---: \|
	\| accuracy \| 0.9560 \|
	\| F1 \| 0.7445 \|
	\| precision \| 0.7127 \|
	\| recall \| 0.7794 \|
	\| ROC-AUC \| 0.9762 \|
	\| PR-AUC \| 0.8328 \|

	Best validation threshold from sweep: `0.5378`.

	## License

	Model weights are released under OpenRAIL++.

	Third-party notices are listed in `THIRD_PARTY_NOTICES.md`.

	## Limitations

	This model is intended as a toxicity scorer for research and evaluation workflows. It should not be used as a standalone moderation decision system without additional validation.