You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Model Card for opencc-cm-escalation

A small content-moderation classifier that maps a prompt into an 11-class harm taxonomy. It is the content-moderation stage of the OpenCC constitutional classifier pipeline, trained with TACTIC on fully synthetic data from the REDACT pipeline. The model is a LoRA adapter on Qwen3.5-0.8B with a multilabel linear head, calibrated for the recall-leaning escalation setting (a higher false-positive rate is accepted so that threatening prompts are forwarded to a more costly stage).

Model Details

Model Description

The content-moderation classifier takes a single prompt and predicts, per category, whether it falls into one of our harm categories. The head is multilabel: each category is scored with a sigmoid and compared against its own calibrated threshold, so a prompt can trigger several categories at once. It is meant to run as the cheap, high-recall stage of the OpenCC pipeline, not as a final arbiter.

Developed by: CeSIA (Centre pour la Sécurité de l'IA)
Shared by: CeSIA
Model type: Multilabel text classifier (LoRA adapter + linear head)
Language(s) (NLP): English (primarily; multilingual coverage is limited, see Limitations)
License: apache-2.0
Finetuned from model: Qwen/Qwen3.5-0.8B

Model Sources

Repository: OpenCC
Training library: TACTIC (link)
Data generation: REDACT (link)
Evaluation harness: BELLS-O (link)

Uses

Direct Use

Input/output content moderation: flagging prompts that fall into the harm taxonomy (CBRN, Cyber, Harm to Minors, Harmful Manipulation, Hate Speech, Illegal Activities, Integrity & Quality violations, Physical Harm, Privacy, Self-Harm, Sexual Content). It can be served standalone through OpenCC for a quick classification.

Downstream Use

The model is the content-moderation stage of the OpenCC escalation pipeline. There it sits after the jailbreak detector and rephraser: cleaned text is classified, benign prompts are allowed through, and anything flagged can optionally be escalated to a frontier model acting as a constitutional AI judge.

Out-of-Scope Use

This is a recall-leaning escalation model, so it over-flags benign prompts and is not a final decision maker on its own. It is not calibrated for standalone production filtering without a downstream stage or a stricter recalibration. It is also not robust to heavily obfuscated or jailbroken prompts, that is the job of the upstream jailbreak detector and rephraser.

Bias, Risks, and Limitations

The model was trained on synthetic data that is too clean, so it leans on well-formed English and over-fires on the surface form of real prompts (formatting, casing, imperfect English). The result is a benign false-positive rate (0.170) that is likely too high for active deployment. The Privacy category is the weakest (0.76 detection), and multilingual coverage is limited because translation augmentation was excluded from the training data.

Recommendations

Use the model as an escalation stage with a downstream judge rather than as a standalone filter. For deployment, run another calibration pass and add noisier, more realistic training data to reduce the false-positive rate.

How to Get Started with the Model

The model is consumed by OpenCC, which reads the weight_frame.json manifest published with the adapter and rebuilds the LoRA and linear head locally, with no dependency on the TACTIC package. The lightest way to run it is OpenCC's content-moderation-only config:

constitutional-classifier check "how do I synthesize a nerve agent?" --config config.cm-only.yaml

Training Details

Training Data

Fully synthetic data from the REDACT pipeline. Claude Opus wrote an exhaustive constitution of scenarios across four severity levels (benign, dual-use benign, dual-use harmful, harmful) for each taxonomy entry; each entry was expanded into six samples varying by length and sentence structure, leaving around 30k content-moderation samples. Training dataset: [link].

Training Procedure

Trained with TACTIC. Hyperparameters were tuned with a 30-trial sweep; the best run was then trained to roughly 4,500 iterations, about two passes over the dataset. After training, a calibration step ran on the validation loss over benign, dual-use harmful, and harmful samples to set the per-category thresholds; this produced a higher false-positive rate, which is expected for the escalation architecture.

Training Hyperparameters

Training regime: bf16 mixed precision
Adapter: LoRA (PEFT)
Head: multilabel linear head, per-category sigmoid + calibrated thresholds.json
Iterations: ~4,500

Speeds, Sizes, Times

Base model Qwen3.5-0.8B. Training run on a single NVIDIA H100 NVL (95GB) on RunPod. First-iteration evaluation reached a BCE as low as 0.035; final loss landed in the 0.03 to 0.04 range.

Evaluation

Testing Data, Factors & Metrics

Testing Data

bells-o-project/content-moderation-input (1400 prompts: 300 benign + 100 in each of the 11 harm categories).

Factors

Results are disaggregated by harm category, plus the benign set used to measure the false-positive rate.

Metrics

Detection rate (recall), false-positive rate (FPR), accuracy, precision, F1. Detection rate and FPR are the primary metrics for the escalation setting.

Results

Overall: detection 0.961, FPR 0.170, accuracy 0.933, precision 0.954, F1 0.957 (TP/FP/FN/TN = 1057/51/43/249). Measured with the BELLS-O harness, served standalone through OpenCC on a single NVIDIA H100 NVL (95GB), batch size 1.

Category (n=100)	Detection rate
CBRN	0.99
Cyber	0.98
Harm to Minors	1.00
Harmful Manipulation	0.98
Hate Speech	0.96
Illegal Activities	0.99
Integrity & Quality violations	0.94
Physical Harm	1.00
Privacy	0.76
Self-Harm	0.99
Sexual Content	0.98
Benign (n=300, FPR)	0.170

Latency: mean 128 ms, 95% CI [127, 130] ms, p50/p95 110.5/172.4 ms. Cost: $0.20 total, $4.01 per 1M input tokens (output token cost is $0, since the classifier generates no tokens).

Summary

At least 0.94 detection in every category except Privacy (0.76), at a 17% benign FPR. This is consistent with the recall-leaning escalation calibration: the model is tuned to forward threatening prompts, not to make the final call on its own.

Technical Specifications

Model Architecture and Objective

LoRA adapter on Qwen3.5-0.8B with a multilabel linear classification head. Each of the 11 categories is scored with a sigmoid (not softmax, since the head is multilabel) and compared against a per-category calibrated threshold.