CoPE-B-A4B: The COntent Policy Evaluator Model (Text-Only Variant)

Model Overview

CoPE-B-A4B is the 2nd-generation Content Policy Evaluator model from Zentropi, built on the Gemma-4-26B-A4B-it Mixture-of-Experts architecture. It performs accurate content classification based on developer-customizable policies — like its predecessor CoPE-A-9B, but with substantially improved policy steerability and a larger context window (now 256K tokens).

This is the text-only variant. It cannot accept image inputs. For multimodal classification (including images and video), see zentropi-ai/cope-b-a4b-mm — the multimodal companion model that shares CoPE-B's training methodology and adds native image understanding.

Full methodology, training recipe, and evaluation details behind the CoPE model family are described in our paper: "CoPE: A Small Language Model for Steerable and Scalable Content Labeling" (arXiv:2512.18027).

Key Features

Improved steerability and context length vs CoPE-A-9B
Policy-adaptive content classification (no fixed taxonomy)
High-accuracy, low latency binary labels
Mixture-of-Experts efficiency: 25.2B total / 3.8B active parameters
Frontier-level capability at consumer-GPU inference cost

Getting Started

You can use CoPE-B-A4B in three ways:

Zentropi API — fastest path, with a generous free tier (no infra required)
Self-hosted vLLM — for production-scale serving on your own infrastructure
Direct inference in Python — load via Transformers; see this Colab notebook for a working example

See the Running the Model section below for details on each.

Technical Specifications

Model Architecture

CoPE-B-A4B is built on top of Gemma-4-26B-A4B-it:

25.2 billion total parameters across 128 experts per layer
3.8 billion active parameters per forward pass (top-k=8 of 128 experts)
256K-token context window

CoPE-B-A4B was fine-tuned with LoRA then merged into the base so it can be run directly without adapters. For efficiency, this text-only model physically cannot process image inputs.

Input Format

The model accepts input in a structured format:

Examine the given POLICY and determine if the given CONTENT meets the criteria for ANY of the LABELS. Answer "1" if yes, and "0" if no.


POLICY
======

[policy text]


CONTENT
=======

[content text]

This prompt should be passed as the user-turn content and tokenized via the Gemma-4 chat template:

messages = [{"role": "user", "content": cope_prompt}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

Important: Creating high-quality labeling criteria is the key to unlocking superior performance, so we've created the Zentropi system to enable rapid generation, testing, and tuning of policies that are optimized for CoPE interpretability. It is free for anyone to get started.

Output Format

CoPE-B-A4B provides binary classification outputs as a single token:

0: None of the policy labels apply
1: One or more policy labels apply

System Requirements

Deployable on a single 80GB GPU (A100, H100, or comparable). bf16 weights ~52 GB.
Inference latency comparable to a 4B-parameter dense model at batch=1, due to MoE's sparse activation
Compatible with vLLM ≥ 0.20.2 for production serving

Training Details

For the full training recipe (hyperparameters, contradictory-policy dataset construction, ablation studies), see our paper: "CoPE: A Small Language Model for Steerable and Scalable Content Labeling". A condensed methodology overview is also available in our research talk.

Training Methodology

CoPE-B-A4B inherits and refines the policy-interpretation training methodology pioneered with CoPE-A-9B:

Contradictory example training: identical content samples with systematically contradictory labels across policy variants, forcing the model to learn policy interpretation rather than pattern memorization
Policy-shape diversity: training corpus spans permissive, moderate, and stringent policy variants per topical area

Training Data

Policy texts authored by the CoPE team across multiple topic areas
Content data sourced from publicly-accessible internet forums
Labels produced via a 4-pass LLM-assisted relabeling pipeline

Data Integrity

The training corpus and the evaluation test sets are disjoint splits of Zentropi's internal dataset. The held-out test split shares zero content_text or policy_text samples with the training split. Test policies are novel policies so the evaluation measures policy-text generalization, not policy memorization.

Performance Evaluation

Methodology

CoPE-B-A4B was evaluated on a held-out test set of (content, policy) pairs with relabeled ground-truth labels, against a broad slate of frontier proprietary models, open-weight reasoning models, and fixed-taxonomy safety classifiers. All numbers below are on the relabeled test set. Tables are sorted by F1 descending; CoPE models are in bold.

Benchmark Results

Overview: Average Across Topics

Unweighted mean of the per-category Precision / Recall / F1 below. Every category area carries equal weight. Detailed performance per category follows afterwards.

Model	Precision	Recall	F1 Score	Self-hostable	Single-pass*	Multimodal
CoPE-B-A4B-MM	0.83	0.84	0.82	✓	✓	✓
CoPE-B-A4B	0.74	0.90	0.81	✓	✓
CoPE-A-9B	0.74	0.88	0.80	✓	✓
GPT-5.4 (default reasoning)	0.68	0.95	0.78			✓
Gemini-3.5-Flash	0.69	0.91	0.78		✓	✓
Gemma-4-26B-A4B-it	0.67	0.90	0.76	✓	✓	✓
Claude-Opus-4.6	0.65	0.95	0.75		✓	✓
Gemini-3.1-Flash-Lite	0.69	0.86	0.75		✓	✓
gpt-oss-120b (default reasoning)	0.68	0.88	0.75	✓
gpt-oss-safeguard-20b (default reasoning)	0.70	0.82	0.75	✓
gpt-oss-120b (low reasoning)	0.66	0.86	0.73	✓
gpt-oss-20b (default reasoning)	0.65	0.88	0.72	✓
gpt-oss-20b (low reasoning)	0.63	0.89	0.72	✓
Claude-Sonnet-4.6	0.61	0.89	0.71		✓	✓
GPT-5-mini (default reasoning)	0.56	0.97	0.69			✓
Claude-Haiku-4.5	0.56	0.68	0.60		✓	✓
ShieldGemma-9B	0.54	0.75	0.58	✓	✓
LlamaGuard4-12B	0.50	0.66	0.52	✓	✓	✓

* Single-pass means the model produces its classification in one forward pass, with no internal reasoning chain — enabling lower latency and cost than reasoning-based models that may emit thousands of intermediate tokens per decision.

Drugs Classification

Model	Precision	Recall	F1 Score
Claude-Opus-4.6	0.78	0.97	0.87
CoPE-B-A4B-MM	0.75	0.90	0.82
Gemini-3.5-Flash	0.70	1.0	0.82
Gemma-4-26B-A4B-it	0.69	0.97	0.81
Claude-Sonnet-4.6	0.65	0.93	0.77
CoPE-B-A4B	0.66	0.90	0.76
GPT-5.4 (default reasoning)	0.61	1.0	0.76
gpt-oss-safeguard-20b (default reasoning)	0.68	0.83	0.75
gpt-oss-120b (default reasoning)	0.59	1.0	0.74
Gemini-3.1-Flash-Lite	0.57	1.0	0.72
gpt-oss-20b (default reasoning)	0.56	0.97	0.71
gpt-oss-120b (low reasoning)	0.53	1.0	0.69
CoPE-A-9B	0.57	0.83	0.68
GPT-5-mini (default reasoning)	0.49	1.0	0.66
gpt-oss-20b (low reasoning)	0.50	0.97	0.66
ShieldGemma-9B	0.42	1.0	0.59
LlamaGuard4-12B	0.39	0.90	0.55
Claude-Haiku-4.5	0.68	0.43	0.53

Harassment Classification

Model	Precision	Recall	F1 Score
Gemini-3.5-Flash	0.63	0.91	0.75
GPT-5.4 (default reasoning)	0.60	0.95	0.74
gpt-oss-120b (low reasoning)	0.63	0.87	0.73
CoPE-B-A4B	0.57	0.96	0.72
CoPE-B-A4B-MM	0.58	0.93	0.72
CoPE-A-9B	0.60	0.88	0.71
Gemini-3.1-Flash-Lite	0.58	0.91	0.71
gpt-oss-120b (default reasoning)	0.58	0.85	0.69
gpt-oss-20b (default reasoning)	0.56	0.90	0.69
gpt-oss-20b (low reasoning)	0.56	0.89	0.69
gpt-oss-safeguard-20b (default reasoning)	0.59	0.79	0.68
Gemma-4-26B-A4B-it	0.49	0.93	0.65
Claude-Opus-4.6	0.44	0.98	0.61
GPT-5-mini (default reasoning)	0.45	0.95	0.61
Claude-Sonnet-4.6	0.39	0.94	0.56
Claude-Haiku-4.5	0.44	0.60	0.51
ShieldGemma-9B	0.32	0.60	0.42
LlamaGuard4-12B	0.25	0.44	0.32

Hate Speech Classification

Model	Precision	Recall	F1 Score
GPT-5.4 (default reasoning)	0.88	0.93	0.91
Gemini-3.1-Flash-Lite	0.92	0.84	0.88
Claude-Opus-4.6	0.78	0.98	0.87
CoPE-B-A4B	0.86	0.88	0.87
CoPE-B-A4B-MM	0.93	0.82	0.87
Gemma-4-26B-A4B-it	0.89	0.84	0.86
gpt-oss-120b (default reasoning)	0.80	0.94	0.86
gpt-oss-safeguard-20b (default reasoning)	0.81	0.87	0.84
Gemini-3.5-Flash	0.77	0.90	0.83
gpt-oss-120b (low reasoning)	0.74	0.94	0.83
gpt-oss-20b (default reasoning)	0.74	0.91	0.82
CoPE-A-9B	0.71	0.94	0.81
GPT-5-mini (default reasoning)	0.68	0.99	0.80
gpt-oss-20b (low reasoning)	0.67	0.93	0.78
Claude-Sonnet-4.6	0.66	0.92	0.77
ShieldGemma-9B	0.56	0.98	0.71
LlamaGuard4-12B	0.56	0.87	0.68
Claude-Haiku-4.5	0.54	0.73	0.62

Self-Harm Content Classification

Model	Precision	Recall	F1 Score
CoPE-B-A4B-MM	0.95	0.92	0.94
GPT-5-mini (default reasoning)	0.91	0.98	0.94
CoPE-B-A4B	0.91	0.94	0.93
GPT-5.4 (default reasoning)	0.95	0.90	0.93
Claude-Sonnet-4.6	0.88	0.97	0.92
Gemini-3.5-Flash	0.86	0.98	0.92
Claude-Opus-4.6	0.85	0.98	0.91
CoPE-A-9B	0.93	0.89	0.91
gpt-oss-120b (default reasoning)	0.95	0.87	0.91
gpt-oss-20b (default reasoning)	0.94	0.88	0.91
Claude-Haiku-4.5	0.88	0.90	0.89
Gemini-3.1-Flash-Lite	0.83	0.96	0.89
gpt-oss-safeguard-20b (default reasoning)	0.93	0.86	0.89
gpt-oss-120b (low reasoning)	0.96	0.79	0.87
gpt-oss-20b (low reasoning)	0.95	0.80	0.87
Gemma-4-26B-A4B-it	0.73	0.97	0.84
ShieldGemma-9B	0.72	0.89	0.80
LlamaGuard4-12B	0.80	0.71	0.75

Sexual Content Classification

Model	Precision	Recall	F1 Score
CoPE-A-9B	0.98	0.93	0.95
CoPE-B-A4B-MM	0.86	0.98	0.92
gpt-oss-120b (default reasoning)	0.94	0.89	0.92
Gemini-3.5-Flash	0.88	0.93	0.90
gpt-oss-safeguard-20b (default reasoning)	0.91	0.89	0.90
Claude-Opus-4.6	0.88	0.89	0.89
gpt-oss-120b (low reasoning)	0.88	0.91	0.89
gpt-oss-20b (low reasoning)	0.94	0.84	0.89
GPT-5.4 (default reasoning)	0.82	0.95	0.88
gpt-oss-20b (default reasoning)	0.94	0.82	0.88
Gemma-4-26B-A4B-it	0.81	0.93	0.87
Gemini-3.1-Flash-Lite	0.80	0.93	0.86
Claude-Sonnet-4.6	0.90	0.79	0.84
ShieldGemma-9B	0.91	0.77	0.83
Claude-Haiku-4.5	0.74	0.91	0.82
GPT-5-mini (default reasoning)	0.72	0.95	0.82
CoPE-B-A4B	0.69	0.98	0.81
LlamaGuard4-12B	0.83	0.36	0.50

Note: The regressions on sexual content classification for CoPE-B-A4B relative to CoPE-A-9B can be addressed by creating a policy that is well-matched to your golden dataset. Tools to do so are available at zentropi.ai.

Toxic Speech Classification

Model	Precision	Recall	F1 Score
CoPE-B-A4B	0.76	0.85	0.80
CoPE-A-9B	0.67	0.91	0.77
CoPE-B-A4B-MM	0.80	0.73	0.76
Gemini-3.5-Flash	0.56	0.94	0.70
Claude-Sonnet-4.6	0.53	0.94	0.67
Gemini-3.1-Flash-Lite	0.51	1.0	0.67
Gemma-4-26B-A4B-it	0.48	1.0	0.65
Claude-Opus-4.6	0.46	0.97	0.63
gpt-oss-safeguard-20b (default reasoning)	0.46	0.94	0.62
ShieldGemma-9B	0.43	0.97	0.60
gpt-oss-20b (low reasoning)	0.43	0.97	0.59
GPT-5.4 (default reasoning)	0.41	1.0	0.58
gpt-oss-120b (default reasoning)	0.41	1.0	0.58
gpt-oss-120b (low reasoning)	0.40	1.0	0.57
gpt-oss-20b (default reasoning)	0.40	0.97	0.57
Claude-Haiku-4.5	0.43	0.73	0.54
GPT-5-mini (default reasoning)	0.32	1.0	0.49
LlamaGuard4-12B	0.37	0.52	0.43

For background on the unique nature of the toxicity policy we tested, see this blog post.

Violence Classification

Model	Precision	Recall	F1 Score
CoPE-A-9B	0.72	0.79	0.76
CoPE-B-A4B	0.70	0.79	0.75
CoPE-B-A4B-MM	0.96	0.56	0.71
GPT-5.4 (default reasoning)	0.52	0.90	0.66
Gemma-4-26B-A4B-it	0.57	0.69	0.63
Gemini-3.5-Flash	0.45	0.69	0.55
gpt-oss-120b (default reasoning)	0.48	0.62	0.54
gpt-oss-safeguard-20b (default reasoning)	0.51	0.56	0.54
gpt-oss-20b (low reasoning)	0.39	0.82	0.53
Claude-Opus-4.6	0.36	0.87	0.51
gpt-oss-120b (low reasoning)	0.48	0.54	0.51
gpt-oss-20b (default reasoning)	0.40	0.69	0.51
GPT-5-mini (default reasoning)	0.33	0.95	0.49
Gemini-3.1-Flash-Lite	0.62	0.41	0.49
Claude-Sonnet-4.6	0.29	0.77	0.42
LlamaGuard4-12B	0.27	0.79	0.40
Claude-Haiku-4.5	0.24	0.44	0.31
ShieldGemma-9B	0.40	0.051	0.091

Performance Analysis

In short, CoPE-B-A4B delivers policy-steerable classification accuracy that matches or exceeds frontier proprietary models — while being a fraction of their size, far faster and cheaper to run, deployable locally for greater security, and with open weights for further fine-tuning.

Specifically, CoPE-B-A4B delivers an unweighted-average F1 of 0.81 — ahead of GPT-5.4 at 0.78 (using default reasoning). The multimodal sibling CoPE-B-A4B-MM leads the field at 0.82.

The fixed-taxonomy safety classifiers (LlamaGuard4-12B, ShieldGemma-9B) trail by 0.16+ absolute on overall F1. This is consistent with their built-in-taxonomy design: when asked to evaluate against a user-supplied policy, they tend to over-fire or miss off-taxonomy criteria.

Beyond raw F1, CoPE-B-A4B's primary upgrade over CoPE-A-9B is in policy steerability — the model's ability to follow custom policy stances on the same content rather than apply a fixed harm taxonomy. A dedicated steerability benchmark with full methodology and head-to-head model comparison will be published separately.

Intended Applications

Primary Use Cases

Content Labeling
- Real-time content moderation
- Batch processing of content
- Policy-driven content classification at scale
LLM Guardrails
- Input prompt risk assessment
- Output answer risk assessment
- NB: Not yet optimized for agentic patterns
Content Scoring
- Feature generation for social feed ranking
- Language model training data filtering
- Content quality assessment & measurement

See also these case studies for how other organizations are using CoPE’s powerful classification capabilities to advance their work.

Discouraged Uses

While the Apache 2.0 license permits broad use, the following applications fall outside the intended scope of the model and may produce poor or unsafe results:

Surveillance applications
Use cases beyond the stated technical limitations (see below)
Zero shot use without human review for high-stakes moderation decisions

License

CoPE-B-A4B is released under the Apache 2.0 license so that all AI-powered platforms can have access to tools that make the digital world more trustworthy. We continue to welcome collaboration with technical researchers working on methods to mitigate the dual-use risks of open T&S technology.

Running the Model

Sample Policies

CoPE-B-A4B works well with Zentropi's seven public reference policies covering the harm areas the model was trained on. These are ready to use and serve as good starting points for custom policy authoring:

Important: The strength of the CoPE system is that it can interpret your rules and you are not stuck with anybody else's definitions, including ours. Therefore use the policies above as an example, but adapt the policy to your platform's specific needs. For custom policies, Zentropi provides a guided authoring workflow that optimizes policy structure for CoPE given your labeled 'golden' dataset.

via Hosted API

The easiest way to get started with this model is to use it through the Zentropi API, which has a very generous free tier. Just create an account and mint an API key.

via Direct Inference (Python)

To call the model directly via Transformers, see this runnable Colab notebook. It loads CoPE-B-A4B from the Hub in bf16, applies the proper prompt template, and shows a complete worked example end-to-end.

via Self-Hosting (vLLM)

As an open model, CoPE-B-A4B can also be self-hosted on your infrastructure under vLLM:

vllm serve zentropi-ai/cope-b-a4b \
  --dtype bfloat16 \
  --max-model-len 256000

Migrating from CoPE-A

If you're currently using CoPE-A-9B and moving to CoPE-B-A4B, three things to know:

1. CoPE-B uses the Gemma-4 chat template

CoPE-B's prompt must be passed through apply_chat_template as a user-turn message — the answer comes back as the assistant-turn output. If your CoPE-A code path raw-concatenates the prompt directly, that pattern will not work with CoPE-B. See the Input Format section above or the runnable Colab notebook for the exact pattern.

Note also that the CoPE-B prompt is leaner than CoPE-A's: there is no INSTRUCTIONS header or ANSWER footer to include — the chat template's role markers replace them.

2. Recalibrate confidence thresholds

CoPE-B is on average more confident than CoPE-A — it concentrates more probability mass on its answer token. If you use the output token probability (or logprob) as a confidence signal for downstream routing or thresholding, your CoPE-A thresholds will not transfer directly. Recalibrate against a labeled sample of your own traffic before relying on the old thresholds.

3. Re-optimize policies for CoPE-B

Policies that were optimized for CoPE-A may not be optimal for CoPE-B. CoPE-B's improved policy interpretation can extract more nuanced criteria from a policy than CoPE-A could, which sometimes changes the optimal phrasing. We recommend running existing CoPE-A policies through the Zentropi platform, which has CoPE-B-aware policy authoring tools, to refresh them against a labeled golden dataset.

Limitations and Constraints

Current Limitations

Scope: Binary classification only (i.e., presence/absence of matching labels). Aggregate policies that attempt to apply multiple labels simultaneously will not succeed.
Text Processing: Limited to 256K tokens (combined policy and content) — a 32x increase over CoPE-A-9B's 8K limit
No Image Support: CoPE-B-A4B is text-only by construction. For image classification, use the multimodal variant zentropi-ai/cope-b-a4b-mm.
Language Support: Currently optimized for US English policy interpretation. Performance may degrade for other languages and locales.
Knowledge Constraints: Cannot make classifications requiring external verification (e.g., misinformation) unless explicitly defined in the provided context

Ethical Considerations

Bias and Fairness

While comprehensive bias evaluation is still ongoing, users should:

Implement careful policy design to mitigate potential biases
Monitor classification patterns across different demographic groups
Contribute problematic examples to our bias assessment efforts

Safety Measures

The model's binary classification nature inherently limits certain risks, but users should:

Maintain appropriate human oversight
Regularly audit classification decisions
Implement robust observability systems

Maintenance and Updates

Update Schedule

Annual releases planned
Regular performance improvements
Community-driven feature enhancements

Future Roadmap Focus

Performance optimizations (e.g., quantized variants)
Greater content diversity
Multilingual and locale support

Community and Support

For any technical questions or comments, please join our HuggingFace community forum or the Roost model community. You can share your feedback, suggest new areas, or pick our brains about anything. If you'd prefer a more private discussion, you can also email us at info@zentropi.ai.

About the Developer

CoPE-B-A4B is developed and maintained by Zentropi, a public benefit company focused on making content classification simple and powerful. The project represents a collaborative effort between industry experts and researchers to advance the state of the art in content labeling technology.

Citation

If you use CoPE in your research, please cite our paper:

@article{cope2025,
  title   = {CoPE: A Small Language Model for Steerable and Scalable Content Labeling},
  author  = {Chakrabarti, Willner, et al.},
  journal = {arXiv preprint arXiv:2512.18027},
  year    = {2025},
  url     = {https://arxiv.org/abs/2512.18027}
}

Last Updated: May 27, 2026