RyanStudio commited on
Commit
0adb3cc
·
verified ·
1 Parent(s): 127a8d0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +189 -0
README.md CHANGED
@@ -1,3 +1,192 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ language:
4
+ - en
5
+ base_model:
6
+ - FacebookAI/roberta-base
7
+ pipeline_tag: text-classification
8
+ tags:
9
+ - content-moderation
10
+ - content-guard
11
+ - moderation
12
+ - guard
13
+ - safety
14
  ---
15
+ # Mezzo Content Guard Large
16
+
17
+ <a href="https://discord.gg/sBMqepFV6m"><img src="https://discord.com/api/guilds/1386414999932506197/embed.png" alt="Discord Link" height="20"></a>
18
+
19
+ **Mezzo Content Guard** is a series of RoBERTa-based, English-only Content Moderation Models trained on approximately 14M tokens (360k+ rows) of labelled examples.
20
+
21
+ **Mezzo Content Guard** comes in 3 different sizes, based on RoBERTa Large, Base, and DistilRoBERTa Base
22
+ - Large (355M Params)
23
+ - Base (125M Params)
24
+ - Small (82.8M Params)
25
+
26
+ Try out the demo at the [`Mezzo Content Guard Demo`](https://huggingface.co/spaces/RyanStudio/Mezzo-Content-Guard-Demo) Space
27
+
28
+ ## Categories
29
+ **Mezzo Content Guard** covers 5 different categories
30
+
31
+ - **Hate Speech:** Content that attacks or uses discriminatory and pejorative language toward a person or group based on inherent characteristics such as race, religion, ethnicity, gender, sexual orientation, or disability.
32
+
33
+ - **Self Harm:** Content where individuals express a desire to harm themselves, or encouraging acts that lead to self harm.
34
+
35
+ - **Sexual:** Content involving any references, descriptions or intent of sexual acts.
36
+
37
+ - **Toxic:** Content that attacks, harasses, or discriminates towards an individual.
38
+
39
+ - **Violence:** Content that depicts violence and gore, or content that incites acts of violence.
40
+
41
+ ## Benchmarks
42
+ All benchmarks were done with a threshold of 0.5, though the threshold can be increased or decreased to trade between precision and recall
43
+
44
+ ### Sexual
45
+ | Model | Precision | Recall | F1 | ROC-AUC |
46
+ | --- | --- | --- | --- | --- |
47
+ | Mezzo Content Guard Large | 0.8396 | 0.8370 | **0.8383** | **0.9917** |
48
+ | Mezzo Content Guard Base | 0.8190 | 0.8227 | 0.8209 | 0.9895 |
49
+ | Mezzo Content Guard Small | 0.8376 | 0.7740 | 0.8045 | 0.9865 |
50
+ | KoalaAI/Text-Moderation | 0.1503 | **0.8423** | 0.2551 | 0.8770 |
51
+ | ifmain/ModerationBERT-En-02 | **0.8500** | 0.3591 | 0.5049 | 0.9373 |
52
+
53
+ ### Violence
54
+ | Model | Precision | Recall | F1 | ROC-AUC |
55
+ | --- | --- | --- | --- | --- |
56
+ | Mezzo Content Guard Large | 0.7050 | 0.7827 | **0.7418** | 0.9921 |
57
+ | Mezzo Content Guard Base | **0.7330** | 0.7460 | 0.7394 | **0.9924** |
58
+ | Mezzo Content Guard Small | 0.6772 | 0.7269 | 0.7011 | 0.9883 |
59
+ | KoalaAI/Text-Moderation | 0.0136 | **1.0000** | 0.0269 | 0.8737 |
60
+ | ifmain/ModerationBERT-En-02 | 0.5414 | 0.3554 | 0.4291 | 0.9461 |
61
+
62
+ ### Self-Harm
63
+ | Model | Precision | Recall | F1 | ROC-AUC |
64
+ | --- | --- | --- | --- | --- |
65
+ | Mezzo Content Guard Large | 0.8558 | 0.8711 | 0.8634 | **0.9888** |
66
+ | Mezzo Content Guard Base | 0.8524 | 0.8749 | **0.8635** | 0.9868 |
67
+ | Mezzo Content Guard Small | 0.8595 | 0.8401 | 0.8497 | 0.9853 |
68
+ | KoalaAI/Text-Moderation | 0.0923 | **0.8946** | 0.1673 | 0.9178 |
69
+ | ifmain/ModerationBERT-En-02 | **0.9174** | 0.4807 | 0.6309 | 0.9471 |
70
+
71
+ ### Hate Speech
72
+ | Model | Precision | Recall | F1 | ROC-AUC |
73
+ | --- | --- | --- | --- | --- |
74
+ | Mezzo Content Guard Large | 0.8268 | 0.8229 | **0.8248** | **0.9865** |
75
+ | Mezzo Content Guard Base | 0.7991 | 0.8398 | 0.8190 | 0.9855 |
76
+ | Mezzo Content Guard Small | 0.8043 | 0.8055 | 0.8049 | 0.9829 |
77
+ | KoalaAI/Text-Moderation | 0.1000 | **0.9967** | 0.1817 | 0.9172 |
78
+ | ifmain/ModerationBERT-En-02 | **0.9111** | 0.3436 | 0.4990 | 0.9506 |
79
+
80
+ ### Toxic
81
+ | Model | Precision | Recall | F1 | ROC-AUC |
82
+ | --- | --- | --- | --- | --- |
83
+ | Mezzo Content Guard Large | **0.7647** | 0.7459 | **0.7552** | **0.9778** |
84
+ | Mezzo Content Guard Base | 0.7456 | **0.7498** | 0.7477 | 0.9760 |
85
+ | Mezzo Content Guard Small | 0.7394 | 0.7162 | 0.7276 | 0.9720 |
86
+ | KoalaAI/Text-Moderation | 0.4884 | 0.6878 | 0.5712 | 0.9162 |
87
+ | ifmain/ModerationBERT-En-02 | 0.4781 | 0.6406 | 0.5475 | 0.9128 |
88
+
89
+ ### Macro Averages
90
+ | Model | Precision | Recall | F1 | ROC-AUC |
91
+ | --- | --- | --- | --- | --- |
92
+ | Mezzo Content Guard Large | **0.7984** | 0.8119 | **0.8047** | **0.9874** |
93
+ | Mezzo Content Guard Base | 0.7898 | 0.8066 | 0.7981 | 0.9860 |
94
+ | Mezzo Content Guard Small | 0.7836 | 0.7725 | 0.7776 | 0.9830 |
95
+ | KoalaAI/Text-Moderation | 0.1689 | **0.8843** | 0.2404 | 0.9004 |
96
+ | ifmain/ModerationBERT-En-02 | 0.7396 | 0.4359 | 0.5223 | 0.9388 |
97
+
98
+ # Quickstart
99
+ Introducing our new custom [`mezzo-guard`](https://github.com/RyanStudioo/MezzoGuard) library that supports the Mezzo Prompt Guard and Mezzo Content Guard models. It offers automatic chunking, organized policies, and redactions.
100
+
101
+ Installation:
102
+ ```bash
103
+ pip install mezzo-guard
104
+ ```
105
+
106
+ ```py
107
+ from mezzoguard import CONTENTGUARD
108
+ from mezzoguard.content_guard import ContentPolicy, Category, Guard
109
+
110
+ model = Guard(CONTENTGUARD.MEZZO_CONTENT_GUARD_BASE)
111
+ content_policy = ContentPolicy().add_threshold(Category.SEXUAL, 0.5)
112
+
113
+ sexual_query = "I want to fuck you"
114
+ benign_query = "I want to have a nice day"
115
+ violent_query = "I want to kill you"
116
+
117
+ result_1 = model.scan(text=sexual_query)
118
+ print(content_policy.evaluate(result_1))
119
+ # True
120
+
121
+ result_2 = model.scan(text=benign_query)
122
+ print(content_policy.evaluate(result_2))
123
+ # False
124
+
125
+ result_3 = model.scan(text=violent_query)
126
+ print(content_policy.evaluate(result_3))
127
+ # False
128
+ ```
129
+
130
+ With `transformers`
131
+ ```py
132
+ from transformers import pipeline
133
+
134
+ model = pipeline("text-classification", model="RyanStudio/Mezzo-Content-Guard-Base")
135
+
136
+
137
+ safe_prompt = "I love mezzo content guard!!!"
138
+ print(model(safe_prompt))
139
+
140
+ hate_speech_prompt = "I hate faggots"
141
+ print(model(hate_speech_prompt))
142
+
143
+ self_harm_prompt = "I want to kill myself"
144
+ print(model(self_harm_prompt))
145
+
146
+ sexual_prompt = "I want to fuck someone"
147
+ print(model(sexual_prompt))
148
+
149
+ toxic_prompt = "You are a cunt"
150
+ print(model(toxic_prompt))
151
+
152
+ violence_prompt = "I want to kill someone"
153
+ print(model(violence_prompt))
154
+
155
+ violence_hate_speech_toxic = "I want to kill you because you're a gay faggot"
156
+ print(model(violence_hate_speech_toxic, top_k=None))
157
+
158
+ ```
159
+
160
+ # Training:
161
+ The training data was sourced from various open-sourced datasets, as well as synthetically generated from LLMs such as Deepseek v4 Pro, Claude Sonnet 4.6, and Kimi K2.6.
162
+
163
+ Due to inconsistent labelling and definitions across various datasets, the data was re-laballed using Qwen3Guard-4B and Qwen3.5-4B to fit the specific categorical definitions.
164
+
165
+ The following table shows the data distribution:
166
+
167
+ | Label | Positives | % of Data |
168
+ |-------------|----------:|----------:|
169
+ | sexual | 18,233 | 4.95% |
170
+ | violence | 5,440 | 1.48% |
171
+ | self-harm | 7,826 | 2.13% |
172
+ | hate-speech | 31,597 | 8.59% |
173
+ | toxic | 33,088 | 8.99% |
174
+
175
+ Mezzo Content Guard Large was the first model trained, and then further distilled into the Base and Small models. All models were trained with a max seq length of 256, which filtered out less than 1% of the data in the dataset
176
+
177
+ In initial experiments, RoBERTa-base was only able to hit a 71% macro f1 score, however with distillation, it is able to punch above its weight and hit a 79% macro f1 score.
178
+
179
+ While a "Divisive" Category was added in the Preview Model, targeting political and religious speech, it was deemed unnecessary and harmed the model's overall performance
180
+
181
+ # Limitations
182
+ - **Re-labelling**: Due to the training data being relabelled by Qwen3Guard and Qwen3.5, any inaccuracies from when these models were trained may be passed on to the model
183
+ - **Context Length**: Although a context length of 256 is more than enough for most applications, the model may suffer above it. Due to limitations of RoBERTa, the model can only scan texts up to 512 tokens in length, and chunking is required in lengths above it
184
+ - **Edge Cases**: A large majority of the open sourced datasets used were often dated, and may not take into account modern day slang words or more subtle bypasses, we recommend finetuning the model on your own usecase
185
+ - **English Only**: The RoBERTa models are primarily english-based models, and will suffer in multilingual contexts
186
+
187
+ # Future Iterations
188
+ This model, while suitable for most casual applications, it can still be significantly improved.
189
+
190
+ Future Content Guard models may employ
191
+ - utilization of newer BERT-based models such as ModernBERT or Ettin-Encoder models to support larger contexts and improve general performance
192
+ - improvements to the base dataset in order to account for slang and edge cases, reducing False Positives and Negatives