GreesyGuard (GreesyGPT)
GreesyGuard is a lightweight reasoning-based content moderation model designed to analyze user messages, evaluate harm potential, and produce structured moderation verdicts.
Unlike traditional classifiers, GreesyGuard performs step‑by‑step analysis inside <think> blocks before generating the final moderation decision.
This improves transparency and makes moderation decisions easier to audit.
Model Overview
GreesyGuard is a Transformer model specialized for safety classification tasks such as:
- harassment detection
- hate speech
- spam detection
- misinformation identification
- crisis detection
Instead of directly outputting a label, the model:
- Analyzes the message
- Evaluates context and intent
- Identifies policy violations
- Outputs a final moderation verdict
Moderation Labels
The model produces the following moderation categories:
SAFE
SPAM
MISINFORMATION
HARASSMENT
HATE_SPEECH
CRISIS_REFERRAL
UNSAFE
Example output:
## Verdict
**HARASSMENT**
Model Architecture
| Parameter | Value |
|---|---|
| Layers | 12 |
| Heads | 12 |
| Embedding Dimension | 768 |
| Context Window | 12,000 tokens |
| Tokenizer | o200k_base (extended) |
| Vocabulary Size | 8192 |
Key architectural features:
- Transformer decoder architecture
- Rotary Positional Embeddings (RoPE)
- KV‑Cache optimized inference
- Structured chat‑template training
- Markdown reasoning output
Reasoning Modes
The model supports configurable reasoning budgets:
| Mode | Think Tokens | Purpose |
|---|---|---|
| NONE | 200 | Fast moderation |
| LOW | 512 | Balanced reasoning |
| MEDIUM | 1536 | Detailed analysis |
| HIGH | 3072 | Maximum review depth |
Higher modes produce more thorough moderation reasoning but increase latency.
Example Usage
from model import GreesyGPT, generate_moderation, ReasoningMode, OutputFormat
model = GreesyGPT()
result = generate_moderation(
model,
prompt="You're worthless and nobody likes you.",
mode=ReasoningMode.MEDIUM,
output_format=OutputFormat.JSON
)
print(result["verdict_fmt"])
Example structured output:
{
"verdict": "HARASSMENT",
"severity": 3,
"confidence_hint": "medium"
}
Training Format
Training data follows a structured conversation template:
<|system|>
moderation instructions
</|system|>
<|user|>
message to review
</|user|>
<|assistant|>
<think>
step-by-step reasoning
</think>
verdict<|endoftext|>
Only assistant tokens contribute to the training loss.
Intended Use
GreesyGuard is designed for:
- social media moderation
- comment filtering
- forum safety pipelines
- research in explainable moderation systems
Limitations
- The reasoning output may appear confident but still be incorrect.
- Sarcasm and cultural context can be misinterpreted.
- The model should not be used for fully automated enforcement without human oversight.
Safety
Moderation systems should always include human review for high‑impact actions such as account suspension or legal escalation.
Authors
Created by the GreesyGuard Project
Author: Nicat
- Downloads last month
- 124