๐ SPECS.md โ AI Community Moderation & Escalation Environment
๐ง Overview
This project defines an OpenEnv-compliant environment that simulates a real-world content moderation and policy enforcement system.
The agent operates as a moderation decision-maker responsible for:
- identifying policy violations
- investigating context
- enforcing appropriate actions
- handling ambiguity under platform rules
Unlike static classification tasks, this environment models moderation as a multi-step decision process with deterministic grading derived from a formal policy document.
๐ฏ Problem Definition
Given:
- a user-generated post
- associated context (user history, reports, geo, thread)
- a predefined policy document
The agent must:
- Investigate the content
- Identify violation type (if any)
- Decide and execute an appropriate moderation action
๐งฉ Core Design Principles
1. Deterministic Evaluation
All decisions are evaluated against:
- a synthetic policy document
- predefined violation taxonomy
- rule-based ground truth
No subjective scoring is allowed.
2. Multi-Step Decision Process
observe โ investigate โ classify โ decide โ enforce
Each step contributes to final reward.
3. Partial Observability
The agent does not initially receive full information and must:
- request additional context
- inspect user behavior
- interpret signals over time
4. Real-World Alignment
The environment reflects real moderation challenges:
- misinformation
- harassment
- context-dependent violations
- geo-specific policy differences
๐งฑ Environment Components
๐ฅ State (Observation)
Each state includes:
post_content: text of the contentuser_history: last N posts + violation countreports: number and type of reportsengagement: likes, shares, virality scoregeo: region identifier (e.g., IN, EU, US)thread_context: previous messages (optional)policy_context: relevant policy snippets (optional)
๐ฎ Action Space
Investigation Actions
fetch_user_historyfetch_thread_contextcheck_policy_clause
Classification Actions
mark_violation_type(harassment, misinformation, safe, etc.)
Decision Actions
allowflagremoveescalate
๐ Episode Flow
reset()initializes a moderation scenarioAgent receives initial state
Agent performs a sequence of actions
Environment updates state accordingly
Rewards are assigned incrementally
Episode ends when:
- final decision is made, or
- max steps reached
๐ Reward Design
Step-level Rewards
- correct investigation step โ +0.2
- correct violation classification โ +0.5
- unnecessary action โ -0.1
Final Decision Rewards
- correct action โ +1.0
- acceptable but suboptimal โ +0.5
- false positive โ -0.5
- false negative โ -2.0
Escalation Handling
- correct escalation โ +0.5
- unnecessary escalation โ -0.2
๐ฏ Tasks
๐ข Task 1 โ Clear Violations (Easy)
- explicit harmful content
- no ambiguity
- minimal context required
Goal: correct identification and action
๐ก Task 2 โ Ambiguous Content (Medium)
- sarcasm, criticism, borderline cases
- requires context and policy reasoning
Goal: balanced moderation decision
๐ด Task 3 โ Contextual & Geo-Aware Moderation (Hard)
- misinformation, repeated offenders, viral posts
- geo-specific policy differences
- multi-step reasoning required
Goal: optimize long-term safety and correctness
๐งช Grading Strategy
Each task includes a deterministic grader that:
compares agent decisions to ground truth derived from policy rules
assigns score between 0.0 and 1.0
accounts for:
- correctness
- efficiency
- decision quality
๐งฌ Data Creation Strategy
To ensure deterministic grading and reproducibility, all data in the environment is synthetically generated using controlled templates and rules.
1. Template-Based Generation
Posts are generated using predefined templates mapped to violation types:
TEMPLATES = {
"harassment": ["I will hurt you", "You are worthless"],
"misinformation": ["Vaccines don't work", "This cures all diseases"],
"safe": ["Had a great day!", "Check out this cool product"]
}
Each template includes:
- known violation label
- severity level
- expected action
2. Controlled Variations
To increase diversity while preserving ground truth:
- tone variation (angry, sarcastic, neutral)
- noise injection (irrelevant text, emojis, typos)
- paraphrasing using rule-based transformations
3. Context Generation
Additional signals are generated programmatically:
user_history
- prior violations count
- repeated behavior patterns
reports
- number of reports aligned with severity
engagement
- virality score (affects task difficulty)
geo
- region-specific overrides (policy-dependent)
4. Ground Truth Labeling
Each generated scenario includes:
{
"violation_type": "misinformation",
"severity": "high",
"expected_action": "remove"
}
This ensures:
- fully deterministic grading
- no ambiguity in evaluation
5. Reproducibility
- random seeds used for generation
- fixed datasets per task level
- identical inputs produce identical outputs
๐ RL Framing
This environment models moderation as a sequential decision-making problem:
- actions influence future state
- rewards accumulate over steps
- agents must learn optimal action sequences
๐ฆ OpenEnv Compliance
The environment implements:
reset()โ returns initial observationstep(action)โ returns (observation, reward, done, info)state()โ returns current state
Additionally:
- typed models for Observation, Action, Reward
openenv.yamlfor metadata- reproducible baseline inference
๐ง One-Line Summary
A deterministic, multi-step content moderation environment where agents learn to enforce platform policies under ambiguity, context, and real-world constraints.