Spaces:

ThejasRao
/

openenv-content-moderation

Sleeping

App Files Files Community

openenv-content-moderation / docs /SPECS.md

ThejasRao

Initial OpenENV hackathon submission

c492c3f 2 months ago

preview code

raw

history blame contribute delete

6.26 kB

📄 SPECS.md — AI Community Moderation & Escalation Environment

🧠 Overview

This project defines an OpenEnv-compliant environment that simulates a real-world content moderation and policy enforcement system.

The agent operates as a moderation decision-maker responsible for:

identifying policy violations
investigating context
enforcing appropriate actions
handling ambiguity under platform rules

Unlike static classification tasks, this environment models moderation as a multi-step decision process with deterministic grading derived from a formal policy document.

🎯 Problem Definition

Given:

a user-generated post
associated context (user history, reports, geo, thread)
a predefined policy document

The agent must:

Investigate the content
Identify violation type (if any)
Decide and execute an appropriate moderation action

🧩 Core Design Principles

1. Deterministic Evaluation

All decisions are evaluated against:

a synthetic policy document
predefined violation taxonomy
rule-based ground truth

No subjective scoring is allowed.

2. Multi-Step Decision Process

observe → investigate → classify → decide → enforce

Each step contributes to final reward.

3. Partial Observability

The agent does not initially receive full information and must:

request additional context
inspect user behavior
interpret signals over time

4. Real-World Alignment

The environment reflects real moderation challenges:

misinformation
harassment
context-dependent violations
geo-specific policy differences

🧱 Environment Components

📥 State (Observation)

Each state includes:

post_content: text of the content
user_history: last N posts + violation count
reports: number and type of reports
engagement: likes, shares, virality score
geo: region identifier (e.g., IN, EU, US)
thread_context: previous messages (optional)
policy_context: relevant policy snippets (optional)

🎮 Action Space

Investigation Actions

fetch_user_history
fetch_thread_context
check_policy_clause

Classification Actions

mark_violation_type (harassment, misinformation, safe, etc.)

Decision Actions

allow
flag
remove
escalate

🔁 Episode Flow

reset() initializes a moderation scenario
Agent receives initial state
Agent performs a sequence of actions
Environment updates state accordingly
Rewards are assigned incrementally
Episode ends when:
- final decision is made, or
- max steps reached

🏆 Reward Design

Step-level Rewards

correct investigation step → +0.2
correct violation classification → +0.5
unnecessary action → -0.1

Final Decision Rewards

correct action → +1.0
acceptable but suboptimal → +0.5
false positive → -0.5
false negative → -2.0

Escalation Handling

correct escalation → +0.5
unnecessary escalation → -0.2

🎯 Tasks

🟢 Task 1 — Clear Violations (Easy)

explicit harmful content
no ambiguity
minimal context required

Goal: correct identification and action

🟡 Task 2 — Ambiguous Content (Medium)

sarcasm, criticism, borderline cases
requires context and policy reasoning

Goal: balanced moderation decision

🔴 Task 3 — Contextual & Geo-Aware Moderation (Hard)

misinformation, repeated offenders, viral posts
geo-specific policy differences
multi-step reasoning required

Goal: optimize long-term safety and correctness

🧪 Grading Strategy

Each task includes a deterministic grader that:

compares agent decisions to ground truth derived from policy rules
assigns score between 0.0 and 1.0
accounts for:
- correctness
- efficiency
- decision quality

🧬 Data Creation Strategy

To ensure deterministic grading and reproducibility, all data in the environment is synthetically generated using controlled templates and rules.

1. Template-Based Generation

Posts are generated using predefined templates mapped to violation types:

TEMPLATES = {
  "harassment": ["I will hurt you", "You are worthless"],
  "misinformation": ["Vaccines don't work", "This cures all diseases"],
  "safe": ["Had a great day!", "Check out this cool product"]
}

Each template includes:

known violation label
severity level
expected action

2. Controlled Variations

To increase diversity while preserving ground truth:

tone variation (angry, sarcastic, neutral)
noise injection (irrelevant text, emojis, typos)
paraphrasing using rule-based transformations

3. Context Generation

Additional signals are generated programmatically:

user_history
- prior violations count
- repeated behavior patterns
reports
- number of reports aligned with severity
engagement
- virality score (affects task difficulty)
geo
- region-specific overrides (policy-dependent)

4. Ground Truth Labeling

Each generated scenario includes:

{
  "violation_type": "misinformation",
  "severity": "high",
  "expected_action": "remove"
}

This ensures:

fully deterministic grading
no ambiguity in evaluation

5. Reproducibility

random seeds used for generation
fixed datasets per task level
identical inputs produce identical outputs

🔁 RL Framing

This environment models moderation as a sequential decision-making problem:

actions influence future state
rewards accumulate over steps
agents must learn optimal action sequences

📦 OpenEnv Compliance

The environment implements:

reset() → returns initial observation
step(action) → returns (observation, reward, done, info)
state() → returns current state

Additionally:

typed models for Observation, Action, Reward
openenv.yaml for metadata
reproducible baseline inference

🧠 One-Line Summary

A deterministic, multi-step content moderation environment where agents learn to enforce platform policies under ambiguity, context, and real-world constraints.