ThejasRao's picture
Initial OpenENV hackathon submission
c492c3f

๐Ÿ“„ SPECS.md โ€” AI Community Moderation & Escalation Environment


๐Ÿง  Overview

This project defines an OpenEnv-compliant environment that simulates a real-world content moderation and policy enforcement system.

The agent operates as a moderation decision-maker responsible for:

  • identifying policy violations
  • investigating context
  • enforcing appropriate actions
  • handling ambiguity under platform rules

Unlike static classification tasks, this environment models moderation as a multi-step decision process with deterministic grading derived from a formal policy document.


๐ŸŽฏ Problem Definition

Given:

  • a user-generated post
  • associated context (user history, reports, geo, thread)
  • a predefined policy document

The agent must:

  1. Investigate the content
  2. Identify violation type (if any)
  3. Decide and execute an appropriate moderation action

๐Ÿงฉ Core Design Principles

1. Deterministic Evaluation

All decisions are evaluated against:

  • a synthetic policy document
  • predefined violation taxonomy
  • rule-based ground truth

No subjective scoring is allowed.


2. Multi-Step Decision Process

observe โ†’ investigate โ†’ classify โ†’ decide โ†’ enforce

Each step contributes to final reward.


3. Partial Observability

The agent does not initially receive full information and must:

  • request additional context
  • inspect user behavior
  • interpret signals over time

4. Real-World Alignment

The environment reflects real moderation challenges:

  • misinformation
  • harassment
  • context-dependent violations
  • geo-specific policy differences

๐Ÿงฑ Environment Components

๐Ÿ“ฅ State (Observation)

Each state includes:

  • post_content: text of the content
  • user_history: last N posts + violation count
  • reports: number and type of reports
  • engagement: likes, shares, virality score
  • geo: region identifier (e.g., IN, EU, US)
  • thread_context: previous messages (optional)
  • policy_context: relevant policy snippets (optional)

๐ŸŽฎ Action Space

Investigation Actions

  • fetch_user_history
  • fetch_thread_context
  • check_policy_clause

Classification Actions

  • mark_violation_type (harassment, misinformation, safe, etc.)

Decision Actions

  • allow
  • flag
  • remove
  • escalate

๐Ÿ” Episode Flow

  1. reset() initializes a moderation scenario

  2. Agent receives initial state

  3. Agent performs a sequence of actions

  4. Environment updates state accordingly

  5. Rewards are assigned incrementally

  6. Episode ends when:

    • final decision is made, or
    • max steps reached

๐Ÿ† Reward Design

Step-level Rewards

  • correct investigation step โ†’ +0.2
  • correct violation classification โ†’ +0.5
  • unnecessary action โ†’ -0.1

Final Decision Rewards

  • correct action โ†’ +1.0
  • acceptable but suboptimal โ†’ +0.5
  • false positive โ†’ -0.5
  • false negative โ†’ -2.0

Escalation Handling

  • correct escalation โ†’ +0.5
  • unnecessary escalation โ†’ -0.2

๐ŸŽฏ Tasks

๐ŸŸข Task 1 โ€” Clear Violations (Easy)

  • explicit harmful content
  • no ambiguity
  • minimal context required

Goal: correct identification and action


๐ŸŸก Task 2 โ€” Ambiguous Content (Medium)

  • sarcasm, criticism, borderline cases
  • requires context and policy reasoning

Goal: balanced moderation decision


๐Ÿ”ด Task 3 โ€” Contextual & Geo-Aware Moderation (Hard)

  • misinformation, repeated offenders, viral posts
  • geo-specific policy differences
  • multi-step reasoning required

Goal: optimize long-term safety and correctness


๐Ÿงช Grading Strategy

Each task includes a deterministic grader that:

  • compares agent decisions to ground truth derived from policy rules

  • assigns score between 0.0 and 1.0

  • accounts for:

    • correctness
    • efficiency
    • decision quality

๐Ÿงฌ Data Creation Strategy

To ensure deterministic grading and reproducibility, all data in the environment is synthetically generated using controlled templates and rules.

1. Template-Based Generation

Posts are generated using predefined templates mapped to violation types:

TEMPLATES = {
  "harassment": ["I will hurt you", "You are worthless"],
  "misinformation": ["Vaccines don't work", "This cures all diseases"],
  "safe": ["Had a great day!", "Check out this cool product"]
}

Each template includes:

  • known violation label
  • severity level
  • expected action

2. Controlled Variations

To increase diversity while preserving ground truth:

  • tone variation (angry, sarcastic, neutral)
  • noise injection (irrelevant text, emojis, typos)
  • paraphrasing using rule-based transformations

3. Context Generation

Additional signals are generated programmatically:

  • user_history

    • prior violations count
    • repeated behavior patterns
  • reports

    • number of reports aligned with severity
  • engagement

    • virality score (affects task difficulty)
  • geo

    • region-specific overrides (policy-dependent)

4. Ground Truth Labeling

Each generated scenario includes:

{
  "violation_type": "misinformation",
  "severity": "high",
  "expected_action": "remove"
}

This ensures:

  • fully deterministic grading
  • no ambiguity in evaluation

5. Reproducibility

  • random seeds used for generation
  • fixed datasets per task level
  • identical inputs produce identical outputs

๐Ÿ” RL Framing

This environment models moderation as a sequential decision-making problem:

  • actions influence future state
  • rewards accumulate over steps
  • agents must learn optimal action sequences

๐Ÿ“ฆ OpenEnv Compliance

The environment implements:

  • reset() โ†’ returns initial observation
  • step(action) โ†’ returns (observation, reward, done, info)
  • state() โ†’ returns current state

Additionally:

  • typed models for Observation, Action, Reward
  • openenv.yaml for metadata
  • reproducible baseline inference

๐Ÿง  One-Line Summary

A deterministic, multi-step content moderation environment where agents learn to enforce platform policies under ambiguity, context, and real-world constraints.