ThejasRao's picture
Initial OpenENV hackathon submission
c492c3f
# 📄 SPECS.md — AI Community Moderation & Escalation Environment
---
## 🧠 Overview
This project defines an **OpenEnv-compliant environment** that simulates a real-world **content moderation and policy enforcement system**.
The agent operates as a moderation decision-maker responsible for:
* identifying policy violations
* investigating context
* enforcing appropriate actions
* handling ambiguity under platform rules
Unlike static classification tasks, this environment models moderation as a **multi-step decision process** with **deterministic grading derived from a formal policy document**.
---
## 🎯 Problem Definition
Given:
* a user-generated post
* associated context (user history, reports, geo, thread)
* a predefined policy document
The agent must:
1. Investigate the content
2. Identify violation type (if any)
3. Decide and execute an appropriate moderation action
---
## 🧩 Core Design Principles
### 1. Deterministic Evaluation
All decisions are evaluated against:
* a **synthetic policy document**
* predefined violation taxonomy
* rule-based ground truth
No subjective scoring is allowed.
---
### 2. Multi-Step Decision Process
```text
observe → investigate → classify → decide → enforce
```
Each step contributes to final reward.
---
### 3. Partial Observability
The agent does not initially receive full information and must:
* request additional context
* inspect user behavior
* interpret signals over time
---
### 4. Real-World Alignment
The environment reflects real moderation challenges:
* misinformation
* harassment
* context-dependent violations
* geo-specific policy differences
---
## 🧱 Environment Components
### 📥 State (Observation)
Each state includes:
* `post_content`: text of the content
* `user_history`: last N posts + violation count
* `reports`: number and type of reports
* `engagement`: likes, shares, virality score
* `geo`: region identifier (e.g., IN, EU, US)
* `thread_context`: previous messages (optional)
* `policy_context`: relevant policy snippets (optional)
---
### 🎮 Action Space
#### Investigation Actions
* `fetch_user_history`
* `fetch_thread_context`
* `check_policy_clause`
#### Classification Actions
* `mark_violation_type` (harassment, misinformation, safe, etc.)
#### Decision Actions
* `allow`
* `flag`
* `remove`
* `escalate`
---
### 🔁 Episode Flow
1. `reset()` initializes a moderation scenario
2. Agent receives initial state
3. Agent performs a sequence of actions
4. Environment updates state accordingly
5. Rewards are assigned incrementally
6. Episode ends when:
* final decision is made, or
* max steps reached
---
## 🏆 Reward Design
### Step-level Rewards
* correct investigation step → +0.2
* correct violation classification → +0.5
* unnecessary action → -0.1
### Final Decision Rewards
* correct action → +1.0
* acceptable but suboptimal → +0.5
* false positive → -0.5
* false negative → -2.0
### Escalation Handling
* correct escalation → +0.5
* unnecessary escalation → -0.2
---
## 🎯 Tasks
### 🟢 Task 1 — Clear Violations (Easy)
* explicit harmful content
* no ambiguity
* minimal context required
**Goal:** correct identification and action
---
### 🟡 Task 2 — Ambiguous Content (Medium)
* sarcasm, criticism, borderline cases
* requires context and policy reasoning
**Goal:** balanced moderation decision
---
### 🔴 Task 3 — Contextual & Geo-Aware Moderation (Hard)
* misinformation, repeated offenders, viral posts
* geo-specific policy differences
* multi-step reasoning required
**Goal:** optimize long-term safety and correctness
---
## 🧪 Grading Strategy
Each task includes a **deterministic grader** that:
* compares agent decisions to ground truth derived from policy rules
* assigns score between **0.0 and 1.0**
* accounts for:
* correctness
* efficiency
* decision quality
---
## 🧬 Data Creation Strategy
To ensure deterministic grading and reproducibility, all data in the environment is **synthetically generated using controlled templates and rules**.
### 1. Template-Based Generation
Posts are generated using predefined templates mapped to violation types:
```python
TEMPLATES = {
"harassment": ["I will hurt you", "You are worthless"],
"misinformation": ["Vaccines don't work", "This cures all diseases"],
"safe": ["Had a great day!", "Check out this cool product"]
}
```
Each template includes:
* known violation label
* severity level
* expected action
---
### 2. Controlled Variations
To increase diversity while preserving ground truth:
* tone variation (angry, sarcastic, neutral)
* noise injection (irrelevant text, emojis, typos)
* paraphrasing using rule-based transformations
---
### 3. Context Generation
Additional signals are generated programmatically:
* **user_history**
* prior violations count
* repeated behavior patterns
* **reports**
* number of reports aligned with severity
* **engagement**
* virality score (affects task difficulty)
* **geo**
* region-specific overrides (policy-dependent)
---
### 4. Ground Truth Labeling
Each generated scenario includes:
```json
{
"violation_type": "misinformation",
"severity": "high",
"expected_action": "remove"
}
```
This ensures:
* fully deterministic grading
* no ambiguity in evaluation
---
### 5. Reproducibility
* random seeds used for generation
* fixed datasets per task level
* identical inputs produce identical outputs
---
## 🔁 RL Framing
This environment models moderation as a **sequential decision-making problem**:
* actions influence future state
* rewards accumulate over steps
* agents must learn optimal action sequences
---
## 📦 OpenEnv Compliance
The environment implements:
* `reset()` → returns initial observation
* `step(action)` → returns (observation, reward, done, info)
* `state()` → returns current state
Additionally:
* typed models for Observation, Action, Reward
* `openenv.yaml` for metadata
* reproducible baseline inference
---
## 🧠 One-Line Summary
> A deterministic, multi-step content moderation environment where agents learn to enforce platform policies under ambiguity, context, and real-world constraints.
---