# 📄 SPECS.md — AI Community Moderation & Escalation Environment

---

## 🧠 Overview

This project defines an **OpenEnv-compliant environment** that simulates a real-world **content moderation and policy enforcement system**.

The agent operates as a moderation decision-maker responsible for:

* identifying policy violations
* investigating context
* enforcing appropriate actions
* handling ambiguity under platform rules

Unlike static classification tasks, this environment models moderation as a **multi-step decision process** with **deterministic grading derived from a formal policy document**.

---

## 🎯 Problem Definition

Given:

* a user-generated post
* associated context (user history, reports, geo, thread)
* a predefined policy document

The agent must:

1. Investigate the content
2. Identify violation type (if any)
3. Decide and execute an appropriate moderation action

---

## 🧩 Core Design Principles

### 1. Deterministic Evaluation

All decisions are evaluated against:

* a **synthetic policy document**
* predefined violation taxonomy
* rule-based ground truth

No subjective scoring is allowed.

---

### 2. Multi-Step Decision Process

```text
observe → investigate → classify → decide → enforce
```

Each step contributes to final reward.

---

### 3. Partial Observability

The agent does not initially receive full information and must:

* request additional context
* inspect user behavior
* interpret signals over time

---

### 4. Real-World Alignment

The environment reflects real moderation challenges:

* misinformation
* harassment
* context-dependent violations
* geo-specific policy differences

---

## 🧱 Environment Components

### 📥 State (Observation)

Each state includes:

* `post_content`: text of the content
* `user_history`: last N posts + violation count
* `reports`: number and type of reports
* `engagement`: likes, shares, virality score
* `geo`: region identifier (e.g., IN, EU, US)
* `thread_context`: previous messages (optional)
* `policy_context`: relevant policy snippets (optional)

---

### 🎮 Action Space

#### Investigation Actions

* `fetch_user_history`
* `fetch_thread_context`
* `check_policy_clause`

#### Classification Actions

* `mark_violation_type` (harassment, misinformation, safe, etc.)

#### Decision Actions

* `allow`
* `flag`
* `remove`
* `escalate`

---

### 🔁 Episode Flow

1. `reset()` initializes a moderation scenario
2. Agent receives initial state
3. Agent performs a sequence of actions
4. Environment updates state accordingly
5. Rewards are assigned incrementally
6. Episode ends when:

   * final decision is made, or
   * max steps reached

---

## 🏆 Reward Design

### Step-level Rewards

* correct investigation step → +0.2
* correct violation classification → +0.5
* unnecessary action → -0.1

### Final Decision Rewards

* correct action → +1.0
* acceptable but suboptimal → +0.5
* false positive → -0.5
* false negative → -2.0

### Escalation Handling

* correct escalation → +0.5
* unnecessary escalation → -0.2

---

## 🎯 Tasks

### 🟢 Task 1 — Clear Violations (Easy)

* explicit harmful content
* no ambiguity
* minimal context required

**Goal:** correct identification and action

---

### 🟡 Task 2 — Ambiguous Content (Medium)

* sarcasm, criticism, borderline cases
* requires context and policy reasoning

**Goal:** balanced moderation decision

---

### 🔴 Task 3 — Contextual & Geo-Aware Moderation (Hard)

* misinformation, repeated offenders, viral posts
* geo-specific policy differences
* multi-step reasoning required

**Goal:** optimize long-term safety and correctness

---

## 🧪 Grading Strategy

Each task includes a **deterministic grader** that:

* compares agent decisions to ground truth derived from policy rules
* assigns score between **0.0 and 1.0**
* accounts for:

  * correctness
  * efficiency
  * decision quality

---

## 🧬 Data Creation Strategy

To ensure deterministic grading and reproducibility, all data in the environment is **synthetically generated using controlled templates and rules**.

### 1. Template-Based Generation

Posts are generated using predefined templates mapped to violation types:

```python
TEMPLATES = {
  "harassment": ["I will hurt you", "You are worthless"],
  "misinformation": ["Vaccines don't work", "This cures all diseases"],
  "safe": ["Had a great day!", "Check out this cool product"]
}
```

Each template includes:

* known violation label
* severity level
* expected action

---

### 2. Controlled Variations

To increase diversity while preserving ground truth:

* tone variation (angry, sarcastic, neutral)
* noise injection (irrelevant text, emojis, typos)
* paraphrasing using rule-based transformations

---

### 3. Context Generation

Additional signals are generated programmatically:

* **user_history**

  * prior violations count
  * repeated behavior patterns

* **reports**

  * number of reports aligned with severity

* **engagement**

  * virality score (affects task difficulty)

* **geo**

  * region-specific overrides (policy-dependent)

---

### 4. Ground Truth Labeling

Each generated scenario includes:

```json
{
  "violation_type": "misinformation",
  "severity": "high",
  "expected_action": "remove"
}
```

This ensures:

* fully deterministic grading
* no ambiguity in evaluation

---

### 5. Reproducibility

* random seeds used for generation
* fixed datasets per task level
* identical inputs produce identical outputs

---

## 🔁 RL Framing

This environment models moderation as a **sequential decision-making problem**:

* actions influence future state
* rewards accumulate over steps
* agents must learn optimal action sequences

---

## 📦 OpenEnv Compliance

The environment implements:

* `reset()` → returns initial observation
* `step(action)` → returns (observation, reward, done, info)
* `state()` → returns current state

Additionally:

* typed models for Observation, Action, Reward
* `openenv.yaml` for metadata
* reproducible baseline inference

---

## 🧠 One-Line Summary

> A deterministic, multi-step content moderation environment where agents learn to enforce platform policies under ambiguity, context, and real-world constraints.

---