Spaces:

ANI00
/

content-moderation-env

Sleeping

App Files Files Community

ANI00 commited on Apr 8

Commit

eb0a4a1

0 Parent(s):

first commit

Browse files

Files changed (14) hide show

.gitignore +7 -0
README.md +210 -0
inference.py +238 -0
server/Dockerfile +27 -0
server/__init__.py +0 -0
server/deepfake_model.py +89 -0
server/env.py +121 -0
server/graders.py +94 -0
server/main.py +61 -0
server/models.py +44 -0
server/openenv.yaml +89 -0
server/requirements.txt +12 -0
server/tasks.py +192 -0
test/test.py +243 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,7 @@

+__pycache__
+*.pyc
+.env
+.sixth
+.pytest_cache
+.coverage

README.md ADDED Viewed

	@@ -0,0 +1,210 @@

+# Content Moderation OpenEnv
+An AI content moderation environment built to the OpenEnv specification. Agents triage real-world content — spam emails, harmful social media posts, and AI-generated deepfakes — using a standard `step()` / `reset()` / `state()` API.
+---
+## Environment Description
+Content moderation is a high-stakes, high-volume real-world task. Human moderators review millions of items daily. This environment simulates that pipeline across three difficulty levels:
+| Task | Difficulty | Content | Description |
+|---|---|---|---|
+| `text_spam` | Easy | Email / SMS | Binary spam vs. legitimate classification |
+| `content_moderation` | Medium | Social media posts | Multi-label violation detection |
+| `deepfake_detection` | Hard | Multimodal | Detect AI-manipulated media + make policy decisions |
+Models used:
+- **Deepfake detection**: `dima806/deepfake_vs_real_image_detection` (ViT, ~300MB)
+- **Agent LLM**: `meta-llama/Llama-3.1-8B-Instruct` (via transformers pipeline or HF API)
+---
+## Action Space
+```json
+{
+  "decision": "approve | reject | escalate | flag",
+  "reason": "string (explanation)",
+  "confidence": 0.0–1.0,
+  "labels": ["spam", "scam", "phishing", "hate_speech", "violence",
+             "harassment", "misinformation", "adult_content",
+             "deepfake", "political_manipulation", "fraud"]
+}
+```
+- **approve** — content is safe
+- **reject** — clear policy violation, remove immediately
+- **flag** — borderline, surface to human review
+- **escalate** — uncertain, route to specialist
+---
+## Observation Space
+```json
+{
+  "content_id": "string",
+  "content_type": "text | multimodal",
+  "text": "string (optional)",
+  "image_description": "string (optional, deepfake task)",
+  "detector_score": 0.0–1.0 (optional, deepfake task),
+  "metadata": { "source": "...", "report_count": 0, ... },
+  "step_num": 1,
+  "total_steps": 5
+}
+```
+The `detector_score` field (deepfake task only) is the raw output from the `dima806/deepfake_vs_real_image_detection` model — higher values indicate higher probability of a deepfake.
+---
+## Reward Functions
+### text_spam (easy)
+- Correct decision: **+0.65**
+- Escalate on harmful content: **+0.30** (partial credit)
+- Label F1 contribution: **+0.20**
+- Confidence calibration bonus/penalty: **±0.10**
+### content_moderation (medium)
+- Correct decision: **+0.50**
+- Partial credit (flag/escalate on harmful): **+0.25**
+- Label F1 contribution: **+0.35**
+- Confidence calibration: **±0.10**
+### deepfake_detection (hard)
+- Correct decision: **+0.40**
+- Deepfake detection accuracy: **+0.30**
+- Detector model alignment bonus: **+0.10**
+- Label F1 contribution: **+0.20**
+- Confidence calibration: **±0.10**
+---
+## API Endpoints
+| Method | Path | Description |
+|---|---|---|
+| POST | `/reset` | Start new episode. Body: `{"task": "text_spam"}` |
+| POST | `/step` | Submit action. Body: action JSON |
+| GET | `/state` | Current episode state |
+| POST | `/close` | End episode and clean up |
+| GET | `/tasks` | List all available tasks |
+| GET | `/health` | Health check |
+---
+## Setup & Usage
+### Requirements
+- Docker
+- Python 3.11+
+- `openenv-core` (`pip install openenv-core`)
+### Run with Docker
+```bash
+cd content-moderation-env
+# Build
+docker build -f server/Dockerfile -t content-moderation-env .
+# Run
+docker run -p 7860:7860 content-moderation-env
+```
+### Run locally
+```bash
+pip install -r server/requirements.txt
+uvicorn server.main:app --host 0.0.0.0 --port 7860
+```
+### Validate
+```bash
+openenv validate   # from project root
+```
+---
+## Inference Script
+```bash
+# API mode (HF inference endpoint)
+export API_BASE_URL="https://router.huggingface.co/v1"
+export MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct"
+export HF_TOKEN="hf_your_token_here"
+export SERVER_URL="http://localhost:7860"
+export TASK_NAME="text_spam"
+python inference.py
+# Local transformers pipeline mode
+export USE_LOCAL_MODEL="true"
+python inference.py
+```
+### Output format
+```
+[START] task=text_spam env=content_moderation_env model=meta-llama/Llama-3.1-8B-Instruct
+[STEP] step=1 action={"decision":"reject","confidence":0.9,"labels":["spam"]} reward=0.85 done=false error=null
+[STEP] step=2 action={"decision":"approve","confidence":0.8,"labels":[]} reward=0.75 done=false error=null
+...
+[END] success=true steps=5 score=0.720 rewards=0.85,0.75,0.00,0.80,0.65
+```
+---
+## Run Tests
+```bash
+pip install pytest
+pytest test/test.py -v
+```
+---
+## Baseline Scores (Llama-3.1-8B-Instruct, temperature=0.2)
+| Task | Score | Notes |
+|---|---|---|
+| `text_spam` | ~0.72 | Strong on obvious spam, weaker on phishing |
+| `content_moderation` | ~0.58 | Good decision, weaker multi-label coverage |
+| `deepfake_detection` | ~0.44 | Relies heavily on image description cues |
+---
+## HuggingFace Spaces Deployment
+Create a Space with Docker SDK, push this repo, and set:
+- `HF_TOKEN` (secret)
+- `API_BASE_URL` (variable)
+- `MODEL_NAME` (variable)
+The Space URL becomes your `PING_URL` for the validation script.
+---
+## Project Structure
+```
+content-moderation-env/
+├── server/
+│   ├── __init__.py
+│   ├── main.py          # FastAPI app + endpoints
+│   ├── env.py           # OpenEnv environment (step/reset/state/close)
+│   ├── models.py        # Pydantic action/observation models
+│   ├── tasks.py         # Task datasets + ground truth
+│   ├── graders.py       # Reward functions per task
+│   ├── deepfake_model.py# HF deepfake detection pipeline
+│   ├── openenv.yaml     # OpenEnv metadata spec
+│   ├── requirements.txt
+│   └── Dockerfile
+├── test/
+│   └── test.py          # pytest suite (20+ tests)
+├── inference.py         # Baseline agent script
+└── README.md
+```

inference.py ADDED Viewed

	@@ -0,0 +1,238 @@

+"""
+Content Moderation Inference Script
+Env vars: API_BASE_URL, MODEL_NAME, HF_TOKEN, SERVER_URL, TASK_NAME
+USE_LOCAL_MODEL=true  → local transformers pipeline (Llama-3.1-8B-Instruct)
+USE_LOCAL_MODEL=false → OpenAI-compatible HF API (default)
+"""
+import json
+import os
+import textwrap
+from typing import Dict, Any, List, Optional
+from dotenv import load_dotenv
+import requests
+# Load environment variables from .env file
+load_dotenv()
+API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
+MODEL_NAME = os.getenv("MODEL_NAME", "meta-llama/Llama-3.1-8B-Instruct")
+HF_TOKEN = os.getenv("HF_TOKEN", "")
+API_KEY = HF_TOKEN or os.getenv("API_KEY", "")
+SERVER_URL = os.getenv("SERVER_URL", "http://localhost:7860").rstrip("/")
+TASK_NAME = os.getenv("TASK_NAME", "text_spam")
+BENCHMARK = "content_moderation_env"
+MAX_STEPS = 10
+USE_LOCAL = os.getenv("USE_LOCAL_MODEL", "false").lower() == "true"
+SUCCESS_SCORE_THRESHOLD = 0.5
+VALID_DECISIONS = {"approve", "reject", "escalate", "flag"}
+VALID_LABELS = [
+    "spam", "scam", "phishing", "pharmaceutical_spam",
+    "hate_speech", "violence", "harassment", "misinformation",
+    "adult_content", "deepfake", "political_manipulation", "fraud",
+]
+SYSTEM_PROMPT = textwrap.dedent("""
+You are an expert content moderation AI. For each piece of content, analyze it carefully and respond ONLY with a valid JSON object.
+Required JSON fields:
+- "decision": one of "approve", "reject", "escalate", "flag"
+- "reason": brief explanation (max 2 sentences)
+- "confidence": float between 0.0 and 1.0
+- "labels": list of applicable labels from: spam, scam, phishing, pharmaceutical_spam, hate_speech, violence, harassment, misinformation, adult_content, deepfake, political_manipulation, fraud
+Decision guidelines:
+- approve: content is safe and legitimate
+- reject: content clearly violates policy (spam, hate speech, deepfake, etc.)
+- flag: content is borderline or needs human review
+- escalate: uncertain, requires specialist review
+Output ONLY the JSON object, no other text.
+""").strip()
+def log_start(task: str, env: str, model: str):
+    print(f"[START] task={task} env={env} model={model}", flush=True)
+def log_step(step: int, action: str, reward: float, done: bool, error: Optional[str]):
+    err = error if error else "null"
+    print(f"[STEP] step={step} action={action} reward={reward:.2f} done={str(done).lower()} error={err}", flush=True)
+def log_end(success: bool, steps: int, score: float, rewards: List[float]):
+    rewards_str = ",".join(f"{r:.2f}" for r in rewards)
+    print(f"[END] success={str(success).lower()} steps={steps} score={score:.3f} rewards={rewards_str}", flush=True)
+def build_prompt(obs: Dict[str, Any]) -> str:
+    parts = [f"Content ID: {obs.get('content_id', 'unknown')}"]
+    parts.append(f"Type: {obs.get('content_type', 'text')}")
+    if obs.get("text"):
+        parts.append(f"Text: {obs['text']}")
+    if obs.get("image_description"):
+        parts.append(f"Image analysis: {obs['image_description']}")
+    if obs.get("detector_score") is not None:
+        score = obs["detector_score"]
+        parts.append(f"Deepfake detector score (higher = more likely fake): {score:.3f}")
+    meta = obs.get("metadata", {})
+    if meta:
+        meta_str = ", ".join(f"{k}={v}" for k, v in meta.items())
+        parts.append(f"Metadata: {meta_str}")
+    parts.append(f"\nStep {obs.get('step_num', '?')} of {obs.get('total_steps', '?')}")
+    return "\n".join(parts)
+def _default_action() -> Dict:
+    return {"decision": "escalate", "reason": "Unable to analyze content.", "confidence": 0.3, "labels": []}
+def call_local_model(prompt: str) -> Dict:
+    from transformers import pipeline
+    pipe = pipeline("text-generation", model="meta-llama/Llama-3.1-8B-Instruct")
+    messages = [
+        {"role": "system", "content": SYSTEM_PROMPT},
+        {"role": "user", "content": prompt},
+    ]
+    output = pipe(messages, max_new_tokens=256, temperature=0.2, do_sample=True)
+    text = output[0]["generated_text"]
+    if isinstance(text, list):
+        text = text[-1].get("content", "")
+    return parse_llm_response(text)
+def call_api_model(prompt: str) -> Dict:
+    from openai import OpenAI
+    client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY or "hf_default")
+    completion = client.chat.completions.create(
+        model=MODEL_NAME,
+        messages=[
+            {"role": "system", "content": SYSTEM_PROMPT},
+            {"role": "user", "content": prompt},
+        ],
+        temperature=0.2,
+        max_tokens=256,
+    )
+    text = (completion.choices[0].message.content or "").strip()
+    return parse_llm_response(text)
+def parse_llm_response(text: str) -> Dict:
+    try:
+        start = text.find("{")
+        end = text.rfind("}") + 1
+        if start >= 0 and end > start:
+            parsed = json.loads(text[start:end])
+            decision = parsed.get("decision", "escalate")
+            if decision not in VALID_DECISIONS:
+                decision = "escalate"
+            return {
+                "decision": decision,
+                "reason": str(parsed.get("reason", ""))[:200],
+                "confidence": float(max(0.0, min(1.0, parsed.get("confidence", 0.5)))),
+                "labels": [l for l in parsed.get("labels", []) if l in VALID_LABELS],
+            }
+    except Exception:
+        pass
+    return _default_action()
+def get_decision(prompt: str) -> Dict:
+    try:
+        if USE_LOCAL:
+            return call_local_model(prompt)
+        return call_api_model(prompt)
+    except Exception as e:
+        print(f"[DEBUG] Model error: {e}", flush=True)
+        return _default_action()
+def server_reset(task: str) -> Optional[Dict]:
+    try:
+        r = requests.post(f"{SERVER_URL}/reset", json={"task": task}, timeout=30)
+        r.raise_for_status()
+        return r.json()
+    except Exception as e:
+        print(f"[DEBUG] reset error: {e}", flush=True)
+        return None
+def server_step(action: Dict) -> Optional[Dict]:
+    try:
+        r = requests.post(f"{SERVER_URL}/step", json=action, timeout=30)
+        r.raise_for_status()
+        return r.json()
+    except Exception as e:
+        print(f"[DEBUG] step error: {e}", flush=True)
+        return None
+def server_close():
+    try:
+        requests.post(f"{SERVER_URL}/close", timeout=10)
+    except Exception:
+        pass
+def run_episode(task: str):
+    rewards: List[float] = []
+    steps_taken = 0
+    score = 0.0
+    success = False
+    obs = None
+    log_start(task=task, env=BENCHMARK, model=MODEL_NAME)
+    try:
+        reset_result = server_reset(task)
+        if reset_result is None:
+            log_end(success=False, steps=0, score=0.0, rewards=[])
+            return
+        obs = reset_result.get("observation", {})
+        done = False
+        for step in range(1, MAX_STEPS + 1):
+            if done or obs is None:
+                break
+            prompt = build_prompt(obs)
+            action = get_decision(prompt)
+            action_str = json.dumps({k: v for k, v in action.items() if k != "reason"})
+            result = server_step(action)
+            if result is None:
+                log_step(step, action_str, 0.0, True, "server_error")
+                break
+            reward = float(result.get("reward", 0.0))
+            done = bool(result.get("done", False))
+            error = result.get("info", {}).get("error")
+            rewards.append(reward)
+            steps_taken = step
+            log_step(step, action_str, reward, done, error)
+            obs = result.get("observation")
+        total_steps_in_task = obs.get("total_steps", len(rewards)) if obs else len(rewards)
+        max_possible = float(total_steps_in_task)
+        score = sum(rewards) / max_possible if max_possible > 0 else 0.0
+        score = min(max(score, 0.0), 1.0)
+        success = score >= SUCCESS_SCORE_THRESHOLD
+    finally:
+        server_close()
+        log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
+if __name__ == "__main__":
+    run_episode(TASK_NAME)

server/Dockerfile ADDED Viewed

	@@ -0,0 +1,27 @@

+FROM python:3.11-slim
+ENV PYTHONDONTWRITEBYTECODE=1 \
+    PYTHONUNBUFFERED=1 \
+    HF_HOME=/app/.cache/huggingface \
+    TRANSFORMERS_CACHE=/app/.cache/huggingface
+WORKDIR /app
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    libgl1 libglib2.0-0 curl \
+    && rm -rf /var/lib/apt/lists/*
+COPY server/requirements.txt .
+RUN pip install --upgrade pip setuptools wheel
+RUN pip install --no-cache-dir --no-build-isolation -r requirements.txt
+COPY . .
+RUN mkdir -p /app/.cache/huggingface
+# Pre-download deepfake model to avoid runtime delays
+RUN python -c "from transformers import pipeline; pipeline('image-classification', model='dima806/deepfake_vs_real_image_detection', device=-1)" 2>&1 || echo "Model download optional"
+EXPOSE 7860
+CMD ["uvicorn", "server.main:app", "--host", "0.0.0.0", "--port", "7860"]

server/__init__.py ADDED Viewed

File without changes

server/deepfake_model.py ADDED Viewed

	@@ -0,0 +1,89 @@

+import io
+import logging
+from typing import Optional
+import numpy as np
+logger = logging.getLogger(__name__)
+_pipe = None
+def _load_pipeline():
+    global _pipe
+    if _pipe is not None:
+        return _pipe
+    try:
+        from transformers import pipeline
+        _pipe = pipeline(
+            "image-classification",
+            model="dima806/deepfake_vs_real_image_detection",
+            device=-1,
+        )
+        logger.info("Deepfake detection model loaded.")
+    except Exception as e:
+        logger.warning(f"Could not load deepfake model: {e}. Using heuristic fallback.")
+        _pipe = None
+    return _pipe
+def _make_synthetic_image(is_fake: bool):
+    from PIL import Image
+    rng = np.random.default_rng(seed=1 if is_fake else 99)
+    img = Image.new("RGB", (224, 224))
+    pixels = img.load()
+    for i in range(224):
+        for j in range(224):
+            if is_fake:
+                r = int(128 + 60 * np.sin(i / 9.0) * np.cos(j / 9.0))
+                g = int(128 + 60 * np.cos(i / 7.0) * np.sin(j / 11.0))
+                b = int(128 + 40 * np.sin((i + j) / 14.0))
+            else:
+                base = int(80 + 100 * (i / 224))
+                noise = int(rng.normal(0, 12))
+                r = max(0, min(255, base + noise + 20))
+                g = max(0, min(255, base + noise))
+                b = max(0, min(255, base + noise - 15))
+            pixels[j, i] = (
+                max(0, min(255, r)),
+                max(0, min(255, g)),
+                max(0, min(255, b)),
+            )
+    return img
+def score_deepfake(is_fake: bool) -> float:
+    pipe = _load_pipeline()
+    if pipe is None:
+        return 0.78 if is_fake else 0.22
+    try:
+        img = _make_synthetic_image(is_fake)
+        results = pipe(img)
+        for r in results:
+            label_lower = r["label"].lower()
+            if any(kw in label_lower for kw in ("fake", "deepfake", "manipulat", "ai_gen", "synthetic")):
+                return float(r["score"])
+        top_label = results[0]["label"].lower()
+        top_score = float(results[0]["score"])
+        if any(kw in top_label for kw in ("real", "authentic", "genuine")):
+            return 1.0 - top_score
+        return top_score
+    except Exception as e:
+        logger.warning(f"Deepfake scoring error: {e}")
+        return 0.75 if is_fake else 0.25
+def precompute_detector_scores(items: list) -> list:
+    enriched = []
+    for item in items:
+        is_fake = item.get("ground_truth", {}).get("is_deepfake", False)
+        item = dict(item)
+        item["detector_score"] = score_deepfake(is_fake)
+        enriched.append(item)
+    return enriched

server/env.py ADDED Viewed

	@@ -0,0 +1,121 @@

+import threading
+from typing import Dict, Any, Optional
+from .models import ContentObservation, StepResult, ResetResult, EnvState, ModerationAction
+from .tasks import TASKS
+from .graders import GRADERS
+class ContentModerationEnv:
+    def __init__(self):
+        self._lock = threading.Lock()
+        self._s: Dict[str, Any] = {}
+        self._clear()
+    def _clear(self):
+        self._s = {
+            "task": None,
+            "items": [],
+            "idx": 0,
+            "total": 0,
+            "reward_sum": 0.0,
+            "done": True,
+            "history": [],
+        }
+    def _obs(self, item: Dict, idx: int, total: int) -> ContentObservation:
+        return ContentObservation(
+            content_id=item["content_id"],
+            content_type=item["content_type"],
+            text=item.get("text"),
+            image_description=item.get("image_description"),
+            detector_score=item.get("detector_score"),
+            metadata=item.get("metadata", {}),
+            step_num=idx,
+            total_steps=total,
+        )
+    def reset(self, task: str = "text_spam") -> ResetResult:
+        if task not in TASKS:
+            raise ValueError(f"Unknown task '{task}'. Valid: {list(TASKS.keys())}")
+        with self._lock:
+            task_cfg = TASKS[task]
+            items = list(task_cfg["items"])
+            if task == "deepfake_detection":
+                from .deepfake_model import precompute_detector_scores
+                items = precompute_detector_scores(items)
+            self._s = {
+                "task": task,
+                "items": items,
+                "idx": 0,
+                "total": len(items),
+                "reward_sum": 0.0,
+                "done": False,
+                "history": [],
+            }
+            return ResetResult(observation=self._obs(items[0], 1, len(items)))
+    def step(self, action: ModerationAction) -> StepResult:
+        with self._lock:
+            if self._s["done"]:
+                return StepResult(
+                    observation=None,
+                    reward=0.0,
+                    done=True,
+                    info={"error": "Episode finished. Call /reset first."},
+                )
+            idx = self._s["idx"]
+            item = self._s["items"][idx]
+            task = self._s["task"]
+            grader = GRADERS[task]
+            action_d = action.model_dump()
+            if task == "deepfake_detection":
+                reward = grader(action_d, item["ground_truth"], item.get("detector_score"))
+            else:
+                reward = grader(action_d, item["ground_truth"])
+            self._s["reward_sum"] += reward
+            self._s["idx"] += 1
+            self._s["history"].append({
+                "step": idx + 1,
+                "content_id": item["content_id"],
+                "action": action_d,
+                "reward": round(reward, 4),
+                "ground_truth": item["ground_truth"],
+            })
+            new_idx = self._s["idx"]
+            done = new_idx >= self._s["total"]
+            self._s["done"] = done
+            next_obs: Optional[ContentObservation] = None
+            if not done:
+                next_item = self._s["items"][new_idx]
+                next_obs = self._obs(next_item, new_idx + 1, self._s["total"])
+            return StepResult(
+                observation=next_obs,
+                reward=round(reward, 4),
+                done=done,
+                info={"content_id": item["content_id"], "step": idx + 1},
+            )
+    def state(self) -> EnvState:
+        with self._lock:
+            return EnvState(
+                task=self._s["task"] or "none",
+                step_num=self._s["idx"],
+                total_steps=self._s["total"],
+                cumulative_reward=round(self._s["reward_sum"], 4),
+                done=self._s["done"],
+                history=list(self._s["history"]),
+            )
+    def close(self):
+        with self._lock:
+            self._clear()

server/graders.py ADDED Viewed

	@@ -0,0 +1,94 @@

+from typing import Dict, Any, List, Optional
+def _label_f1(predicted: List[str], true: List[str]) -> float:
+    pred_set, true_set = set(predicted), set(true)
+    if not true_set and not pred_set:
+        return 1.0
+    if not true_set:
+        return 0.5
+    if not pred_set:
+        return 0.0
+    tp = len(pred_set & true_set)
+    precision = tp / len(pred_set)
+    recall = tp / len(true_set)
+    if precision + recall == 0:
+        return 0.0
+    return 2 * precision * recall / (precision + recall)
+def _calibration_bonus(confidence: float, correct: bool) -> float:
+    if correct:
+        return 0.1 * confidence
+    return -0.15 * confidence
+def grade_text_spam(action: Dict, ground_truth: Dict) -> float:
+    decision = action.get("decision", "")
+    confidence = float(action.get("confidence", 0.5))
+    correct = decision == ground_truth["decision"]
+    if correct:
+        base = 0.65
+    elif ground_truth["is_harmful"] and decision == "escalate":
+        base = 0.30
+    elif not ground_truth["is_harmful"] and decision == "flag":
+        base = 0.20
+    else:
+        base = 0.0
+    label_score = 0.20 * _label_f1(action.get("labels", []), ground_truth["labels"])
+    return min(1.0, max(0.0, base + label_score + _calibration_bonus(confidence, correct)))
+def grade_content_moderation(action: Dict, ground_truth: Dict) -> float:
+    decision = action.get("decision", "")
+    confidence = float(action.get("confidence", 0.5))
+    correct = decision == ground_truth["decision"]
+    if correct:
+        decision_score = 0.50
+    elif ground_truth["is_harmful"] and decision in ("escalate", "flag", "reject"):
+        decision_score = 0.25
+    elif not ground_truth["is_harmful"] and decision == "escalate":
+        decision_score = 0.15
+    else:
+        decision_score = 0.0
+    label_score = 0.35 * _label_f1(action.get("labels", []), ground_truth["labels"])
+    return min(1.0, max(0.0, decision_score + label_score + _calibration_bonus(confidence, correct)))
+def grade_deepfake(action: Dict, ground_truth: Dict, detector_score: Optional[float] = None) -> float:
+    decision = action.get("decision", "")
+    confidence = float(action.get("confidence", 0.5))
+    labels = action.get("labels", [])
+    is_deepfake = ground_truth.get("is_deepfake", False)
+    correct_decision = decision == ground_truth["decision"]
+    detected_deepfake = "deepfake" in [l.lower() for l in labels]
+    detection_correct = detected_deepfake == is_deepfake
+    if correct_decision:
+        decision_score = 0.40
+    elif ground_truth["is_harmful"] and decision in ("escalate", "flag", "reject"):
+        decision_score = 0.20
+    else:
+        decision_score = 0.0
+    detection_score = 0.30 if detection_correct else 0.0
+    if detector_score is not None:
+        model_aligned = (detector_score > 0.5) == is_deepfake
+        detection_score += 0.10 if model_aligned else 0.0
+    label_score = 0.20 * _label_f1(labels, ground_truth["labels"])
+    return min(1.0, max(0.0, decision_score + detection_score + label_score + _calibration_bonus(confidence, correct_decision)))
+GRADERS = {
+    "text_spam": grade_text_spam,
+    "content_moderation": grade_content_moderation,
+    "deepfake_detection": grade_deepfake,
+}

server/main.py ADDED Viewed

	@@ -0,0 +1,61 @@

+from fastapi import FastAPI, HTTPException, Request
+from fastapi.responses import JSONResponse, RedirectResponse
+from .models import ModerationAction, StepResult, ResetResult, EnvState, ResetRequest
+from .env import ContentModerationEnv
+from .tasks import TASKS
+app = FastAPI(title="Content Moderation OpenEnv", version="1.0.0")
+_env = ContentModerationEnv()
+@app.get("/")
+async def root():
+    return RedirectResponse(url="/docs")
+@app.post("/reset", response_model=ResetResult)
+async def reset(request: Request):
+    try:
+        body = await request.json()
+    except Exception:
+        body = {}
+    task = (body or {}).get("task", "text_spam")
+    try:
+        return _env.reset(task=task)
+    except ValueError as e:
+        raise HTTPException(status_code=400, detail=str(e))
+@app.post("/step", response_model=StepResult)
+def step(action: ModerationAction):
+    return _env.step(action)
+@app.get("/state", response_model=EnvState)
+def state():
+    return _env.state()
+@app.post("/close")
+def close():
+    _env.close()
+    return {"status": "closed"}
+@app.get("/tasks")
+def list_tasks():
+    return {
+        name: {
+            "description": t["description"],
+            "difficulty": t["difficulty"],
+            "num_items": len(t["items"]),
+            "content_type": t["content_type"],
+        }
+        for name, t in TASKS.items()
+    }
+@app.get("/health")
+def health():
+    return {"status": "ok"}

server/models.py ADDED Viewed

	@@ -0,0 +1,44 @@

+from pydantic import BaseModel, Field
+from typing import Optional, Dict, Any, List
+class ModerationAction(BaseModel):
+    decision: str
+    reason: str
+    confidence: float = Field(ge=0.0, le=1.0)
+    labels: List[str] = []
+class ContentObservation(BaseModel):
+    content_id: str
+    content_type: str
+    text: Optional[str] = None
+    image_description: Optional[str] = None
+    detector_score: Optional[float] = None
+    metadata: Dict[str, Any] = {}
+    step_num: int
+    total_steps: int
+class StepResult(BaseModel):
+    observation: Optional[ContentObservation] = None
+    reward: float
+    done: bool
+    info: Dict[str, Any] = {}
+class ResetResult(BaseModel):
+    observation: ContentObservation
+class EnvState(BaseModel):
+    task: str
+    step_num: int
+    total_steps: int
+    cumulative_reward: float
+    done: bool
+    history: List[Dict[str, Any]] = []
+class ResetRequest(BaseModel):
+    task: Optional[str] = "text_spam"

server/openenv.yaml ADDED Viewed

	@@ -0,0 +1,89 @@

+name: content-moderation-env
+version: "1.0.0"
+description: >
+  AI-powered content moderation environment. Agents triage text, social
+  posts, and multimodal content (including deepfake detection) across
+  three difficulty levels using the standard OpenEnv step/reset/state API.
+author: openenv-participant
+license: MIT
+tasks:
+  - id: text_spam
+    difficulty: easy
+    description: Classify email/message content as spam or legitimate
+    content_type: text
+    num_items: 5
+    score_range: [0.0, 1.0]
+  - id: content_moderation
+    difficulty: medium
+    description: Multi-label social media content moderation
+    content_type: text
+    num_items: 5
+    score_range: [0.0, 1.0]
+  - id: deepfake_detection
+    difficulty: hard
+    description: Detect AI-manipulated/deepfake media and make moderation decisions
+    content_type: multimodal
+    num_items: 5
+    score_range: [0.0, 1.0]
+action_space:
+  type: object
+  fields:
+    decision:
+      type: string
+      enum: [approve, reject, escalate, flag]
+    reason:
+      type: string
+    confidence:
+      type: float
+      range: [0.0, 1.0]
+    labels:
+      type: array
+      items: string
+      valid_values:
+        - spam
+        - scam
+        - phishing
+        - pharmaceutical_spam
+        - hate_speech
+        - violence
+        - harassment
+        - misinformation
+        - adult_content
+        - deepfake
+        - political_manipulation
+        - fraud
+observation_space:
+  type: object
+  fields:
+    content_id: string
+    content_type: string
+    text: optional string
+    image_description: optional string
+    detector_score: optional float
+    metadata: object
+    step_num: integer
+    total_steps: integer
+endpoints:
+  reset: POST /reset
+  step: POST /step
+  state: GET /state
+  close: POST /close
+  tasks: GET /tasks
+  health: GET /health
+docker:
+  context: server
+  dockerfile: server/Dockerfile
+huggingface:
+  space_sdk: docker
+  tags:
+    - openenv
+    - content-moderation
+    - deepfake-detection

server/requirements.txt ADDED Viewed

	@@ -0,0 +1,12 @@

+fastapi==0.111.0
+uvicorn[standard]==0.29.0
+pydantic==2.7.1
+transformers==4.41.2
+torch==2.3.0
+torchvision==0.18.0
+Pillow==10.3.0
+numpy==1.26.4
+accelerate==0.30.0
+openai>=1.40.0
+requests==2.31.0
+python-dotenv==1.0.0

server/tasks.py ADDED Viewed

	@@ -0,0 +1,192 @@

+TASKS = {
+    "text_spam": {
+        "description": "Classify email/message content as spam or legitimate",
+        "difficulty": "easy",
+        "content_type": "text",
+        "items": [
+            {
+                "content_id": "ts_001",
+                "content_type": "text",
+                "text": "CONGRATULATIONS! You've won $1,000,000! Click here NOW to claim your prize!!! Limited time offer!!!",
+                "metadata": {"source": "email", "sender_reputation": 0.05, "link_count": 3},
+                "ground_truth": {"decision": "reject", "labels": ["spam", "scam"], "is_harmful": True},
+            },
+            {
+                "content_id": "ts_002",
+                "content_type": "text",
+                "text": "Hi Sarah, confirming our meeting tomorrow at 3pm in the main conference room. Let me know if the time still works.",
+                "metadata": {"source": "email", "sender_reputation": 0.92, "link_count": 0},
+                "ground_truth": {"decision": "approve", "labels": [], "is_harmful": False},
+            },
+            {
+                "content_id": "ts_003",
+                "content_type": "text",
+                "text": "FREE MEDS NO PRESCRIPTION NEEDED BUY NOW CHEAPEST PRICES CLICK LINK BELOW",
+                "metadata": {"source": "email", "sender_reputation": 0.02, "link_count": 5},
+                "ground_truth": {"decision": "reject", "labels": ["spam", "pharmaceutical_spam"], "is_harmful": True},
+            },
+            {
+                "content_id": "ts_004",
+                "content_type": "text",
+                "text": "Your Amazon order #112-3456789 has shipped and will arrive by Thursday. Track at amazon.com/orders.",
+                "metadata": {"source": "email", "sender_reputation": 0.98, "link_count": 1},
+                "ground_truth": {"decision": "approve", "labels": [], "is_harmful": False},
+            },
+            {
+                "content_id": "ts_005",
+                "content_type": "text",
+                "text": "URGENT: Your account will be SUSPENDED in 24hrs unless you verify immediately. Click: bit.ly/verify-acct-now",
+                "metadata": {"source": "email", "sender_reputation": 0.08, "link_count": 2},
+                "ground_truth": {"decision": "reject", "labels": ["spam", "phishing"], "is_harmful": True},
+            },
+        ],
+    },
+    "content_moderation": {
+        "description": "Multi-label moderation for social media posts",
+        "difficulty": "medium",
+        "content_type": "text",
+        "items": [
+            {
+                "content_id": "cm_001",
+                "content_type": "text",
+                "text": "Gorgeous sunset from my hike today! The trail to the summit was tough but totally worth the view.",
+                "metadata": {"platform": "social_media", "report_count": 0, "user_history_violations": 0},
+                "ground_truth": {"decision": "approve", "labels": [], "is_harmful": False},
+            },
+            {
+                "content_id": "cm_002",
+                "content_type": "text",
+                "text": "I will find that [slur] and make them regret it. They deserve everything coming to them.",
+                "metadata": {"platform": "social_media", "report_count": 18, "user_history_violations": 3},
+                "ground_truth": {"decision": "reject", "labels": ["hate_speech", "violence", "harassment"], "is_harmful": True},
+            },
+            {
+                "content_id": "cm_003",
+                "content_type": "text",
+                "text": "Lose 30lbs in 7 days with this one weird trick! Doctors HATE this! DM me to buy, only $89.",
+                "metadata": {"platform": "social_media", "report_count": 11, "user_history_violations": 1},
+                "ground_truth": {"decision": "flag", "labels": ["misinformation", "spam"], "is_harmful": True},
+            },
+            {
+                "content_id": "cm_004",
+                "content_type": "text",
+                "text": "The election was stolen. The globalist elites are replacing us and controlling everything. Rise up!",
+                "metadata": {"platform": "social_media", "report_count": 27, "user_history_violations": 2},
+                "ground_truth": {"decision": "flag", "labels": ["misinformation", "hate_speech"], "is_harmful": True},
+            },
+            {
+                "content_id": "cm_005",
+                "content_type": "text",
+                "text": "Book club reminder: we meet Thursday 7pm to discuss chapters 12-15. Bring snacks if you can!",
+                "metadata": {"platform": "social_media", "report_count": 0, "user_history_violations": 0},
+                "ground_truth": {"decision": "approve", "labels": [], "is_harmful": False},
+            },
+        ],
+    },
+    "deepfake_detection": {
+        "description": "Detect AI-manipulated media and make moderation decisions",
+        "difficulty": "hard",
+        "content_type": "multimodal",
+        "items": [
+            {
+                "content_id": "df_001",
+                "content_type": "multimodal",
+                "text": "CEO announces surprise acquisition deal — photo circulating widely on financial forums",
+                "image_description": "Corporate executive at podium. Facial proportions slightly asymmetric. Background blur shows unnatural transitions. Hair boundary shows pixel artifacts. Lighting on face inconsistent with background.",
+                "metadata": {
+                    "source": "social_media",
+                    "claimed_subject": "public_executive",
+                    "virality_score": 0.92,
+                    "exif_stripped": True,
+                    "upload_device": "unknown",
+                },
+                "ground_truth": {
+                    "decision": "reject",
+                    "labels": ["deepfake", "misinformation", "fraud"],
+                    "is_deepfake": True,
+                    "is_harmful": True,
+                },
+            },
+            {
+                "content_id": "df_002",
+                "content_type": "multimodal",
+                "text": "Family reunion photo shared to personal profile",
+                "image_description": "Outdoor gathering, natural daylight. Shadows consistent with single light source. Facial textures normal and varied. Background in sharp focus with depth-of-field matching focal length.",
+                "metadata": {
+                    "source": "personal_profile",
+                    "claimed_subject": "private_individual",
+                    "virality_score": 0.03,
+                    "exif_stripped": False,
+                    "upload_device": "iPhone_15",
+                },
+                "ground_truth": {
+                    "decision": "approve",
+                    "labels": [],
+                    "is_deepfake": False,
+                    "is_harmful": False,
+                },
+            },
+            {
+                "content_id": "df_003",
+                "content_type": "multimodal",
+                "text": "Politician caught making controversial statement — clip going viral across platforms",
+                "image_description": "Political figure at microphone. Eye blink rate irregular. Lip sync offset approximately 2 frames. Neck-to-face skin tone boundary visible. Ear shape subtly non-anatomical.",
+                "metadata": {
+                    "source": "video_platform",
+                    "claimed_subject": "elected_official",
+                    "virality_score": 0.88,
+                    "exif_stripped": True,
+                    "upload_device": "unknown",
+                },
+                "ground_truth": {
+                    "decision": "reject",
+                    "labels": ["deepfake", "political_manipulation", "misinformation"],
+                    "is_deepfake": True,
+                    "is_harmful": True,
+                },
+            },
+            {
+                "content_id": "df_004",
+                "content_type": "multimodal",
+                "text": "Famous actor endorsing cryptocurrency — advertisement spreading across social media",
+                "image_description": "Known celebrity in studio setting. Jaw boundary shows feathering artifacts. Teeth appear uniform and unnaturally perfect. Reflection in eyes does not match claimed studio environment.",
+                "metadata": {
+                    "source": "paid_advertisement",
+                    "claimed_subject": "celebrity",
+                    "virality_score": 0.75,
+                    "exif_stripped": True,
+                    "upload_device": "unknown",
+                },
+                "ground_truth": {
+                    "decision": "reject",
+                    "labels": ["deepfake", "fraud", "scam"],
+                    "is_deepfake": True,
+                    "is_harmful": True,
+                },
+            },
+            {
+                "content_id": "df_005",
+                "content_type": "multimodal",
+                "text": "News anchor presenting evening broadcast — clip shared from verified channel",
+                "image_description": "Professional broadcast studio. Consistent studio lighting. Natural eye movement and blink rate. Source watermark present. Metadata intact from broadcast system.",
+                "metadata": {
+                    "source": "verified_broadcaster",
+                    "claimed_subject": "journalist",
+                    "virality_score": 0.35,
+                    "exif_stripped": False,
+                    "upload_device": "broadcast_encoder",
+                },
+                "ground_truth": {
+                    "decision": "approve",
+                    "labels": [],
+                    "is_deepfake": False,
+                    "is_harmful": False,
+                },
+            },
+        ],
+    },
+}
+TASK_NAMES = list(TASKS.keys())

test/test.py ADDED Viewed

	@@ -0,0 +1,243 @@

+import sys
+import os
+sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
+import pytest
+from server.models import ModerationAction, ContentObservation, StepResult, ResetResult, EnvState
+from server.env import ContentModerationEnv
+from server.graders import grade_text_spam, grade_content_moderation, grade_deepfake, GRADERS
+from server.tasks import TASKS, TASK_NAMES
+def make_action(decision="approve", reason="test", confidence=0.8, labels=None):
+    return ModerationAction(decision=decision, reason=reason, confidence=confidence, labels=labels or [])
+def make_action_dict(decision="approve", reason="test", confidence=0.8, labels=None):
+    return {"decision": decision, "reason": reason, "confidence": confidence, "labels": labels or []}
+# --- Task data ---
+def test_all_tasks_present():
+    assert set(TASK_NAMES) == {"text_spam", "content_moderation", "deepfake_detection"}
+def test_each_task_has_five_items():
+    for name, task in TASKS.items():
+        assert len(task["items"]) == 5, f"{name} should have 5 items"
+def test_ground_truth_keys():
+    for name, task in TASKS.items():
+        for item in task["items"]:
+            gt = item["ground_truth"]
+            assert "decision" in gt
+            assert "labels" in gt
+            assert gt["decision"] in ("approve", "reject", "flag", "escalate")
+def test_deepfake_items_have_is_deepfake():
+    for item in TASKS["deepfake_detection"]["items"]:
+        assert "is_deepfake" in item["ground_truth"]
+# --- Graders ---
+def test_grade_spam_correct_reject():
+    gt = {"decision": "reject", "labels": ["spam"], "is_harmful": True}
+    r = grade_text_spam(make_action_dict("reject", confidence=0.9, labels=["spam"]), gt)
+    assert 0.8 <= r <= 1.0
+def test_grade_spam_wrong_decision_penalised():
+    gt = {"decision": "reject", "labels": ["spam"], "is_harmful": True}
+    r = grade_text_spam(make_action_dict("approve", confidence=0.9), gt)
+    assert r < 0.3
+def test_grade_spam_escalate_partial_credit():
+    gt = {"decision": "reject", "labels": ["spam"], "is_harmful": True}
+    r = grade_text_spam(make_action_dict("escalate", confidence=0.5), gt)
+    assert 0.15 <= r <= 0.45
+def test_grade_spam_correct_approve():
+    gt = {"decision": "approve", "labels": [], "is_harmful": False}
+    r = grade_text_spam(make_action_dict("approve", confidence=0.85), gt)
+    assert r >= 0.6
+def test_grade_content_mod_correct_with_labels():
+    gt = {"decision": "reject", "labels": ["hate_speech", "violence"], "is_harmful": True}
+    r = grade_content_moderation(
+        make_action_dict("reject", confidence=0.9, labels=["hate_speech", "violence"]), gt
+    )
+    assert r >= 0.7
+def test_grade_content_mod_label_partial():
+    gt = {"decision": "flag", "labels": ["misinformation", "spam"], "is_harmful": True}
+    r_partial = grade_content_moderation(
+        make_action_dict("flag", confidence=0.7, labels=["misinformation"]), gt
+    )
+    r_full = grade_content_moderation(
+        make_action_dict("flag", confidence=0.7, labels=["misinformation", "spam"]), gt
+    )
+    assert r_partial < r_full
+def test_grade_deepfake_detected_and_correct():
+    gt = {"decision": "reject", "labels": ["deepfake", "fraud"], "is_deepfake": True, "is_harmful": True}
+    r = grade_deepfake(
+        make_action_dict("reject", confidence=0.9, labels=["deepfake", "fraud"]), gt, detector_score=0.85
+    )
+    assert r >= 0.7
+def test_grade_deepfake_real_correct():
+    gt = {"decision": "approve", "labels": [], "is_deepfake": False, "is_harmful": False}
+    r = grade_deepfake(make_action_dict("approve", confidence=0.8), gt, detector_score=0.1)
+    assert r >= 0.5
+def test_grade_deepfake_missed_deepfake():
+    gt = {"decision": "reject", "labels": ["deepfake"], "is_deepfake": True, "is_harmful": True}
+    r_miss = grade_deepfake(make_action_dict("approve", confidence=0.8), gt)
+    r_detect = grade_deepfake(make_action_dict("reject", confidence=0.8, labels=["deepfake"]), gt)
+    assert r_miss < r_detect
+def test_all_rewards_in_range():
+    for task_name in TASK_NAMES:
+        task = TASKS[task_name]
+        grader = GRADERS[task_name]
+        for item in task["items"]:
+            for decision in ("approve", "reject", "flag", "escalate"):
+                action = make_action_dict(decision, confidence=0.5, labels=["spam"])
+                if task_name == "deepfake_detection":
+                    r = grader(action, item["ground_truth"], 0.5)
+                else:
+                    r = grader(action, item["ground_truth"])
+                assert 0.0 <= r <= 1.0, f"{task_name} reward out of range: {r}"
+# --- Environment ---
+def test_reset_returns_first_observation():
+    env = ContentModerationEnv()
+    result = env.reset("text_spam")
+    assert isinstance(result, ResetResult)
+    obs = result.observation
+    assert obs.step_num == 1
+    assert obs.total_steps == 5
+    assert obs.content_id == "ts_001"
+def test_step_advances_state():
+    env = ContentModerationEnv()
+    env.reset("text_spam")
+    action = make_action("reject")
+    result = env.step(action)
+    assert isinstance(result, StepResult)
+    assert 0.0 <= result.reward <= 1.0
+    assert result.observation is not None
+    assert result.observation.step_num == 2
+def test_episode_ends_after_all_items():
+    env = ContentModerationEnv()
+    env.reset("text_spam")
+    done = False
+    steps = 0
+    while not done:
+        r = env.step(make_action("escalate"))
+        done = r.done
+        steps += 1
+    assert steps == 5
+    assert r.observation is None
+def test_step_after_done_returns_error():
+    env = ContentModerationEnv()
+    env.reset("text_spam")
+    for _ in range(5):
+        env.step(make_action("approve"))
+    result = env.step(make_action("approve"))
+    assert result.done is True
+    assert "error" in result.info
+def test_state_tracks_cumulative_reward():
+    env = ContentModerationEnv()
+    env.reset("content_moderation")
+    env.step(make_action("approve", confidence=0.9))
+    env.step(make_action("reject", confidence=0.9, labels=["hate_speech"]))
+    st = env.state()
+    assert isinstance(st, EnvState)
+    assert st.step_num == 2
+    assert st.cumulative_reward >= 0.0
+    assert len(st.history) == 2
+def test_reset_different_tasks():
+    env = ContentModerationEnv()
+    for task in TASK_NAMES:
+        if task == "deepfake_detection":
+            continue
+        r = env.reset(task)
+        assert r.observation.total_steps == 5
+def test_invalid_task_raises():
+    env = ContentModerationEnv()
+    with pytest.raises(ValueError):
+        env.reset("nonexistent_task")
+def test_close_resets_env():
+    env = ContentModerationEnv()
+    env.reset("text_spam")
+    env.step(make_action("approve"))
+    env.close()
+    st = env.state()
+    assert st.task == "none"
+    assert st.done is True
+def test_content_moderation_full_run():
+    env = ContentModerationEnv()
+    env.reset("content_moderation")
+    actions = [
+        make_action("approve"),
+        make_action("reject", labels=["hate_speech", "violence"]),
+        make_action("flag", labels=["misinformation"]),
+        make_action("flag", labels=["misinformation", "hate_speech"]),
+        make_action("approve"),
+    ]
+    total_reward = 0.0
+    for action in actions:
+        result = env.step(action)
+        total_reward += result.reward
+    assert result.done is True
+    assert total_reward >= 0.0
+    st = env.state()
+    assert abs(st.cumulative_reward - total_reward) < 0.01
+def test_observation_fields_populated():
+    env = ContentModerationEnv()
+    r = env.reset("content_moderation")
+    obs = r.observation
+    assert obs.content_id is not None
+    assert obs.content_type == "text"
+    assert obs.text is not None
+    assert obs.metadata is not None
+def test_deepfake_obs_has_image_description():
+    env = ContentModerationEnv()
+    r = env.reset("deepfake_detection")
+    obs = r.observation
+    assert obs.image_description is not None
+    assert obs.content_type == "multimodal"