ANI00 commited on
Commit
eb0a4a1
·
0 Parent(s):

first commit

Browse files
.gitignore ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ __pycache__
2
+ *.pyc
3
+ .env
4
+
5
+ .sixth
6
+ .pytest_cache
7
+ .coverage
README.md ADDED
@@ -0,0 +1,210 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Content Moderation OpenEnv
2
+
3
+ An AI content moderation environment built to the OpenEnv specification. Agents triage real-world content — spam emails, harmful social media posts, and AI-generated deepfakes — using a standard `step()` / `reset()` / `state()` API.
4
+
5
+ ---
6
+
7
+ ## Environment Description
8
+
9
+ Content moderation is a high-stakes, high-volume real-world task. Human moderators review millions of items daily. This environment simulates that pipeline across three difficulty levels:
10
+
11
+ | Task | Difficulty | Content | Description |
12
+ |---|---|---|---|
13
+ | `text_spam` | Easy | Email / SMS | Binary spam vs. legitimate classification |
14
+ | `content_moderation` | Medium | Social media posts | Multi-label violation detection |
15
+ | `deepfake_detection` | Hard | Multimodal | Detect AI-manipulated media + make policy decisions |
16
+
17
+ Models used:
18
+ - **Deepfake detection**: `dima806/deepfake_vs_real_image_detection` (ViT, ~300MB)
19
+ - **Agent LLM**: `meta-llama/Llama-3.1-8B-Instruct` (via transformers pipeline or HF API)
20
+
21
+ ---
22
+
23
+ ## Action Space
24
+
25
+ ```json
26
+ {
27
+ "decision": "approve | reject | escalate | flag",
28
+ "reason": "string (explanation)",
29
+ "confidence": 0.0–1.0,
30
+ "labels": ["spam", "scam", "phishing", "hate_speech", "violence",
31
+ "harassment", "misinformation", "adult_content",
32
+ "deepfake", "political_manipulation", "fraud"]
33
+ }
34
+ ```
35
+
36
+ - **approve** — content is safe
37
+ - **reject** — clear policy violation, remove immediately
38
+ - **flag** — borderline, surface to human review
39
+ - **escalate** — uncertain, route to specialist
40
+
41
+ ---
42
+
43
+ ## Observation Space
44
+
45
+ ```json
46
+ {
47
+ "content_id": "string",
48
+ "content_type": "text | multimodal",
49
+ "text": "string (optional)",
50
+ "image_description": "string (optional, deepfake task)",
51
+ "detector_score": 0.0–1.0 (optional, deepfake task),
52
+ "metadata": { "source": "...", "report_count": 0, ... },
53
+ "step_num": 1,
54
+ "total_steps": 5
55
+ }
56
+ ```
57
+
58
+ The `detector_score` field (deepfake task only) is the raw output from the `dima806/deepfake_vs_real_image_detection` model — higher values indicate higher probability of a deepfake.
59
+
60
+ ---
61
+
62
+ ## Reward Functions
63
+
64
+ ### text_spam (easy)
65
+ - Correct decision: **+0.65**
66
+ - Escalate on harmful content: **+0.30** (partial credit)
67
+ - Label F1 contribution: **+0.20**
68
+ - Confidence calibration bonus/penalty: **±0.10**
69
+
70
+ ### content_moderation (medium)
71
+ - Correct decision: **+0.50**
72
+ - Partial credit (flag/escalate on harmful): **+0.25**
73
+ - Label F1 contribution: **+0.35**
74
+ - Confidence calibration: **±0.10**
75
+
76
+ ### deepfake_detection (hard)
77
+ - Correct decision: **+0.40**
78
+ - Deepfake detection accuracy: **+0.30**
79
+ - Detector model alignment bonus: **+0.10**
80
+ - Label F1 contribution: **+0.20**
81
+ - Confidence calibration: **±0.10**
82
+
83
+ ---
84
+
85
+ ## API Endpoints
86
+
87
+ | Method | Path | Description |
88
+ |---|---|---|
89
+ | POST | `/reset` | Start new episode. Body: `{"task": "text_spam"}` |
90
+ | POST | `/step` | Submit action. Body: action JSON |
91
+ | GET | `/state` | Current episode state |
92
+ | POST | `/close` | End episode and clean up |
93
+ | GET | `/tasks` | List all available tasks |
94
+ | GET | `/health` | Health check |
95
+
96
+ ---
97
+
98
+ ## Setup & Usage
99
+
100
+ ### Requirements
101
+ - Docker
102
+ - Python 3.11+
103
+ - `openenv-core` (`pip install openenv-core`)
104
+
105
+ ### Run with Docker
106
+
107
+ ```bash
108
+ cd content-moderation-env
109
+
110
+ # Build
111
+ docker build -f server/Dockerfile -t content-moderation-env .
112
+
113
+ # Run
114
+ docker run -p 7860:7860 content-moderation-env
115
+ ```
116
+
117
+ ### Run locally
118
+
119
+ ```bash
120
+ pip install -r server/requirements.txt
121
+ uvicorn server.main:app --host 0.0.0.0 --port 7860
122
+ ```
123
+
124
+ ### Validate
125
+
126
+ ```bash
127
+ openenv validate # from project root
128
+ ```
129
+
130
+ ---
131
+
132
+ ## Inference Script
133
+
134
+ ```bash
135
+ # API mode (HF inference endpoint)
136
+ export API_BASE_URL="https://router.huggingface.co/v1"
137
+ export MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct"
138
+ export HF_TOKEN="hf_your_token_here"
139
+ export SERVER_URL="http://localhost:7860"
140
+ export TASK_NAME="text_spam"
141
+
142
+ python inference.py
143
+
144
+ # Local transformers pipeline mode
145
+ export USE_LOCAL_MODEL="true"
146
+ python inference.py
147
+ ```
148
+
149
+ ### Output format
150
+
151
+ ```
152
+ [START] task=text_spam env=content_moderation_env model=meta-llama/Llama-3.1-8B-Instruct
153
+ [STEP] step=1 action={"decision":"reject","confidence":0.9,"labels":["spam"]} reward=0.85 done=false error=null
154
+ [STEP] step=2 action={"decision":"approve","confidence":0.8,"labels":[]} reward=0.75 done=false error=null
155
+ ...
156
+ [END] success=true steps=5 score=0.720 rewards=0.85,0.75,0.00,0.80,0.65
157
+ ```
158
+
159
+ ---
160
+
161
+ ## Run Tests
162
+
163
+ ```bash
164
+ pip install pytest
165
+ pytest test/test.py -v
166
+ ```
167
+
168
+ ---
169
+
170
+ ## Baseline Scores (Llama-3.1-8B-Instruct, temperature=0.2)
171
+
172
+ | Task | Score | Notes |
173
+ |---|---|---|
174
+ | `text_spam` | ~0.72 | Strong on obvious spam, weaker on phishing |
175
+ | `content_moderation` | ~0.58 | Good decision, weaker multi-label coverage |
176
+ | `deepfake_detection` | ~0.44 | Relies heavily on image description cues |
177
+
178
+ ---
179
+
180
+ ## HuggingFace Spaces Deployment
181
+
182
+ Create a Space with Docker SDK, push this repo, and set:
183
+ - `HF_TOKEN` (secret)
184
+ - `API_BASE_URL` (variable)
185
+ - `MODEL_NAME` (variable)
186
+
187
+ The Space URL becomes your `PING_URL` for the validation script.
188
+
189
+ ---
190
+
191
+ ## Project Structure
192
+
193
+ ```
194
+ content-moderation-env/
195
+ ├── server/
196
+ │ ├── __init__.py
197
+ │ ├── main.py # FastAPI app + endpoints
198
+ │ ├── env.py # OpenEnv environment (step/reset/state/close)
199
+ │ ├── models.py # Pydantic action/observation models
200
+ │ ├── tasks.py # Task datasets + ground truth
201
+ │ ├── graders.py # Reward functions per task
202
+ │ ├── deepfake_model.py# HF deepfake detection pipeline
203
+ │ ├── openenv.yaml # OpenEnv metadata spec
204
+ │ ├── requirements.txt
205
+ │ └── Dockerfile
206
+ ├── test/
207
+ │ └── test.py # pytest suite (20+ tests)
208
+ ├── inference.py # Baseline agent script
209
+ └── README.md
210
+ ```
inference.py ADDED
@@ -0,0 +1,238 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Content Moderation Inference Script
3
+ Env vars: API_BASE_URL, MODEL_NAME, HF_TOKEN, SERVER_URL, TASK_NAME
4
+ USE_LOCAL_MODEL=true → local transformers pipeline (Llama-3.1-8B-Instruct)
5
+ USE_LOCAL_MODEL=false → OpenAI-compatible HF API (default)
6
+ """
7
+
8
+ import json
9
+ import os
10
+ import textwrap
11
+ from typing import Dict, Any, List, Optional
12
+
13
+ from dotenv import load_dotenv
14
+ import requests
15
+
16
+ # Load environment variables from .env file
17
+ load_dotenv()
18
+
19
+ API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
20
+ MODEL_NAME = os.getenv("MODEL_NAME", "meta-llama/Llama-3.1-8B-Instruct")
21
+ HF_TOKEN = os.getenv("HF_TOKEN", "")
22
+ API_KEY = HF_TOKEN or os.getenv("API_KEY", "")
23
+ SERVER_URL = os.getenv("SERVER_URL", "http://localhost:7860").rstrip("/")
24
+ TASK_NAME = os.getenv("TASK_NAME", "text_spam")
25
+ BENCHMARK = "content_moderation_env"
26
+ MAX_STEPS = 10
27
+ USE_LOCAL = os.getenv("USE_LOCAL_MODEL", "false").lower() == "true"
28
+ SUCCESS_SCORE_THRESHOLD = 0.5
29
+
30
+ VALID_DECISIONS = {"approve", "reject", "escalate", "flag"}
31
+ VALID_LABELS = [
32
+ "spam", "scam", "phishing", "pharmaceutical_spam",
33
+ "hate_speech", "violence", "harassment", "misinformation",
34
+ "adult_content", "deepfake", "political_manipulation", "fraud",
35
+ ]
36
+
37
+ SYSTEM_PROMPT = textwrap.dedent("""
38
+ You are an expert content moderation AI. For each piece of content, analyze it carefully and respond ONLY with a valid JSON object.
39
+
40
+ Required JSON fields:
41
+ - "decision": one of "approve", "reject", "escalate", "flag"
42
+ - "reason": brief explanation (max 2 sentences)
43
+ - "confidence": float between 0.0 and 1.0
44
+ - "labels": list of applicable labels from: spam, scam, phishing, pharmaceutical_spam, hate_speech, violence, harassment, misinformation, adult_content, deepfake, political_manipulation, fraud
45
+
46
+ Decision guidelines:
47
+ - approve: content is safe and legitimate
48
+ - reject: content clearly violates policy (spam, hate speech, deepfake, etc.)
49
+ - flag: content is borderline or needs human review
50
+ - escalate: uncertain, requires specialist review
51
+
52
+ Output ONLY the JSON object, no other text.
53
+ """).strip()
54
+
55
+
56
+ def log_start(task: str, env: str, model: str):
57
+ print(f"[START] task={task} env={env} model={model}", flush=True)
58
+
59
+
60
+ def log_step(step: int, action: str, reward: float, done: bool, error: Optional[str]):
61
+ err = error if error else "null"
62
+ print(f"[STEP] step={step} action={action} reward={reward:.2f} done={str(done).lower()} error={err}", flush=True)
63
+
64
+
65
+ def log_end(success: bool, steps: int, score: float, rewards: List[float]):
66
+ rewards_str = ",".join(f"{r:.2f}" for r in rewards)
67
+ print(f"[END] success={str(success).lower()} steps={steps} score={score:.3f} rewards={rewards_str}", flush=True)
68
+
69
+
70
+ def build_prompt(obs: Dict[str, Any]) -> str:
71
+ parts = [f"Content ID: {obs.get('content_id', 'unknown')}"]
72
+ parts.append(f"Type: {obs.get('content_type', 'text')}")
73
+
74
+ if obs.get("text"):
75
+ parts.append(f"Text: {obs['text']}")
76
+ if obs.get("image_description"):
77
+ parts.append(f"Image analysis: {obs['image_description']}")
78
+ if obs.get("detector_score") is not None:
79
+ score = obs["detector_score"]
80
+ parts.append(f"Deepfake detector score (higher = more likely fake): {score:.3f}")
81
+
82
+ meta = obs.get("metadata", {})
83
+ if meta:
84
+ meta_str = ", ".join(f"{k}={v}" for k, v in meta.items())
85
+ parts.append(f"Metadata: {meta_str}")
86
+
87
+ parts.append(f"\nStep {obs.get('step_num', '?')} of {obs.get('total_steps', '?')}")
88
+ return "\n".join(parts)
89
+
90
+
91
+ def _default_action() -> Dict:
92
+ return {"decision": "escalate", "reason": "Unable to analyze content.", "confidence": 0.3, "labels": []}
93
+
94
+
95
+ def call_local_model(prompt: str) -> Dict:
96
+ from transformers import pipeline
97
+
98
+ pipe = pipeline("text-generation", model="meta-llama/Llama-3.1-8B-Instruct")
99
+ messages = [
100
+ {"role": "system", "content": SYSTEM_PROMPT},
101
+ {"role": "user", "content": prompt},
102
+ ]
103
+ output = pipe(messages, max_new_tokens=256, temperature=0.2, do_sample=True)
104
+ text = output[0]["generated_text"]
105
+ if isinstance(text, list):
106
+ text = text[-1].get("content", "")
107
+ return parse_llm_response(text)
108
+
109
+
110
+ def call_api_model(prompt: str) -> Dict:
111
+ from openai import OpenAI
112
+
113
+ client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY or "hf_default")
114
+ completion = client.chat.completions.create(
115
+ model=MODEL_NAME,
116
+ messages=[
117
+ {"role": "system", "content": SYSTEM_PROMPT},
118
+ {"role": "user", "content": prompt},
119
+ ],
120
+ temperature=0.2,
121
+ max_tokens=256,
122
+ )
123
+ text = (completion.choices[0].message.content or "").strip()
124
+ return parse_llm_response(text)
125
+
126
+
127
+ def parse_llm_response(text: str) -> Dict:
128
+ try:
129
+ start = text.find("{")
130
+ end = text.rfind("}") + 1
131
+ if start >= 0 and end > start:
132
+ parsed = json.loads(text[start:end])
133
+ decision = parsed.get("decision", "escalate")
134
+ if decision not in VALID_DECISIONS:
135
+ decision = "escalate"
136
+ return {
137
+ "decision": decision,
138
+ "reason": str(parsed.get("reason", ""))[:200],
139
+ "confidence": float(max(0.0, min(1.0, parsed.get("confidence", 0.5)))),
140
+ "labels": [l for l in parsed.get("labels", []) if l in VALID_LABELS],
141
+ }
142
+ except Exception:
143
+ pass
144
+ return _default_action()
145
+
146
+
147
+ def get_decision(prompt: str) -> Dict:
148
+ try:
149
+ if USE_LOCAL:
150
+ return call_local_model(prompt)
151
+ return call_api_model(prompt)
152
+ except Exception as e:
153
+ print(f"[DEBUG] Model error: {e}", flush=True)
154
+ return _default_action()
155
+
156
+
157
+ def server_reset(task: str) -> Optional[Dict]:
158
+ try:
159
+ r = requests.post(f"{SERVER_URL}/reset", json={"task": task}, timeout=30)
160
+ r.raise_for_status()
161
+ return r.json()
162
+ except Exception as e:
163
+ print(f"[DEBUG] reset error: {e}", flush=True)
164
+ return None
165
+
166
+
167
+ def server_step(action: Dict) -> Optional[Dict]:
168
+ try:
169
+ r = requests.post(f"{SERVER_URL}/step", json=action, timeout=30)
170
+ r.raise_for_status()
171
+ return r.json()
172
+ except Exception as e:
173
+ print(f"[DEBUG] step error: {e}", flush=True)
174
+ return None
175
+
176
+
177
+ def server_close():
178
+ try:
179
+ requests.post(f"{SERVER_URL}/close", timeout=10)
180
+ except Exception:
181
+ pass
182
+
183
+
184
+ def run_episode(task: str):
185
+ rewards: List[float] = []
186
+ steps_taken = 0
187
+ score = 0.0
188
+ success = False
189
+ obs = None
190
+
191
+ log_start(task=task, env=BENCHMARK, model=MODEL_NAME)
192
+
193
+ try:
194
+ reset_result = server_reset(task)
195
+ if reset_result is None:
196
+ log_end(success=False, steps=0, score=0.0, rewards=[])
197
+ return
198
+
199
+ obs = reset_result.get("observation", {})
200
+ done = False
201
+
202
+ for step in range(1, MAX_STEPS + 1):
203
+ if done or obs is None:
204
+ break
205
+
206
+ prompt = build_prompt(obs)
207
+ action = get_decision(prompt)
208
+ action_str = json.dumps({k: v for k, v in action.items() if k != "reason"})
209
+
210
+ result = server_step(action)
211
+ if result is None:
212
+ log_step(step, action_str, 0.0, True, "server_error")
213
+ break
214
+
215
+ reward = float(result.get("reward", 0.0))
216
+ done = bool(result.get("done", False))
217
+ error = result.get("info", {}).get("error")
218
+
219
+ rewards.append(reward)
220
+ steps_taken = step
221
+
222
+ log_step(step, action_str, reward, done, error)
223
+
224
+ obs = result.get("observation")
225
+
226
+ total_steps_in_task = obs.get("total_steps", len(rewards)) if obs else len(rewards)
227
+ max_possible = float(total_steps_in_task)
228
+ score = sum(rewards) / max_possible if max_possible > 0 else 0.0
229
+ score = min(max(score, 0.0), 1.0)
230
+ success = score >= SUCCESS_SCORE_THRESHOLD
231
+
232
+ finally:
233
+ server_close()
234
+ log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
235
+
236
+
237
+ if __name__ == "__main__":
238
+ run_episode(TASK_NAME)
server/Dockerfile ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.11-slim
2
+
3
+ ENV PYTHONDONTWRITEBYTECODE=1 \
4
+ PYTHONUNBUFFERED=1 \
5
+ HF_HOME=/app/.cache/huggingface \
6
+ TRANSFORMERS_CACHE=/app/.cache/huggingface
7
+
8
+ WORKDIR /app
9
+
10
+ RUN apt-get update && apt-get install -y --no-install-recommends \
11
+ libgl1 libglib2.0-0 curl \
12
+ && rm -rf /var/lib/apt/lists/*
13
+
14
+ COPY server/requirements.txt .
15
+ RUN pip install --upgrade pip setuptools wheel
16
+ RUN pip install --no-cache-dir --no-build-isolation -r requirements.txt
17
+
18
+ COPY . .
19
+
20
+ RUN mkdir -p /app/.cache/huggingface
21
+
22
+ # Pre-download deepfake model to avoid runtime delays
23
+ RUN python -c "from transformers import pipeline; pipeline('image-classification', model='dima806/deepfake_vs_real_image_detection', device=-1)" 2>&1 || echo "Model download optional"
24
+
25
+ EXPOSE 7860
26
+
27
+ CMD ["uvicorn", "server.main:app", "--host", "0.0.0.0", "--port", "7860"]
server/__init__.py ADDED
File without changes
server/deepfake_model.py ADDED
@@ -0,0 +1,89 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import io
2
+ import logging
3
+ from typing import Optional
4
+
5
+ import numpy as np
6
+
7
+ logger = logging.getLogger(__name__)
8
+ _pipe = None
9
+
10
+
11
+ def _load_pipeline():
12
+ global _pipe
13
+ if _pipe is not None:
14
+ return _pipe
15
+ try:
16
+ from transformers import pipeline
17
+ _pipe = pipeline(
18
+ "image-classification",
19
+ model="dima806/deepfake_vs_real_image_detection",
20
+ device=-1,
21
+ )
22
+ logger.info("Deepfake detection model loaded.")
23
+ except Exception as e:
24
+ logger.warning(f"Could not load deepfake model: {e}. Using heuristic fallback.")
25
+ _pipe = None
26
+ return _pipe
27
+
28
+
29
+ def _make_synthetic_image(is_fake: bool):
30
+ from PIL import Image
31
+
32
+ rng = np.random.default_rng(seed=1 if is_fake else 99)
33
+ img = Image.new("RGB", (224, 224))
34
+ pixels = img.load()
35
+
36
+ for i in range(224):
37
+ for j in range(224):
38
+ if is_fake:
39
+ r = int(128 + 60 * np.sin(i / 9.0) * np.cos(j / 9.0))
40
+ g = int(128 + 60 * np.cos(i / 7.0) * np.sin(j / 11.0))
41
+ b = int(128 + 40 * np.sin((i + j) / 14.0))
42
+ else:
43
+ base = int(80 + 100 * (i / 224))
44
+ noise = int(rng.normal(0, 12))
45
+ r = max(0, min(255, base + noise + 20))
46
+ g = max(0, min(255, base + noise))
47
+ b = max(0, min(255, base + noise - 15))
48
+ pixels[j, i] = (
49
+ max(0, min(255, r)),
50
+ max(0, min(255, g)),
51
+ max(0, min(255, b)),
52
+ )
53
+ return img
54
+
55
+
56
+ def score_deepfake(is_fake: bool) -> float:
57
+ pipe = _load_pipeline()
58
+
59
+ if pipe is None:
60
+ return 0.78 if is_fake else 0.22
61
+
62
+ try:
63
+ img = _make_synthetic_image(is_fake)
64
+ results = pipe(img)
65
+
66
+ for r in results:
67
+ label_lower = r["label"].lower()
68
+ if any(kw in label_lower for kw in ("fake", "deepfake", "manipulat", "ai_gen", "synthetic")):
69
+ return float(r["score"])
70
+
71
+ top_label = results[0]["label"].lower()
72
+ top_score = float(results[0]["score"])
73
+ if any(kw in top_label for kw in ("real", "authentic", "genuine")):
74
+ return 1.0 - top_score
75
+ return top_score
76
+
77
+ except Exception as e:
78
+ logger.warning(f"Deepfake scoring error: {e}")
79
+ return 0.75 if is_fake else 0.25
80
+
81
+
82
+ def precompute_detector_scores(items: list) -> list:
83
+ enriched = []
84
+ for item in items:
85
+ is_fake = item.get("ground_truth", {}).get("is_deepfake", False)
86
+ item = dict(item)
87
+ item["detector_score"] = score_deepfake(is_fake)
88
+ enriched.append(item)
89
+ return enriched
server/env.py ADDED
@@ -0,0 +1,121 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import threading
2
+ from typing import Dict, Any, Optional
3
+
4
+ from .models import ContentObservation, StepResult, ResetResult, EnvState, ModerationAction
5
+ from .tasks import TASKS
6
+ from .graders import GRADERS
7
+
8
+
9
+ class ContentModerationEnv:
10
+ def __init__(self):
11
+ self._lock = threading.Lock()
12
+ self._s: Dict[str, Any] = {}
13
+ self._clear()
14
+
15
+ def _clear(self):
16
+ self._s = {
17
+ "task": None,
18
+ "items": [],
19
+ "idx": 0,
20
+ "total": 0,
21
+ "reward_sum": 0.0,
22
+ "done": True,
23
+ "history": [],
24
+ }
25
+
26
+ def _obs(self, item: Dict, idx: int, total: int) -> ContentObservation:
27
+ return ContentObservation(
28
+ content_id=item["content_id"],
29
+ content_type=item["content_type"],
30
+ text=item.get("text"),
31
+ image_description=item.get("image_description"),
32
+ detector_score=item.get("detector_score"),
33
+ metadata=item.get("metadata", {}),
34
+ step_num=idx,
35
+ total_steps=total,
36
+ )
37
+
38
+ def reset(self, task: str = "text_spam") -> ResetResult:
39
+ if task not in TASKS:
40
+ raise ValueError(f"Unknown task '{task}'. Valid: {list(TASKS.keys())}")
41
+
42
+ with self._lock:
43
+ task_cfg = TASKS[task]
44
+ items = list(task_cfg["items"])
45
+
46
+ if task == "deepfake_detection":
47
+ from .deepfake_model import precompute_detector_scores
48
+ items = precompute_detector_scores(items)
49
+
50
+ self._s = {
51
+ "task": task,
52
+ "items": items,
53
+ "idx": 0,
54
+ "total": len(items),
55
+ "reward_sum": 0.0,
56
+ "done": False,
57
+ "history": [],
58
+ }
59
+ return ResetResult(observation=self._obs(items[0], 1, len(items)))
60
+
61
+ def step(self, action: ModerationAction) -> StepResult:
62
+ with self._lock:
63
+ if self._s["done"]:
64
+ return StepResult(
65
+ observation=None,
66
+ reward=0.0,
67
+ done=True,
68
+ info={"error": "Episode finished. Call /reset first."},
69
+ )
70
+
71
+ idx = self._s["idx"]
72
+ item = self._s["items"][idx]
73
+ task = self._s["task"]
74
+ grader = GRADERS[task]
75
+ action_d = action.model_dump()
76
+
77
+ if task == "deepfake_detection":
78
+ reward = grader(action_d, item["ground_truth"], item.get("detector_score"))
79
+ else:
80
+ reward = grader(action_d, item["ground_truth"])
81
+
82
+ self._s["reward_sum"] += reward
83
+ self._s["idx"] += 1
84
+ self._s["history"].append({
85
+ "step": idx + 1,
86
+ "content_id": item["content_id"],
87
+ "action": action_d,
88
+ "reward": round(reward, 4),
89
+ "ground_truth": item["ground_truth"],
90
+ })
91
+
92
+ new_idx = self._s["idx"]
93
+ done = new_idx >= self._s["total"]
94
+ self._s["done"] = done
95
+
96
+ next_obs: Optional[ContentObservation] = None
97
+ if not done:
98
+ next_item = self._s["items"][new_idx]
99
+ next_obs = self._obs(next_item, new_idx + 1, self._s["total"])
100
+
101
+ return StepResult(
102
+ observation=next_obs,
103
+ reward=round(reward, 4),
104
+ done=done,
105
+ info={"content_id": item["content_id"], "step": idx + 1},
106
+ )
107
+
108
+ def state(self) -> EnvState:
109
+ with self._lock:
110
+ return EnvState(
111
+ task=self._s["task"] or "none",
112
+ step_num=self._s["idx"],
113
+ total_steps=self._s["total"],
114
+ cumulative_reward=round(self._s["reward_sum"], 4),
115
+ done=self._s["done"],
116
+ history=list(self._s["history"]),
117
+ )
118
+
119
+ def close(self):
120
+ with self._lock:
121
+ self._clear()
server/graders.py ADDED
@@ -0,0 +1,94 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from typing import Dict, Any, List, Optional
2
+
3
+
4
+ def _label_f1(predicted: List[str], true: List[str]) -> float:
5
+ pred_set, true_set = set(predicted), set(true)
6
+ if not true_set and not pred_set:
7
+ return 1.0
8
+ if not true_set:
9
+ return 0.5
10
+ if not pred_set:
11
+ return 0.0
12
+ tp = len(pred_set & true_set)
13
+ precision = tp / len(pred_set)
14
+ recall = tp / len(true_set)
15
+ if precision + recall == 0:
16
+ return 0.0
17
+ return 2 * precision * recall / (precision + recall)
18
+
19
+
20
+ def _calibration_bonus(confidence: float, correct: bool) -> float:
21
+ if correct:
22
+ return 0.1 * confidence
23
+ return -0.15 * confidence
24
+
25
+
26
+ def grade_text_spam(action: Dict, ground_truth: Dict) -> float:
27
+ decision = action.get("decision", "")
28
+ confidence = float(action.get("confidence", 0.5))
29
+ correct = decision == ground_truth["decision"]
30
+
31
+ if correct:
32
+ base = 0.65
33
+ elif ground_truth["is_harmful"] and decision == "escalate":
34
+ base = 0.30
35
+ elif not ground_truth["is_harmful"] and decision == "flag":
36
+ base = 0.20
37
+ else:
38
+ base = 0.0
39
+
40
+ label_score = 0.20 * _label_f1(action.get("labels", []), ground_truth["labels"])
41
+ return min(1.0, max(0.0, base + label_score + _calibration_bonus(confidence, correct)))
42
+
43
+
44
+ def grade_content_moderation(action: Dict, ground_truth: Dict) -> float:
45
+ decision = action.get("decision", "")
46
+ confidence = float(action.get("confidence", 0.5))
47
+ correct = decision == ground_truth["decision"]
48
+
49
+ if correct:
50
+ decision_score = 0.50
51
+ elif ground_truth["is_harmful"] and decision in ("escalate", "flag", "reject"):
52
+ decision_score = 0.25
53
+ elif not ground_truth["is_harmful"] and decision == "escalate":
54
+ decision_score = 0.15
55
+ else:
56
+ decision_score = 0.0
57
+
58
+ label_score = 0.35 * _label_f1(action.get("labels", []), ground_truth["labels"])
59
+ return min(1.0, max(0.0, decision_score + label_score + _calibration_bonus(confidence, correct)))
60
+
61
+
62
+ def grade_deepfake(action: Dict, ground_truth: Dict, detector_score: Optional[float] = None) -> float:
63
+ decision = action.get("decision", "")
64
+ confidence = float(action.get("confidence", 0.5))
65
+ labels = action.get("labels", [])
66
+
67
+ is_deepfake = ground_truth.get("is_deepfake", False)
68
+ correct_decision = decision == ground_truth["decision"]
69
+
70
+ detected_deepfake = "deepfake" in [l.lower() for l in labels]
71
+ detection_correct = detected_deepfake == is_deepfake
72
+
73
+ if correct_decision:
74
+ decision_score = 0.40
75
+ elif ground_truth["is_harmful"] and decision in ("escalate", "flag", "reject"):
76
+ decision_score = 0.20
77
+ else:
78
+ decision_score = 0.0
79
+
80
+ detection_score = 0.30 if detection_correct else 0.0
81
+
82
+ if detector_score is not None:
83
+ model_aligned = (detector_score > 0.5) == is_deepfake
84
+ detection_score += 0.10 if model_aligned else 0.0
85
+
86
+ label_score = 0.20 * _label_f1(labels, ground_truth["labels"])
87
+ return min(1.0, max(0.0, decision_score + detection_score + label_score + _calibration_bonus(confidence, correct_decision)))
88
+
89
+
90
+ GRADERS = {
91
+ "text_spam": grade_text_spam,
92
+ "content_moderation": grade_content_moderation,
93
+ "deepfake_detection": grade_deepfake,
94
+ }
server/main.py ADDED
@@ -0,0 +1,61 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from fastapi import FastAPI, HTTPException, Request
2
+ from fastapi.responses import JSONResponse, RedirectResponse
3
+
4
+ from .models import ModerationAction, StepResult, ResetResult, EnvState, ResetRequest
5
+ from .env import ContentModerationEnv
6
+ from .tasks import TASKS
7
+
8
+ app = FastAPI(title="Content Moderation OpenEnv", version="1.0.0")
9
+ _env = ContentModerationEnv()
10
+
11
+
12
+ @app.get("/")
13
+ async def root():
14
+ return RedirectResponse(url="/docs")
15
+
16
+
17
+ @app.post("/reset", response_model=ResetResult)
18
+ async def reset(request: Request):
19
+ try:
20
+ body = await request.json()
21
+ except Exception:
22
+ body = {}
23
+ task = (body or {}).get("task", "text_spam")
24
+ try:
25
+ return _env.reset(task=task)
26
+ except ValueError as e:
27
+ raise HTTPException(status_code=400, detail=str(e))
28
+
29
+
30
+ @app.post("/step", response_model=StepResult)
31
+ def step(action: ModerationAction):
32
+ return _env.step(action)
33
+
34
+
35
+ @app.get("/state", response_model=EnvState)
36
+ def state():
37
+ return _env.state()
38
+
39
+
40
+ @app.post("/close")
41
+ def close():
42
+ _env.close()
43
+ return {"status": "closed"}
44
+
45
+
46
+ @app.get("/tasks")
47
+ def list_tasks():
48
+ return {
49
+ name: {
50
+ "description": t["description"],
51
+ "difficulty": t["difficulty"],
52
+ "num_items": len(t["items"]),
53
+ "content_type": t["content_type"],
54
+ }
55
+ for name, t in TASKS.items()
56
+ }
57
+
58
+
59
+ @app.get("/health")
60
+ def health():
61
+ return {"status": "ok"}
server/models.py ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from pydantic import BaseModel, Field
2
+ from typing import Optional, Dict, Any, List
3
+
4
+
5
+ class ModerationAction(BaseModel):
6
+ decision: str
7
+ reason: str
8
+ confidence: float = Field(ge=0.0, le=1.0)
9
+ labels: List[str] = []
10
+
11
+
12
+ class ContentObservation(BaseModel):
13
+ content_id: str
14
+ content_type: str
15
+ text: Optional[str] = None
16
+ image_description: Optional[str] = None
17
+ detector_score: Optional[float] = None
18
+ metadata: Dict[str, Any] = {}
19
+ step_num: int
20
+ total_steps: int
21
+
22
+
23
+ class StepResult(BaseModel):
24
+ observation: Optional[ContentObservation] = None
25
+ reward: float
26
+ done: bool
27
+ info: Dict[str, Any] = {}
28
+
29
+
30
+ class ResetResult(BaseModel):
31
+ observation: ContentObservation
32
+
33
+
34
+ class EnvState(BaseModel):
35
+ task: str
36
+ step_num: int
37
+ total_steps: int
38
+ cumulative_reward: float
39
+ done: bool
40
+ history: List[Dict[str, Any]] = []
41
+
42
+
43
+ class ResetRequest(BaseModel):
44
+ task: Optional[str] = "text_spam"
server/openenv.yaml ADDED
@@ -0,0 +1,89 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ name: content-moderation-env
2
+ version: "1.0.0"
3
+ description: >
4
+ AI-powered content moderation environment. Agents triage text, social
5
+ posts, and multimodal content (including deepfake detection) across
6
+ three difficulty levels using the standard OpenEnv step/reset/state API.
7
+ author: openenv-participant
8
+ license: MIT
9
+
10
+ tasks:
11
+ - id: text_spam
12
+ difficulty: easy
13
+ description: Classify email/message content as spam or legitimate
14
+ content_type: text
15
+ num_items: 5
16
+ score_range: [0.0, 1.0]
17
+
18
+ - id: content_moderation
19
+ difficulty: medium
20
+ description: Multi-label social media content moderation
21
+ content_type: text
22
+ num_items: 5
23
+ score_range: [0.0, 1.0]
24
+
25
+ - id: deepfake_detection
26
+ difficulty: hard
27
+ description: Detect AI-manipulated/deepfake media and make moderation decisions
28
+ content_type: multimodal
29
+ num_items: 5
30
+ score_range: [0.0, 1.0]
31
+
32
+ action_space:
33
+ type: object
34
+ fields:
35
+ decision:
36
+ type: string
37
+ enum: [approve, reject, escalate, flag]
38
+ reason:
39
+ type: string
40
+ confidence:
41
+ type: float
42
+ range: [0.0, 1.0]
43
+ labels:
44
+ type: array
45
+ items: string
46
+ valid_values:
47
+ - spam
48
+ - scam
49
+ - phishing
50
+ - pharmaceutical_spam
51
+ - hate_speech
52
+ - violence
53
+ - harassment
54
+ - misinformation
55
+ - adult_content
56
+ - deepfake
57
+ - political_manipulation
58
+ - fraud
59
+
60
+ observation_space:
61
+ type: object
62
+ fields:
63
+ content_id: string
64
+ content_type: string
65
+ text: optional string
66
+ image_description: optional string
67
+ detector_score: optional float
68
+ metadata: object
69
+ step_num: integer
70
+ total_steps: integer
71
+
72
+ endpoints:
73
+ reset: POST /reset
74
+ step: POST /step
75
+ state: GET /state
76
+ close: POST /close
77
+ tasks: GET /tasks
78
+ health: GET /health
79
+
80
+ docker:
81
+ context: server
82
+ dockerfile: server/Dockerfile
83
+
84
+ huggingface:
85
+ space_sdk: docker
86
+ tags:
87
+ - openenv
88
+ - content-moderation
89
+ - deepfake-detection
server/requirements.txt ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ fastapi==0.111.0
2
+ uvicorn[standard]==0.29.0
3
+ pydantic==2.7.1
4
+ transformers==4.41.2
5
+ torch==2.3.0
6
+ torchvision==0.18.0
7
+ Pillow==10.3.0
8
+ numpy==1.26.4
9
+ accelerate==0.30.0
10
+ openai>=1.40.0
11
+ requests==2.31.0
12
+ python-dotenv==1.0.0
server/tasks.py ADDED
@@ -0,0 +1,192 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ TASKS = {
2
+ "text_spam": {
3
+ "description": "Classify email/message content as spam or legitimate",
4
+ "difficulty": "easy",
5
+ "content_type": "text",
6
+ "items": [
7
+ {
8
+ "content_id": "ts_001",
9
+ "content_type": "text",
10
+ "text": "CONGRATULATIONS! You've won $1,000,000! Click here NOW to claim your prize!!! Limited time offer!!!",
11
+ "metadata": {"source": "email", "sender_reputation": 0.05, "link_count": 3},
12
+ "ground_truth": {"decision": "reject", "labels": ["spam", "scam"], "is_harmful": True},
13
+ },
14
+ {
15
+ "content_id": "ts_002",
16
+ "content_type": "text",
17
+ "text": "Hi Sarah, confirming our meeting tomorrow at 3pm in the main conference room. Let me know if the time still works.",
18
+ "metadata": {"source": "email", "sender_reputation": 0.92, "link_count": 0},
19
+ "ground_truth": {"decision": "approve", "labels": [], "is_harmful": False},
20
+ },
21
+ {
22
+ "content_id": "ts_003",
23
+ "content_type": "text",
24
+ "text": "FREE MEDS NO PRESCRIPTION NEEDED BUY NOW CHEAPEST PRICES CLICK LINK BELOW",
25
+ "metadata": {"source": "email", "sender_reputation": 0.02, "link_count": 5},
26
+ "ground_truth": {"decision": "reject", "labels": ["spam", "pharmaceutical_spam"], "is_harmful": True},
27
+ },
28
+ {
29
+ "content_id": "ts_004",
30
+ "content_type": "text",
31
+ "text": "Your Amazon order #112-3456789 has shipped and will arrive by Thursday. Track at amazon.com/orders.",
32
+ "metadata": {"source": "email", "sender_reputation": 0.98, "link_count": 1},
33
+ "ground_truth": {"decision": "approve", "labels": [], "is_harmful": False},
34
+ },
35
+ {
36
+ "content_id": "ts_005",
37
+ "content_type": "text",
38
+ "text": "URGENT: Your account will be SUSPENDED in 24hrs unless you verify immediately. Click: bit.ly/verify-acct-now",
39
+ "metadata": {"source": "email", "sender_reputation": 0.08, "link_count": 2},
40
+ "ground_truth": {"decision": "reject", "labels": ["spam", "phishing"], "is_harmful": True},
41
+ },
42
+ ],
43
+ },
44
+
45
+ "content_moderation": {
46
+ "description": "Multi-label moderation for social media posts",
47
+ "difficulty": "medium",
48
+ "content_type": "text",
49
+ "items": [
50
+ {
51
+ "content_id": "cm_001",
52
+ "content_type": "text",
53
+ "text": "Gorgeous sunset from my hike today! The trail to the summit was tough but totally worth the view.",
54
+ "metadata": {"platform": "social_media", "report_count": 0, "user_history_violations": 0},
55
+ "ground_truth": {"decision": "approve", "labels": [], "is_harmful": False},
56
+ },
57
+ {
58
+ "content_id": "cm_002",
59
+ "content_type": "text",
60
+ "text": "I will find that [slur] and make them regret it. They deserve everything coming to them.",
61
+ "metadata": {"platform": "social_media", "report_count": 18, "user_history_violations": 3},
62
+ "ground_truth": {"decision": "reject", "labels": ["hate_speech", "violence", "harassment"], "is_harmful": True},
63
+ },
64
+ {
65
+ "content_id": "cm_003",
66
+ "content_type": "text",
67
+ "text": "Lose 30lbs in 7 days with this one weird trick! Doctors HATE this! DM me to buy, only $89.",
68
+ "metadata": {"platform": "social_media", "report_count": 11, "user_history_violations": 1},
69
+ "ground_truth": {"decision": "flag", "labels": ["misinformation", "spam"], "is_harmful": True},
70
+ },
71
+ {
72
+ "content_id": "cm_004",
73
+ "content_type": "text",
74
+ "text": "The election was stolen. The globalist elites are replacing us and controlling everything. Rise up!",
75
+ "metadata": {"platform": "social_media", "report_count": 27, "user_history_violations": 2},
76
+ "ground_truth": {"decision": "flag", "labels": ["misinformation", "hate_speech"], "is_harmful": True},
77
+ },
78
+ {
79
+ "content_id": "cm_005",
80
+ "content_type": "text",
81
+ "text": "Book club reminder: we meet Thursday 7pm to discuss chapters 12-15. Bring snacks if you can!",
82
+ "metadata": {"platform": "social_media", "report_count": 0, "user_history_violations": 0},
83
+ "ground_truth": {"decision": "approve", "labels": [], "is_harmful": False},
84
+ },
85
+ ],
86
+ },
87
+
88
+ "deepfake_detection": {
89
+ "description": "Detect AI-manipulated media and make moderation decisions",
90
+ "difficulty": "hard",
91
+ "content_type": "multimodal",
92
+ "items": [
93
+ {
94
+ "content_id": "df_001",
95
+ "content_type": "multimodal",
96
+ "text": "CEO announces surprise acquisition deal — photo circulating widely on financial forums",
97
+ "image_description": "Corporate executive at podium. Facial proportions slightly asymmetric. Background blur shows unnatural transitions. Hair boundary shows pixel artifacts. Lighting on face inconsistent with background.",
98
+ "metadata": {
99
+ "source": "social_media",
100
+ "claimed_subject": "public_executive",
101
+ "virality_score": 0.92,
102
+ "exif_stripped": True,
103
+ "upload_device": "unknown",
104
+ },
105
+ "ground_truth": {
106
+ "decision": "reject",
107
+ "labels": ["deepfake", "misinformation", "fraud"],
108
+ "is_deepfake": True,
109
+ "is_harmful": True,
110
+ },
111
+ },
112
+ {
113
+ "content_id": "df_002",
114
+ "content_type": "multimodal",
115
+ "text": "Family reunion photo shared to personal profile",
116
+ "image_description": "Outdoor gathering, natural daylight. Shadows consistent with single light source. Facial textures normal and varied. Background in sharp focus with depth-of-field matching focal length.",
117
+ "metadata": {
118
+ "source": "personal_profile",
119
+ "claimed_subject": "private_individual",
120
+ "virality_score": 0.03,
121
+ "exif_stripped": False,
122
+ "upload_device": "iPhone_15",
123
+ },
124
+ "ground_truth": {
125
+ "decision": "approve",
126
+ "labels": [],
127
+ "is_deepfake": False,
128
+ "is_harmful": False,
129
+ },
130
+ },
131
+ {
132
+ "content_id": "df_003",
133
+ "content_type": "multimodal",
134
+ "text": "Politician caught making controversial statement — clip going viral across platforms",
135
+ "image_description": "Political figure at microphone. Eye blink rate irregular. Lip sync offset approximately 2 frames. Neck-to-face skin tone boundary visible. Ear shape subtly non-anatomical.",
136
+ "metadata": {
137
+ "source": "video_platform",
138
+ "claimed_subject": "elected_official",
139
+ "virality_score": 0.88,
140
+ "exif_stripped": True,
141
+ "upload_device": "unknown",
142
+ },
143
+ "ground_truth": {
144
+ "decision": "reject",
145
+ "labels": ["deepfake", "political_manipulation", "misinformation"],
146
+ "is_deepfake": True,
147
+ "is_harmful": True,
148
+ },
149
+ },
150
+ {
151
+ "content_id": "df_004",
152
+ "content_type": "multimodal",
153
+ "text": "Famous actor endorsing cryptocurrency — advertisement spreading across social media",
154
+ "image_description": "Known celebrity in studio setting. Jaw boundary shows feathering artifacts. Teeth appear uniform and unnaturally perfect. Reflection in eyes does not match claimed studio environment.",
155
+ "metadata": {
156
+ "source": "paid_advertisement",
157
+ "claimed_subject": "celebrity",
158
+ "virality_score": 0.75,
159
+ "exif_stripped": True,
160
+ "upload_device": "unknown",
161
+ },
162
+ "ground_truth": {
163
+ "decision": "reject",
164
+ "labels": ["deepfake", "fraud", "scam"],
165
+ "is_deepfake": True,
166
+ "is_harmful": True,
167
+ },
168
+ },
169
+ {
170
+ "content_id": "df_005",
171
+ "content_type": "multimodal",
172
+ "text": "News anchor presenting evening broadcast — clip shared from verified channel",
173
+ "image_description": "Professional broadcast studio. Consistent studio lighting. Natural eye movement and blink rate. Source watermark present. Metadata intact from broadcast system.",
174
+ "metadata": {
175
+ "source": "verified_broadcaster",
176
+ "claimed_subject": "journalist",
177
+ "virality_score": 0.35,
178
+ "exif_stripped": False,
179
+ "upload_device": "broadcast_encoder",
180
+ },
181
+ "ground_truth": {
182
+ "decision": "approve",
183
+ "labels": [],
184
+ "is_deepfake": False,
185
+ "is_harmful": False,
186
+ },
187
+ },
188
+ ],
189
+ },
190
+ }
191
+
192
+ TASK_NAMES = list(TASKS.keys())
test/test.py ADDED
@@ -0,0 +1,243 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import sys
2
+ import os
3
+ sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
4
+
5
+ import pytest
6
+ from server.models import ModerationAction, ContentObservation, StepResult, ResetResult, EnvState
7
+ from server.env import ContentModerationEnv
8
+ from server.graders import grade_text_spam, grade_content_moderation, grade_deepfake, GRADERS
9
+ from server.tasks import TASKS, TASK_NAMES
10
+
11
+
12
+ def make_action(decision="approve", reason="test", confidence=0.8, labels=None):
13
+ return ModerationAction(decision=decision, reason=reason, confidence=confidence, labels=labels or [])
14
+
15
+
16
+ def make_action_dict(decision="approve", reason="test", confidence=0.8, labels=None):
17
+ return {"decision": decision, "reason": reason, "confidence": confidence, "labels": labels or []}
18
+
19
+
20
+ # --- Task data ---
21
+
22
+ def test_all_tasks_present():
23
+ assert set(TASK_NAMES) == {"text_spam", "content_moderation", "deepfake_detection"}
24
+
25
+
26
+ def test_each_task_has_five_items():
27
+ for name, task in TASKS.items():
28
+ assert len(task["items"]) == 5, f"{name} should have 5 items"
29
+
30
+
31
+ def test_ground_truth_keys():
32
+ for name, task in TASKS.items():
33
+ for item in task["items"]:
34
+ gt = item["ground_truth"]
35
+ assert "decision" in gt
36
+ assert "labels" in gt
37
+ assert gt["decision"] in ("approve", "reject", "flag", "escalate")
38
+
39
+
40
+ def test_deepfake_items_have_is_deepfake():
41
+ for item in TASKS["deepfake_detection"]["items"]:
42
+ assert "is_deepfake" in item["ground_truth"]
43
+
44
+
45
+ # --- Graders ---
46
+
47
+ def test_grade_spam_correct_reject():
48
+ gt = {"decision": "reject", "labels": ["spam"], "is_harmful": True}
49
+ r = grade_text_spam(make_action_dict("reject", confidence=0.9, labels=["spam"]), gt)
50
+ assert 0.8 <= r <= 1.0
51
+
52
+
53
+ def test_grade_spam_wrong_decision_penalised():
54
+ gt = {"decision": "reject", "labels": ["spam"], "is_harmful": True}
55
+ r = grade_text_spam(make_action_dict("approve", confidence=0.9), gt)
56
+ assert r < 0.3
57
+
58
+
59
+ def test_grade_spam_escalate_partial_credit():
60
+ gt = {"decision": "reject", "labels": ["spam"], "is_harmful": True}
61
+ r = grade_text_spam(make_action_dict("escalate", confidence=0.5), gt)
62
+ assert 0.15 <= r <= 0.45
63
+
64
+
65
+ def test_grade_spam_correct_approve():
66
+ gt = {"decision": "approve", "labels": [], "is_harmful": False}
67
+ r = grade_text_spam(make_action_dict("approve", confidence=0.85), gt)
68
+ assert r >= 0.6
69
+
70
+
71
+ def test_grade_content_mod_correct_with_labels():
72
+ gt = {"decision": "reject", "labels": ["hate_speech", "violence"], "is_harmful": True}
73
+ r = grade_content_moderation(
74
+ make_action_dict("reject", confidence=0.9, labels=["hate_speech", "violence"]), gt
75
+ )
76
+ assert r >= 0.7
77
+
78
+
79
+ def test_grade_content_mod_label_partial():
80
+ gt = {"decision": "flag", "labels": ["misinformation", "spam"], "is_harmful": True}
81
+ r_partial = grade_content_moderation(
82
+ make_action_dict("flag", confidence=0.7, labels=["misinformation"]), gt
83
+ )
84
+ r_full = grade_content_moderation(
85
+ make_action_dict("flag", confidence=0.7, labels=["misinformation", "spam"]), gt
86
+ )
87
+ assert r_partial < r_full
88
+
89
+
90
+ def test_grade_deepfake_detected_and_correct():
91
+ gt = {"decision": "reject", "labels": ["deepfake", "fraud"], "is_deepfake": True, "is_harmful": True}
92
+ r = grade_deepfake(
93
+ make_action_dict("reject", confidence=0.9, labels=["deepfake", "fraud"]), gt, detector_score=0.85
94
+ )
95
+ assert r >= 0.7
96
+
97
+
98
+ def test_grade_deepfake_real_correct():
99
+ gt = {"decision": "approve", "labels": [], "is_deepfake": False, "is_harmful": False}
100
+ r = grade_deepfake(make_action_dict("approve", confidence=0.8), gt, detector_score=0.1)
101
+ assert r >= 0.5
102
+
103
+
104
+ def test_grade_deepfake_missed_deepfake():
105
+ gt = {"decision": "reject", "labels": ["deepfake"], "is_deepfake": True, "is_harmful": True}
106
+ r_miss = grade_deepfake(make_action_dict("approve", confidence=0.8), gt)
107
+ r_detect = grade_deepfake(make_action_dict("reject", confidence=0.8, labels=["deepfake"]), gt)
108
+ assert r_miss < r_detect
109
+
110
+
111
+ def test_all_rewards_in_range():
112
+ for task_name in TASK_NAMES:
113
+ task = TASKS[task_name]
114
+ grader = GRADERS[task_name]
115
+ for item in task["items"]:
116
+ for decision in ("approve", "reject", "flag", "escalate"):
117
+ action = make_action_dict(decision, confidence=0.5, labels=["spam"])
118
+ if task_name == "deepfake_detection":
119
+ r = grader(action, item["ground_truth"], 0.5)
120
+ else:
121
+ r = grader(action, item["ground_truth"])
122
+ assert 0.0 <= r <= 1.0, f"{task_name} reward out of range: {r}"
123
+
124
+
125
+ # --- Environment ---
126
+
127
+ def test_reset_returns_first_observation():
128
+ env = ContentModerationEnv()
129
+ result = env.reset("text_spam")
130
+ assert isinstance(result, ResetResult)
131
+ obs = result.observation
132
+ assert obs.step_num == 1
133
+ assert obs.total_steps == 5
134
+ assert obs.content_id == "ts_001"
135
+
136
+
137
+ def test_step_advances_state():
138
+ env = ContentModerationEnv()
139
+ env.reset("text_spam")
140
+ action = make_action("reject")
141
+ result = env.step(action)
142
+ assert isinstance(result, StepResult)
143
+ assert 0.0 <= result.reward <= 1.0
144
+ assert result.observation is not None
145
+ assert result.observation.step_num == 2
146
+
147
+
148
+ def test_episode_ends_after_all_items():
149
+ env = ContentModerationEnv()
150
+ env.reset("text_spam")
151
+ done = False
152
+ steps = 0
153
+ while not done:
154
+ r = env.step(make_action("escalate"))
155
+ done = r.done
156
+ steps += 1
157
+ assert steps == 5
158
+ assert r.observation is None
159
+
160
+
161
+ def test_step_after_done_returns_error():
162
+ env = ContentModerationEnv()
163
+ env.reset("text_spam")
164
+ for _ in range(5):
165
+ env.step(make_action("approve"))
166
+ result = env.step(make_action("approve"))
167
+ assert result.done is True
168
+ assert "error" in result.info
169
+
170
+
171
+ def test_state_tracks_cumulative_reward():
172
+ env = ContentModerationEnv()
173
+ env.reset("content_moderation")
174
+ env.step(make_action("approve", confidence=0.9))
175
+ env.step(make_action("reject", confidence=0.9, labels=["hate_speech"]))
176
+ st = env.state()
177
+ assert isinstance(st, EnvState)
178
+ assert st.step_num == 2
179
+ assert st.cumulative_reward >= 0.0
180
+ assert len(st.history) == 2
181
+
182
+
183
+ def test_reset_different_tasks():
184
+ env = ContentModerationEnv()
185
+ for task in TASK_NAMES:
186
+ if task == "deepfake_detection":
187
+ continue
188
+ r = env.reset(task)
189
+ assert r.observation.total_steps == 5
190
+
191
+
192
+ def test_invalid_task_raises():
193
+ env = ContentModerationEnv()
194
+ with pytest.raises(ValueError):
195
+ env.reset("nonexistent_task")
196
+
197
+
198
+ def test_close_resets_env():
199
+ env = ContentModerationEnv()
200
+ env.reset("text_spam")
201
+ env.step(make_action("approve"))
202
+ env.close()
203
+ st = env.state()
204
+ assert st.task == "none"
205
+ assert st.done is True
206
+
207
+
208
+ def test_content_moderation_full_run():
209
+ env = ContentModerationEnv()
210
+ env.reset("content_moderation")
211
+ actions = [
212
+ make_action("approve"),
213
+ make_action("reject", labels=["hate_speech", "violence"]),
214
+ make_action("flag", labels=["misinformation"]),
215
+ make_action("flag", labels=["misinformation", "hate_speech"]),
216
+ make_action("approve"),
217
+ ]
218
+ total_reward = 0.0
219
+ for action in actions:
220
+ result = env.step(action)
221
+ total_reward += result.reward
222
+ assert result.done is True
223
+ assert total_reward >= 0.0
224
+ st = env.state()
225
+ assert abs(st.cumulative_reward - total_reward) < 0.01
226
+
227
+
228
+ def test_observation_fields_populated():
229
+ env = ContentModerationEnv()
230
+ r = env.reset("content_moderation")
231
+ obs = r.observation
232
+ assert obs.content_id is not None
233
+ assert obs.content_type == "text"
234
+ assert obs.text is not None
235
+ assert obs.metadata is not None
236
+
237
+
238
+ def test_deepfake_obs_has_image_description():
239
+ env = ContentModerationEnv()
240
+ r = env.reset("deepfake_detection")
241
+ obs = r.observation
242
+ assert obs.image_description is not None
243
+ assert obs.content_type == "multimodal"