Spaces:
Sleeping
Sleeping
Commit ·
eb0a4a1
0
Parent(s):
first commit
Browse files- .gitignore +7 -0
- README.md +210 -0
- inference.py +238 -0
- server/Dockerfile +27 -0
- server/__init__.py +0 -0
- server/deepfake_model.py +89 -0
- server/env.py +121 -0
- server/graders.py +94 -0
- server/main.py +61 -0
- server/models.py +44 -0
- server/openenv.yaml +89 -0
- server/requirements.txt +12 -0
- server/tasks.py +192 -0
- test/test.py +243 -0
.gitignore
ADDED
|
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
__pycache__
|
| 2 |
+
*.pyc
|
| 3 |
+
.env
|
| 4 |
+
|
| 5 |
+
.sixth
|
| 6 |
+
.pytest_cache
|
| 7 |
+
.coverage
|
README.md
ADDED
|
@@ -0,0 +1,210 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Content Moderation OpenEnv
|
| 2 |
+
|
| 3 |
+
An AI content moderation environment built to the OpenEnv specification. Agents triage real-world content — spam emails, harmful social media posts, and AI-generated deepfakes — using a standard `step()` / `reset()` / `state()` API.
|
| 4 |
+
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
## Environment Description
|
| 8 |
+
|
| 9 |
+
Content moderation is a high-stakes, high-volume real-world task. Human moderators review millions of items daily. This environment simulates that pipeline across three difficulty levels:
|
| 10 |
+
|
| 11 |
+
| Task | Difficulty | Content | Description |
|
| 12 |
+
|---|---|---|---|
|
| 13 |
+
| `text_spam` | Easy | Email / SMS | Binary spam vs. legitimate classification |
|
| 14 |
+
| `content_moderation` | Medium | Social media posts | Multi-label violation detection |
|
| 15 |
+
| `deepfake_detection` | Hard | Multimodal | Detect AI-manipulated media + make policy decisions |
|
| 16 |
+
|
| 17 |
+
Models used:
|
| 18 |
+
- **Deepfake detection**: `dima806/deepfake_vs_real_image_detection` (ViT, ~300MB)
|
| 19 |
+
- **Agent LLM**: `meta-llama/Llama-3.1-8B-Instruct` (via transformers pipeline or HF API)
|
| 20 |
+
|
| 21 |
+
---
|
| 22 |
+
|
| 23 |
+
## Action Space
|
| 24 |
+
|
| 25 |
+
```json
|
| 26 |
+
{
|
| 27 |
+
"decision": "approve | reject | escalate | flag",
|
| 28 |
+
"reason": "string (explanation)",
|
| 29 |
+
"confidence": 0.0–1.0,
|
| 30 |
+
"labels": ["spam", "scam", "phishing", "hate_speech", "violence",
|
| 31 |
+
"harassment", "misinformation", "adult_content",
|
| 32 |
+
"deepfake", "political_manipulation", "fraud"]
|
| 33 |
+
}
|
| 34 |
+
```
|
| 35 |
+
|
| 36 |
+
- **approve** — content is safe
|
| 37 |
+
- **reject** — clear policy violation, remove immediately
|
| 38 |
+
- **flag** — borderline, surface to human review
|
| 39 |
+
- **escalate** — uncertain, route to specialist
|
| 40 |
+
|
| 41 |
+
---
|
| 42 |
+
|
| 43 |
+
## Observation Space
|
| 44 |
+
|
| 45 |
+
```json
|
| 46 |
+
{
|
| 47 |
+
"content_id": "string",
|
| 48 |
+
"content_type": "text | multimodal",
|
| 49 |
+
"text": "string (optional)",
|
| 50 |
+
"image_description": "string (optional, deepfake task)",
|
| 51 |
+
"detector_score": 0.0–1.0 (optional, deepfake task),
|
| 52 |
+
"metadata": { "source": "...", "report_count": 0, ... },
|
| 53 |
+
"step_num": 1,
|
| 54 |
+
"total_steps": 5
|
| 55 |
+
}
|
| 56 |
+
```
|
| 57 |
+
|
| 58 |
+
The `detector_score` field (deepfake task only) is the raw output from the `dima806/deepfake_vs_real_image_detection` model — higher values indicate higher probability of a deepfake.
|
| 59 |
+
|
| 60 |
+
---
|
| 61 |
+
|
| 62 |
+
## Reward Functions
|
| 63 |
+
|
| 64 |
+
### text_spam (easy)
|
| 65 |
+
- Correct decision: **+0.65**
|
| 66 |
+
- Escalate on harmful content: **+0.30** (partial credit)
|
| 67 |
+
- Label F1 contribution: **+0.20**
|
| 68 |
+
- Confidence calibration bonus/penalty: **±0.10**
|
| 69 |
+
|
| 70 |
+
### content_moderation (medium)
|
| 71 |
+
- Correct decision: **+0.50**
|
| 72 |
+
- Partial credit (flag/escalate on harmful): **+0.25**
|
| 73 |
+
- Label F1 contribution: **+0.35**
|
| 74 |
+
- Confidence calibration: **±0.10**
|
| 75 |
+
|
| 76 |
+
### deepfake_detection (hard)
|
| 77 |
+
- Correct decision: **+0.40**
|
| 78 |
+
- Deepfake detection accuracy: **+0.30**
|
| 79 |
+
- Detector model alignment bonus: **+0.10**
|
| 80 |
+
- Label F1 contribution: **+0.20**
|
| 81 |
+
- Confidence calibration: **±0.10**
|
| 82 |
+
|
| 83 |
+
---
|
| 84 |
+
|
| 85 |
+
## API Endpoints
|
| 86 |
+
|
| 87 |
+
| Method | Path | Description |
|
| 88 |
+
|---|---|---|
|
| 89 |
+
| POST | `/reset` | Start new episode. Body: `{"task": "text_spam"}` |
|
| 90 |
+
| POST | `/step` | Submit action. Body: action JSON |
|
| 91 |
+
| GET | `/state` | Current episode state |
|
| 92 |
+
| POST | `/close` | End episode and clean up |
|
| 93 |
+
| GET | `/tasks` | List all available tasks |
|
| 94 |
+
| GET | `/health` | Health check |
|
| 95 |
+
|
| 96 |
+
---
|
| 97 |
+
|
| 98 |
+
## Setup & Usage
|
| 99 |
+
|
| 100 |
+
### Requirements
|
| 101 |
+
- Docker
|
| 102 |
+
- Python 3.11+
|
| 103 |
+
- `openenv-core` (`pip install openenv-core`)
|
| 104 |
+
|
| 105 |
+
### Run with Docker
|
| 106 |
+
|
| 107 |
+
```bash
|
| 108 |
+
cd content-moderation-env
|
| 109 |
+
|
| 110 |
+
# Build
|
| 111 |
+
docker build -f server/Dockerfile -t content-moderation-env .
|
| 112 |
+
|
| 113 |
+
# Run
|
| 114 |
+
docker run -p 7860:7860 content-moderation-env
|
| 115 |
+
```
|
| 116 |
+
|
| 117 |
+
### Run locally
|
| 118 |
+
|
| 119 |
+
```bash
|
| 120 |
+
pip install -r server/requirements.txt
|
| 121 |
+
uvicorn server.main:app --host 0.0.0.0 --port 7860
|
| 122 |
+
```
|
| 123 |
+
|
| 124 |
+
### Validate
|
| 125 |
+
|
| 126 |
+
```bash
|
| 127 |
+
openenv validate # from project root
|
| 128 |
+
```
|
| 129 |
+
|
| 130 |
+
---
|
| 131 |
+
|
| 132 |
+
## Inference Script
|
| 133 |
+
|
| 134 |
+
```bash
|
| 135 |
+
# API mode (HF inference endpoint)
|
| 136 |
+
export API_BASE_URL="https://router.huggingface.co/v1"
|
| 137 |
+
export MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct"
|
| 138 |
+
export HF_TOKEN="hf_your_token_here"
|
| 139 |
+
export SERVER_URL="http://localhost:7860"
|
| 140 |
+
export TASK_NAME="text_spam"
|
| 141 |
+
|
| 142 |
+
python inference.py
|
| 143 |
+
|
| 144 |
+
# Local transformers pipeline mode
|
| 145 |
+
export USE_LOCAL_MODEL="true"
|
| 146 |
+
python inference.py
|
| 147 |
+
```
|
| 148 |
+
|
| 149 |
+
### Output format
|
| 150 |
+
|
| 151 |
+
```
|
| 152 |
+
[START] task=text_spam env=content_moderation_env model=meta-llama/Llama-3.1-8B-Instruct
|
| 153 |
+
[STEP] step=1 action={"decision":"reject","confidence":0.9,"labels":["spam"]} reward=0.85 done=false error=null
|
| 154 |
+
[STEP] step=2 action={"decision":"approve","confidence":0.8,"labels":[]} reward=0.75 done=false error=null
|
| 155 |
+
...
|
| 156 |
+
[END] success=true steps=5 score=0.720 rewards=0.85,0.75,0.00,0.80,0.65
|
| 157 |
+
```
|
| 158 |
+
|
| 159 |
+
---
|
| 160 |
+
|
| 161 |
+
## Run Tests
|
| 162 |
+
|
| 163 |
+
```bash
|
| 164 |
+
pip install pytest
|
| 165 |
+
pytest test/test.py -v
|
| 166 |
+
```
|
| 167 |
+
|
| 168 |
+
---
|
| 169 |
+
|
| 170 |
+
## Baseline Scores (Llama-3.1-8B-Instruct, temperature=0.2)
|
| 171 |
+
|
| 172 |
+
| Task | Score | Notes |
|
| 173 |
+
|---|---|---|
|
| 174 |
+
| `text_spam` | ~0.72 | Strong on obvious spam, weaker on phishing |
|
| 175 |
+
| `content_moderation` | ~0.58 | Good decision, weaker multi-label coverage |
|
| 176 |
+
| `deepfake_detection` | ~0.44 | Relies heavily on image description cues |
|
| 177 |
+
|
| 178 |
+
---
|
| 179 |
+
|
| 180 |
+
## HuggingFace Spaces Deployment
|
| 181 |
+
|
| 182 |
+
Create a Space with Docker SDK, push this repo, and set:
|
| 183 |
+
- `HF_TOKEN` (secret)
|
| 184 |
+
- `API_BASE_URL` (variable)
|
| 185 |
+
- `MODEL_NAME` (variable)
|
| 186 |
+
|
| 187 |
+
The Space URL becomes your `PING_URL` for the validation script.
|
| 188 |
+
|
| 189 |
+
---
|
| 190 |
+
|
| 191 |
+
## Project Structure
|
| 192 |
+
|
| 193 |
+
```
|
| 194 |
+
content-moderation-env/
|
| 195 |
+
├── server/
|
| 196 |
+
│ ├── __init__.py
|
| 197 |
+
│ ├── main.py # FastAPI app + endpoints
|
| 198 |
+
│ ├── env.py # OpenEnv environment (step/reset/state/close)
|
| 199 |
+
│ ├── models.py # Pydantic action/observation models
|
| 200 |
+
│ ├── tasks.py # Task datasets + ground truth
|
| 201 |
+
│ ├── graders.py # Reward functions per task
|
| 202 |
+
│ ├── deepfake_model.py# HF deepfake detection pipeline
|
| 203 |
+
│ ├── openenv.yaml # OpenEnv metadata spec
|
| 204 |
+
│ ├── requirements.txt
|
| 205 |
+
│ └── Dockerfile
|
| 206 |
+
├── test/
|
| 207 |
+
│ └── test.py # pytest suite (20+ tests)
|
| 208 |
+
├── inference.py # Baseline agent script
|
| 209 |
+
└── README.md
|
| 210 |
+
```
|
inference.py
ADDED
|
@@ -0,0 +1,238 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Content Moderation Inference Script
|
| 3 |
+
Env vars: API_BASE_URL, MODEL_NAME, HF_TOKEN, SERVER_URL, TASK_NAME
|
| 4 |
+
USE_LOCAL_MODEL=true → local transformers pipeline (Llama-3.1-8B-Instruct)
|
| 5 |
+
USE_LOCAL_MODEL=false → OpenAI-compatible HF API (default)
|
| 6 |
+
"""
|
| 7 |
+
|
| 8 |
+
import json
|
| 9 |
+
import os
|
| 10 |
+
import textwrap
|
| 11 |
+
from typing import Dict, Any, List, Optional
|
| 12 |
+
|
| 13 |
+
from dotenv import load_dotenv
|
| 14 |
+
import requests
|
| 15 |
+
|
| 16 |
+
# Load environment variables from .env file
|
| 17 |
+
load_dotenv()
|
| 18 |
+
|
| 19 |
+
API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
|
| 20 |
+
MODEL_NAME = os.getenv("MODEL_NAME", "meta-llama/Llama-3.1-8B-Instruct")
|
| 21 |
+
HF_TOKEN = os.getenv("HF_TOKEN", "")
|
| 22 |
+
API_KEY = HF_TOKEN or os.getenv("API_KEY", "")
|
| 23 |
+
SERVER_URL = os.getenv("SERVER_URL", "http://localhost:7860").rstrip("/")
|
| 24 |
+
TASK_NAME = os.getenv("TASK_NAME", "text_spam")
|
| 25 |
+
BENCHMARK = "content_moderation_env"
|
| 26 |
+
MAX_STEPS = 10
|
| 27 |
+
USE_LOCAL = os.getenv("USE_LOCAL_MODEL", "false").lower() == "true"
|
| 28 |
+
SUCCESS_SCORE_THRESHOLD = 0.5
|
| 29 |
+
|
| 30 |
+
VALID_DECISIONS = {"approve", "reject", "escalate", "flag"}
|
| 31 |
+
VALID_LABELS = [
|
| 32 |
+
"spam", "scam", "phishing", "pharmaceutical_spam",
|
| 33 |
+
"hate_speech", "violence", "harassment", "misinformation",
|
| 34 |
+
"adult_content", "deepfake", "political_manipulation", "fraud",
|
| 35 |
+
]
|
| 36 |
+
|
| 37 |
+
SYSTEM_PROMPT = textwrap.dedent("""
|
| 38 |
+
You are an expert content moderation AI. For each piece of content, analyze it carefully and respond ONLY with a valid JSON object.
|
| 39 |
+
|
| 40 |
+
Required JSON fields:
|
| 41 |
+
- "decision": one of "approve", "reject", "escalate", "flag"
|
| 42 |
+
- "reason": brief explanation (max 2 sentences)
|
| 43 |
+
- "confidence": float between 0.0 and 1.0
|
| 44 |
+
- "labels": list of applicable labels from: spam, scam, phishing, pharmaceutical_spam, hate_speech, violence, harassment, misinformation, adult_content, deepfake, political_manipulation, fraud
|
| 45 |
+
|
| 46 |
+
Decision guidelines:
|
| 47 |
+
- approve: content is safe and legitimate
|
| 48 |
+
- reject: content clearly violates policy (spam, hate speech, deepfake, etc.)
|
| 49 |
+
- flag: content is borderline or needs human review
|
| 50 |
+
- escalate: uncertain, requires specialist review
|
| 51 |
+
|
| 52 |
+
Output ONLY the JSON object, no other text.
|
| 53 |
+
""").strip()
|
| 54 |
+
|
| 55 |
+
|
| 56 |
+
def log_start(task: str, env: str, model: str):
|
| 57 |
+
print(f"[START] task={task} env={env} model={model}", flush=True)
|
| 58 |
+
|
| 59 |
+
|
| 60 |
+
def log_step(step: int, action: str, reward: float, done: bool, error: Optional[str]):
|
| 61 |
+
err = error if error else "null"
|
| 62 |
+
print(f"[STEP] step={step} action={action} reward={reward:.2f} done={str(done).lower()} error={err}", flush=True)
|
| 63 |
+
|
| 64 |
+
|
| 65 |
+
def log_end(success: bool, steps: int, score: float, rewards: List[float]):
|
| 66 |
+
rewards_str = ",".join(f"{r:.2f}" for r in rewards)
|
| 67 |
+
print(f"[END] success={str(success).lower()} steps={steps} score={score:.3f} rewards={rewards_str}", flush=True)
|
| 68 |
+
|
| 69 |
+
|
| 70 |
+
def build_prompt(obs: Dict[str, Any]) -> str:
|
| 71 |
+
parts = [f"Content ID: {obs.get('content_id', 'unknown')}"]
|
| 72 |
+
parts.append(f"Type: {obs.get('content_type', 'text')}")
|
| 73 |
+
|
| 74 |
+
if obs.get("text"):
|
| 75 |
+
parts.append(f"Text: {obs['text']}")
|
| 76 |
+
if obs.get("image_description"):
|
| 77 |
+
parts.append(f"Image analysis: {obs['image_description']}")
|
| 78 |
+
if obs.get("detector_score") is not None:
|
| 79 |
+
score = obs["detector_score"]
|
| 80 |
+
parts.append(f"Deepfake detector score (higher = more likely fake): {score:.3f}")
|
| 81 |
+
|
| 82 |
+
meta = obs.get("metadata", {})
|
| 83 |
+
if meta:
|
| 84 |
+
meta_str = ", ".join(f"{k}={v}" for k, v in meta.items())
|
| 85 |
+
parts.append(f"Metadata: {meta_str}")
|
| 86 |
+
|
| 87 |
+
parts.append(f"\nStep {obs.get('step_num', '?')} of {obs.get('total_steps', '?')}")
|
| 88 |
+
return "\n".join(parts)
|
| 89 |
+
|
| 90 |
+
|
| 91 |
+
def _default_action() -> Dict:
|
| 92 |
+
return {"decision": "escalate", "reason": "Unable to analyze content.", "confidence": 0.3, "labels": []}
|
| 93 |
+
|
| 94 |
+
|
| 95 |
+
def call_local_model(prompt: str) -> Dict:
|
| 96 |
+
from transformers import pipeline
|
| 97 |
+
|
| 98 |
+
pipe = pipeline("text-generation", model="meta-llama/Llama-3.1-8B-Instruct")
|
| 99 |
+
messages = [
|
| 100 |
+
{"role": "system", "content": SYSTEM_PROMPT},
|
| 101 |
+
{"role": "user", "content": prompt},
|
| 102 |
+
]
|
| 103 |
+
output = pipe(messages, max_new_tokens=256, temperature=0.2, do_sample=True)
|
| 104 |
+
text = output[0]["generated_text"]
|
| 105 |
+
if isinstance(text, list):
|
| 106 |
+
text = text[-1].get("content", "")
|
| 107 |
+
return parse_llm_response(text)
|
| 108 |
+
|
| 109 |
+
|
| 110 |
+
def call_api_model(prompt: str) -> Dict:
|
| 111 |
+
from openai import OpenAI
|
| 112 |
+
|
| 113 |
+
client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY or "hf_default")
|
| 114 |
+
completion = client.chat.completions.create(
|
| 115 |
+
model=MODEL_NAME,
|
| 116 |
+
messages=[
|
| 117 |
+
{"role": "system", "content": SYSTEM_PROMPT},
|
| 118 |
+
{"role": "user", "content": prompt},
|
| 119 |
+
],
|
| 120 |
+
temperature=0.2,
|
| 121 |
+
max_tokens=256,
|
| 122 |
+
)
|
| 123 |
+
text = (completion.choices[0].message.content or "").strip()
|
| 124 |
+
return parse_llm_response(text)
|
| 125 |
+
|
| 126 |
+
|
| 127 |
+
def parse_llm_response(text: str) -> Dict:
|
| 128 |
+
try:
|
| 129 |
+
start = text.find("{")
|
| 130 |
+
end = text.rfind("}") + 1
|
| 131 |
+
if start >= 0 and end > start:
|
| 132 |
+
parsed = json.loads(text[start:end])
|
| 133 |
+
decision = parsed.get("decision", "escalate")
|
| 134 |
+
if decision not in VALID_DECISIONS:
|
| 135 |
+
decision = "escalate"
|
| 136 |
+
return {
|
| 137 |
+
"decision": decision,
|
| 138 |
+
"reason": str(parsed.get("reason", ""))[:200],
|
| 139 |
+
"confidence": float(max(0.0, min(1.0, parsed.get("confidence", 0.5)))),
|
| 140 |
+
"labels": [l for l in parsed.get("labels", []) if l in VALID_LABELS],
|
| 141 |
+
}
|
| 142 |
+
except Exception:
|
| 143 |
+
pass
|
| 144 |
+
return _default_action()
|
| 145 |
+
|
| 146 |
+
|
| 147 |
+
def get_decision(prompt: str) -> Dict:
|
| 148 |
+
try:
|
| 149 |
+
if USE_LOCAL:
|
| 150 |
+
return call_local_model(prompt)
|
| 151 |
+
return call_api_model(prompt)
|
| 152 |
+
except Exception as e:
|
| 153 |
+
print(f"[DEBUG] Model error: {e}", flush=True)
|
| 154 |
+
return _default_action()
|
| 155 |
+
|
| 156 |
+
|
| 157 |
+
def server_reset(task: str) -> Optional[Dict]:
|
| 158 |
+
try:
|
| 159 |
+
r = requests.post(f"{SERVER_URL}/reset", json={"task": task}, timeout=30)
|
| 160 |
+
r.raise_for_status()
|
| 161 |
+
return r.json()
|
| 162 |
+
except Exception as e:
|
| 163 |
+
print(f"[DEBUG] reset error: {e}", flush=True)
|
| 164 |
+
return None
|
| 165 |
+
|
| 166 |
+
|
| 167 |
+
def server_step(action: Dict) -> Optional[Dict]:
|
| 168 |
+
try:
|
| 169 |
+
r = requests.post(f"{SERVER_URL}/step", json=action, timeout=30)
|
| 170 |
+
r.raise_for_status()
|
| 171 |
+
return r.json()
|
| 172 |
+
except Exception as e:
|
| 173 |
+
print(f"[DEBUG] step error: {e}", flush=True)
|
| 174 |
+
return None
|
| 175 |
+
|
| 176 |
+
|
| 177 |
+
def server_close():
|
| 178 |
+
try:
|
| 179 |
+
requests.post(f"{SERVER_URL}/close", timeout=10)
|
| 180 |
+
except Exception:
|
| 181 |
+
pass
|
| 182 |
+
|
| 183 |
+
|
| 184 |
+
def run_episode(task: str):
|
| 185 |
+
rewards: List[float] = []
|
| 186 |
+
steps_taken = 0
|
| 187 |
+
score = 0.0
|
| 188 |
+
success = False
|
| 189 |
+
obs = None
|
| 190 |
+
|
| 191 |
+
log_start(task=task, env=BENCHMARK, model=MODEL_NAME)
|
| 192 |
+
|
| 193 |
+
try:
|
| 194 |
+
reset_result = server_reset(task)
|
| 195 |
+
if reset_result is None:
|
| 196 |
+
log_end(success=False, steps=0, score=0.0, rewards=[])
|
| 197 |
+
return
|
| 198 |
+
|
| 199 |
+
obs = reset_result.get("observation", {})
|
| 200 |
+
done = False
|
| 201 |
+
|
| 202 |
+
for step in range(1, MAX_STEPS + 1):
|
| 203 |
+
if done or obs is None:
|
| 204 |
+
break
|
| 205 |
+
|
| 206 |
+
prompt = build_prompt(obs)
|
| 207 |
+
action = get_decision(prompt)
|
| 208 |
+
action_str = json.dumps({k: v for k, v in action.items() if k != "reason"})
|
| 209 |
+
|
| 210 |
+
result = server_step(action)
|
| 211 |
+
if result is None:
|
| 212 |
+
log_step(step, action_str, 0.0, True, "server_error")
|
| 213 |
+
break
|
| 214 |
+
|
| 215 |
+
reward = float(result.get("reward", 0.0))
|
| 216 |
+
done = bool(result.get("done", False))
|
| 217 |
+
error = result.get("info", {}).get("error")
|
| 218 |
+
|
| 219 |
+
rewards.append(reward)
|
| 220 |
+
steps_taken = step
|
| 221 |
+
|
| 222 |
+
log_step(step, action_str, reward, done, error)
|
| 223 |
+
|
| 224 |
+
obs = result.get("observation")
|
| 225 |
+
|
| 226 |
+
total_steps_in_task = obs.get("total_steps", len(rewards)) if obs else len(rewards)
|
| 227 |
+
max_possible = float(total_steps_in_task)
|
| 228 |
+
score = sum(rewards) / max_possible if max_possible > 0 else 0.0
|
| 229 |
+
score = min(max(score, 0.0), 1.0)
|
| 230 |
+
success = score >= SUCCESS_SCORE_THRESHOLD
|
| 231 |
+
|
| 232 |
+
finally:
|
| 233 |
+
server_close()
|
| 234 |
+
log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
|
| 235 |
+
|
| 236 |
+
|
| 237 |
+
if __name__ == "__main__":
|
| 238 |
+
run_episode(TASK_NAME)
|
server/Dockerfile
ADDED
|
@@ -0,0 +1,27 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
FROM python:3.11-slim
|
| 2 |
+
|
| 3 |
+
ENV PYTHONDONTWRITEBYTECODE=1 \
|
| 4 |
+
PYTHONUNBUFFERED=1 \
|
| 5 |
+
HF_HOME=/app/.cache/huggingface \
|
| 6 |
+
TRANSFORMERS_CACHE=/app/.cache/huggingface
|
| 7 |
+
|
| 8 |
+
WORKDIR /app
|
| 9 |
+
|
| 10 |
+
RUN apt-get update && apt-get install -y --no-install-recommends \
|
| 11 |
+
libgl1 libglib2.0-0 curl \
|
| 12 |
+
&& rm -rf /var/lib/apt/lists/*
|
| 13 |
+
|
| 14 |
+
COPY server/requirements.txt .
|
| 15 |
+
RUN pip install --upgrade pip setuptools wheel
|
| 16 |
+
RUN pip install --no-cache-dir --no-build-isolation -r requirements.txt
|
| 17 |
+
|
| 18 |
+
COPY . .
|
| 19 |
+
|
| 20 |
+
RUN mkdir -p /app/.cache/huggingface
|
| 21 |
+
|
| 22 |
+
# Pre-download deepfake model to avoid runtime delays
|
| 23 |
+
RUN python -c "from transformers import pipeline; pipeline('image-classification', model='dima806/deepfake_vs_real_image_detection', device=-1)" 2>&1 || echo "Model download optional"
|
| 24 |
+
|
| 25 |
+
EXPOSE 7860
|
| 26 |
+
|
| 27 |
+
CMD ["uvicorn", "server.main:app", "--host", "0.0.0.0", "--port", "7860"]
|
server/__init__.py
ADDED
|
File without changes
|
server/deepfake_model.py
ADDED
|
@@ -0,0 +1,89 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import io
|
| 2 |
+
import logging
|
| 3 |
+
from typing import Optional
|
| 4 |
+
|
| 5 |
+
import numpy as np
|
| 6 |
+
|
| 7 |
+
logger = logging.getLogger(__name__)
|
| 8 |
+
_pipe = None
|
| 9 |
+
|
| 10 |
+
|
| 11 |
+
def _load_pipeline():
|
| 12 |
+
global _pipe
|
| 13 |
+
if _pipe is not None:
|
| 14 |
+
return _pipe
|
| 15 |
+
try:
|
| 16 |
+
from transformers import pipeline
|
| 17 |
+
_pipe = pipeline(
|
| 18 |
+
"image-classification",
|
| 19 |
+
model="dima806/deepfake_vs_real_image_detection",
|
| 20 |
+
device=-1,
|
| 21 |
+
)
|
| 22 |
+
logger.info("Deepfake detection model loaded.")
|
| 23 |
+
except Exception as e:
|
| 24 |
+
logger.warning(f"Could not load deepfake model: {e}. Using heuristic fallback.")
|
| 25 |
+
_pipe = None
|
| 26 |
+
return _pipe
|
| 27 |
+
|
| 28 |
+
|
| 29 |
+
def _make_synthetic_image(is_fake: bool):
|
| 30 |
+
from PIL import Image
|
| 31 |
+
|
| 32 |
+
rng = np.random.default_rng(seed=1 if is_fake else 99)
|
| 33 |
+
img = Image.new("RGB", (224, 224))
|
| 34 |
+
pixels = img.load()
|
| 35 |
+
|
| 36 |
+
for i in range(224):
|
| 37 |
+
for j in range(224):
|
| 38 |
+
if is_fake:
|
| 39 |
+
r = int(128 + 60 * np.sin(i / 9.0) * np.cos(j / 9.0))
|
| 40 |
+
g = int(128 + 60 * np.cos(i / 7.0) * np.sin(j / 11.0))
|
| 41 |
+
b = int(128 + 40 * np.sin((i + j) / 14.0))
|
| 42 |
+
else:
|
| 43 |
+
base = int(80 + 100 * (i / 224))
|
| 44 |
+
noise = int(rng.normal(0, 12))
|
| 45 |
+
r = max(0, min(255, base + noise + 20))
|
| 46 |
+
g = max(0, min(255, base + noise))
|
| 47 |
+
b = max(0, min(255, base + noise - 15))
|
| 48 |
+
pixels[j, i] = (
|
| 49 |
+
max(0, min(255, r)),
|
| 50 |
+
max(0, min(255, g)),
|
| 51 |
+
max(0, min(255, b)),
|
| 52 |
+
)
|
| 53 |
+
return img
|
| 54 |
+
|
| 55 |
+
|
| 56 |
+
def score_deepfake(is_fake: bool) -> float:
|
| 57 |
+
pipe = _load_pipeline()
|
| 58 |
+
|
| 59 |
+
if pipe is None:
|
| 60 |
+
return 0.78 if is_fake else 0.22
|
| 61 |
+
|
| 62 |
+
try:
|
| 63 |
+
img = _make_synthetic_image(is_fake)
|
| 64 |
+
results = pipe(img)
|
| 65 |
+
|
| 66 |
+
for r in results:
|
| 67 |
+
label_lower = r["label"].lower()
|
| 68 |
+
if any(kw in label_lower for kw in ("fake", "deepfake", "manipulat", "ai_gen", "synthetic")):
|
| 69 |
+
return float(r["score"])
|
| 70 |
+
|
| 71 |
+
top_label = results[0]["label"].lower()
|
| 72 |
+
top_score = float(results[0]["score"])
|
| 73 |
+
if any(kw in top_label for kw in ("real", "authentic", "genuine")):
|
| 74 |
+
return 1.0 - top_score
|
| 75 |
+
return top_score
|
| 76 |
+
|
| 77 |
+
except Exception as e:
|
| 78 |
+
logger.warning(f"Deepfake scoring error: {e}")
|
| 79 |
+
return 0.75 if is_fake else 0.25
|
| 80 |
+
|
| 81 |
+
|
| 82 |
+
def precompute_detector_scores(items: list) -> list:
|
| 83 |
+
enriched = []
|
| 84 |
+
for item in items:
|
| 85 |
+
is_fake = item.get("ground_truth", {}).get("is_deepfake", False)
|
| 86 |
+
item = dict(item)
|
| 87 |
+
item["detector_score"] = score_deepfake(is_fake)
|
| 88 |
+
enriched.append(item)
|
| 89 |
+
return enriched
|
server/env.py
ADDED
|
@@ -0,0 +1,121 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import threading
|
| 2 |
+
from typing import Dict, Any, Optional
|
| 3 |
+
|
| 4 |
+
from .models import ContentObservation, StepResult, ResetResult, EnvState, ModerationAction
|
| 5 |
+
from .tasks import TASKS
|
| 6 |
+
from .graders import GRADERS
|
| 7 |
+
|
| 8 |
+
|
| 9 |
+
class ContentModerationEnv:
|
| 10 |
+
def __init__(self):
|
| 11 |
+
self._lock = threading.Lock()
|
| 12 |
+
self._s: Dict[str, Any] = {}
|
| 13 |
+
self._clear()
|
| 14 |
+
|
| 15 |
+
def _clear(self):
|
| 16 |
+
self._s = {
|
| 17 |
+
"task": None,
|
| 18 |
+
"items": [],
|
| 19 |
+
"idx": 0,
|
| 20 |
+
"total": 0,
|
| 21 |
+
"reward_sum": 0.0,
|
| 22 |
+
"done": True,
|
| 23 |
+
"history": [],
|
| 24 |
+
}
|
| 25 |
+
|
| 26 |
+
def _obs(self, item: Dict, idx: int, total: int) -> ContentObservation:
|
| 27 |
+
return ContentObservation(
|
| 28 |
+
content_id=item["content_id"],
|
| 29 |
+
content_type=item["content_type"],
|
| 30 |
+
text=item.get("text"),
|
| 31 |
+
image_description=item.get("image_description"),
|
| 32 |
+
detector_score=item.get("detector_score"),
|
| 33 |
+
metadata=item.get("metadata", {}),
|
| 34 |
+
step_num=idx,
|
| 35 |
+
total_steps=total,
|
| 36 |
+
)
|
| 37 |
+
|
| 38 |
+
def reset(self, task: str = "text_spam") -> ResetResult:
|
| 39 |
+
if task not in TASKS:
|
| 40 |
+
raise ValueError(f"Unknown task '{task}'. Valid: {list(TASKS.keys())}")
|
| 41 |
+
|
| 42 |
+
with self._lock:
|
| 43 |
+
task_cfg = TASKS[task]
|
| 44 |
+
items = list(task_cfg["items"])
|
| 45 |
+
|
| 46 |
+
if task == "deepfake_detection":
|
| 47 |
+
from .deepfake_model import precompute_detector_scores
|
| 48 |
+
items = precompute_detector_scores(items)
|
| 49 |
+
|
| 50 |
+
self._s = {
|
| 51 |
+
"task": task,
|
| 52 |
+
"items": items,
|
| 53 |
+
"idx": 0,
|
| 54 |
+
"total": len(items),
|
| 55 |
+
"reward_sum": 0.0,
|
| 56 |
+
"done": False,
|
| 57 |
+
"history": [],
|
| 58 |
+
}
|
| 59 |
+
return ResetResult(observation=self._obs(items[0], 1, len(items)))
|
| 60 |
+
|
| 61 |
+
def step(self, action: ModerationAction) -> StepResult:
|
| 62 |
+
with self._lock:
|
| 63 |
+
if self._s["done"]:
|
| 64 |
+
return StepResult(
|
| 65 |
+
observation=None,
|
| 66 |
+
reward=0.0,
|
| 67 |
+
done=True,
|
| 68 |
+
info={"error": "Episode finished. Call /reset first."},
|
| 69 |
+
)
|
| 70 |
+
|
| 71 |
+
idx = self._s["idx"]
|
| 72 |
+
item = self._s["items"][idx]
|
| 73 |
+
task = self._s["task"]
|
| 74 |
+
grader = GRADERS[task]
|
| 75 |
+
action_d = action.model_dump()
|
| 76 |
+
|
| 77 |
+
if task == "deepfake_detection":
|
| 78 |
+
reward = grader(action_d, item["ground_truth"], item.get("detector_score"))
|
| 79 |
+
else:
|
| 80 |
+
reward = grader(action_d, item["ground_truth"])
|
| 81 |
+
|
| 82 |
+
self._s["reward_sum"] += reward
|
| 83 |
+
self._s["idx"] += 1
|
| 84 |
+
self._s["history"].append({
|
| 85 |
+
"step": idx + 1,
|
| 86 |
+
"content_id": item["content_id"],
|
| 87 |
+
"action": action_d,
|
| 88 |
+
"reward": round(reward, 4),
|
| 89 |
+
"ground_truth": item["ground_truth"],
|
| 90 |
+
})
|
| 91 |
+
|
| 92 |
+
new_idx = self._s["idx"]
|
| 93 |
+
done = new_idx >= self._s["total"]
|
| 94 |
+
self._s["done"] = done
|
| 95 |
+
|
| 96 |
+
next_obs: Optional[ContentObservation] = None
|
| 97 |
+
if not done:
|
| 98 |
+
next_item = self._s["items"][new_idx]
|
| 99 |
+
next_obs = self._obs(next_item, new_idx + 1, self._s["total"])
|
| 100 |
+
|
| 101 |
+
return StepResult(
|
| 102 |
+
observation=next_obs,
|
| 103 |
+
reward=round(reward, 4),
|
| 104 |
+
done=done,
|
| 105 |
+
info={"content_id": item["content_id"], "step": idx + 1},
|
| 106 |
+
)
|
| 107 |
+
|
| 108 |
+
def state(self) -> EnvState:
|
| 109 |
+
with self._lock:
|
| 110 |
+
return EnvState(
|
| 111 |
+
task=self._s["task"] or "none",
|
| 112 |
+
step_num=self._s["idx"],
|
| 113 |
+
total_steps=self._s["total"],
|
| 114 |
+
cumulative_reward=round(self._s["reward_sum"], 4),
|
| 115 |
+
done=self._s["done"],
|
| 116 |
+
history=list(self._s["history"]),
|
| 117 |
+
)
|
| 118 |
+
|
| 119 |
+
def close(self):
|
| 120 |
+
with self._lock:
|
| 121 |
+
self._clear()
|
server/graders.py
ADDED
|
@@ -0,0 +1,94 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from typing import Dict, Any, List, Optional
|
| 2 |
+
|
| 3 |
+
|
| 4 |
+
def _label_f1(predicted: List[str], true: List[str]) -> float:
|
| 5 |
+
pred_set, true_set = set(predicted), set(true)
|
| 6 |
+
if not true_set and not pred_set:
|
| 7 |
+
return 1.0
|
| 8 |
+
if not true_set:
|
| 9 |
+
return 0.5
|
| 10 |
+
if not pred_set:
|
| 11 |
+
return 0.0
|
| 12 |
+
tp = len(pred_set & true_set)
|
| 13 |
+
precision = tp / len(pred_set)
|
| 14 |
+
recall = tp / len(true_set)
|
| 15 |
+
if precision + recall == 0:
|
| 16 |
+
return 0.0
|
| 17 |
+
return 2 * precision * recall / (precision + recall)
|
| 18 |
+
|
| 19 |
+
|
| 20 |
+
def _calibration_bonus(confidence: float, correct: bool) -> float:
|
| 21 |
+
if correct:
|
| 22 |
+
return 0.1 * confidence
|
| 23 |
+
return -0.15 * confidence
|
| 24 |
+
|
| 25 |
+
|
| 26 |
+
def grade_text_spam(action: Dict, ground_truth: Dict) -> float:
|
| 27 |
+
decision = action.get("decision", "")
|
| 28 |
+
confidence = float(action.get("confidence", 0.5))
|
| 29 |
+
correct = decision == ground_truth["decision"]
|
| 30 |
+
|
| 31 |
+
if correct:
|
| 32 |
+
base = 0.65
|
| 33 |
+
elif ground_truth["is_harmful"] and decision == "escalate":
|
| 34 |
+
base = 0.30
|
| 35 |
+
elif not ground_truth["is_harmful"] and decision == "flag":
|
| 36 |
+
base = 0.20
|
| 37 |
+
else:
|
| 38 |
+
base = 0.0
|
| 39 |
+
|
| 40 |
+
label_score = 0.20 * _label_f1(action.get("labels", []), ground_truth["labels"])
|
| 41 |
+
return min(1.0, max(0.0, base + label_score + _calibration_bonus(confidence, correct)))
|
| 42 |
+
|
| 43 |
+
|
| 44 |
+
def grade_content_moderation(action: Dict, ground_truth: Dict) -> float:
|
| 45 |
+
decision = action.get("decision", "")
|
| 46 |
+
confidence = float(action.get("confidence", 0.5))
|
| 47 |
+
correct = decision == ground_truth["decision"]
|
| 48 |
+
|
| 49 |
+
if correct:
|
| 50 |
+
decision_score = 0.50
|
| 51 |
+
elif ground_truth["is_harmful"] and decision in ("escalate", "flag", "reject"):
|
| 52 |
+
decision_score = 0.25
|
| 53 |
+
elif not ground_truth["is_harmful"] and decision == "escalate":
|
| 54 |
+
decision_score = 0.15
|
| 55 |
+
else:
|
| 56 |
+
decision_score = 0.0
|
| 57 |
+
|
| 58 |
+
label_score = 0.35 * _label_f1(action.get("labels", []), ground_truth["labels"])
|
| 59 |
+
return min(1.0, max(0.0, decision_score + label_score + _calibration_bonus(confidence, correct)))
|
| 60 |
+
|
| 61 |
+
|
| 62 |
+
def grade_deepfake(action: Dict, ground_truth: Dict, detector_score: Optional[float] = None) -> float:
|
| 63 |
+
decision = action.get("decision", "")
|
| 64 |
+
confidence = float(action.get("confidence", 0.5))
|
| 65 |
+
labels = action.get("labels", [])
|
| 66 |
+
|
| 67 |
+
is_deepfake = ground_truth.get("is_deepfake", False)
|
| 68 |
+
correct_decision = decision == ground_truth["decision"]
|
| 69 |
+
|
| 70 |
+
detected_deepfake = "deepfake" in [l.lower() for l in labels]
|
| 71 |
+
detection_correct = detected_deepfake == is_deepfake
|
| 72 |
+
|
| 73 |
+
if correct_decision:
|
| 74 |
+
decision_score = 0.40
|
| 75 |
+
elif ground_truth["is_harmful"] and decision in ("escalate", "flag", "reject"):
|
| 76 |
+
decision_score = 0.20
|
| 77 |
+
else:
|
| 78 |
+
decision_score = 0.0
|
| 79 |
+
|
| 80 |
+
detection_score = 0.30 if detection_correct else 0.0
|
| 81 |
+
|
| 82 |
+
if detector_score is not None:
|
| 83 |
+
model_aligned = (detector_score > 0.5) == is_deepfake
|
| 84 |
+
detection_score += 0.10 if model_aligned else 0.0
|
| 85 |
+
|
| 86 |
+
label_score = 0.20 * _label_f1(labels, ground_truth["labels"])
|
| 87 |
+
return min(1.0, max(0.0, decision_score + detection_score + label_score + _calibration_bonus(confidence, correct_decision)))
|
| 88 |
+
|
| 89 |
+
|
| 90 |
+
GRADERS = {
|
| 91 |
+
"text_spam": grade_text_spam,
|
| 92 |
+
"content_moderation": grade_content_moderation,
|
| 93 |
+
"deepfake_detection": grade_deepfake,
|
| 94 |
+
}
|
server/main.py
ADDED
|
@@ -0,0 +1,61 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from fastapi import FastAPI, HTTPException, Request
|
| 2 |
+
from fastapi.responses import JSONResponse, RedirectResponse
|
| 3 |
+
|
| 4 |
+
from .models import ModerationAction, StepResult, ResetResult, EnvState, ResetRequest
|
| 5 |
+
from .env import ContentModerationEnv
|
| 6 |
+
from .tasks import TASKS
|
| 7 |
+
|
| 8 |
+
app = FastAPI(title="Content Moderation OpenEnv", version="1.0.0")
|
| 9 |
+
_env = ContentModerationEnv()
|
| 10 |
+
|
| 11 |
+
|
| 12 |
+
@app.get("/")
|
| 13 |
+
async def root():
|
| 14 |
+
return RedirectResponse(url="/docs")
|
| 15 |
+
|
| 16 |
+
|
| 17 |
+
@app.post("/reset", response_model=ResetResult)
|
| 18 |
+
async def reset(request: Request):
|
| 19 |
+
try:
|
| 20 |
+
body = await request.json()
|
| 21 |
+
except Exception:
|
| 22 |
+
body = {}
|
| 23 |
+
task = (body or {}).get("task", "text_spam")
|
| 24 |
+
try:
|
| 25 |
+
return _env.reset(task=task)
|
| 26 |
+
except ValueError as e:
|
| 27 |
+
raise HTTPException(status_code=400, detail=str(e))
|
| 28 |
+
|
| 29 |
+
|
| 30 |
+
@app.post("/step", response_model=StepResult)
|
| 31 |
+
def step(action: ModerationAction):
|
| 32 |
+
return _env.step(action)
|
| 33 |
+
|
| 34 |
+
|
| 35 |
+
@app.get("/state", response_model=EnvState)
|
| 36 |
+
def state():
|
| 37 |
+
return _env.state()
|
| 38 |
+
|
| 39 |
+
|
| 40 |
+
@app.post("/close")
|
| 41 |
+
def close():
|
| 42 |
+
_env.close()
|
| 43 |
+
return {"status": "closed"}
|
| 44 |
+
|
| 45 |
+
|
| 46 |
+
@app.get("/tasks")
|
| 47 |
+
def list_tasks():
|
| 48 |
+
return {
|
| 49 |
+
name: {
|
| 50 |
+
"description": t["description"],
|
| 51 |
+
"difficulty": t["difficulty"],
|
| 52 |
+
"num_items": len(t["items"]),
|
| 53 |
+
"content_type": t["content_type"],
|
| 54 |
+
}
|
| 55 |
+
for name, t in TASKS.items()
|
| 56 |
+
}
|
| 57 |
+
|
| 58 |
+
|
| 59 |
+
@app.get("/health")
|
| 60 |
+
def health():
|
| 61 |
+
return {"status": "ok"}
|
server/models.py
ADDED
|
@@ -0,0 +1,44 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from pydantic import BaseModel, Field
|
| 2 |
+
from typing import Optional, Dict, Any, List
|
| 3 |
+
|
| 4 |
+
|
| 5 |
+
class ModerationAction(BaseModel):
|
| 6 |
+
decision: str
|
| 7 |
+
reason: str
|
| 8 |
+
confidence: float = Field(ge=0.0, le=1.0)
|
| 9 |
+
labels: List[str] = []
|
| 10 |
+
|
| 11 |
+
|
| 12 |
+
class ContentObservation(BaseModel):
|
| 13 |
+
content_id: str
|
| 14 |
+
content_type: str
|
| 15 |
+
text: Optional[str] = None
|
| 16 |
+
image_description: Optional[str] = None
|
| 17 |
+
detector_score: Optional[float] = None
|
| 18 |
+
metadata: Dict[str, Any] = {}
|
| 19 |
+
step_num: int
|
| 20 |
+
total_steps: int
|
| 21 |
+
|
| 22 |
+
|
| 23 |
+
class StepResult(BaseModel):
|
| 24 |
+
observation: Optional[ContentObservation] = None
|
| 25 |
+
reward: float
|
| 26 |
+
done: bool
|
| 27 |
+
info: Dict[str, Any] = {}
|
| 28 |
+
|
| 29 |
+
|
| 30 |
+
class ResetResult(BaseModel):
|
| 31 |
+
observation: ContentObservation
|
| 32 |
+
|
| 33 |
+
|
| 34 |
+
class EnvState(BaseModel):
|
| 35 |
+
task: str
|
| 36 |
+
step_num: int
|
| 37 |
+
total_steps: int
|
| 38 |
+
cumulative_reward: float
|
| 39 |
+
done: bool
|
| 40 |
+
history: List[Dict[str, Any]] = []
|
| 41 |
+
|
| 42 |
+
|
| 43 |
+
class ResetRequest(BaseModel):
|
| 44 |
+
task: Optional[str] = "text_spam"
|
server/openenv.yaml
ADDED
|
@@ -0,0 +1,89 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
name: content-moderation-env
|
| 2 |
+
version: "1.0.0"
|
| 3 |
+
description: >
|
| 4 |
+
AI-powered content moderation environment. Agents triage text, social
|
| 5 |
+
posts, and multimodal content (including deepfake detection) across
|
| 6 |
+
three difficulty levels using the standard OpenEnv step/reset/state API.
|
| 7 |
+
author: openenv-participant
|
| 8 |
+
license: MIT
|
| 9 |
+
|
| 10 |
+
tasks:
|
| 11 |
+
- id: text_spam
|
| 12 |
+
difficulty: easy
|
| 13 |
+
description: Classify email/message content as spam or legitimate
|
| 14 |
+
content_type: text
|
| 15 |
+
num_items: 5
|
| 16 |
+
score_range: [0.0, 1.0]
|
| 17 |
+
|
| 18 |
+
- id: content_moderation
|
| 19 |
+
difficulty: medium
|
| 20 |
+
description: Multi-label social media content moderation
|
| 21 |
+
content_type: text
|
| 22 |
+
num_items: 5
|
| 23 |
+
score_range: [0.0, 1.0]
|
| 24 |
+
|
| 25 |
+
- id: deepfake_detection
|
| 26 |
+
difficulty: hard
|
| 27 |
+
description: Detect AI-manipulated/deepfake media and make moderation decisions
|
| 28 |
+
content_type: multimodal
|
| 29 |
+
num_items: 5
|
| 30 |
+
score_range: [0.0, 1.0]
|
| 31 |
+
|
| 32 |
+
action_space:
|
| 33 |
+
type: object
|
| 34 |
+
fields:
|
| 35 |
+
decision:
|
| 36 |
+
type: string
|
| 37 |
+
enum: [approve, reject, escalate, flag]
|
| 38 |
+
reason:
|
| 39 |
+
type: string
|
| 40 |
+
confidence:
|
| 41 |
+
type: float
|
| 42 |
+
range: [0.0, 1.0]
|
| 43 |
+
labels:
|
| 44 |
+
type: array
|
| 45 |
+
items: string
|
| 46 |
+
valid_values:
|
| 47 |
+
- spam
|
| 48 |
+
- scam
|
| 49 |
+
- phishing
|
| 50 |
+
- pharmaceutical_spam
|
| 51 |
+
- hate_speech
|
| 52 |
+
- violence
|
| 53 |
+
- harassment
|
| 54 |
+
- misinformation
|
| 55 |
+
- adult_content
|
| 56 |
+
- deepfake
|
| 57 |
+
- political_manipulation
|
| 58 |
+
- fraud
|
| 59 |
+
|
| 60 |
+
observation_space:
|
| 61 |
+
type: object
|
| 62 |
+
fields:
|
| 63 |
+
content_id: string
|
| 64 |
+
content_type: string
|
| 65 |
+
text: optional string
|
| 66 |
+
image_description: optional string
|
| 67 |
+
detector_score: optional float
|
| 68 |
+
metadata: object
|
| 69 |
+
step_num: integer
|
| 70 |
+
total_steps: integer
|
| 71 |
+
|
| 72 |
+
endpoints:
|
| 73 |
+
reset: POST /reset
|
| 74 |
+
step: POST /step
|
| 75 |
+
state: GET /state
|
| 76 |
+
close: POST /close
|
| 77 |
+
tasks: GET /tasks
|
| 78 |
+
health: GET /health
|
| 79 |
+
|
| 80 |
+
docker:
|
| 81 |
+
context: server
|
| 82 |
+
dockerfile: server/Dockerfile
|
| 83 |
+
|
| 84 |
+
huggingface:
|
| 85 |
+
space_sdk: docker
|
| 86 |
+
tags:
|
| 87 |
+
- openenv
|
| 88 |
+
- content-moderation
|
| 89 |
+
- deepfake-detection
|
server/requirements.txt
ADDED
|
@@ -0,0 +1,12 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
fastapi==0.111.0
|
| 2 |
+
uvicorn[standard]==0.29.0
|
| 3 |
+
pydantic==2.7.1
|
| 4 |
+
transformers==4.41.2
|
| 5 |
+
torch==2.3.0
|
| 6 |
+
torchvision==0.18.0
|
| 7 |
+
Pillow==10.3.0
|
| 8 |
+
numpy==1.26.4
|
| 9 |
+
accelerate==0.30.0
|
| 10 |
+
openai>=1.40.0
|
| 11 |
+
requests==2.31.0
|
| 12 |
+
python-dotenv==1.0.0
|
server/tasks.py
ADDED
|
@@ -0,0 +1,192 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
TASKS = {
|
| 2 |
+
"text_spam": {
|
| 3 |
+
"description": "Classify email/message content as spam or legitimate",
|
| 4 |
+
"difficulty": "easy",
|
| 5 |
+
"content_type": "text",
|
| 6 |
+
"items": [
|
| 7 |
+
{
|
| 8 |
+
"content_id": "ts_001",
|
| 9 |
+
"content_type": "text",
|
| 10 |
+
"text": "CONGRATULATIONS! You've won $1,000,000! Click here NOW to claim your prize!!! Limited time offer!!!",
|
| 11 |
+
"metadata": {"source": "email", "sender_reputation": 0.05, "link_count": 3},
|
| 12 |
+
"ground_truth": {"decision": "reject", "labels": ["spam", "scam"], "is_harmful": True},
|
| 13 |
+
},
|
| 14 |
+
{
|
| 15 |
+
"content_id": "ts_002",
|
| 16 |
+
"content_type": "text",
|
| 17 |
+
"text": "Hi Sarah, confirming our meeting tomorrow at 3pm in the main conference room. Let me know if the time still works.",
|
| 18 |
+
"metadata": {"source": "email", "sender_reputation": 0.92, "link_count": 0},
|
| 19 |
+
"ground_truth": {"decision": "approve", "labels": [], "is_harmful": False},
|
| 20 |
+
},
|
| 21 |
+
{
|
| 22 |
+
"content_id": "ts_003",
|
| 23 |
+
"content_type": "text",
|
| 24 |
+
"text": "FREE MEDS NO PRESCRIPTION NEEDED BUY NOW CHEAPEST PRICES CLICK LINK BELOW",
|
| 25 |
+
"metadata": {"source": "email", "sender_reputation": 0.02, "link_count": 5},
|
| 26 |
+
"ground_truth": {"decision": "reject", "labels": ["spam", "pharmaceutical_spam"], "is_harmful": True},
|
| 27 |
+
},
|
| 28 |
+
{
|
| 29 |
+
"content_id": "ts_004",
|
| 30 |
+
"content_type": "text",
|
| 31 |
+
"text": "Your Amazon order #112-3456789 has shipped and will arrive by Thursday. Track at amazon.com/orders.",
|
| 32 |
+
"metadata": {"source": "email", "sender_reputation": 0.98, "link_count": 1},
|
| 33 |
+
"ground_truth": {"decision": "approve", "labels": [], "is_harmful": False},
|
| 34 |
+
},
|
| 35 |
+
{
|
| 36 |
+
"content_id": "ts_005",
|
| 37 |
+
"content_type": "text",
|
| 38 |
+
"text": "URGENT: Your account will be SUSPENDED in 24hrs unless you verify immediately. Click: bit.ly/verify-acct-now",
|
| 39 |
+
"metadata": {"source": "email", "sender_reputation": 0.08, "link_count": 2},
|
| 40 |
+
"ground_truth": {"decision": "reject", "labels": ["spam", "phishing"], "is_harmful": True},
|
| 41 |
+
},
|
| 42 |
+
],
|
| 43 |
+
},
|
| 44 |
+
|
| 45 |
+
"content_moderation": {
|
| 46 |
+
"description": "Multi-label moderation for social media posts",
|
| 47 |
+
"difficulty": "medium",
|
| 48 |
+
"content_type": "text",
|
| 49 |
+
"items": [
|
| 50 |
+
{
|
| 51 |
+
"content_id": "cm_001",
|
| 52 |
+
"content_type": "text",
|
| 53 |
+
"text": "Gorgeous sunset from my hike today! The trail to the summit was tough but totally worth the view.",
|
| 54 |
+
"metadata": {"platform": "social_media", "report_count": 0, "user_history_violations": 0},
|
| 55 |
+
"ground_truth": {"decision": "approve", "labels": [], "is_harmful": False},
|
| 56 |
+
},
|
| 57 |
+
{
|
| 58 |
+
"content_id": "cm_002",
|
| 59 |
+
"content_type": "text",
|
| 60 |
+
"text": "I will find that [slur] and make them regret it. They deserve everything coming to them.",
|
| 61 |
+
"metadata": {"platform": "social_media", "report_count": 18, "user_history_violations": 3},
|
| 62 |
+
"ground_truth": {"decision": "reject", "labels": ["hate_speech", "violence", "harassment"], "is_harmful": True},
|
| 63 |
+
},
|
| 64 |
+
{
|
| 65 |
+
"content_id": "cm_003",
|
| 66 |
+
"content_type": "text",
|
| 67 |
+
"text": "Lose 30lbs in 7 days with this one weird trick! Doctors HATE this! DM me to buy, only $89.",
|
| 68 |
+
"metadata": {"platform": "social_media", "report_count": 11, "user_history_violations": 1},
|
| 69 |
+
"ground_truth": {"decision": "flag", "labels": ["misinformation", "spam"], "is_harmful": True},
|
| 70 |
+
},
|
| 71 |
+
{
|
| 72 |
+
"content_id": "cm_004",
|
| 73 |
+
"content_type": "text",
|
| 74 |
+
"text": "The election was stolen. The globalist elites are replacing us and controlling everything. Rise up!",
|
| 75 |
+
"metadata": {"platform": "social_media", "report_count": 27, "user_history_violations": 2},
|
| 76 |
+
"ground_truth": {"decision": "flag", "labels": ["misinformation", "hate_speech"], "is_harmful": True},
|
| 77 |
+
},
|
| 78 |
+
{
|
| 79 |
+
"content_id": "cm_005",
|
| 80 |
+
"content_type": "text",
|
| 81 |
+
"text": "Book club reminder: we meet Thursday 7pm to discuss chapters 12-15. Bring snacks if you can!",
|
| 82 |
+
"metadata": {"platform": "social_media", "report_count": 0, "user_history_violations": 0},
|
| 83 |
+
"ground_truth": {"decision": "approve", "labels": [], "is_harmful": False},
|
| 84 |
+
},
|
| 85 |
+
],
|
| 86 |
+
},
|
| 87 |
+
|
| 88 |
+
"deepfake_detection": {
|
| 89 |
+
"description": "Detect AI-manipulated media and make moderation decisions",
|
| 90 |
+
"difficulty": "hard",
|
| 91 |
+
"content_type": "multimodal",
|
| 92 |
+
"items": [
|
| 93 |
+
{
|
| 94 |
+
"content_id": "df_001",
|
| 95 |
+
"content_type": "multimodal",
|
| 96 |
+
"text": "CEO announces surprise acquisition deal — photo circulating widely on financial forums",
|
| 97 |
+
"image_description": "Corporate executive at podium. Facial proportions slightly asymmetric. Background blur shows unnatural transitions. Hair boundary shows pixel artifacts. Lighting on face inconsistent with background.",
|
| 98 |
+
"metadata": {
|
| 99 |
+
"source": "social_media",
|
| 100 |
+
"claimed_subject": "public_executive",
|
| 101 |
+
"virality_score": 0.92,
|
| 102 |
+
"exif_stripped": True,
|
| 103 |
+
"upload_device": "unknown",
|
| 104 |
+
},
|
| 105 |
+
"ground_truth": {
|
| 106 |
+
"decision": "reject",
|
| 107 |
+
"labels": ["deepfake", "misinformation", "fraud"],
|
| 108 |
+
"is_deepfake": True,
|
| 109 |
+
"is_harmful": True,
|
| 110 |
+
},
|
| 111 |
+
},
|
| 112 |
+
{
|
| 113 |
+
"content_id": "df_002",
|
| 114 |
+
"content_type": "multimodal",
|
| 115 |
+
"text": "Family reunion photo shared to personal profile",
|
| 116 |
+
"image_description": "Outdoor gathering, natural daylight. Shadows consistent with single light source. Facial textures normal and varied. Background in sharp focus with depth-of-field matching focal length.",
|
| 117 |
+
"metadata": {
|
| 118 |
+
"source": "personal_profile",
|
| 119 |
+
"claimed_subject": "private_individual",
|
| 120 |
+
"virality_score": 0.03,
|
| 121 |
+
"exif_stripped": False,
|
| 122 |
+
"upload_device": "iPhone_15",
|
| 123 |
+
},
|
| 124 |
+
"ground_truth": {
|
| 125 |
+
"decision": "approve",
|
| 126 |
+
"labels": [],
|
| 127 |
+
"is_deepfake": False,
|
| 128 |
+
"is_harmful": False,
|
| 129 |
+
},
|
| 130 |
+
},
|
| 131 |
+
{
|
| 132 |
+
"content_id": "df_003",
|
| 133 |
+
"content_type": "multimodal",
|
| 134 |
+
"text": "Politician caught making controversial statement — clip going viral across platforms",
|
| 135 |
+
"image_description": "Political figure at microphone. Eye blink rate irregular. Lip sync offset approximately 2 frames. Neck-to-face skin tone boundary visible. Ear shape subtly non-anatomical.",
|
| 136 |
+
"metadata": {
|
| 137 |
+
"source": "video_platform",
|
| 138 |
+
"claimed_subject": "elected_official",
|
| 139 |
+
"virality_score": 0.88,
|
| 140 |
+
"exif_stripped": True,
|
| 141 |
+
"upload_device": "unknown",
|
| 142 |
+
},
|
| 143 |
+
"ground_truth": {
|
| 144 |
+
"decision": "reject",
|
| 145 |
+
"labels": ["deepfake", "political_manipulation", "misinformation"],
|
| 146 |
+
"is_deepfake": True,
|
| 147 |
+
"is_harmful": True,
|
| 148 |
+
},
|
| 149 |
+
},
|
| 150 |
+
{
|
| 151 |
+
"content_id": "df_004",
|
| 152 |
+
"content_type": "multimodal",
|
| 153 |
+
"text": "Famous actor endorsing cryptocurrency — advertisement spreading across social media",
|
| 154 |
+
"image_description": "Known celebrity in studio setting. Jaw boundary shows feathering artifacts. Teeth appear uniform and unnaturally perfect. Reflection in eyes does not match claimed studio environment.",
|
| 155 |
+
"metadata": {
|
| 156 |
+
"source": "paid_advertisement",
|
| 157 |
+
"claimed_subject": "celebrity",
|
| 158 |
+
"virality_score": 0.75,
|
| 159 |
+
"exif_stripped": True,
|
| 160 |
+
"upload_device": "unknown",
|
| 161 |
+
},
|
| 162 |
+
"ground_truth": {
|
| 163 |
+
"decision": "reject",
|
| 164 |
+
"labels": ["deepfake", "fraud", "scam"],
|
| 165 |
+
"is_deepfake": True,
|
| 166 |
+
"is_harmful": True,
|
| 167 |
+
},
|
| 168 |
+
},
|
| 169 |
+
{
|
| 170 |
+
"content_id": "df_005",
|
| 171 |
+
"content_type": "multimodal",
|
| 172 |
+
"text": "News anchor presenting evening broadcast — clip shared from verified channel",
|
| 173 |
+
"image_description": "Professional broadcast studio. Consistent studio lighting. Natural eye movement and blink rate. Source watermark present. Metadata intact from broadcast system.",
|
| 174 |
+
"metadata": {
|
| 175 |
+
"source": "verified_broadcaster",
|
| 176 |
+
"claimed_subject": "journalist",
|
| 177 |
+
"virality_score": 0.35,
|
| 178 |
+
"exif_stripped": False,
|
| 179 |
+
"upload_device": "broadcast_encoder",
|
| 180 |
+
},
|
| 181 |
+
"ground_truth": {
|
| 182 |
+
"decision": "approve",
|
| 183 |
+
"labels": [],
|
| 184 |
+
"is_deepfake": False,
|
| 185 |
+
"is_harmful": False,
|
| 186 |
+
},
|
| 187 |
+
},
|
| 188 |
+
],
|
| 189 |
+
},
|
| 190 |
+
}
|
| 191 |
+
|
| 192 |
+
TASK_NAMES = list(TASKS.keys())
|
test/test.py
ADDED
|
@@ -0,0 +1,243 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import sys
|
| 2 |
+
import os
|
| 3 |
+
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
|
| 4 |
+
|
| 5 |
+
import pytest
|
| 6 |
+
from server.models import ModerationAction, ContentObservation, StepResult, ResetResult, EnvState
|
| 7 |
+
from server.env import ContentModerationEnv
|
| 8 |
+
from server.graders import grade_text_spam, grade_content_moderation, grade_deepfake, GRADERS
|
| 9 |
+
from server.tasks import TASKS, TASK_NAMES
|
| 10 |
+
|
| 11 |
+
|
| 12 |
+
def make_action(decision="approve", reason="test", confidence=0.8, labels=None):
|
| 13 |
+
return ModerationAction(decision=decision, reason=reason, confidence=confidence, labels=labels or [])
|
| 14 |
+
|
| 15 |
+
|
| 16 |
+
def make_action_dict(decision="approve", reason="test", confidence=0.8, labels=None):
|
| 17 |
+
return {"decision": decision, "reason": reason, "confidence": confidence, "labels": labels or []}
|
| 18 |
+
|
| 19 |
+
|
| 20 |
+
# --- Task data ---
|
| 21 |
+
|
| 22 |
+
def test_all_tasks_present():
|
| 23 |
+
assert set(TASK_NAMES) == {"text_spam", "content_moderation", "deepfake_detection"}
|
| 24 |
+
|
| 25 |
+
|
| 26 |
+
def test_each_task_has_five_items():
|
| 27 |
+
for name, task in TASKS.items():
|
| 28 |
+
assert len(task["items"]) == 5, f"{name} should have 5 items"
|
| 29 |
+
|
| 30 |
+
|
| 31 |
+
def test_ground_truth_keys():
|
| 32 |
+
for name, task in TASKS.items():
|
| 33 |
+
for item in task["items"]:
|
| 34 |
+
gt = item["ground_truth"]
|
| 35 |
+
assert "decision" in gt
|
| 36 |
+
assert "labels" in gt
|
| 37 |
+
assert gt["decision"] in ("approve", "reject", "flag", "escalate")
|
| 38 |
+
|
| 39 |
+
|
| 40 |
+
def test_deepfake_items_have_is_deepfake():
|
| 41 |
+
for item in TASKS["deepfake_detection"]["items"]:
|
| 42 |
+
assert "is_deepfake" in item["ground_truth"]
|
| 43 |
+
|
| 44 |
+
|
| 45 |
+
# --- Graders ---
|
| 46 |
+
|
| 47 |
+
def test_grade_spam_correct_reject():
|
| 48 |
+
gt = {"decision": "reject", "labels": ["spam"], "is_harmful": True}
|
| 49 |
+
r = grade_text_spam(make_action_dict("reject", confidence=0.9, labels=["spam"]), gt)
|
| 50 |
+
assert 0.8 <= r <= 1.0
|
| 51 |
+
|
| 52 |
+
|
| 53 |
+
def test_grade_spam_wrong_decision_penalised():
|
| 54 |
+
gt = {"decision": "reject", "labels": ["spam"], "is_harmful": True}
|
| 55 |
+
r = grade_text_spam(make_action_dict("approve", confidence=0.9), gt)
|
| 56 |
+
assert r < 0.3
|
| 57 |
+
|
| 58 |
+
|
| 59 |
+
def test_grade_spam_escalate_partial_credit():
|
| 60 |
+
gt = {"decision": "reject", "labels": ["spam"], "is_harmful": True}
|
| 61 |
+
r = grade_text_spam(make_action_dict("escalate", confidence=0.5), gt)
|
| 62 |
+
assert 0.15 <= r <= 0.45
|
| 63 |
+
|
| 64 |
+
|
| 65 |
+
def test_grade_spam_correct_approve():
|
| 66 |
+
gt = {"decision": "approve", "labels": [], "is_harmful": False}
|
| 67 |
+
r = grade_text_spam(make_action_dict("approve", confidence=0.85), gt)
|
| 68 |
+
assert r >= 0.6
|
| 69 |
+
|
| 70 |
+
|
| 71 |
+
def test_grade_content_mod_correct_with_labels():
|
| 72 |
+
gt = {"decision": "reject", "labels": ["hate_speech", "violence"], "is_harmful": True}
|
| 73 |
+
r = grade_content_moderation(
|
| 74 |
+
make_action_dict("reject", confidence=0.9, labels=["hate_speech", "violence"]), gt
|
| 75 |
+
)
|
| 76 |
+
assert r >= 0.7
|
| 77 |
+
|
| 78 |
+
|
| 79 |
+
def test_grade_content_mod_label_partial():
|
| 80 |
+
gt = {"decision": "flag", "labels": ["misinformation", "spam"], "is_harmful": True}
|
| 81 |
+
r_partial = grade_content_moderation(
|
| 82 |
+
make_action_dict("flag", confidence=0.7, labels=["misinformation"]), gt
|
| 83 |
+
)
|
| 84 |
+
r_full = grade_content_moderation(
|
| 85 |
+
make_action_dict("flag", confidence=0.7, labels=["misinformation", "spam"]), gt
|
| 86 |
+
)
|
| 87 |
+
assert r_partial < r_full
|
| 88 |
+
|
| 89 |
+
|
| 90 |
+
def test_grade_deepfake_detected_and_correct():
|
| 91 |
+
gt = {"decision": "reject", "labels": ["deepfake", "fraud"], "is_deepfake": True, "is_harmful": True}
|
| 92 |
+
r = grade_deepfake(
|
| 93 |
+
make_action_dict("reject", confidence=0.9, labels=["deepfake", "fraud"]), gt, detector_score=0.85
|
| 94 |
+
)
|
| 95 |
+
assert r >= 0.7
|
| 96 |
+
|
| 97 |
+
|
| 98 |
+
def test_grade_deepfake_real_correct():
|
| 99 |
+
gt = {"decision": "approve", "labels": [], "is_deepfake": False, "is_harmful": False}
|
| 100 |
+
r = grade_deepfake(make_action_dict("approve", confidence=0.8), gt, detector_score=0.1)
|
| 101 |
+
assert r >= 0.5
|
| 102 |
+
|
| 103 |
+
|
| 104 |
+
def test_grade_deepfake_missed_deepfake():
|
| 105 |
+
gt = {"decision": "reject", "labels": ["deepfake"], "is_deepfake": True, "is_harmful": True}
|
| 106 |
+
r_miss = grade_deepfake(make_action_dict("approve", confidence=0.8), gt)
|
| 107 |
+
r_detect = grade_deepfake(make_action_dict("reject", confidence=0.8, labels=["deepfake"]), gt)
|
| 108 |
+
assert r_miss < r_detect
|
| 109 |
+
|
| 110 |
+
|
| 111 |
+
def test_all_rewards_in_range():
|
| 112 |
+
for task_name in TASK_NAMES:
|
| 113 |
+
task = TASKS[task_name]
|
| 114 |
+
grader = GRADERS[task_name]
|
| 115 |
+
for item in task["items"]:
|
| 116 |
+
for decision in ("approve", "reject", "flag", "escalate"):
|
| 117 |
+
action = make_action_dict(decision, confidence=0.5, labels=["spam"])
|
| 118 |
+
if task_name == "deepfake_detection":
|
| 119 |
+
r = grader(action, item["ground_truth"], 0.5)
|
| 120 |
+
else:
|
| 121 |
+
r = grader(action, item["ground_truth"])
|
| 122 |
+
assert 0.0 <= r <= 1.0, f"{task_name} reward out of range: {r}"
|
| 123 |
+
|
| 124 |
+
|
| 125 |
+
# --- Environment ---
|
| 126 |
+
|
| 127 |
+
def test_reset_returns_first_observation():
|
| 128 |
+
env = ContentModerationEnv()
|
| 129 |
+
result = env.reset("text_spam")
|
| 130 |
+
assert isinstance(result, ResetResult)
|
| 131 |
+
obs = result.observation
|
| 132 |
+
assert obs.step_num == 1
|
| 133 |
+
assert obs.total_steps == 5
|
| 134 |
+
assert obs.content_id == "ts_001"
|
| 135 |
+
|
| 136 |
+
|
| 137 |
+
def test_step_advances_state():
|
| 138 |
+
env = ContentModerationEnv()
|
| 139 |
+
env.reset("text_spam")
|
| 140 |
+
action = make_action("reject")
|
| 141 |
+
result = env.step(action)
|
| 142 |
+
assert isinstance(result, StepResult)
|
| 143 |
+
assert 0.0 <= result.reward <= 1.0
|
| 144 |
+
assert result.observation is not None
|
| 145 |
+
assert result.observation.step_num == 2
|
| 146 |
+
|
| 147 |
+
|
| 148 |
+
def test_episode_ends_after_all_items():
|
| 149 |
+
env = ContentModerationEnv()
|
| 150 |
+
env.reset("text_spam")
|
| 151 |
+
done = False
|
| 152 |
+
steps = 0
|
| 153 |
+
while not done:
|
| 154 |
+
r = env.step(make_action("escalate"))
|
| 155 |
+
done = r.done
|
| 156 |
+
steps += 1
|
| 157 |
+
assert steps == 5
|
| 158 |
+
assert r.observation is None
|
| 159 |
+
|
| 160 |
+
|
| 161 |
+
def test_step_after_done_returns_error():
|
| 162 |
+
env = ContentModerationEnv()
|
| 163 |
+
env.reset("text_spam")
|
| 164 |
+
for _ in range(5):
|
| 165 |
+
env.step(make_action("approve"))
|
| 166 |
+
result = env.step(make_action("approve"))
|
| 167 |
+
assert result.done is True
|
| 168 |
+
assert "error" in result.info
|
| 169 |
+
|
| 170 |
+
|
| 171 |
+
def test_state_tracks_cumulative_reward():
|
| 172 |
+
env = ContentModerationEnv()
|
| 173 |
+
env.reset("content_moderation")
|
| 174 |
+
env.step(make_action("approve", confidence=0.9))
|
| 175 |
+
env.step(make_action("reject", confidence=0.9, labels=["hate_speech"]))
|
| 176 |
+
st = env.state()
|
| 177 |
+
assert isinstance(st, EnvState)
|
| 178 |
+
assert st.step_num == 2
|
| 179 |
+
assert st.cumulative_reward >= 0.0
|
| 180 |
+
assert len(st.history) == 2
|
| 181 |
+
|
| 182 |
+
|
| 183 |
+
def test_reset_different_tasks():
|
| 184 |
+
env = ContentModerationEnv()
|
| 185 |
+
for task in TASK_NAMES:
|
| 186 |
+
if task == "deepfake_detection":
|
| 187 |
+
continue
|
| 188 |
+
r = env.reset(task)
|
| 189 |
+
assert r.observation.total_steps == 5
|
| 190 |
+
|
| 191 |
+
|
| 192 |
+
def test_invalid_task_raises():
|
| 193 |
+
env = ContentModerationEnv()
|
| 194 |
+
with pytest.raises(ValueError):
|
| 195 |
+
env.reset("nonexistent_task")
|
| 196 |
+
|
| 197 |
+
|
| 198 |
+
def test_close_resets_env():
|
| 199 |
+
env = ContentModerationEnv()
|
| 200 |
+
env.reset("text_spam")
|
| 201 |
+
env.step(make_action("approve"))
|
| 202 |
+
env.close()
|
| 203 |
+
st = env.state()
|
| 204 |
+
assert st.task == "none"
|
| 205 |
+
assert st.done is True
|
| 206 |
+
|
| 207 |
+
|
| 208 |
+
def test_content_moderation_full_run():
|
| 209 |
+
env = ContentModerationEnv()
|
| 210 |
+
env.reset("content_moderation")
|
| 211 |
+
actions = [
|
| 212 |
+
make_action("approve"),
|
| 213 |
+
make_action("reject", labels=["hate_speech", "violence"]),
|
| 214 |
+
make_action("flag", labels=["misinformation"]),
|
| 215 |
+
make_action("flag", labels=["misinformation", "hate_speech"]),
|
| 216 |
+
make_action("approve"),
|
| 217 |
+
]
|
| 218 |
+
total_reward = 0.0
|
| 219 |
+
for action in actions:
|
| 220 |
+
result = env.step(action)
|
| 221 |
+
total_reward += result.reward
|
| 222 |
+
assert result.done is True
|
| 223 |
+
assert total_reward >= 0.0
|
| 224 |
+
st = env.state()
|
| 225 |
+
assert abs(st.cumulative_reward - total_reward) < 0.01
|
| 226 |
+
|
| 227 |
+
|
| 228 |
+
def test_observation_fields_populated():
|
| 229 |
+
env = ContentModerationEnv()
|
| 230 |
+
r = env.reset("content_moderation")
|
| 231 |
+
obs = r.observation
|
| 232 |
+
assert obs.content_id is not None
|
| 233 |
+
assert obs.content_type == "text"
|
| 234 |
+
assert obs.text is not None
|
| 235 |
+
assert obs.metadata is not None
|
| 236 |
+
|
| 237 |
+
|
| 238 |
+
def test_deepfake_obs_has_image_description():
|
| 239 |
+
env = ContentModerationEnv()
|
| 240 |
+
r = env.reset("deepfake_detection")
|
| 241 |
+
obs = r.observation
|
| 242 |
+
assert obs.image_description is not None
|
| 243 |
+
assert obs.content_type == "multimodal"
|