Spaces:
Sleeping
title: Content Moderation OpenEnv
emoji: π‘οΈ
colorFrom: blue
colorTo: indigo
sdk: docker
app_file: app.py
pinned: false
Content Moderation OpenEnv
An AI content moderation environment built to the OpenEnv specification. Agents triage real-world content β spam emails, harmful social media posts, and AI-generated deepfakes β using a standard step() / reset() / state() API.
π Table of Contents
- Environment Description & Motivation
- Task Descriptions
- Observation Space
- Action Space
- Reward Functions
- Baseline Scores
- Setup & Usage
- Running the Inference Script
- API Reference
- Project Structure
- Environment Variables
- Running Tests
- Troubleshooting
- Citation
- Acknowledgements
Environment Description & Motivation
Content moderation is a high-stakes, high-volume real-world task. Human moderators review millions of items daily across platforms and inboxes. This environment simulates a realistic moderation pipeline across three difficulty levels, enabling AI agents to learn decision-making strategies under resource constraints.
Key Challenges:
- Multi-label classification with imbalanced datasets
- Confidence calibration under uncertainty
- Real-world content variability (spam, deepfakes, policy violations)
- Escalation vs. immediate action tradeoffs
| Task | Difficulty | Content Type | Metrics | Description |
|---|---|---|---|---|
text_spam |
Easy | Email / SMS | Binary classification + confidence | Spam vs. legitimate email filtering with sender reputation signals |
content_moderation |
Medium | Social media posts | Multi-label (up to 11 labels) | Detect policy violations (hate speech, harassment, violence, etc.) |
deepfake_detection |
Hard | Multimodal (image + text) | Detection accuracy + labels | Identify AI-manipulated media using vision transformer + metadata |
Models Used:
- Deepfake Detection:
dima806/deepfake_vs_real_image_detection(ViT, ~300MB) - Language Model:
meta-llama/Llama-3.1-8B-Instruct(8B params, via transformers or HF API) - Infrastructure: FastAPI + Uvicorn, Docker containerization
Task Descriptions
Task 1: Text Spam Classification (Easy)
Objective: Binary classification of emails/SMS as spam or legitimate.
Dataset:
- 50 items (30 spam, 20 legitimate)
- Features: text content, sender reputation score, link count, source
- Ground truth: decision + labels
Example:
{
"content_id": "ts_001",
"content_type": "text",
"text": "CONGRATULATIONS! You've won $1,000,000! Click here NOW to claim your prize!!!",
"metadata": {"source": "email", "sender_reputation": 0.05, "link_count": 3},
"ground_truth": {
"decision": "reject",
"labels": ["spam", "scam"],
"is_harmful": true
}
}
Task 2: Content Moderation (Medium)
Objective: Multi-label classification of social media posts for policy violations.
Dataset:
- 40 items across diverse platforms
- Labels: spam, scam, phishing, hate_speech, violence, harassment, misinformation, adult_content, deepfake, political_manipulation, fraud
- Features: post text, engagement metrics, user reputation, report count
Violation Categories:
| Category | Definition | Examples |
|---|---|---|
| Hate Speech | Dehumanizing content targeting identity | Slurs, discrimination, incitement |
| Violence | Threats or glorification of violence | Physical harm, weapon promotion |
| Harassment | Coordinated or severe personal attacks | Doxxing, targeted campaigns |
| Misinformation | False claims with societal impact | Election fraud claims, health hoaxes |
Task 3: Deepfake Detection (Hard)
Objective: Detect AI-manipulated media and classify content appropriately.
Dataset:
- 30 items (multimodal: images + descriptions)
- Deepfake detection model outputs raw confidence scores (0-1)
- Features: image description, detector_score, metadata
Detector Score Interpretation:
0.0-0.3: Likely real/authentic0.3-0.7: Uncertain, may require additional analysis0.7-1.0: Likely deepfake/manipulated
Example:
{
"content_id": "df_001",
"content_type": "multimodal",
"image_description": "Portrait of person in business attire, lighting appears natural",
"detector_score": 0.82,
"metadata": {"platform": "social_media", "report_count": 3}
}
Observation Space
Every step returns a ContentObservation with the following structure:
{
"content_id": "string",
"content_type": "text | multimodal",
"text": "string (optional, for text tasks)",
"image_description": "string (optional, deepfake task only)",
"detector_score": 0.0-1.0 (optional, deepfake task only),
"metadata": {
"source": "email | social_media | platform",
"sender_reputation": 0.0-1.0,
"link_count": 0,
"report_count": 0,
"timestamp": "ISO8601"
},
"step_num": 1,
"total_steps": 10
}
| Field | Type | Required | Task | Description |
|---|---|---|---|---|
content_id |
string | All | Unique identifier for the content item | |
content_type |
string | All | Type of content: text or multimodal |
|
text |
string | text_spam, content_moderation | The actual email/post body | |
image_description |
string | deepfake_detection | AI-generated description of the image | |
detector_score |
float | deepfake_detection | Raw output from deepfake model (0-1) | |
metadata |
object | All | Platform-specific signals (reputation, reports, etc.) | |
step_num |
int | All | Current step in episode | |
total_steps |
int | All | Total steps in this episode |
Action Space
Agents must respond with a ModerationAction object:
{
"decision": "approve | reject | escalate | flag",
"reason": "string (max 200 chars, explanation)",
"confidence": 0.0-1.0,
"labels": ["spam", "scam", "phishing", "hate_speech", "violence",
"harassment", "misinformation", "adult_content",
"deepfake", "political_manipulation", "fraud"]
}
| Field | Type | Range | Description |
|---|---|---|---|
decision |
enum | {approve, reject, escalate, flag} | approve: Safe content; reject: Clear violation, remove immediately; flag: Borderline, needs review; escalate: Uncertain, route to specialist |
reason |
string | 0-200 chars | Brief explanation of the decision |
confidence |
float | [0.0, 1.0] | Agent's confidence in this decision. Calibration is rewarded. |
labels |
array | 0-11 labels | Detected violation categories. Must be valid from the allowed set. |
Valid Labels:
spamβ Unsolicited bulk messagingscamβ Financial or identity theft attemptsphishingβ Social engineering attackshate_speechβ Dehumanizing content targeting identity groupsviolenceβ Threats, glorification, or instruction of violenceharassmentβ Personal attacks, doxxing, coordinated abusemisinformationβ False claims with societal impactadult_contentβ NSFW or sexually explicit materialdeepfakeβ AI-manipulated mediapolitical_manipulationβ Coordinated inauthentic behaviorfraudβ Financial scams, false claims
Reward Functions
Rewards are computed per task based on decision accuracy, label coverage (F1), and confidence calibration.
text_spam (Easy)
| Component | Reward | Condition |
|---|---|---|
| Correct decision | +0.65 | decision matches ground truth |
| Escalate on harmful | +0.30 | Harmful content + escalate/flag (partial credit) |
| Label F1 contribution | +0.20 | F1 score of predicted vs. true labels |
| Confidence calibration | Β±0.10 | Bonus if confident on correct, penalty if confident on wrong |
| Max per step | 1.00 | Sum of components (capped) |
content_moderation (Medium)
| Component | Reward | Condition |
|---|---|---|
| Correct decision | +0.50 | decision matches ground truth |
| Partial credit | +0.25 | Harmful content + flag/escalate (conservative approach) |
| Label F1 contribution | +0.35 | Multi-label F1 score (up to 11 labels) |
| Confidence calibration | Β±0.10 | Brier score penalty for miscalibration |
| Max per step | 1.00 | Sum of components (capped) |
deepfake_detection (Hard)
| Component | Reward | Condition |
|---|---|---|
| Correct decision | +0.40 | decision matches ground truth |
| Deepfake detection | +0.30 | Accuracy vs. detector_score threshold |
| Detector alignment | +0.10 | Bonus for leveraging model signals |
| Label F1 contribution | +0.20 | Multi-label F1 (fewer labels than medium task) |
| Confidence calibration | Β±0.10 | Calibration error penalty |
| Max per step | 1.00 | Sum of components (capped) |
Calibration Bonus Formula:
bonus = 0.1 Γ (confidence if correct else -confidence)
Baseline Scores
Scores reported for Llama-3.1-8B-Instruct with temperature=0.2 and top-p=0.95:
| Task | Score | Steps | Notes |
|---|---|---|---|
text_spam |
0.72 | 5 | Strong on obvious spam; struggles with phishing disguised as legitimate |
content_moderation |
0.58 | 8 | Good binary decisions; incomplete label coverage (F1 β0.52) |
deepfake_detection |
0.44 | 10 | Relies on image descriptions; independent detector signals underutilized |
Setup & Usage
Requirements
- Python: 3.11 or higher
- Docker (optional, for containerized deployment)
- GPU (optional, recommended for deepfake models): CUDA 12.1+
- Memory: 8GB+ RAM (16GB recommended for local LLM inference)
- Disk: 10GB+ (models cached in
~/.cache/huggingface/)
Local Installation
Clone and navigate:
git clone https://github.com/Anidipta/Content-Moderation-env.git cd Content-Moderation-envCreate virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activateInstall dependencies:
pip install -r server/requirements.txtStart the server:
uvicorn server.main:app --host 0.0.0.0 --port 7860Server runs at
http://localhost:7860Access API documentation:
- Swagger UI:
http://localhost:7860/docs - ReDoc:
http://localhost:7860/redoc
- Swagger UI:
Docker Deployment
Build the Image
# Basic build
docker build -f server/Dockerfile -t content-moderation-env .
# Build with memory allocation (recommended)
docker build --memory=4g -f server/Dockerfile -t content-moderation-env .
# Build with progress output
docker build --progress=plain -f server/Dockerfile -t content-moderation-env .
Run the Container
# Basic run
docker run -p 7860:7860 content-moderation-env
# Run with environment variables
docker run -p 7860:7860 \
-e API_BASE_URL="https://router.huggingface.co/v1" \
-e MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct" \
-e HF_TOKEN="hf_your_token_here" \
content-moderation-env
# Run with GPU support
docker run --gpus all -p 7860:7860 content-moderation-env
# Run with volume mounts (cache models locally)
docker run -p 7860:7860 \
-v ~/.cache/huggingface:/app/.cache/huggingface \
content-moderation-env
# Run in background
docker run -d -p 7860:7860 --name moderation-env content-moderation-env
# Check logs
docker logs moderation-env
# Stop container
docker stop moderation-env
Dockerfile Details
The server/Dockerfile uses:
- Base Image:
python:3.11-slim(~300MB) β minimal footprint with Python runtime - System Dependencies:
libgl1 libglib2.0-0 curlβ required for vision models and health checks - Dependencies Installation: Multi-stage approach with pip cache optimization
- Model Preloading: Deepfake detection model downloaded during build for faster startup
- Environment Setup: HuggingFace cache directories and Python settings pre-configured
- Entry Point: FastAPI app via Uvicorn on port 7860
# Key optimizations:
- --no-cache-dir: Reduces image size by 50%
- --no-build-isolation: Prevents memory spikes during pip install
- Pre-downloaded models: Eliminates first-run delays
- Minimal dependencies: Only libraries needed for the environment
Deployment to Production
Docker Compose:
version: '3.8'
services:
moderation-api:
build:
context: .
dockerfile: server/Dockerfile
ports:
- "7860:7860"
environment:
- API_BASE_URL=https://router.huggingface.co/v1
- MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct
- HF_TOKEN=${HF_TOKEN}
volumes:
- ~/.cache/huggingface:/app/.cache/huggingface
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:7860/health"]
interval: 30s
timeout: 10s
retries: 3
Run with: docker-compose up -d
HuggingFace Spaces Deployment
- Create a new Space with Docker SDK
- Add Secrets (Settings β Repository secrets):
HF_TOKEN: Your HuggingFace API token
- Add Variables (Settings β Repository variables):
API_BASE_URL:https://router.huggingface.co/v1MODEL_NAME:meta-llama/Llama-3.1-8B-Instruct
- Push this repository to the Space
- Space URL becomes your
PING_URLfor validation scripts
Running the Inference Script
# API mode (HF inference endpoint)
export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct"
export HF_TOKEN="hf_your_token_here"
export SERVER_URL="http://localhost:7860"
export TASK_NAME="text_spam"
python inference.py
# Local transformers pipeline mode
export USE_LOCAL_MODEL="true"
python inference.py
Output Format
[START] task=text_spam env=content_moderation_env model=meta-llama/Llama-3.1-8B-Instruct
[STEP] step=1 action={"decision":"reject","confidence":0.9,"labels":["spam"]} reward=0.85 done=false error=null
[STEP] step=2 action={"decision":"approve","confidence":0.8,"labels":[]} reward=0.75 done=false error=null
[STEP] step=3 action={"decision":"escalate","confidence":0.5,"labels":["scam"]} reward=0.30 done=false error=null
[STEP] step=4 action={"decision":"reject","confidence":0.85,"labels":["phishing"]} reward=0.70 done=false error=null
[STEP] step=5 action={"decision":"approve","confidence":0.88,"labels":[]} reward=0.75 done=true error=null
[END] success=true steps=5 score=0.720 rewards=0.85,0.75,0.30,0.70,0.75
| Field | Type | Description |
|---|---|---|
task |
string | The task being evaluated |
step |
int | Current step number in episode |
decision |
string | Agent's moderation decision |
confidence |
float | Agent's confidence (0-1) |
labels |
array | Detected violation labels |
reward |
float | Reward received for this step |
done |
boolean | Episode completion flag |
error |
string/null | Error message if applicable |
score |
float | Final episode score |
API Reference
Server Endpoints
All endpoints are JSON-based with FastAPI's automatic validation.
1. Reset Episode
POST /reset
Start a new moderation episode.
Request Body:
{
"task": "text_spam"
}
Response (200 OK):
{
"observation": {
"content_id": "ts_001",
"content_type": "text",
"text": "CONGRATULATIONS! You've won $1,000,000!...",
"metadata": {"source": "email", "sender_reputation": 0.05, "link_count": 3},
"step_num": 1,
"total_steps": 10
},
"info": {}
}
Error (400):
{
"detail": "Unknown task 'invalid_task'. Valid: ['text_spam', 'content_moderation', 'deepfake_detection']"
}
2. Submit Action
POST /step
Submit a moderation action for the current content.
Request Body:
{
"decision": "reject",
"reason": "Email contains typical spam patterns and suspicious links",
"confidence": 0.92,
"labels": ["spam", "scam"]
}
Response (200 OK):
{
"observation": {
"content_id": "ts_002",
"content_type": "text",
"text": "Hi Sarah, confirming our meeting tomorrow...",
"metadata": {"source": "email", "sender_reputation": 0.92, "link_count": 0},
"step_num": 2,
"total_steps": 10
},
"reward": 0.85,
"done": false,
"info": {}
}
3. Get Current State
GET /state
Retrieve the current episode state without taking an action.
Response (200 OK):
{
"observation": {...},
"reward": 0.85,
"done": false,
"info": {
"task": "text_spam",
"items_completed": 2,
"total_items": 10,
"cumulative_reward": 1.60
}
}
4. Close Episode
POST /close
Explicitly close the episode and clean up resources.
Response (200 OK):
{
"status": "closed",
"final_reward": 7.20,
"steps_completed": 10
}
5. List Available Tasks
GET /tasks
Get metadata about all available tasks.
Response (200 OK):
{
"text_spam": {
"description": "Classify email/message content as spam or legitimate",
"difficulty": "easy",
"num_items": 50,
"content_type": "text"
},
"content_moderation": {
"description": "Detect policy violations in social media posts",
"difficulty": "medium",
"num_items": 40,
"content_type": "text"
},
"deepfake_detection": {
"description": "Identify AI-manipulated media",
"difficulty": "hard",
"num_items": 30,
"content_type": "multimodal"
}
}
6. Health Check
GET /health
Check server health and status.
Response (200 OK):
{
"status": "ok"
}
7. Root Endpoint
GET /
Redirects to interactive Swagger UI documentation.
Project Structure
content-moderation-env/
β
βββ README.md # This file
βββ uv.lock # Dependency lock file (UV package manager)
βββ inference.py # Baseline agent script (235 lines)
β # Demonstrates LLM agent interaction
β # Supports HF API and local inference modes
β
βββ server/ # FastAPI application (core)
β βββ __init__.py # Package marker (empty)
β β
β βββ main.py # FastAPI app & HTTP endpoints (57 lines)
β β # Defines: /reset, /step, /state, /close
β β # /tasks, /health, / endpoints
β β
β βββ env.py # OpenEnv environment implementation (122 lines)
β β # Core logic: reset(), step(), state(), close()
β β # Thread-safe with locks for concurrency
β β
β βββ models.py # Pydantic data models
β β # Defines: ContentObservation, ModerationAction
β β # StepResult, ResetResult, EnvState
β β
β βββ tasks.py # Task datasets & ground truth (193 lines)
β β # Contains: text_spam, content_moderation,
β β # deepfake_detection task definitions & items
β β
β βββ graders.py # Reward functions per task (95 lines)
β β # Implements: label F1, calibration bonus,
β β # decision accuracy scoring logic
β β
β βββ deepfake_model.py # HF deepfake detection pipeline (90 lines)
β β # Lazy-loads: dima806/deepfake_vs_real...
β β # Caches model in HF_HOME for reuse
β β
β βββ openenv.yaml # OpenEnv specification metadata
β β # Declares task specs, observation/action space
β β
β βββ Dockerfile # Docker container definition
β β # Base: python:3.11-slim (~300MB)
β β # Installs system deps, pip packages,
β β # pre-downloads deepfake model
β β
β βββ requirements.txt # Python dependencies (12 packages)
β # Key: fastapi, uvicorn, transformers,
β # torch, openai, python-dotenv
β
βββ test/ # Test suite
β βββ test.py # pytest tests (20+ test cases)
β # Coverage: tasks, endpoints, rewards
β
βββ .env # Environment variables (git-ignored)
# Stores: HF_TOKEN, API_BASE_URL, etc.
Environment Variables
Configuration is controlled via environment variables. Create a .env file in the project root:
# ============ API Configuration ============
API_BASE_URL=https://router.huggingface.co/v1
# URL of the LLM inference endpoint
# Default: HuggingFace router (requires HF_TOKEN)
MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct
# Which LLM to use for agent inference
# Other options: gpt-3.5-turbo, claude-3-opus, mistral-large, etc.
HF_TOKEN=hf_your_token_here
# HuggingFace API token for authenticated requests
# Get from: https://huggingface.co/settings/tokens
# ============ Server Configuration ============
SERVER_URL=http://localhost:7860
# Where the OpenEnv API server runs
# Used by inference.py to connect to environment
# ============ Task & Inference Configuration ============
TASK_NAME=text_spam
# Which task to run: text_spam, content_moderation, deepfake_detection
USE_LOCAL_MODEL=false
# If true: Load Llama-3.1-8B locally via transformers
# If false: Use remote API (requires HF_TOKEN)
# Local mode requires 16GB+ RAM
# ============ HuggingFace Model Caching ============
HF_HOME=/app/.cache/huggingface
# Directory for cached HF models and datasets
# Mounted as volume in Docker for persistence
TRANSFORMERS_CACHE=/app/.cache/huggingface
# Alternative env var for transformers library caching
# ============ Python Configuration ============
PYTHONDONTWRITEBYTECODE=1
# Don't create __pycache__ directories
PYTHONUNBUFFERED=1
# Stream logs immediately (useful in Docker)
# ============ Logging ============
LOG_LEVEL=INFO
# Log level: DEBUG, INFO, WARNING, ERROR, CRITICAL
Variable Precedence
- Environment variables (highest priority)
.envfile- Hardcoded defaults in code (lowest priority)
Example override:
export HF_TOKEN="hf_custom_token" && python inference.py
# Uses custom token instead of .env value
Running Tests
The project includes a comprehensive test suite using pytest.
Setup
pip install pytest pytest-cov
Run All Tests
pytest test/test.py -v
Run Specific Test Class
pytest test/test.py::TestTasks -v
Run with Coverage Report
pytest test/test.py --cov=server --cov-report=html
# Opens htmlcov/index.html in browser for coverage visualization
Test Categories
| Test | Coverage | Status |
|---|---|---|
| Task loading | All 3 tasks initialize correctly | β |
| API endpoints | /reset, /step, /state, /close, /tasks, /health | β |
| Reward grading | text_spam, content_moderation, deepfake_detection | β |
| Input validation | Action schema validation, label validation | β |
| Edge cases | Empty labels, out-of-range confidence, etc. | β |
Troubleshooting
Installation Issues
Problem: ImportError: No module named 'openai'
Solution: pip install "openai>=1.40.0"
Problem: ImportError: No module named 'torch'
Solution: pip install torch torchvision
# For GPU: pip install torch torchvision -f https://download.pytorch.org/whl/cu121/torch_stable.html
Problem: FileNotFoundError: requirements.txt
Solution: Ensure you're in the project root: cd content-moderation-env/
# Then: pip install -r server/requirements.txt
Docker Issues
Problem: Segmentation fault (core dumped) during build
Solution: Allocate more memory to Docker build:
docker build --memory=8g -f server/Dockerfile -t content-moderation-env .
Problem: failed to solve: failed to compute cache key
Solution: Ensure requirements.txt is in server/ directory:
# Current: server/requirements.txt (correct)
# Wrong: ./requirements.txt
Problem: Port 7860 already in use
Solution: Use different port:
docker run -p 8000:7860 content-moderation-env
# Now access at http://localhost:8000
Runtime Issues
Problem: Connection refused: localhost:7860
Solution: Ensure server is running:
uvicorn server.main:app --host 0.0.0.0 --port 7860
In Docker, use: docker logs <container_id>
Problem: Client.__init__() got an unexpected keyword argument 'proxies'
Solution: Update OpenAI client:
pip install --upgrade openai
Problem: HuggingFace models downloading very slowly
Solution: Check internet connection and verify HF_TOKEN:
export HF_TOKEN="hf_your_token_here"
# Or download models ahead of time
python -c "from transformers import pipeline; pipeline('image-classification', model='dima806/deepfake_vs_real_image_detection')"
API Issues
Problem: Invalid request to /step without /reset
Error: "Environment not initialized. Call /reset first."
Solution: Always call POST /reset before any /step requests
Problem: Invalid label in action
Error: {"detail": "Invalid label: 'unknown_label'"}
Solution: Use only valid labels from the specification
Problem: Confidence out of range
Solution: Ensure confidence is between 0.0 and 1.0
Citation
If you use this environment in your research, please cite:
@software{content_moderation_openenv_2025,
title={Content Moderation OpenEnv: A Real-World AI Triage Environment},
author={Anidipta},
year={2025},
url={https://github.com/Anidipta/Content-Moderation-env},
note={OpenEnv Specification Compliant}
}
Acknowledgements
π Built for the OpenEnv Hackathon 2025.
Special Thanks To:
- OpenEnv community for the specification and framework
- HuggingFace for model hosting and inference APIs
- Meta for the Llama-3.1-8B-Instruct model
- Contributors and testers who improved the environment
Dataset & Content Note: The email and content corpus is entirely synthetic and does not represent any real individuals, companies, organizations, or actual events. All examples are generated for demonstration and testing purposes only.
License: MIT License β See LICENSE file for details
Questions? Open an issue on GitHub or contact the maintainers.
Last Updated: April 8, 2026 | OpenEnv Spec Version: 1.0 colorTo: green sdk: docker pinned: false license: mit
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
f6dee02010a32ba1936311cbb3790fa087282e74