ANI00's picture
Add root Dockerfile for HF Spaces build
af65c6d verified
metadata
title: Content Moderation OpenEnv
emoji: πŸ›‘οΈ
colorFrom: blue
colorTo: indigo
sdk: docker
app_file: app.py
pinned: false

Content Moderation OpenEnv

An AI content moderation environment built to the OpenEnv specification. Agents triage real-world content β€” spam emails, harmful social media posts, and AI-generated deepfakes β€” using a standard step() / reset() / state() API.

OpenEnv Spec Python 3.11+ FastAPI Docker License: MIT


πŸ“‹ Table of Contents


Environment Description & Motivation

Content moderation is a high-stakes, high-volume real-world task. Human moderators review millions of items daily across platforms and inboxes. This environment simulates a realistic moderation pipeline across three difficulty levels, enabling AI agents to learn decision-making strategies under resource constraints.

Key Challenges:

  • Multi-label classification with imbalanced datasets
  • Confidence calibration under uncertainty
  • Real-world content variability (spam, deepfakes, policy violations)
  • Escalation vs. immediate action tradeoffs
Task Difficulty Content Type Metrics Description
text_spam Easy Email / SMS Binary classification + confidence Spam vs. legitimate email filtering with sender reputation signals
content_moderation Medium Social media posts Multi-label (up to 11 labels) Detect policy violations (hate speech, harassment, violence, etc.)
deepfake_detection Hard Multimodal (image + text) Detection accuracy + labels Identify AI-manipulated media using vision transformer + metadata

Models Used:

  • Deepfake Detection: dima806/deepfake_vs_real_image_detection (ViT, ~300MB)
  • Language Model: meta-llama/Llama-3.1-8B-Instruct (8B params, via transformers or HF API)
  • Infrastructure: FastAPI + Uvicorn, Docker containerization

Task Descriptions

Task 1: Text Spam Classification (Easy)

Objective: Binary classification of emails/SMS as spam or legitimate.

Dataset:

  • 50 items (30 spam, 20 legitimate)
  • Features: text content, sender reputation score, link count, source
  • Ground truth: decision + labels

Example:

{
  "content_id": "ts_001",
  "content_type": "text",
  "text": "CONGRATULATIONS! You've won $1,000,000! Click here NOW to claim your prize!!!",
  "metadata": {"source": "email", "sender_reputation": 0.05, "link_count": 3},
  "ground_truth": {
    "decision": "reject",
    "labels": ["spam", "scam"],
    "is_harmful": true
  }
}

Task 2: Content Moderation (Medium)

Objective: Multi-label classification of social media posts for policy violations.

Dataset:

  • 40 items across diverse platforms
  • Labels: spam, scam, phishing, hate_speech, violence, harassment, misinformation, adult_content, deepfake, political_manipulation, fraud
  • Features: post text, engagement metrics, user reputation, report count

Violation Categories:

Category Definition Examples
Hate Speech Dehumanizing content targeting identity Slurs, discrimination, incitement
Violence Threats or glorification of violence Physical harm, weapon promotion
Harassment Coordinated or severe personal attacks Doxxing, targeted campaigns
Misinformation False claims with societal impact Election fraud claims, health hoaxes

Task 3: Deepfake Detection (Hard)

Objective: Detect AI-manipulated media and classify content appropriately.

Dataset:

  • 30 items (multimodal: images + descriptions)
  • Deepfake detection model outputs raw confidence scores (0-1)
  • Features: image description, detector_score, metadata

Detector Score Interpretation:

  • 0.0-0.3: Likely real/authentic
  • 0.3-0.7: Uncertain, may require additional analysis
  • 0.7-1.0: Likely deepfake/manipulated

Example:

{
  "content_id": "df_001",
  "content_type": "multimodal",
  "image_description": "Portrait of person in business attire, lighting appears natural",
  "detector_score": 0.82,
  "metadata": {"platform": "social_media", "report_count": 3}
}

Observation Space

Every step returns a ContentObservation with the following structure:

{
  "content_id": "string",
  "content_type": "text | multimodal",
  "text": "string (optional, for text tasks)",
  "image_description": "string (optional, deepfake task only)",
  "detector_score": 0.0-1.0 (optional, deepfake task only),
  "metadata": {
    "source": "email | social_media | platform",
    "sender_reputation": 0.0-1.0,
    "link_count": 0,
    "report_count": 0,
    "timestamp": "ISO8601"
  },
  "step_num": 1,
  "total_steps": 10
}
Field Type Required Task Description
content_id string All Unique identifier for the content item
content_type string All Type of content: text or multimodal
text string text_spam, content_moderation The actual email/post body
image_description string deepfake_detection AI-generated description of the image
detector_score float deepfake_detection Raw output from deepfake model (0-1)
metadata object All Platform-specific signals (reputation, reports, etc.)
step_num int All Current step in episode
total_steps int All Total steps in this episode

Action Space

Agents must respond with a ModerationAction object:

{
  "decision": "approve | reject | escalate | flag",
  "reason": "string (max 200 chars, explanation)",
  "confidence": 0.0-1.0,
  "labels": ["spam", "scam", "phishing", "hate_speech", "violence",
             "harassment", "misinformation", "adult_content",
             "deepfake", "political_manipulation", "fraud"]
}
Field Type Range Description
decision enum {approve, reject, escalate, flag} approve: Safe content; reject: Clear violation, remove immediately; flag: Borderline, needs review; escalate: Uncertain, route to specialist
reason string 0-200 chars Brief explanation of the decision
confidence float [0.0, 1.0] Agent's confidence in this decision. Calibration is rewarded.
labels array 0-11 labels Detected violation categories. Must be valid from the allowed set.

Valid Labels:

  • spam β€” Unsolicited bulk messaging
  • scam β€” Financial or identity theft attempts
  • phishing β€” Social engineering attacks
  • hate_speech β€” Dehumanizing content targeting identity groups
  • violence β€” Threats, glorification, or instruction of violence
  • harassment β€” Personal attacks, doxxing, coordinated abuse
  • misinformation β€” False claims with societal impact
  • adult_content β€” NSFW or sexually explicit material
  • deepfake β€” AI-manipulated media
  • political_manipulation β€” Coordinated inauthentic behavior
  • fraud β€” Financial scams, false claims

Reward Functions

Rewards are computed per task based on decision accuracy, label coverage (F1), and confidence calibration.

text_spam (Easy)

Component Reward Condition
Correct decision +0.65 decision matches ground truth
Escalate on harmful +0.30 Harmful content + escalate/flag (partial credit)
Label F1 contribution +0.20 F1 score of predicted vs. true labels
Confidence calibration Β±0.10 Bonus if confident on correct, penalty if confident on wrong
Max per step 1.00 Sum of components (capped)

content_moderation (Medium)

Component Reward Condition
Correct decision +0.50 decision matches ground truth
Partial credit +0.25 Harmful content + flag/escalate (conservative approach)
Label F1 contribution +0.35 Multi-label F1 score (up to 11 labels)
Confidence calibration Β±0.10 Brier score penalty for miscalibration
Max per step 1.00 Sum of components (capped)

deepfake_detection (Hard)

Component Reward Condition
Correct decision +0.40 decision matches ground truth
Deepfake detection +0.30 Accuracy vs. detector_score threshold
Detector alignment +0.10 Bonus for leveraging model signals
Label F1 contribution +0.20 Multi-label F1 (fewer labels than medium task)
Confidence calibration Β±0.10 Calibration error penalty
Max per step 1.00 Sum of components (capped)

Calibration Bonus Formula:

bonus = 0.1 Γ— (confidence if correct else -confidence)

Baseline Scores

Scores reported for Llama-3.1-8B-Instruct with temperature=0.2 and top-p=0.95:

Task Score Steps Notes
text_spam 0.72 5 Strong on obvious spam; struggles with phishing disguised as legitimate
content_moderation 0.58 8 Good binary decisions; incomplete label coverage (F1 β‰ˆ0.52)
deepfake_detection 0.44 10 Relies on image descriptions; independent detector signals underutilized

Setup & Usage

Requirements

  • Python: 3.11 or higher
  • Docker (optional, for containerized deployment)
  • GPU (optional, recommended for deepfake models): CUDA 12.1+
  • Memory: 8GB+ RAM (16GB recommended for local LLM inference)
  • Disk: 10GB+ (models cached in ~/.cache/huggingface/)

Local Installation

  1. Clone and navigate:

    git clone https://github.com/Anidipta/Content-Moderation-env.git
    cd Content-Moderation-env
    
  2. Create virtual environment:

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
    
  3. Install dependencies:

    pip install -r server/requirements.txt
    
  4. Start the server:

    uvicorn server.main:app --host 0.0.0.0 --port 7860
    

    Server runs at http://localhost:7860

  5. Access API documentation:

    • Swagger UI: http://localhost:7860/docs
    • ReDoc: http://localhost:7860/redoc

Docker Deployment

Build the Image

# Basic build
docker build -f server/Dockerfile -t content-moderation-env .

# Build with memory allocation (recommended)
docker build --memory=4g -f server/Dockerfile -t content-moderation-env .

# Build with progress output
docker build --progress=plain -f server/Dockerfile -t content-moderation-env .

Run the Container

# Basic run
docker run -p 7860:7860 content-moderation-env

# Run with environment variables
docker run -p 7860:7860 \
  -e API_BASE_URL="https://router.huggingface.co/v1" \
  -e MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct" \
  -e HF_TOKEN="hf_your_token_here" \
  content-moderation-env

# Run with GPU support
docker run --gpus all -p 7860:7860 content-moderation-env

# Run with volume mounts (cache models locally)
docker run -p 7860:7860 \
  -v ~/.cache/huggingface:/app/.cache/huggingface \
  content-moderation-env

# Run in background
docker run -d -p 7860:7860 --name moderation-env content-moderation-env

# Check logs
docker logs moderation-env

# Stop container
docker stop moderation-env

Dockerfile Details

The server/Dockerfile uses:

  • Base Image: python:3.11-slim (~300MB) β€” minimal footprint with Python runtime
  • System Dependencies: libgl1 libglib2.0-0 curl β€” required for vision models and health checks
  • Dependencies Installation: Multi-stage approach with pip cache optimization
  • Model Preloading: Deepfake detection model downloaded during build for faster startup
  • Environment Setup: HuggingFace cache directories and Python settings pre-configured
  • Entry Point: FastAPI app via Uvicorn on port 7860
# Key optimizations:
- --no-cache-dir: Reduces image size by 50%
- --no-build-isolation: Prevents memory spikes during pip install
- Pre-downloaded models: Eliminates first-run delays
- Minimal dependencies: Only libraries needed for the environment

Deployment to Production

Docker Compose:

version: '3.8'
services:
  moderation-api:
    build:
      context: .
      dockerfile: server/Dockerfile
    ports:
      - "7860:7860"
    environment:
      - API_BASE_URL=https://router.huggingface.co/v1
      - MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct
      - HF_TOKEN=${HF_TOKEN}
    volumes:
      - ~/.cache/huggingface:/app/.cache/huggingface
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:7860/health"]
      interval: 30s
      timeout: 10s
      retries: 3

Run with: docker-compose up -d

HuggingFace Spaces Deployment

  1. Create a new Space with Docker SDK
  2. Add Secrets (Settings β†’ Repository secrets):
    • HF_TOKEN: Your HuggingFace API token
  3. Add Variables (Settings β†’ Repository variables):
    • API_BASE_URL: https://router.huggingface.co/v1
    • MODEL_NAME: meta-llama/Llama-3.1-8B-Instruct
  4. Push this repository to the Space
  5. Space URL becomes your PING_URL for validation scripts

Running the Inference Script

# API mode (HF inference endpoint)
export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct"
export HF_TOKEN="hf_your_token_here"
export SERVER_URL="http://localhost:7860"
export TASK_NAME="text_spam"

python inference.py

# Local transformers pipeline mode
export USE_LOCAL_MODEL="true"
python inference.py

Output Format

[START] task=text_spam env=content_moderation_env model=meta-llama/Llama-3.1-8B-Instruct
[STEP] step=1 action={"decision":"reject","confidence":0.9,"labels":["spam"]} reward=0.85 done=false error=null
[STEP] step=2 action={"decision":"approve","confidence":0.8,"labels":[]} reward=0.75 done=false error=null
[STEP] step=3 action={"decision":"escalate","confidence":0.5,"labels":["scam"]} reward=0.30 done=false error=null
[STEP] step=4 action={"decision":"reject","confidence":0.85,"labels":["phishing"]} reward=0.70 done=false error=null
[STEP] step=5 action={"decision":"approve","confidence":0.88,"labels":[]} reward=0.75 done=true error=null
[END] success=true steps=5 score=0.720 rewards=0.85,0.75,0.30,0.70,0.75
Field Type Description
task string The task being evaluated
step int Current step number in episode
decision string Agent's moderation decision
confidence float Agent's confidence (0-1)
labels array Detected violation labels
reward float Reward received for this step
done boolean Episode completion flag
error string/null Error message if applicable
score float Final episode score

API Reference

Server Endpoints

All endpoints are JSON-based with FastAPI's automatic validation.

1. Reset Episode

POST /reset

Start a new moderation episode.

Request Body:

{
  "task": "text_spam"
}

Response (200 OK):

{
  "observation": {
    "content_id": "ts_001",
    "content_type": "text",
    "text": "CONGRATULATIONS! You've won $1,000,000!...",
    "metadata": {"source": "email", "sender_reputation": 0.05, "link_count": 3},
    "step_num": 1,
    "total_steps": 10
  },
  "info": {}
}

Error (400):

{
  "detail": "Unknown task 'invalid_task'. Valid: ['text_spam', 'content_moderation', 'deepfake_detection']"
}

2. Submit Action

POST /step

Submit a moderation action for the current content.

Request Body:

{
  "decision": "reject",
  "reason": "Email contains typical spam patterns and suspicious links",
  "confidence": 0.92,
  "labels": ["spam", "scam"]
}

Response (200 OK):

{
  "observation": {
    "content_id": "ts_002",
    "content_type": "text",
    "text": "Hi Sarah, confirming our meeting tomorrow...",
    "metadata": {"source": "email", "sender_reputation": 0.92, "link_count": 0},
    "step_num": 2,
    "total_steps": 10
  },
  "reward": 0.85,
  "done": false,
  "info": {}
}

3. Get Current State

GET /state

Retrieve the current episode state without taking an action.

Response (200 OK):

{
  "observation": {...},
  "reward": 0.85,
  "done": false,
  "info": {
    "task": "text_spam",
    "items_completed": 2,
    "total_items": 10,
    "cumulative_reward": 1.60
  }
}

4. Close Episode

POST /close

Explicitly close the episode and clean up resources.

Response (200 OK):

{
  "status": "closed",
  "final_reward": 7.20,
  "steps_completed": 10
}

5. List Available Tasks

GET /tasks

Get metadata about all available tasks.

Response (200 OK):

{
  "text_spam": {
    "description": "Classify email/message content as spam or legitimate",
    "difficulty": "easy",
    "num_items": 50,
    "content_type": "text"
  },
  "content_moderation": {
    "description": "Detect policy violations in social media posts",
    "difficulty": "medium",
    "num_items": 40,
    "content_type": "text"
  },
  "deepfake_detection": {
    "description": "Identify AI-manipulated media",
    "difficulty": "hard",
    "num_items": 30,
    "content_type": "multimodal"
  }
}

6. Health Check

GET /health

Check server health and status.

Response (200 OK):

{
  "status": "ok"
}

7. Root Endpoint

GET /

Redirects to interactive Swagger UI documentation.


Project Structure

content-moderation-env/
β”‚
β”œβ”€β”€ README.md                          # This file
β”œβ”€β”€ uv.lock                            # Dependency lock file (UV package manager)
β”œβ”€β”€ inference.py                       # Baseline agent script (235 lines)
β”‚                                      # Demonstrates LLM agent interaction
β”‚                                      # Supports HF API and local inference modes
β”‚
β”œβ”€β”€ server/                            # FastAPI application (core)
β”‚   β”œβ”€β”€ __init__.py                    # Package marker (empty)
β”‚   β”‚
β”‚   β”œβ”€β”€ main.py                        # FastAPI app & HTTP endpoints (57 lines)
β”‚   β”‚                                  # Defines: /reset, /step, /state, /close
β”‚   β”‚                                  # /tasks, /health, / endpoints
β”‚   β”‚
β”‚   β”œβ”€β”€ env.py                         # OpenEnv environment implementation (122 lines)
β”‚   β”‚                                  # Core logic: reset(), step(), state(), close()
β”‚   β”‚                                  # Thread-safe with locks for concurrency
β”‚   β”‚
β”‚   β”œβ”€β”€ models.py                      # Pydantic data models
β”‚   β”‚                                  # Defines: ContentObservation, ModerationAction
β”‚   β”‚                                  # StepResult, ResetResult, EnvState
β”‚   β”‚
β”‚   β”œβ”€β”€ tasks.py                       # Task datasets & ground truth (193 lines)
β”‚   β”‚                                  # Contains: text_spam, content_moderation,
β”‚   β”‚                                  # deepfake_detection task definitions & items
β”‚   β”‚
β”‚   β”œβ”€β”€ graders.py                     # Reward functions per task (95 lines)
β”‚   β”‚                                  # Implements: label F1, calibration bonus,
β”‚   β”‚                                  # decision accuracy scoring logic
β”‚   β”‚
β”‚   β”œβ”€β”€ deepfake_model.py              # HF deepfake detection pipeline (90 lines)
β”‚   β”‚                                  # Lazy-loads: dima806/deepfake_vs_real...
β”‚   β”‚                                  # Caches model in HF_HOME for reuse
β”‚   β”‚
β”‚   β”œβ”€β”€ openenv.yaml                   # OpenEnv specification metadata
β”‚   β”‚                                  # Declares task specs, observation/action space
β”‚   β”‚
β”‚   β”œβ”€β”€ Dockerfile                     # Docker container definition
β”‚   β”‚                                  # Base: python:3.11-slim (~300MB)
β”‚   β”‚                                  # Installs system deps, pip packages,
β”‚   β”‚                                  # pre-downloads deepfake model
β”‚   β”‚
β”‚   └── requirements.txt                # Python dependencies (12 packages)
β”‚                                      # Key: fastapi, uvicorn, transformers,
β”‚                                      # torch, openai, python-dotenv
β”‚
β”œβ”€β”€ test/                              # Test suite
β”‚   └── test.py                        # pytest tests (20+ test cases)
β”‚                                      # Coverage: tasks, endpoints, rewards
β”‚
└── .env                               # Environment variables (git-ignored)
                                       # Stores: HF_TOKEN, API_BASE_URL, etc.

Environment Variables

Configuration is controlled via environment variables. Create a .env file in the project root:

# ============ API Configuration ============
API_BASE_URL=https://router.huggingface.co/v1
# URL of the LLM inference endpoint
# Default: HuggingFace router (requires HF_TOKEN)

MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct
# Which LLM to use for agent inference
# Other options: gpt-3.5-turbo, claude-3-opus, mistral-large, etc.

HF_TOKEN=hf_your_token_here
# HuggingFace API token for authenticated requests
# Get from: https://huggingface.co/settings/tokens

# ============ Server Configuration ============
SERVER_URL=http://localhost:7860
# Where the OpenEnv API server runs
# Used by inference.py to connect to environment

# ============ Task & Inference Configuration ============
TASK_NAME=text_spam
# Which task to run: text_spam, content_moderation, deepfake_detection

USE_LOCAL_MODEL=false
# If true: Load Llama-3.1-8B locally via transformers
# If false: Use remote API (requires HF_TOKEN)
# Local mode requires 16GB+ RAM

# ============ HuggingFace Model Caching ============
HF_HOME=/app/.cache/huggingface
# Directory for cached HF models and datasets
# Mounted as volume in Docker for persistence

TRANSFORMERS_CACHE=/app/.cache/huggingface
# Alternative env var for transformers library caching

# ============ Python Configuration ============
PYTHONDONTWRITEBYTECODE=1
# Don't create __pycache__ directories

PYTHONUNBUFFERED=1
# Stream logs immediately (useful in Docker)

# ============ Logging ============
LOG_LEVEL=INFO
# Log level: DEBUG, INFO, WARNING, ERROR, CRITICAL

Variable Precedence

  1. Environment variables (highest priority)
  2. .env file
  3. Hardcoded defaults in code (lowest priority)

Example override:

export HF_TOKEN="hf_custom_token" && python inference.py
# Uses custom token instead of .env value

Running Tests

The project includes a comprehensive test suite using pytest.

Setup

pip install pytest pytest-cov

Run All Tests

pytest test/test.py -v

Run Specific Test Class

pytest test/test.py::TestTasks -v

Run with Coverage Report

pytest test/test.py --cov=server --cov-report=html
# Opens htmlcov/index.html in browser for coverage visualization

Test Categories

Test Coverage Status
Task loading All 3 tasks initialize correctly βœ“
API endpoints /reset, /step, /state, /close, /tasks, /health βœ“
Reward grading text_spam, content_moderation, deepfake_detection βœ“
Input validation Action schema validation, label validation βœ“
Edge cases Empty labels, out-of-range confidence, etc. βœ“

Troubleshooting

Installation Issues

Problem: ImportError: No module named 'openai'

Solution: pip install "openai>=1.40.0"

Problem: ImportError: No module named 'torch'

Solution: pip install torch torchvision
# For GPU: pip install torch torchvision -f https://download.pytorch.org/whl/cu121/torch_stable.html

Problem: FileNotFoundError: requirements.txt

Solution: Ensure you're in the project root: cd content-moderation-env/
# Then: pip install -r server/requirements.txt

Docker Issues

Problem: Segmentation fault (core dumped) during build

Solution: Allocate more memory to Docker build:
docker build --memory=8g -f server/Dockerfile -t content-moderation-env .

Problem: failed to solve: failed to compute cache key

Solution: Ensure requirements.txt is in server/ directory:
# Current: server/requirements.txt (correct)
# Wrong: ./requirements.txt

Problem: Port 7860 already in use

Solution: Use different port:
docker run -p 8000:7860 content-moderation-env
# Now access at http://localhost:8000

Runtime Issues

Problem: Connection refused: localhost:7860

Solution: Ensure server is running:
uvicorn server.main:app --host 0.0.0.0 --port 7860

In Docker, use: docker logs <container_id>

Problem: Client.__init__() got an unexpected keyword argument 'proxies'

Solution: Update OpenAI client:
pip install --upgrade openai

Problem: HuggingFace models downloading very slowly

Solution: Check internet connection and verify HF_TOKEN:
export HF_TOKEN="hf_your_token_here"
# Or download models ahead of time
python -c "from transformers import pipeline; pipeline('image-classification', model='dima806/deepfake_vs_real_image_detection')"

API Issues

Problem: Invalid request to /step without /reset

Error: "Environment not initialized. Call /reset first."
Solution: Always call POST /reset before any /step requests

Problem: Invalid label in action

Error: {"detail": "Invalid label: 'unknown_label'"}
Solution: Use only valid labels from the specification

Problem: Confidence out of range

Solution: Ensure confidence is between 0.0 and 1.0

Citation

If you use this environment in your research, please cite:

@software{content_moderation_openenv_2025,
  title={Content Moderation OpenEnv: A Real-World AI Triage Environment},
  author={Anidipta},
  year={2025},
  url={https://github.com/Anidipta/Content-Moderation-env},
  note={OpenEnv Specification Compliant}
}

Acknowledgements

πŸ™ Built for the OpenEnv Hackathon 2025.

Special Thanks To:

  • OpenEnv community for the specification and framework
  • HuggingFace for model hosting and inference APIs
  • Meta for the Llama-3.1-8B-Instruct model
  • Contributors and testers who improved the environment

Dataset & Content Note: The email and content corpus is entirely synthetic and does not represent any real individuals, companies, organizations, or actual events. All examples are generated for demonstration and testing purposes only.

License: MIT License β€” See LICENSE file for details

Questions? Open an issue on GitHub or contact the maintainers.


Last Updated: April 8, 2026 | OpenEnv Spec Version: 1.0 colorTo: green sdk: docker pinned: false license: mit

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

f6dee02010a32ba1936311cbb3790fa087282e74