Spaces:

ANI00
/

content-moderation-env

Sleeping

App Files Files Community

content-moderation-env / README.md

ANI00

Add root Dockerfile for HF Spaces build

af65c6d verified about 2 months ago

preview code

raw

history blame contribute delete

29.3 kB

metadata

title: Content Moderation OpenEnv
emoji: 🛡️
colorFrom: blue
colorTo: indigo
sdk: docker
app_file: app.py
pinned: false

Content Moderation OpenEnv

An AI content moderation environment built to the OpenEnv specification. Agents triage real-world content — spam emails, harmful social media posts, and AI-generated deepfakes — using a standard step() / reset() / state() API.

Environment Description & Motivation

Content moderation is a high-stakes, high-volume real-world task. Human moderators review millions of items daily across platforms and inboxes. This environment simulates a realistic moderation pipeline across three difficulty levels, enabling AI agents to learn decision-making strategies under resource constraints.

Key Challenges:

Multi-label classification with imbalanced datasets
Confidence calibration under uncertainty
Real-world content variability (spam, deepfakes, policy violations)
Escalation vs. immediate action tradeoffs

Task	Difficulty	Content Type	Metrics	Description
`text_spam`	Easy	Email / SMS	Binary classification + confidence	Spam vs. legitimate email filtering with sender reputation signals
`content_moderation`	Medium	Social media posts	Multi-label (up to 11 labels)	Detect policy violations (hate speech, harassment, violence, etc.)
`deepfake_detection`	Hard	Multimodal (image + text)	Detection accuracy + labels	Identify AI-manipulated media using vision transformer + metadata

Models Used:

Deepfake Detection: dima806/deepfake_vs_real_image_detection (ViT, ~300MB)
Language Model: meta-llama/Llama-3.1-8B-Instruct (8B params, via transformers or HF API)
Infrastructure: FastAPI + Uvicorn, Docker containerization

Task Descriptions

Task 1: Text Spam Classification (Easy)

Objective: Binary classification of emails/SMS as spam or legitimate.

Dataset:

50 items (30 spam, 20 legitimate)
Features: text content, sender reputation score, link count, source
Ground truth: decision + labels

Example:

{
  "content_id": "ts_001",
  "content_type": "text",
  "text": "CONGRATULATIONS! You've won $1,000,000! Click here NOW to claim your prize!!!",
  "metadata": {"source": "email", "sender_reputation": 0.05, "link_count": 3},
  "ground_truth": {
    "decision": "reject",
    "labels": ["spam", "scam"],
    "is_harmful": true
  }
}

Task 2: Content Moderation (Medium)

Objective: Multi-label classification of social media posts for policy violations.

Dataset:

40 items across diverse platforms
Labels: spam, scam, phishing, hate_speech, violence, harassment, misinformation, adult_content, deepfake, political_manipulation, fraud
Features: post text, engagement metrics, user reputation, report count

Violation Categories:

Category	Definition	Examples
Hate Speech	Dehumanizing content targeting identity	Slurs, discrimination, incitement
Violence	Threats or glorification of violence	Physical harm, weapon promotion
Harassment	Coordinated or severe personal attacks	Doxxing, targeted campaigns
Misinformation	False claims with societal impact	Election fraud claims, health hoaxes

Task 3: Deepfake Detection (Hard)

Objective: Detect AI-manipulated media and classify content appropriately.

Dataset:

30 items (multimodal: images + descriptions)
Deepfake detection model outputs raw confidence scores (0-1)
Features: image description, detector_score, metadata

Detector Score Interpretation:

0.0-0.3: Likely real/authentic
0.3-0.7: Uncertain, may require additional analysis
0.7-1.0: Likely deepfake/manipulated

Example:

{
  "content_id": "df_001",
  "content_type": "multimodal",
  "image_description": "Portrait of person in business attire, lighting appears natural",
  "detector_score": 0.82,
  "metadata": {"platform": "social_media", "report_count": 3}
}

Observation Space

Every step returns a ContentObservation with the following structure:

{
  "content_id": "string",
  "content_type": "text | multimodal",
  "text": "string (optional, for text tasks)",
  "image_description": "string (optional, deepfake task only)",
  "detector_score": 0.0-1.0 (optional, deepfake task only),
  "metadata": {
    "source": "email | social_media | platform",
    "sender_reputation": 0.0-1.0,
    "link_count": 0,
    "report_count": 0,
    "timestamp": "ISO8601"
  },
  "step_num": 1,
  "total_steps": 10
}

Field	Type	Required	Task
`content_id`	string	All	Unique identifier for the content item
`content_type`	string	All	Type of content: `text` or `multimodal`
`text`	string	text_spam, content_moderation	The actual email/post body
`image_description`	string	deepfake_detection	AI-generated description of the image
`detector_score`	float	deepfake_detection	Raw output from deepfake model (0-1)
`metadata`	object	All	Platform-specific signals (reputation, reports, etc.)
`step_num`	int	All	Current step in episode
`total_steps`	int	All	Total steps in this episode

Action Space

Agents must respond with a ModerationAction object:

{
  "decision": "approve | reject | escalate | flag",
  "reason": "string (max 200 chars, explanation)",
  "confidence": 0.0-1.0,
  "labels": ["spam", "scam", "phishing", "hate_speech", "violence",
             "harassment", "misinformation", "adult_content",
             "deepfake", "political_manipulation", "fraud"]
}

Field	Type	Range	Description
`decision`	enum	{approve, reject, escalate, flag}	approve: Safe content; reject: Clear violation, remove immediately; flag: Borderline, needs review; escalate: Uncertain, route to specialist
`reason`	string	0-200 chars	Brief explanation of the decision
`confidence`	float	[0.0, 1.0]	Agent's confidence in this decision. Calibration is rewarded.
`labels`	array	0-11 labels	Detected violation categories. Must be valid from the allowed set.

Valid Labels:

spam — Unsolicited bulk messaging
scam — Financial or identity theft attempts
phishing — Social engineering attacks
hate_speech — Dehumanizing content targeting identity groups
violence — Threats, glorification, or instruction of violence
harassment — Personal attacks, doxxing, coordinated abuse
misinformation — False claims with societal impact
adult_content — NSFW or sexually explicit material
deepfake — AI-manipulated media
political_manipulation — Coordinated inauthentic behavior
fraud — Financial scams, false claims

Reward Functions

Rewards are computed per task based on decision accuracy, label coverage (F1), and confidence calibration.

text_spam (Easy)

Component	Reward	Condition
Correct decision	+0.65	`decision` matches ground truth
Escalate on harmful	+0.30	Harmful content + escalate/flag (partial credit)
Label F1 contribution	+0.20	F1 score of predicted vs. true labels
Confidence calibration	±0.10	Bonus if confident on correct, penalty if confident on wrong
Max per step	1.00	Sum of components (capped)

content_moderation (Medium)

Component	Reward	Condition
Correct decision	+0.50	`decision` matches ground truth
Partial credit	+0.25	Harmful content + flag/escalate (conservative approach)
Label F1 contribution	+0.35	Multi-label F1 score (up to 11 labels)
Confidence calibration	±0.10	Brier score penalty for miscalibration
Max per step	1.00	Sum of components (capped)

deepfake_detection (Hard)

Component	Reward	Condition
Correct decision	+0.40	`decision` matches ground truth
Deepfake detection	+0.30	Accuracy vs. detector_score threshold
Detector alignment	+0.10	Bonus for leveraging model signals
Label F1 contribution	+0.20	Multi-label F1 (fewer labels than medium task)
Confidence calibration	±0.10	Calibration error penalty
Max per step	1.00	Sum of components (capped)

Calibration Bonus Formula:

bonus = 0.1 × (confidence if correct else -confidence)

Baseline Scores

Scores reported for Llama-3.1-8B-Instruct with temperature=0.2 and top-p=0.95:

Task	Score	Steps	Notes
`text_spam`	0.72	5	Strong on obvious spam; struggles with phishing disguised as legitimate
`content_moderation`	0.58	8	Good binary decisions; incomplete label coverage (F1 ≈0.52)
`deepfake_detection`	0.44	10	Relies on image descriptions; independent detector signals underutilized

Setup & Usage

Requirements

Python: 3.11 or higher
Docker (optional, for containerized deployment)
GPU (optional, recommended for deepfake models): CUDA 12.1+
Memory: 8GB+ RAM (16GB recommended for local LLM inference)
Disk: 10GB+ (models cached in ~/.cache/huggingface/)

Local Installation

Clone and navigate:

git clone https://github.com/Anidipta/Content-Moderation-env.git
cd Content-Moderation-env

Create virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:
```
pip install -r server/requirements.txt
```

Start the server:

uvicorn server.main:app --host 0.0.0.0 --port 7860

Server runs at http://localhost:7860

Access API documentation:
- Swagger UI: http://localhost:7860/docs
- ReDoc: http://localhost:7860/redoc

Docker Deployment

Build the Image

# Basic build
docker build -f server/Dockerfile -t content-moderation-env .

# Build with memory allocation (recommended)
docker build --memory=4g -f server/Dockerfile -t content-moderation-env .

# Build with progress output
docker build --progress=plain -f server/Dockerfile -t content-moderation-env .

Run the Container

# Basic run
docker run -p 7860:7860 content-moderation-env

# Run with environment variables
docker run -p 7860:7860 \
  -e API_BASE_URL="https://router.huggingface.co/v1" \
  -e MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct" \
  -e HF_TOKEN="hf_your_token_here" \
  content-moderation-env

# Run with GPU support
docker run --gpus all -p 7860:7860 content-moderation-env

# Run with volume mounts (cache models locally)
docker run -p 7860:7860 \
  -v ~/.cache/huggingface:/app/.cache/huggingface \
  content-moderation-env

# Run in background
docker run -d -p 7860:7860 --name moderation-env content-moderation-env

# Check logs
docker logs moderation-env

# Stop container
docker stop moderation-env

Dockerfile Details

The server/Dockerfile uses:

Base Image: python:3.11-slim (~300MB) — minimal footprint with Python runtime
System Dependencies: libgl1 libglib2.0-0 curl — required for vision models and health checks
Dependencies Installation: Multi-stage approach with pip cache optimization
Model Preloading: Deepfake detection model downloaded during build for faster startup
Environment Setup: HuggingFace cache directories and Python settings pre-configured
Entry Point: FastAPI app via Uvicorn on port 7860

# Key optimizations:
- --no-cache-dir: Reduces image size by 50%
- --no-build-isolation: Prevents memory spikes during pip install
- Pre-downloaded models: Eliminates first-run delays
- Minimal dependencies: Only libraries needed for the environment

Deployment to Production

Docker Compose:

version: '3.8'
services:
  moderation-api:
    build:
      context: .
      dockerfile: server/Dockerfile
    ports:
      - "7860:7860"
    environment:
      - API_BASE_URL=https://router.huggingface.co/v1
      - MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct
      - HF_TOKEN=${HF_TOKEN}
    volumes:
      - ~/.cache/huggingface:/app/.cache/huggingface
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:7860/health"]
      interval: 30s
      timeout: 10s
      retries: 3

Run with: docker-compose up -d

HuggingFace Spaces Deployment

Create a new Space with Docker SDK
Add Secrets (Settings → Repository secrets):
- HF_TOKEN: Your HuggingFace API token
Add Variables (Settings → Repository variables):
- API_BASE_URL: https://router.huggingface.co/v1
- MODEL_NAME: meta-llama/Llama-3.1-8B-Instruct
Push this repository to the Space
Space URL becomes your PING_URL for validation scripts

Running the Inference Script

# API mode (HF inference endpoint)
export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct"
export HF_TOKEN="hf_your_token_here"
export SERVER_URL="http://localhost:7860"
export TASK_NAME="text_spam"

python inference.py

# Local transformers pipeline mode
export USE_LOCAL_MODEL="true"
python inference.py

Output Format

[START] task=text_spam env=content_moderation_env model=meta-llama/Llama-3.1-8B-Instruct
[STEP] step=1 action={"decision":"reject","confidence":0.9,"labels":["spam"]} reward=0.85 done=false error=null
[STEP] step=2 action={"decision":"approve","confidence":0.8,"labels":[]} reward=0.75 done=false error=null
[STEP] step=3 action={"decision":"escalate","confidence":0.5,"labels":["scam"]} reward=0.30 done=false error=null
[STEP] step=4 action={"decision":"reject","confidence":0.85,"labels":["phishing"]} reward=0.70 done=false error=null
[STEP] step=5 action={"decision":"approve","confidence":0.88,"labels":[]} reward=0.75 done=true error=null
[END] success=true steps=5 score=0.720 rewards=0.85,0.75,0.30,0.70,0.75

Field	Type	Description
`task`	string	The task being evaluated
`step`	int	Current step number in episode
`decision`	string	Agent's moderation decision
`confidence`	float	Agent's confidence (0-1)
`labels`	array	Detected violation labels
`reward`	float	Reward received for this step
`done`	boolean	Episode completion flag
`error`	string/null	Error message if applicable
`score`	float	Final episode score

API Reference

Server Endpoints

All endpoints are JSON-based with FastAPI's automatic validation.

1. Reset Episode

POST /reset

Start a new moderation episode.

Request Body:

{
  "task": "text_spam"
}

Response (200 OK):

{
  "observation": {
    "content_id": "ts_001",
    "content_type": "text",
    "text": "CONGRATULATIONS! You've won $1,000,000!...",
    "metadata": {"source": "email", "sender_reputation": 0.05, "link_count": 3},
    "step_num": 1,
    "total_steps": 10
  },
  "info": {}
}

Error (400):

{
  "detail": "Unknown task 'invalid_task'. Valid: ['text_spam', 'content_moderation', 'deepfake_detection']"
}

2. Submit Action

POST /step

Submit a moderation action for the current content.

Request Body:

{
  "decision": "reject",
  "reason": "Email contains typical spam patterns and suspicious links",
  "confidence": 0.92,
  "labels": ["spam", "scam"]
}

Response (200 OK):

{
  "observation": {
    "content_id": "ts_002",
    "content_type": "text",
    "text": "Hi Sarah, confirming our meeting tomorrow...",
    "metadata": {"source": "email", "sender_reputation": 0.92, "link_count": 0},
    "step_num": 2,
    "total_steps": 10
  },
  "reward": 0.85,
  "done": false,
  "info": {}
}

3. Get Current State

GET /state

Retrieve the current episode state without taking an action.

Response (200 OK):

{
  "observation": {...},
  "reward": 0.85,
  "done": false,
  "info": {
    "task": "text_spam",
    "items_completed": 2,
    "total_items": 10,
    "cumulative_reward": 1.60
  }
}

4. Close Episode

POST /close

Explicitly close the episode and clean up resources.

Response (200 OK):

{
  "status": "closed",
  "final_reward": 7.20,
  "steps_completed": 10
}

5. List Available Tasks

GET /tasks

Get metadata about all available tasks.

Response (200 OK):

{
  "text_spam": {
    "description": "Classify email/message content as spam or legitimate",
    "difficulty": "easy",
    "num_items": 50,
    "content_type": "text"
  },
  "content_moderation": {
    "description": "Detect policy violations in social media posts",
    "difficulty": "medium",
    "num_items": 40,
    "content_type": "text"
  },
  "deepfake_detection": {
    "description": "Identify AI-manipulated media",
    "difficulty": "hard",
    "num_items": 30,
    "content_type": "multimodal"
  }
}

6. Health Check

GET /health

Check server health and status.

Response (200 OK):

{
  "status": "ok"
}

7. Root Endpoint

GET /

Redirects to interactive Swagger UI documentation.

Project Structure

content-moderation-env/
│
├── README.md                          # This file
├── uv.lock                            # Dependency lock file (UV package manager)
├── inference.py                       # Baseline agent script (235 lines)
│                                      # Demonstrates LLM agent interaction
│                                      # Supports HF API and local inference modes
│
├── server/                            # FastAPI application (core)
│   ├── __init__.py                    # Package marker (empty)
│   │
│   ├── main.py                        # FastAPI app & HTTP endpoints (57 lines)
│   │                                  # Defines: /reset, /step, /state, /close
│   │                                  # /tasks, /health, / endpoints
│   │
│   ├── env.py                         # OpenEnv environment implementation (122 lines)
│   │                                  # Core logic: reset(), step(), state(), close()
│   │                                  # Thread-safe with locks for concurrency
│   │
│   ├── models.py                      # Pydantic data models
│   │                                  # Defines: ContentObservation, ModerationAction
│   │                                  # StepResult, ResetResult, EnvState
│   │
│   ├── tasks.py                       # Task datasets & ground truth (193 lines)
│   │                                  # Contains: text_spam, content_moderation,
│   │                                  # deepfake_detection task definitions & items
│   │
│   ├── graders.py                     # Reward functions per task (95 lines)
│   │                                  # Implements: label F1, calibration bonus,
│   │                                  # decision accuracy scoring logic
│   │
│   ├── deepfake_model.py              # HF deepfake detection pipeline (90 lines)
│   │                                  # Lazy-loads: dima806/deepfake_vs_real...
│   │                                  # Caches model in HF_HOME for reuse
│   │
│   ├── openenv.yaml                   # OpenEnv specification metadata
│   │                                  # Declares task specs, observation/action space
│   │
│   ├── Dockerfile                     # Docker container definition
│   │                                  # Base: python:3.11-slim (~300MB)
│   │                                  # Installs system deps, pip packages,
│   │                                  # pre-downloads deepfake model
│   │
│   └── requirements.txt                # Python dependencies (12 packages)
│                                      # Key: fastapi, uvicorn, transformers,
│                                      # torch, openai, python-dotenv
│
├── test/                              # Test suite
│   └── test.py                        # pytest tests (20+ test cases)
│                                      # Coverage: tasks, endpoints, rewards
│
└── .env                               # Environment variables (git-ignored)
                                       # Stores: HF_TOKEN, API_BASE_URL, etc.

Environment Variables

Configuration is controlled via environment variables. Create a .env file in the project root:

# ============ API Configuration ============
API_BASE_URL=https://router.huggingface.co/v1
# URL of the LLM inference endpoint
# Default: HuggingFace router (requires HF_TOKEN)

MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct
# Which LLM to use for agent inference
# Other options: gpt-3.5-turbo, claude-3-opus, mistral-large, etc.

HF_TOKEN=hf_your_token_here
# HuggingFace API token for authenticated requests
# Get from: https://huggingface.co/settings/tokens

# ============ Server Configuration ============
SERVER_URL=http://localhost:7860
# Where the OpenEnv API server runs
# Used by inference.py to connect to environment

# ============ Task & Inference Configuration ============
TASK_NAME=text_spam
# Which task to run: text_spam, content_moderation, deepfake_detection

USE_LOCAL_MODEL=false
# If true: Load Llama-3.1-8B locally via transformers
# If false: Use remote API (requires HF_TOKEN)
# Local mode requires 16GB+ RAM

# ============ HuggingFace Model Caching ============
HF_HOME=/app/.cache/huggingface
# Directory for cached HF models and datasets
# Mounted as volume in Docker for persistence

TRANSFORMERS_CACHE=/app/.cache/huggingface
# Alternative env var for transformers library caching

# ============ Python Configuration ============
PYTHONDONTWRITEBYTECODE=1
# Don't create __pycache__ directories

PYTHONUNBUFFERED=1
# Stream logs immediately (useful in Docker)

# ============ Logging ============
LOG_LEVEL=INFO
# Log level: DEBUG, INFO, WARNING, ERROR, CRITICAL

Variable Precedence

Environment variables (highest priority)
.env file
Hardcoded defaults in code (lowest priority)

Example override:

export HF_TOKEN="hf_custom_token" && python inference.py
# Uses custom token instead of .env value

Running Tests

The project includes a comprehensive test suite using pytest.

Setup

pip install pytest pytest-cov

Run All Tests

pytest test/test.py -v

Run Specific Test Class

pytest test/test.py::TestTasks -v

Run with Coverage Report

pytest test/test.py --cov=server --cov-report=html
# Opens htmlcov/index.html in browser for coverage visualization

Test Categories

Test	Coverage	Status
Task loading	All 3 tasks initialize correctly	✓
API endpoints	/reset, /step, /state, /close, /tasks, /health	✓
Reward grading	text_spam, content_moderation, deepfake_detection	✓
Input validation	Action schema validation, label validation	✓
Edge cases	Empty labels, out-of-range confidence, etc.	✓

Troubleshooting

Installation Issues

Problem: ImportError: No module named 'openai'

Solution: pip install "openai>=1.40.0"

Problem: ImportError: No module named 'torch'

Solution: pip install torch torchvision
# For GPU: pip install torch torchvision -f https://download.pytorch.org/whl/cu121/torch_stable.html

Problem: FileNotFoundError: requirements.txt

Solution: Ensure you're in the project root: cd content-moderation-env/
# Then: pip install -r server/requirements.txt

Docker Issues

Problem: Segmentation fault (core dumped) during build

Solution: Allocate more memory to Docker build:
docker build --memory=8g -f server/Dockerfile -t content-moderation-env .

Problem: failed to solve: failed to compute cache key

Solution: Ensure requirements.txt is in server/ directory:
# Current: server/requirements.txt (correct)
# Wrong: ./requirements.txt

Problem: Port 7860 already in use

Solution: Use different port:
docker run -p 8000:7860 content-moderation-env
# Now access at http://localhost:8000

Runtime Issues

Problem: Connection refused: localhost:7860

Solution: Ensure server is running:
uvicorn server.main:app --host 0.0.0.0 --port 7860

In Docker, use: docker logs <container_id>

Problem: Client.__init__() got an unexpected keyword argument 'proxies'

Solution: Update OpenAI client:
pip install --upgrade openai

Problem: HuggingFace models downloading very slowly

Solution: Check internet connection and verify HF_TOKEN:
export HF_TOKEN="hf_your_token_here"
# Or download models ahead of time
python -c "from transformers import pipeline; pipeline('image-classification', model='dima806/deepfake_vs_real_image_detection')"

API Issues

Problem: Invalid request to /step without /reset

Error: "Environment not initialized. Call /reset first."
Solution: Always call POST /reset before any /step requests

Problem: Invalid label in action

Error: {"detail": "Invalid label: 'unknown_label'"}
Solution: Use only valid labels from the specification

Problem: Confidence out of range

Solution: Ensure confidence is between 0.0 and 1.0

Citation

If you use this environment in your research, please cite:

@software{content_moderation_openenv_2025,
  title={Content Moderation OpenEnv: A Real-World AI Triage Environment},
  author={Anidipta},
  year={2025},
  url={https://github.com/Anidipta/Content-Moderation-env},
  note={OpenEnv Specification Compliant}
}

Acknowledgements

🙏 Built for the OpenEnv Hackathon 2025.

Special Thanks To:

OpenEnv community for the specification and framework
HuggingFace for model hosting and inference APIs
Meta for the Llama-3.1-8B-Instruct model
Contributors and testers who improved the environment

Dataset & Content Note: The email and content corpus is entirely synthetic and does not represent any real individuals, companies, organizations, or actual events. All examples are generated for demonstration and testing purposes only.

License: MIT License — See LICENSE file for details

Questions? Open an issue on GitHub or contact the maintainers.

Last Updated: April 8, 2026 | OpenEnv Spec Version: 1.0 colorTo: green sdk: docker pinned: false license: mit

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

f6dee02010a32ba1936311cbb3790fa087282e74