File size: 2,780 Bytes
7743c15
 
 
4535620
7743c15
 
 
 
 
 
 
 
4535620
7743c15
 
 
2f4ff21
7743c15
2f4ff21
7743c15
2f4ff21
7743c15
2f4ff21
 
 
7743c15
2f4ff21
7743c15
2f4ff21
 
 
7743c15
2f4ff21
7743c15
2f4ff21
7743c15
2f4ff21
 
 
7743c15
2f4ff21
7743c15
2f4ff21
 
 
 
7743c15
2f4ff21
 
 
 
 
 
 
 
7743c15
2f4ff21
7743c15
2f4ff21
 
 
 
7743c15
2f4ff21
7743c15
2f4ff21
7743c15
4535620
2f4ff21
 
 
 
7743c15
2f4ff21
 
 
 
7743c15
 
2f4ff21
7743c15
2f4ff21
 
 
7743c15
2f4ff21
7743c15
2f4ff21
7743c15
2f4ff21
 
 
 
 
7743c15
2f4ff21
7743c15
2f4ff21
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7743c15
2f4ff21
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
---
title: ContentGuardEnv
emoji: 🛡️
colorFrom: indigo
colorTo: blue
sdk: docker
app_port: 7860
tags:
  - openenv
  - trust-and-safety
  - meta
  - llama-3
  - moderation-research
pinned: false
---

# ContentGuardEnv

I built ContentGuardEnv for the **Meta x Hugging Face Hackathon 2026** as a practical moderation environment where an AI agent has to do more than just classify text.

Instead of only asking "is this toxic?", the environment asks the model to:

1. Detect the policy category.
2. Choose a proportional enforcement action.
3. Explain a decision in an appeal-style format.

## Live Deployment

- Hugging Face Space: https://mj064-contentguardenv.hf.space
- Hugging Face repo: https://huggingface.co/spaces/mj064/ContentGuardEnv
- GitHub repo: https://github.com/mj064/meta_hack

## What This Project Does

ContentGuardEnv is an OpenEnv-style environment with three difficulty tiers:

- Easy: category detection
- Medium: enforcement action + severity
- Hard: appeal ruling + policy references

It includes:

- A FastAPI backend for reset/step/state APIs
- A WebSocket reasoning stream for live agent traces
- A browser dashboard to run episodes and inspect rewards
- A grading pipeline that returns reward + feedback for each decision

## Why I Built It

The goal was to simulate the type of moderation decisions that are messy in real systems: ambiguous context, policy tradeoffs, and high-cost mistakes.

This project is meant to be usable both as:

- A demo app for human-in-the-loop moderation testing
- A benchmark harness for agent evaluation loops

## Stack

- Python + FastAPI
- Vanilla JS/CSS frontend
- OpenAI/Hugging Face compatible inference routing
- Dockerized runtime for Hugging Face Spaces

## Run Locally

1. Install dependencies.

```bash
pip install -r requirements.txt
```

2. Set environment variables (or use a local `.env`).

```bash
API_BASE_URL=https://api.openai.com/v1
MODEL_NAME=gpt-4o-mini
HF_TOKEN=your_token_here
```

3. Start the app.

```bash
python server/app.py
```

Open http://localhost:7860

## API Overview

- POST `/reset`
- POST `/step/{episode_id}`
- GET `/state/{episode_id}`
- GET `/health`
- WS `/ws`

## Deploy to Hugging Face Space

This repo includes a helper script:

```bash
python sync_repo.py
```

It syncs the project folder to the Space while ignoring local-only artifacts.

## Project Layout

- `server/app.py`: FastAPI app + WebSocket gateway
- `server/env/`: environment, tasks, graders, data generation
- `server/static/`: dashboard HTML/CSS/JS
- `inference.py`: script for benchmark/evaluation flows
- `sync_repo.py`: one-command Hugging Face Space sync

## Notes

This is actively iterated during hackathon development, so UI and evaluation behavior continue to evolve as edge cases are discovered.