Spaces:
Sleeping
title: EthicsGuard
emoji: 🛡️
colorFrom: blue
colorTo: red
sdk: docker
app_port: 7860
pinned: false
EthicsGuard
EthicsGuard is an OpenEnv-style benchmark for sequential moderation and policy-enforcement triage. Instead of classifying one item at a time, the agent sees the full remaining queue of synthetic flagged items and must decide both which item to process next and which action to take.
The environment targets a practical gap: moderation systems are usually optimized for single-item scoring, while real operations depend on prioritization, escalation discipline, and queue-level efficiency.
Why This Environment Exists
EthicsGuard models a real workflow humans actually perform:
- review a queue of flagged AI or system-generated content
- prioritize higher-risk cases first
- choose among
approve,flag_remove,escalate, andskip - operate under a fixed step budget
This makes the benchmark useful for evaluating agent planning, policy following, and safety-oriented triage behavior rather than only binary classification accuracy.
Benchmark Summary
| Area | Details |
|---|---|
| Domain | AI-output moderation and generic-system moderation |
| Observation | Full remaining queue, step count, steps remaining, compact policy summary |
| Action | (item_id, action_type) |
| Actions | approve, flag_remove, escalate, skip |
| Episode End | Queue empty or 15 steps reached |
| Policy | Configurable JSON policy mapping with ordered priority tiers |
| Output Score | Normalized final score in (0.0, 1.0) |
Tasks
| Task | Queue Size | Difficulty |
|---|---|---|
easy |
8 | Clear violations, stronger signals, low hint noise |
medium |
10 | Mixed cues, moderate hint noise, more ambiguous calls |
hard |
12 | Subtle violations, high hint noise, harder routing decisions |
Locked seed registry:
easy: 40 train, 10 evalmedium: 80 train, 20 evalhard: 240 train, 60 eval
Hint behavior:
mediumuses exactly 3 null hints per queue.easyandhardcannot represent exactly 30% null hints per individual episode because queue sizes are 8 and 12.- The implementation therefore matches the documented 30% ratio exactly across the locked seed registry while keeping each episode deterministic.
Reward And Grading
Per-step reward includes:
+0.3for a correct action-0.3for an incorrect action-1.0for skipping a tier-1 item- quadratic escalation penalty above a 20% escalation ratio
+0.05diversity bonus once at least 3 action types have been used
Terminal reward adds:
2.0 * grader_score
Dense step reward is clamped to [-1.0, 1.0] before the terminal outcome bonus is applied.
Final normalized score is computed as:
- 50% final classification accuracy
- 30% partial-order tier compliance
- 20% efficiency for resolving all tier-1 items within the first 40% of the step budget
To satisfy the benchmark validator, the final published task score is clamped to the open interval (0, 1) using a small epsilon. Exact 0.0 and 1.0 are not emitted.
Baselines
Implemented baselines in ethicsguard/baselines.py:
random: uniformly samples both queue item and actiongreedy_by_hint: picks the highest visible hint and uses fixed thresholdsrule_based: infers category from the synthetic snippet and applies the policy deterministicallyalways_escalate: audit baselinealways_approve: audit baseline
Run the calibration suite:
uv run python -c "from ethicsguard.baselines import run_all_baselines, audit_thresholds; import pprint; pprint.pp(run_all_baselines(split='all')); pprint.pp(audit_thresholds())"
Measured results from the current runtime:
| Difficulty | Agent | Mean | Std | Min | Max |
|---|---|---|---|---|---|
| Easy | random | 0.2629 | 0.1327 | 0.0158 | 0.6243 |
| Easy | greedy_by_hint | 0.5743 | 0.1804 | 0.1548 | 0.9375 |
| Easy | rule_based | 0.9999 | 0.0000 | 0.9999 | 0.9999 |
| Easy | always_escalate | 0.2882 | 0.1496 | 0.0000 | 0.5821 |
| Easy | always_approve | 0.3307 | 0.1808 | 0.0375 | 0.6537 |
| Medium | random | 0.2947 | 0.1172 | 0.1065 | 0.7667 |
| Medium | greedy_by_hint | 0.4538 | 0.1403 | 0.2224 | 0.9000 |
| Medium | rule_based | 0.9959 | 0.0280 | 0.8000 | 0.9999 |
| Medium | always_escalate | 0.3104 | 0.1453 | 0.0097 | 0.6625 |
| Medium | always_approve | 0.2894 | 0.1205 | 0.1097 | 0.6532 |
| Hard | random | 0.2778 | 0.1113 | 0.0187 | 0.7361 |
| Hard | greedy_by_hint | 0.3905 | 0.1245 | 0.1583 | 0.9328 |
| Hard | rule_based | 0.9886 | 0.0462 | 0.8000 | 0.9999 |
| Hard | always_escalate | 0.2665 | 0.1097 | 0.0805 | 0.6617 |
| Hard | always_approve | 0.2594 | 0.1102 | 0.0477 | 0.6888 |
Audit status:
always_escalatestays below0.35mean on all three tasksalways_approvestays below0.35mean on all three tasks
Repository Layout
| Path | Purpose |
|---|---|
ethicsguard/generator.py |
Queue generation, locked seeds, synthetic hints |
ethicsguard/env.py |
Core environment and episode loop |
ethicsguard/reward.py |
Reward shaping |
ethicsguard/grader.py |
Outcome-based grading |
ethicsguard/baselines.py |
Baselines and audit checks |
server/app.py |
FastAPI server for local/HF deployment |
inference.py |
Required benchmark inference script |
openenv.yaml |
OpenEnv metadata |
Quick Start
1. Install
uv sync --extra dev --extra openenv
2. Run Tests
uv run pytest
3. Run Generator Smoke Test
uv run python -m ethicsguard.generator
4. Run Inference
Set environment variables first:
API_BASE_URLMODEL_NAMEHF_TOKEN
Then run:
uv run python inference.py
5. Validate OpenEnv Compatibility
uv run openenv validate
API Surface
The deployed server exposes:
GET /GET /healthGET /tasksPOST /resetPOST /stepGET /statePOST /close
Local API run:
uv run uvicorn server.app:app --host 0.0.0.0 --port 7860
Local smoke test:
curl -X POST http://localhost:7860/reset -H "Content-Type: application/json" -d "{\"task\":\"easy\",\"seed\":2000}"
curl http://localhost:7860/tasks
curl http://localhost:7860/state
Docker And Hugging Face Spaces
Build locally:
docker build -t ethicsguard .
docker run -p 7860:7860 ethicsguard
Container behavior:
- installs the package from
pyproject.toml - runs
uvicorn server.app:app - exposes port
7860 - includes a
/healthhealthcheck
Live HF Space:
Inference Script Contract
- is located in the repo root
- uses the OpenAI client
- reads
API_BASE_URL,MODEL_NAME, andHF_TOKENfrom environment variables - emits only the required structured stdout lines:
[START][STEP][END]
The current script runs one fixed representative eval seed per task:
easy: first eval seedmedium: first eval seedhard: first eval seed
Submission Status
The following submission gates have been exercised successfully:
- HF Space deployment is live and responds to
/reset openenv validatepassesdocker buildsucceedsinference.pyruns and produces structured logs- all local tests pass
- 3 tasks are exposed and scored strictly inside
(0, 1)
Comparison
| Feature | EthicsGuard | SOC triage env | Generic classifiers |
|---|---|---|---|
| Domain | AI outputs + generic systems | Security alerts | Single-item classification |
| Policy | Configurable JSON | Usually hardcoded | N/A |
| Scoring | Outcome + ordering | Often trajectory-based | Per-item accuracy |
| Hints | Noisy and partially hidden | Varies | Usually none |
| Actions | 4-way triage | Often binary | Binary |
Known Limitations
- All content is synthetic and sanitized.
- Inter-item dependencies are out of scope in v1.
context_chain_idis reserved for future support only.- The current rule-based baseline is near-perfect because the snippet templates remain strongly class-indicative.
- If benchmark difficulty needs to increase, the preferred direction is richer synthetic language variation rather than changing the public API or grading contract.