ethicsguard / README.md
GodreignElgin
Clamp final task scores to open interval
bbcb74d
metadata
title: EthicsGuard
emoji: 🛡️
colorFrom: blue
colorTo: red
sdk: docker
app_port: 7860
pinned: false

EthicsGuard

EthicsGuard is an OpenEnv-style benchmark for sequential moderation and policy-enforcement triage. Instead of classifying one item at a time, the agent sees the full remaining queue of synthetic flagged items and must decide both which item to process next and which action to take.

The environment targets a practical gap: moderation systems are usually optimized for single-item scoring, while real operations depend on prioritization, escalation discipline, and queue-level efficiency.

Why This Environment Exists

EthicsGuard models a real workflow humans actually perform:

  • review a queue of flagged AI or system-generated content
  • prioritize higher-risk cases first
  • choose among approve, flag_remove, escalate, and skip
  • operate under a fixed step budget

This makes the benchmark useful for evaluating agent planning, policy following, and safety-oriented triage behavior rather than only binary classification accuracy.

Benchmark Summary

Area Details
Domain AI-output moderation and generic-system moderation
Observation Full remaining queue, step count, steps remaining, compact policy summary
Action (item_id, action_type)
Actions approve, flag_remove, escalate, skip
Episode End Queue empty or 15 steps reached
Policy Configurable JSON policy mapping with ordered priority tiers
Output Score Normalized final score in (0.0, 1.0)

Tasks

Task Queue Size Difficulty
easy 8 Clear violations, stronger signals, low hint noise
medium 10 Mixed cues, moderate hint noise, more ambiguous calls
hard 12 Subtle violations, high hint noise, harder routing decisions

Locked seed registry:

  • easy: 40 train, 10 eval
  • medium: 80 train, 20 eval
  • hard: 240 train, 60 eval

Hint behavior:

  • medium uses exactly 3 null hints per queue.
  • easy and hard cannot represent exactly 30% null hints per individual episode because queue sizes are 8 and 12.
  • The implementation therefore matches the documented 30% ratio exactly across the locked seed registry while keeping each episode deterministic.

Reward And Grading

Per-step reward includes:

  • +0.3 for a correct action
  • -0.3 for an incorrect action
  • -1.0 for skipping a tier-1 item
  • quadratic escalation penalty above a 20% escalation ratio
  • +0.05 diversity bonus once at least 3 action types have been used

Terminal reward adds:

  • 2.0 * grader_score

Dense step reward is clamped to [-1.0, 1.0] before the terminal outcome bonus is applied.

Final normalized score is computed as:

  • 50% final classification accuracy
  • 30% partial-order tier compliance
  • 20% efficiency for resolving all tier-1 items within the first 40% of the step budget

To satisfy the benchmark validator, the final published task score is clamped to the open interval (0, 1) using a small epsilon. Exact 0.0 and 1.0 are not emitted.

Baselines

Implemented baselines in ethicsguard/baselines.py:

  • random: uniformly samples both queue item and action
  • greedy_by_hint: picks the highest visible hint and uses fixed thresholds
  • rule_based: infers category from the synthetic snippet and applies the policy deterministically
  • always_escalate: audit baseline
  • always_approve: audit baseline

Run the calibration suite:

uv run python -c "from ethicsguard.baselines import run_all_baselines, audit_thresholds; import pprint; pprint.pp(run_all_baselines(split='all')); pprint.pp(audit_thresholds())"

Measured results from the current runtime:

Difficulty Agent Mean Std Min Max
Easy random 0.2629 0.1327 0.0158 0.6243
Easy greedy_by_hint 0.5743 0.1804 0.1548 0.9375
Easy rule_based 0.9999 0.0000 0.9999 0.9999
Easy always_escalate 0.2882 0.1496 0.0000 0.5821
Easy always_approve 0.3307 0.1808 0.0375 0.6537
Medium random 0.2947 0.1172 0.1065 0.7667
Medium greedy_by_hint 0.4538 0.1403 0.2224 0.9000
Medium rule_based 0.9959 0.0280 0.8000 0.9999
Medium always_escalate 0.3104 0.1453 0.0097 0.6625
Medium always_approve 0.2894 0.1205 0.1097 0.6532
Hard random 0.2778 0.1113 0.0187 0.7361
Hard greedy_by_hint 0.3905 0.1245 0.1583 0.9328
Hard rule_based 0.9886 0.0462 0.8000 0.9999
Hard always_escalate 0.2665 0.1097 0.0805 0.6617
Hard always_approve 0.2594 0.1102 0.0477 0.6888

Audit status:

  • always_escalate stays below 0.35 mean on all three tasks
  • always_approve stays below 0.35 mean on all three tasks

Repository Layout

Path Purpose
ethicsguard/generator.py Queue generation, locked seeds, synthetic hints
ethicsguard/env.py Core environment and episode loop
ethicsguard/reward.py Reward shaping
ethicsguard/grader.py Outcome-based grading
ethicsguard/baselines.py Baselines and audit checks
server/app.py FastAPI server for local/HF deployment
inference.py Required benchmark inference script
openenv.yaml OpenEnv metadata

Quick Start

1. Install

uv sync --extra dev --extra openenv

2. Run Tests

uv run pytest

3. Run Generator Smoke Test

uv run python -m ethicsguard.generator

4. Run Inference

Set environment variables first:

  • API_BASE_URL
  • MODEL_NAME
  • HF_TOKEN

Then run:

uv run python inference.py

5. Validate OpenEnv Compatibility

uv run openenv validate

API Surface

The deployed server exposes:

  • GET /
  • GET /health
  • GET /tasks
  • POST /reset
  • POST /step
  • GET /state
  • POST /close

Local API run:

uv run uvicorn server.app:app --host 0.0.0.0 --port 7860

Local smoke test:

curl -X POST http://localhost:7860/reset -H "Content-Type: application/json" -d "{\"task\":\"easy\",\"seed\":2000}"
curl http://localhost:7860/tasks
curl http://localhost:7860/state

Docker And Hugging Face Spaces

Build locally:

docker build -t ethicsguard .
docker run -p 7860:7860 ethicsguard

Container behavior:

  • installs the package from pyproject.toml
  • runs uvicorn server.app:app
  • exposes port 7860
  • includes a /health healthcheck

Live HF Space:

Inference Script Contract

inference.py:

  • is located in the repo root
  • uses the OpenAI client
  • reads API_BASE_URL, MODEL_NAME, and HF_TOKEN from environment variables
  • emits only the required structured stdout lines:
    • [START]
    • [STEP]
    • [END]

The current script runs one fixed representative eval seed per task:

  • easy: first eval seed
  • medium: first eval seed
  • hard: first eval seed

Submission Status

The following submission gates have been exercised successfully:

  • HF Space deployment is live and responds to /reset
  • openenv validate passes
  • docker build succeeds
  • inference.py runs and produces structured logs
  • all local tests pass
  • 3 tasks are exposed and scored strictly inside (0, 1)

Comparison

Feature EthicsGuard SOC triage env Generic classifiers
Domain AI outputs + generic systems Security alerts Single-item classification
Policy Configurable JSON Usually hardcoded N/A
Scoring Outcome + ordering Often trajectory-based Per-item accuracy
Hints Noisy and partially hidden Varies Usually none
Actions 4-way triage Often binary Binary

Known Limitations

  • All content is synthetic and sanitized.
  • Inter-item dependencies are out of scope in v1.
  • context_chain_id is reserved for future support only.
  • The current rule-based baseline is near-perfect because the snippet templates remain strongly class-indicative.
  • If benchmark difficulty needs to increase, the preferred direction is richer synthetic language variation rather than changing the public API or grading contract.