Spaces:

Godreign
/

ethicsguard

Sleeping

App Files Files Community

ethicsguard / README.md

GodreignElgin

Clamp final task scores to open interval

bbcb74d 2 months ago

preview code

raw

history blame contribute delete

9.18 kB

metadata

title: EthicsGuard
emoji: 🛡️
colorFrom: blue
colorTo: red
sdk: docker
app_port: 7860
pinned: false

EthicsGuard

EthicsGuard is an OpenEnv-style benchmark for sequential moderation and policy-enforcement triage. Instead of classifying one item at a time, the agent sees the full remaining queue of synthetic flagged items and must decide both which item to process next and which action to take.

The environment targets a practical gap: moderation systems are usually optimized for single-item scoring, while real operations depend on prioritization, escalation discipline, and queue-level efficiency.

Why This Environment Exists

EthicsGuard models a real workflow humans actually perform:

review a queue of flagged AI or system-generated content
prioritize higher-risk cases first
choose among approve, flag_remove, escalate, and skip
operate under a fixed step budget

This makes the benchmark useful for evaluating agent planning, policy following, and safety-oriented triage behavior rather than only binary classification accuracy.

Benchmark Summary

Area	Details
Domain	AI-output moderation and generic-system moderation
Observation	Full remaining queue, step count, steps remaining, compact policy summary
Action	`(item_id, action_type)`
Actions	`approve`, `flag_remove`, `escalate`, `skip`
Episode End	Queue empty or 15 steps reached
Policy	Configurable JSON policy mapping with ordered priority tiers
Output Score	Normalized final score in `(0.0, 1.0)`

Tasks

Task	Queue Size	Difficulty
`easy`	8	Clear violations, stronger signals, low hint noise
`medium`	10	Mixed cues, moderate hint noise, more ambiguous calls
`hard`	12	Subtle violations, high hint noise, harder routing decisions

Locked seed registry:

easy: 40 train, 10 eval
medium: 80 train, 20 eval
hard: 240 train, 60 eval

Hint behavior:

medium uses exactly 3 null hints per queue.
easy and hard cannot represent exactly 30% null hints per individual episode because queue sizes are 8 and 12.
The implementation therefore matches the documented 30% ratio exactly across the locked seed registry while keeping each episode deterministic.

Reward And Grading

Per-step reward includes:

+0.3 for a correct action
-0.3 for an incorrect action
-1.0 for skipping a tier-1 item
quadratic escalation penalty above a 20% escalation ratio
+0.05 diversity bonus once at least 3 action types have been used

Terminal reward adds:

2.0 * grader_score

Dense step reward is clamped to [-1.0, 1.0] before the terminal outcome bonus is applied.

Final normalized score is computed as:

50% final classification accuracy
30% partial-order tier compliance
20% efficiency for resolving all tier-1 items within the first 40% of the step budget

To satisfy the benchmark validator, the final published task score is clamped to the open interval (0, 1) using a small epsilon. Exact 0.0 and 1.0 are not emitted.

Baselines

Implemented baselines in ethicsguard/baselines.py:

random: uniformly samples both queue item and action
greedy_by_hint: picks the highest visible hint and uses fixed thresholds
rule_based: infers category from the synthetic snippet and applies the policy deterministically
always_escalate: audit baseline
always_approve: audit baseline

Run the calibration suite:

uv run python -c "from ethicsguard.baselines import run_all_baselines, audit_thresholds; import pprint; pprint.pp(run_all_baselines(split='all')); pprint.pp(audit_thresholds())"

Measured results from the current runtime:

Difficulty	Agent	Mean	Std	Min	Max
Easy	random	0.2629	0.1327	0.0158	0.6243
Easy	greedy_by_hint	0.5743	0.1804	0.1548	0.9375
Easy	rule_based	0.9999	0.0000	0.9999	0.9999
Easy	always_escalate	0.2882	0.1496	0.0000	0.5821
Easy	always_approve	0.3307	0.1808	0.0375	0.6537
Medium	random	0.2947	0.1172	0.1065	0.7667
Medium	greedy_by_hint	0.4538	0.1403	0.2224	0.9000
Medium	rule_based	0.9959	0.0280	0.8000	0.9999
Medium	always_escalate	0.3104	0.1453	0.0097	0.6625
Medium	always_approve	0.2894	0.1205	0.1097	0.6532
Hard	random	0.2778	0.1113	0.0187	0.7361
Hard	greedy_by_hint	0.3905	0.1245	0.1583	0.9328
Hard	rule_based	0.9886	0.0462	0.8000	0.9999
Hard	always_escalate	0.2665	0.1097	0.0805	0.6617
Hard	always_approve	0.2594	0.1102	0.0477	0.6888

Audit status:

always_escalate stays below 0.35 mean on all three tasks
always_approve stays below 0.35 mean on all three tasks

Repository Layout

Path	Purpose
`ethicsguard/generator.py`	Queue generation, locked seeds, synthetic hints
`ethicsguard/env.py`	Core environment and episode loop
`ethicsguard/reward.py`	Reward shaping
`ethicsguard/grader.py`	Outcome-based grading
`ethicsguard/baselines.py`	Baselines and audit checks
`server/app.py`	FastAPI server for local/HF deployment
`inference.py`	Required benchmark inference script
`openenv.yaml`	OpenEnv metadata

Quick Start

1. Install

uv sync --extra dev --extra openenv

2. Run Tests

uv run pytest

3. Run Generator Smoke Test

uv run python -m ethicsguard.generator

4. Run Inference

Set environment variables first:

API_BASE_URL
MODEL_NAME
HF_TOKEN

Then run:

uv run python inference.py

5. Validate OpenEnv Compatibility

uv run openenv validate

API Surface

The deployed server exposes:

GET /
GET /health
GET /tasks
POST /reset
POST /step
GET /state
POST /close

Local API run:

uv run uvicorn server.app:app --host 0.0.0.0 --port 7860

Local smoke test:

curl -X POST http://localhost:7860/reset -H "Content-Type: application/json" -d "{\"task\":\"easy\",\"seed\":2000}"
curl http://localhost:7860/tasks
curl http://localhost:7860/state

Docker And Hugging Face Spaces

Build locally:

docker build -t ethicsguard .
docker run -p 7860:7860 ethicsguard

Container behavior:

installs the package from pyproject.toml
runs uvicorn server.app:app
exposes port 7860
includes a /health healthcheck

Live HF Space:

Inference Script Contract

inference.py:

is located in the repo root
uses the OpenAI client
reads API_BASE_URL, MODEL_NAME, and HF_TOKEN from environment variables
emits only the required structured stdout lines:
- [START]
- [STEP]
- [END]

The current script runs one fixed representative eval seed per task:

easy: first eval seed
medium: first eval seed
hard: first eval seed

Submission Status

The following submission gates have been exercised successfully:

HF Space deployment is live and responds to /reset
openenv validate passes
docker build succeeds
inference.py runs and produces structured logs
all local tests pass
3 tasks are exposed and scored strictly inside (0, 1)

Comparison

Feature	EthicsGuard	SOC triage env	Generic classifiers
Domain	AI outputs + generic systems	Security alerts	Single-item classification
Policy	Configurable JSON	Usually hardcoded	N/A
Scoring	Outcome + ordering	Often trajectory-based	Per-item accuracy
Hints	Noisy and partially hidden	Varies	Usually none
Actions	4-way triage	Often binary	Binary

Known Limitations

All content is synthetic and sanitized.
Inter-item dependencies are out of scope in v1.
context_chain_id is reserved for future support only.
The current rule-based baseline is near-perfect because the snippet templates remain strongly class-indicative.
If benchmark difficulty needs to increase, the preferred direction is richer synthetic language variation rather than changing the public API or grading contract.