--- title: EthicsGuard emoji: 🛡️ colorFrom: blue colorTo: red sdk: docker app_port: 7860 pinned: false --- # EthicsGuard EthicsGuard is an OpenEnv-style benchmark for sequential moderation and policy-enforcement triage. Instead of classifying one item at a time, the agent sees the full remaining queue of synthetic flagged items and must decide both which item to process next and which action to take. The environment targets a practical gap: moderation systems are usually optimized for single-item scoring, while real operations depend on prioritization, escalation discipline, and queue-level efficiency. ## Why This Environment Exists EthicsGuard models a real workflow humans actually perform: - review a queue of flagged AI or system-generated content - prioritize higher-risk cases first - choose among `approve`, `flag_remove`, `escalate`, and `skip` - operate under a fixed step budget This makes the benchmark useful for evaluating agent planning, policy following, and safety-oriented triage behavior rather than only binary classification accuracy. ## Benchmark Summary | Area | Details | | --- | --- | | Domain | AI-output moderation and generic-system moderation | | Observation | Full remaining queue, step count, steps remaining, compact policy summary | | Action | `(item_id, action_type)` | | Actions | `approve`, `flag_remove`, `escalate`, `skip` | | Episode End | Queue empty or 15 steps reached | | Policy | Configurable JSON policy mapping with ordered priority tiers | | Output Score | Normalized final score in `(0.0, 1.0)` | ## Tasks | Task | Queue Size | Difficulty | | --- | ---: | --- | | `easy` | 8 | Clear violations, stronger signals, low hint noise | | `medium` | 10 | Mixed cues, moderate hint noise, more ambiguous calls | | `hard` | 12 | Subtle violations, high hint noise, harder routing decisions | Locked seed registry: - `easy`: 40 train, 10 eval - `medium`: 80 train, 20 eval - `hard`: 240 train, 60 eval Hint behavior: - `medium` uses exactly 3 null hints per queue. - `easy` and `hard` cannot represent exactly 30% null hints per individual episode because queue sizes are 8 and 12. - The implementation therefore matches the documented 30% ratio exactly across the locked seed registry while keeping each episode deterministic. ## Reward And Grading Per-step reward includes: - `+0.3` for a correct action - `-0.3` for an incorrect action - `-1.0` for skipping a tier-1 item - quadratic escalation penalty above a 20% escalation ratio - `+0.05` diversity bonus once at least 3 action types have been used Terminal reward adds: - `2.0 * grader_score` Dense step reward is clamped to `[-1.0, 1.0]` before the terminal outcome bonus is applied. Final normalized score is computed as: - 50% final classification accuracy - 30% partial-order tier compliance - 20% efficiency for resolving all tier-1 items within the first 40% of the step budget To satisfy the benchmark validator, the final published task score is clamped to the open interval `(0, 1)` using a small epsilon. Exact `0.0` and `1.0` are not emitted. ## Baselines Implemented baselines in [`ethicsguard/baselines.py`](C:/Users/GODREIGN/Desktop/scalerrrr/Scaler-hack/ethicsguard/baselines.py): - `random`: uniformly samples both queue item and action - `greedy_by_hint`: picks the highest visible hint and uses fixed thresholds - `rule_based`: infers category from the synthetic snippet and applies the policy deterministically - `always_escalate`: audit baseline - `always_approve`: audit baseline Run the calibration suite: ```bash uv run python -c "from ethicsguard.baselines import run_all_baselines, audit_thresholds; import pprint; pprint.pp(run_all_baselines(split='all')); pprint.pp(audit_thresholds())" ``` Measured results from the current runtime: | Difficulty | Agent | Mean | Std | Min | Max | | --- | --- | ---: | ---: | ---: | ---: | | Easy | random | 0.2629 | 0.1327 | 0.0158 | 0.6243 | | Easy | greedy_by_hint | 0.5743 | 0.1804 | 0.1548 | 0.9375 | | Easy | rule_based | 0.9999 | 0.0000 | 0.9999 | 0.9999 | | Easy | always_escalate | 0.2882 | 0.1496 | 0.0000 | 0.5821 | | Easy | always_approve | 0.3307 | 0.1808 | 0.0375 | 0.6537 | | Medium | random | 0.2947 | 0.1172 | 0.1065 | 0.7667 | | Medium | greedy_by_hint | 0.4538 | 0.1403 | 0.2224 | 0.9000 | | Medium | rule_based | 0.9959 | 0.0280 | 0.8000 | 0.9999 | | Medium | always_escalate | 0.3104 | 0.1453 | 0.0097 | 0.6625 | | Medium | always_approve | 0.2894 | 0.1205 | 0.1097 | 0.6532 | | Hard | random | 0.2778 | 0.1113 | 0.0187 | 0.7361 | | Hard | greedy_by_hint | 0.3905 | 0.1245 | 0.1583 | 0.9328 | | Hard | rule_based | 0.9886 | 0.0462 | 0.8000 | 0.9999 | | Hard | always_escalate | 0.2665 | 0.1097 | 0.0805 | 0.6617 | | Hard | always_approve | 0.2594 | 0.1102 | 0.0477 | 0.6888 | Audit status: - `always_escalate` stays below `0.35` mean on all three tasks - `always_approve` stays below `0.35` mean on all three tasks ## Repository Layout | Path | Purpose | | --- | --- | | [`ethicsguard/generator.py`](C:/Users/GODREIGN/Desktop/scalerrrr/Scaler-hack/ethicsguard/generator.py) | Queue generation, locked seeds, synthetic hints | | [`ethicsguard/env.py`](C:/Users/GODREIGN/Desktop/scalerrrr/Scaler-hack/ethicsguard/env.py) | Core environment and episode loop | | [`ethicsguard/reward.py`](C:/Users/GODREIGN/Desktop/scalerrrr/Scaler-hack/ethicsguard/reward.py) | Reward shaping | | [`ethicsguard/grader.py`](C:/Users/GODREIGN/Desktop/scalerrrr/Scaler-hack/ethicsguard/grader.py) | Outcome-based grading | | [`ethicsguard/baselines.py`](C:/Users/GODREIGN/Desktop/scalerrrr/Scaler-hack/ethicsguard/baselines.py) | Baselines and audit checks | | [`server/app.py`](C:/Users/GODREIGN/Desktop/scalerrrr/Scaler-hack/server/app.py) | FastAPI server for local/HF deployment | | [`inference.py`](C:/Users/GODREIGN/Desktop/scalerrrr/Scaler-hack/inference.py) | Required benchmark inference script | | [`openenv.yaml`](C:/Users/GODREIGN/Desktop/scalerrrr/Scaler-hack/openenv.yaml) | OpenEnv metadata | ## Quick Start ### 1. Install ```bash uv sync --extra dev --extra openenv ``` ### 2. Run Tests ```bash uv run pytest ``` ### 3. Run Generator Smoke Test ```bash uv run python -m ethicsguard.generator ``` ### 4. Run Inference Set environment variables first: - `API_BASE_URL` - `MODEL_NAME` - `HF_TOKEN` Then run: ```bash uv run python inference.py ``` ### 5. Validate OpenEnv Compatibility ```bash uv run openenv validate ``` ## API Surface The deployed server exposes: - `GET /` - `GET /health` - `GET /tasks` - `POST /reset` - `POST /step` - `GET /state` - `POST /close` Local API run: ```bash uv run uvicorn server.app:app --host 0.0.0.0 --port 7860 ``` Local smoke test: ```bash curl -X POST http://localhost:7860/reset -H "Content-Type: application/json" -d "{\"task\":\"easy\",\"seed\":2000}" curl http://localhost:7860/tasks curl http://localhost:7860/state ``` ## Docker And Hugging Face Spaces Build locally: ```bash docker build -t ethicsguard . docker run -p 7860:7860 ethicsguard ``` Container behavior: - installs the package from [`pyproject.toml`](C:/Users/GODREIGN/Desktop/scalerrrr/Scaler-hack/pyproject.toml) - runs `uvicorn server.app:app` - exposes port `7860` - includes a `/health` healthcheck Live HF Space: - [`https://godreign-ethicsguard.hf.space/health`](https://godreign-ethicsguard.hf.space/health) - [`https://godreign-ethicsguard.hf.space/tasks`](https://godreign-ethicsguard.hf.space/tasks) ## Inference Script Contract [`inference.py`](C:/Users/GODREIGN/Desktop/scalerrrr/Scaler-hack/inference.py): - is located in the repo root - uses the OpenAI client - reads `API_BASE_URL`, `MODEL_NAME`, and `HF_TOKEN` from environment variables - emits only the required structured stdout lines: - `[START]` - `[STEP]` - `[END]` The current script runs one fixed representative eval seed per task: - `easy`: first eval seed - `medium`: first eval seed - `hard`: first eval seed ## Submission Status The following submission gates have been exercised successfully: - HF Space deployment is live and responds to `/reset` - `openenv validate` passes - `docker build` succeeds - `inference.py` runs and produces structured logs - all local tests pass - 3 tasks are exposed and scored strictly inside `(0, 1)` ## Comparison | Feature | EthicsGuard | SOC triage env | Generic classifiers | | --- | --- | --- | --- | | Domain | AI outputs + generic systems | Security alerts | Single-item classification | | Policy | Configurable JSON | Usually hardcoded | N/A | | Scoring | Outcome + ordering | Often trajectory-based | Per-item accuracy | | Hints | Noisy and partially hidden | Varies | Usually none | | Actions | 4-way triage | Often binary | Binary | ## Known Limitations - All content is synthetic and sanitized. - Inter-item dependencies are out of scope in v1. - `context_chain_id` is reserved for future support only. - The current rule-based baseline is near-perfect because the snippet templates remain strongly class-indicative. - If benchmark difficulty needs to increase, the preferred direction is richer synthetic language variation rather than changing the public API or grading contract.