Spaces:
Sleeping
Sleeping
| title: EthicsGuard | |
| emoji: 🛡️ | |
| colorFrom: blue | |
| colorTo: red | |
| sdk: docker | |
| app_port: 7860 | |
| pinned: false | |
| # EthicsGuard | |
| EthicsGuard is an OpenEnv-style benchmark for sequential moderation and policy-enforcement triage. Instead of classifying one item at a time, the agent sees the full remaining queue of synthetic flagged items and must decide both which item to process next and which action to take. | |
| The environment targets a practical gap: moderation systems are usually optimized for single-item scoring, while real operations depend on prioritization, escalation discipline, and queue-level efficiency. | |
| ## Why This Environment Exists | |
| EthicsGuard models a real workflow humans actually perform: | |
| - review a queue of flagged AI or system-generated content | |
| - prioritize higher-risk cases first | |
| - choose among `approve`, `flag_remove`, `escalate`, and `skip` | |
| - operate under a fixed step budget | |
| This makes the benchmark useful for evaluating agent planning, policy following, and safety-oriented triage behavior rather than only binary classification accuracy. | |
| ## Benchmark Summary | |
| | Area | Details | | |
| | --- | --- | | |
| | Domain | AI-output moderation and generic-system moderation | | |
| | Observation | Full remaining queue, step count, steps remaining, compact policy summary | | |
| | Action | `(item_id, action_type)` | | |
| | Actions | `approve`, `flag_remove`, `escalate`, `skip` | | |
| | Episode End | Queue empty or 15 steps reached | | |
| | Policy | Configurable JSON policy mapping with ordered priority tiers | | |
| | Output Score | Normalized final score in `(0.0, 1.0)` | | |
| ## Tasks | |
| | Task | Queue Size | Difficulty | | |
| | --- | ---: | --- | | |
| | `easy` | 8 | Clear violations, stronger signals, low hint noise | | |
| | `medium` | 10 | Mixed cues, moderate hint noise, more ambiguous calls | | |
| | `hard` | 12 | Subtle violations, high hint noise, harder routing decisions | | |
| Locked seed registry: | |
| - `easy`: 40 train, 10 eval | |
| - `medium`: 80 train, 20 eval | |
| - `hard`: 240 train, 60 eval | |
| Hint behavior: | |
| - `medium` uses exactly 3 null hints per queue. | |
| - `easy` and `hard` cannot represent exactly 30% null hints per individual episode because queue sizes are 8 and 12. | |
| - The implementation therefore matches the documented 30% ratio exactly across the locked seed registry while keeping each episode deterministic. | |
| ## Reward And Grading | |
| Per-step reward includes: | |
| - `+0.3` for a correct action | |
| - `-0.3` for an incorrect action | |
| - `-1.0` for skipping a tier-1 item | |
| - quadratic escalation penalty above a 20% escalation ratio | |
| - `+0.05` diversity bonus once at least 3 action types have been used | |
| Terminal reward adds: | |
| - `2.0 * grader_score` | |
| Dense step reward is clamped to `[-1.0, 1.0]` before the terminal outcome bonus is applied. | |
| Final normalized score is computed as: | |
| - 50% final classification accuracy | |
| - 30% partial-order tier compliance | |
| - 20% efficiency for resolving all tier-1 items within the first 40% of the step budget | |
| To satisfy the benchmark validator, the final published task score is clamped to the open interval `(0, 1)` using a small epsilon. Exact `0.0` and `1.0` are not emitted. | |
| ## Baselines | |
| Implemented baselines in [`ethicsguard/baselines.py`](C:/Users/GODREIGN/Desktop/scalerrrr/Scaler-hack/ethicsguard/baselines.py): | |
| - `random`: uniformly samples both queue item and action | |
| - `greedy_by_hint`: picks the highest visible hint and uses fixed thresholds | |
| - `rule_based`: infers category from the synthetic snippet and applies the policy deterministically | |
| - `always_escalate`: audit baseline | |
| - `always_approve`: audit baseline | |
| Run the calibration suite: | |
| ```bash | |
| uv run python -c "from ethicsguard.baselines import run_all_baselines, audit_thresholds; import pprint; pprint.pp(run_all_baselines(split='all')); pprint.pp(audit_thresholds())" | |
| ``` | |
| Measured results from the current runtime: | |
| | Difficulty | Agent | Mean | Std | Min | Max | | |
| | --- | --- | ---: | ---: | ---: | ---: | | |
| | Easy | random | 0.2629 | 0.1327 | 0.0158 | 0.6243 | | |
| | Easy | greedy_by_hint | 0.5743 | 0.1804 | 0.1548 | 0.9375 | | |
| | Easy | rule_based | 0.9999 | 0.0000 | 0.9999 | 0.9999 | | |
| | Easy | always_escalate | 0.2882 | 0.1496 | 0.0000 | 0.5821 | | |
| | Easy | always_approve | 0.3307 | 0.1808 | 0.0375 | 0.6537 | | |
| | Medium | random | 0.2947 | 0.1172 | 0.1065 | 0.7667 | | |
| | Medium | greedy_by_hint | 0.4538 | 0.1403 | 0.2224 | 0.9000 | | |
| | Medium | rule_based | 0.9959 | 0.0280 | 0.8000 | 0.9999 | | |
| | Medium | always_escalate | 0.3104 | 0.1453 | 0.0097 | 0.6625 | | |
| | Medium | always_approve | 0.2894 | 0.1205 | 0.1097 | 0.6532 | | |
| | Hard | random | 0.2778 | 0.1113 | 0.0187 | 0.7361 | | |
| | Hard | greedy_by_hint | 0.3905 | 0.1245 | 0.1583 | 0.9328 | | |
| | Hard | rule_based | 0.9886 | 0.0462 | 0.8000 | 0.9999 | | |
| | Hard | always_escalate | 0.2665 | 0.1097 | 0.0805 | 0.6617 | | |
| | Hard | always_approve | 0.2594 | 0.1102 | 0.0477 | 0.6888 | | |
| Audit status: | |
| - `always_escalate` stays below `0.35` mean on all three tasks | |
| - `always_approve` stays below `0.35` mean on all three tasks | |
| ## Repository Layout | |
| | Path | Purpose | | |
| | --- | --- | | |
| | [`ethicsguard/generator.py`](C:/Users/GODREIGN/Desktop/scalerrrr/Scaler-hack/ethicsguard/generator.py) | Queue generation, locked seeds, synthetic hints | | |
| | [`ethicsguard/env.py`](C:/Users/GODREIGN/Desktop/scalerrrr/Scaler-hack/ethicsguard/env.py) | Core environment and episode loop | | |
| | [`ethicsguard/reward.py`](C:/Users/GODREIGN/Desktop/scalerrrr/Scaler-hack/ethicsguard/reward.py) | Reward shaping | | |
| | [`ethicsguard/grader.py`](C:/Users/GODREIGN/Desktop/scalerrrr/Scaler-hack/ethicsguard/grader.py) | Outcome-based grading | | |
| | [`ethicsguard/baselines.py`](C:/Users/GODREIGN/Desktop/scalerrrr/Scaler-hack/ethicsguard/baselines.py) | Baselines and audit checks | | |
| | [`server/app.py`](C:/Users/GODREIGN/Desktop/scalerrrr/Scaler-hack/server/app.py) | FastAPI server for local/HF deployment | | |
| | [`inference.py`](C:/Users/GODREIGN/Desktop/scalerrrr/Scaler-hack/inference.py) | Required benchmark inference script | | |
| | [`openenv.yaml`](C:/Users/GODREIGN/Desktop/scalerrrr/Scaler-hack/openenv.yaml) | OpenEnv metadata | | |
| ## Quick Start | |
| ### 1. Install | |
| ```bash | |
| uv sync --extra dev --extra openenv | |
| ``` | |
| ### 2. Run Tests | |
| ```bash | |
| uv run pytest | |
| ``` | |
| ### 3. Run Generator Smoke Test | |
| ```bash | |
| uv run python -m ethicsguard.generator | |
| ``` | |
| ### 4. Run Inference | |
| Set environment variables first: | |
| - `API_BASE_URL` | |
| - `MODEL_NAME` | |
| - `HF_TOKEN` | |
| Then run: | |
| ```bash | |
| uv run python inference.py | |
| ``` | |
| ### 5. Validate OpenEnv Compatibility | |
| ```bash | |
| uv run openenv validate | |
| ``` | |
| ## API Surface | |
| The deployed server exposes: | |
| - `GET /` | |
| - `GET /health` | |
| - `GET /tasks` | |
| - `POST /reset` | |
| - `POST /step` | |
| - `GET /state` | |
| - `POST /close` | |
| Local API run: | |
| ```bash | |
| uv run uvicorn server.app:app --host 0.0.0.0 --port 7860 | |
| ``` | |
| Local smoke test: | |
| ```bash | |
| curl -X POST http://localhost:7860/reset -H "Content-Type: application/json" -d "{\"task\":\"easy\",\"seed\":2000}" | |
| curl http://localhost:7860/tasks | |
| curl http://localhost:7860/state | |
| ``` | |
| ## Docker And Hugging Face Spaces | |
| Build locally: | |
| ```bash | |
| docker build -t ethicsguard . | |
| docker run -p 7860:7860 ethicsguard | |
| ``` | |
| Container behavior: | |
| - installs the package from [`pyproject.toml`](C:/Users/GODREIGN/Desktop/scalerrrr/Scaler-hack/pyproject.toml) | |
| - runs `uvicorn server.app:app` | |
| - exposes port `7860` | |
| - includes a `/health` healthcheck | |
| Live HF Space: | |
| - [`https://godreign-ethicsguard.hf.space/health`](https://godreign-ethicsguard.hf.space/health) | |
| - [`https://godreign-ethicsguard.hf.space/tasks`](https://godreign-ethicsguard.hf.space/tasks) | |
| ## Inference Script Contract | |
| [`inference.py`](C:/Users/GODREIGN/Desktop/scalerrrr/Scaler-hack/inference.py): | |
| - is located in the repo root | |
| - uses the OpenAI client | |
| - reads `API_BASE_URL`, `MODEL_NAME`, and `HF_TOKEN` from environment variables | |
| - emits only the required structured stdout lines: | |
| - `[START]` | |
| - `[STEP]` | |
| - `[END]` | |
| The current script runs one fixed representative eval seed per task: | |
| - `easy`: first eval seed | |
| - `medium`: first eval seed | |
| - `hard`: first eval seed | |
| ## Submission Status | |
| The following submission gates have been exercised successfully: | |
| - HF Space deployment is live and responds to `/reset` | |
| - `openenv validate` passes | |
| - `docker build` succeeds | |
| - `inference.py` runs and produces structured logs | |
| - all local tests pass | |
| - 3 tasks are exposed and scored strictly inside `(0, 1)` | |
| ## Comparison | |
| | Feature | EthicsGuard | SOC triage env | Generic classifiers | | |
| | --- | --- | --- | --- | | |
| | Domain | AI outputs + generic systems | Security alerts | Single-item classification | | |
| | Policy | Configurable JSON | Usually hardcoded | N/A | | |
| | Scoring | Outcome + ordering | Often trajectory-based | Per-item accuracy | | |
| | Hints | Noisy and partially hidden | Varies | Usually none | | |
| | Actions | 4-way triage | Often binary | Binary | | |
| ## Known Limitations | |
| - All content is synthetic and sanitized. | |
| - Inter-item dependencies are out of scope in v1. | |
| - `context_chain_id` is reserved for future support only. | |
| - The current rule-based baseline is near-perfect because the snippet templates remain strongly class-indicative. | |
| - If benchmark difficulty needs to increase, the preferred direction is richer synthetic language variation rather than changing the public API or grading contract. | |