ethicsguard / README.md
GodreignElgin
Clamp final task scores to open interval
bbcb74d
---
title: EthicsGuard
emoji: 🛡️
colorFrom: blue
colorTo: red
sdk: docker
app_port: 7860
pinned: false
---
# EthicsGuard
EthicsGuard is an OpenEnv-style benchmark for sequential moderation and policy-enforcement triage. Instead of classifying one item at a time, the agent sees the full remaining queue of synthetic flagged items and must decide both which item to process next and which action to take.
The environment targets a practical gap: moderation systems are usually optimized for single-item scoring, while real operations depend on prioritization, escalation discipline, and queue-level efficiency.
## Why This Environment Exists
EthicsGuard models a real workflow humans actually perform:
- review a queue of flagged AI or system-generated content
- prioritize higher-risk cases first
- choose among `approve`, `flag_remove`, `escalate`, and `skip`
- operate under a fixed step budget
This makes the benchmark useful for evaluating agent planning, policy following, and safety-oriented triage behavior rather than only binary classification accuracy.
## Benchmark Summary
| Area | Details |
| --- | --- |
| Domain | AI-output moderation and generic-system moderation |
| Observation | Full remaining queue, step count, steps remaining, compact policy summary |
| Action | `(item_id, action_type)` |
| Actions | `approve`, `flag_remove`, `escalate`, `skip` |
| Episode End | Queue empty or 15 steps reached |
| Policy | Configurable JSON policy mapping with ordered priority tiers |
| Output Score | Normalized final score in `(0.0, 1.0)` |
## Tasks
| Task | Queue Size | Difficulty |
| --- | ---: | --- |
| `easy` | 8 | Clear violations, stronger signals, low hint noise |
| `medium` | 10 | Mixed cues, moderate hint noise, more ambiguous calls |
| `hard` | 12 | Subtle violations, high hint noise, harder routing decisions |
Locked seed registry:
- `easy`: 40 train, 10 eval
- `medium`: 80 train, 20 eval
- `hard`: 240 train, 60 eval
Hint behavior:
- `medium` uses exactly 3 null hints per queue.
- `easy` and `hard` cannot represent exactly 30% null hints per individual episode because queue sizes are 8 and 12.
- The implementation therefore matches the documented 30% ratio exactly across the locked seed registry while keeping each episode deterministic.
## Reward And Grading
Per-step reward includes:
- `+0.3` for a correct action
- `-0.3` for an incorrect action
- `-1.0` for skipping a tier-1 item
- quadratic escalation penalty above a 20% escalation ratio
- `+0.05` diversity bonus once at least 3 action types have been used
Terminal reward adds:
- `2.0 * grader_score`
Dense step reward is clamped to `[-1.0, 1.0]` before the terminal outcome bonus is applied.
Final normalized score is computed as:
- 50% final classification accuracy
- 30% partial-order tier compliance
- 20% efficiency for resolving all tier-1 items within the first 40% of the step budget
To satisfy the benchmark validator, the final published task score is clamped to the open interval `(0, 1)` using a small epsilon. Exact `0.0` and `1.0` are not emitted.
## Baselines
Implemented baselines in [`ethicsguard/baselines.py`](C:/Users/GODREIGN/Desktop/scalerrrr/Scaler-hack/ethicsguard/baselines.py):
- `random`: uniformly samples both queue item and action
- `greedy_by_hint`: picks the highest visible hint and uses fixed thresholds
- `rule_based`: infers category from the synthetic snippet and applies the policy deterministically
- `always_escalate`: audit baseline
- `always_approve`: audit baseline
Run the calibration suite:
```bash
uv run python -c "from ethicsguard.baselines import run_all_baselines, audit_thresholds; import pprint; pprint.pp(run_all_baselines(split='all')); pprint.pp(audit_thresholds())"
```
Measured results from the current runtime:
| Difficulty | Agent | Mean | Std | Min | Max |
| --- | --- | ---: | ---: | ---: | ---: |
| Easy | random | 0.2629 | 0.1327 | 0.0158 | 0.6243 |
| Easy | greedy_by_hint | 0.5743 | 0.1804 | 0.1548 | 0.9375 |
| Easy | rule_based | 0.9999 | 0.0000 | 0.9999 | 0.9999 |
| Easy | always_escalate | 0.2882 | 0.1496 | 0.0000 | 0.5821 |
| Easy | always_approve | 0.3307 | 0.1808 | 0.0375 | 0.6537 |
| Medium | random | 0.2947 | 0.1172 | 0.1065 | 0.7667 |
| Medium | greedy_by_hint | 0.4538 | 0.1403 | 0.2224 | 0.9000 |
| Medium | rule_based | 0.9959 | 0.0280 | 0.8000 | 0.9999 |
| Medium | always_escalate | 0.3104 | 0.1453 | 0.0097 | 0.6625 |
| Medium | always_approve | 0.2894 | 0.1205 | 0.1097 | 0.6532 |
| Hard | random | 0.2778 | 0.1113 | 0.0187 | 0.7361 |
| Hard | greedy_by_hint | 0.3905 | 0.1245 | 0.1583 | 0.9328 |
| Hard | rule_based | 0.9886 | 0.0462 | 0.8000 | 0.9999 |
| Hard | always_escalate | 0.2665 | 0.1097 | 0.0805 | 0.6617 |
| Hard | always_approve | 0.2594 | 0.1102 | 0.0477 | 0.6888 |
Audit status:
- `always_escalate` stays below `0.35` mean on all three tasks
- `always_approve` stays below `0.35` mean on all three tasks
## Repository Layout
| Path | Purpose |
| --- | --- |
| [`ethicsguard/generator.py`](C:/Users/GODREIGN/Desktop/scalerrrr/Scaler-hack/ethicsguard/generator.py) | Queue generation, locked seeds, synthetic hints |
| [`ethicsguard/env.py`](C:/Users/GODREIGN/Desktop/scalerrrr/Scaler-hack/ethicsguard/env.py) | Core environment and episode loop |
| [`ethicsguard/reward.py`](C:/Users/GODREIGN/Desktop/scalerrrr/Scaler-hack/ethicsguard/reward.py) | Reward shaping |
| [`ethicsguard/grader.py`](C:/Users/GODREIGN/Desktop/scalerrrr/Scaler-hack/ethicsguard/grader.py) | Outcome-based grading |
| [`ethicsguard/baselines.py`](C:/Users/GODREIGN/Desktop/scalerrrr/Scaler-hack/ethicsguard/baselines.py) | Baselines and audit checks |
| [`server/app.py`](C:/Users/GODREIGN/Desktop/scalerrrr/Scaler-hack/server/app.py) | FastAPI server for local/HF deployment |
| [`inference.py`](C:/Users/GODREIGN/Desktop/scalerrrr/Scaler-hack/inference.py) | Required benchmark inference script |
| [`openenv.yaml`](C:/Users/GODREIGN/Desktop/scalerrrr/Scaler-hack/openenv.yaml) | OpenEnv metadata |
## Quick Start
### 1. Install
```bash
uv sync --extra dev --extra openenv
```
### 2. Run Tests
```bash
uv run pytest
```
### 3. Run Generator Smoke Test
```bash
uv run python -m ethicsguard.generator
```
### 4. Run Inference
Set environment variables first:
- `API_BASE_URL`
- `MODEL_NAME`
- `HF_TOKEN`
Then run:
```bash
uv run python inference.py
```
### 5. Validate OpenEnv Compatibility
```bash
uv run openenv validate
```
## API Surface
The deployed server exposes:
- `GET /`
- `GET /health`
- `GET /tasks`
- `POST /reset`
- `POST /step`
- `GET /state`
- `POST /close`
Local API run:
```bash
uv run uvicorn server.app:app --host 0.0.0.0 --port 7860
```
Local smoke test:
```bash
curl -X POST http://localhost:7860/reset -H "Content-Type: application/json" -d "{\"task\":\"easy\",\"seed\":2000}"
curl http://localhost:7860/tasks
curl http://localhost:7860/state
```
## Docker And Hugging Face Spaces
Build locally:
```bash
docker build -t ethicsguard .
docker run -p 7860:7860 ethicsguard
```
Container behavior:
- installs the package from [`pyproject.toml`](C:/Users/GODREIGN/Desktop/scalerrrr/Scaler-hack/pyproject.toml)
- runs `uvicorn server.app:app`
- exposes port `7860`
- includes a `/health` healthcheck
Live HF Space:
- [`https://godreign-ethicsguard.hf.space/health`](https://godreign-ethicsguard.hf.space/health)
- [`https://godreign-ethicsguard.hf.space/tasks`](https://godreign-ethicsguard.hf.space/tasks)
## Inference Script Contract
[`inference.py`](C:/Users/GODREIGN/Desktop/scalerrrr/Scaler-hack/inference.py):
- is located in the repo root
- uses the OpenAI client
- reads `API_BASE_URL`, `MODEL_NAME`, and `HF_TOKEN` from environment variables
- emits only the required structured stdout lines:
- `[START]`
- `[STEP]`
- `[END]`
The current script runs one fixed representative eval seed per task:
- `easy`: first eval seed
- `medium`: first eval seed
- `hard`: first eval seed
## Submission Status
The following submission gates have been exercised successfully:
- HF Space deployment is live and responds to `/reset`
- `openenv validate` passes
- `docker build` succeeds
- `inference.py` runs and produces structured logs
- all local tests pass
- 3 tasks are exposed and scored strictly inside `(0, 1)`
## Comparison
| Feature | EthicsGuard | SOC triage env | Generic classifiers |
| --- | --- | --- | --- |
| Domain | AI outputs + generic systems | Security alerts | Single-item classification |
| Policy | Configurable JSON | Usually hardcoded | N/A |
| Scoring | Outcome + ordering | Often trajectory-based | Per-item accuracy |
| Hints | Noisy and partially hidden | Varies | Usually none |
| Actions | 4-way triage | Often binary | Binary |
## Known Limitations
- All content is synthetic and sanitized.
- Inter-item dependencies are out of scope in v1.
- `context_chain_id` is reserved for future support only.
- The current rule-based baseline is near-perfect because the snippet templates remain strongly class-indicative.
- If benchmark difficulty needs to increase, the preferred direction is richer synthetic language variation rather than changing the public API or grading contract.