---
title: EthicsGuard
emoji: 🛡️
colorFrom: blue
colorTo: red
sdk: docker
app_port: 7860
pinned: false
---

# EthicsGuard

EthicsGuard is an OpenEnv-style benchmark for sequential moderation and policy-enforcement triage. Instead of classifying one item at a time, the agent sees the full remaining queue of synthetic flagged items and must decide both which item to process next and which action to take.

The environment targets a practical gap: moderation systems are usually optimized for single-item scoring, while real operations depend on prioritization, escalation discipline, and queue-level efficiency.

## Why This Environment Exists

EthicsGuard models a real workflow humans actually perform:

- review a queue of flagged AI or system-generated content
- prioritize higher-risk cases first
- choose among `approve`, `flag_remove`, `escalate`, and `skip`
- operate under a fixed step budget

This makes the benchmark useful for evaluating agent planning, policy following, and safety-oriented triage behavior rather than only binary classification accuracy.

## Benchmark Summary

| Area | Details |
| --- | --- |
| Domain | AI-output moderation and generic-system moderation |
| Observation | Full remaining queue, step count, steps remaining, compact policy summary |
| Action | `(item_id, action_type)` |
| Actions | `approve`, `flag_remove`, `escalate`, `skip` |
| Episode End | Queue empty or 15 steps reached |
| Policy | Configurable JSON policy mapping with ordered priority tiers |
| Output Score | Normalized final score in `(0.0, 1.0)` |

## Tasks

| Task | Queue Size | Difficulty |
| --- | ---: | --- |
| `easy` | 8 | Clear violations, stronger signals, low hint noise |
| `medium` | 10 | Mixed cues, moderate hint noise, more ambiguous calls |
| `hard` | 12 | Subtle violations, high hint noise, harder routing decisions |

Locked seed registry:

- `easy`: 40 train, 10 eval
- `medium`: 80 train, 20 eval
- `hard`: 240 train, 60 eval

Hint behavior:

- `medium` uses exactly 3 null hints per queue.
- `easy` and `hard` cannot represent exactly 30% null hints per individual episode because queue sizes are 8 and 12.
- The implementation therefore matches the documented 30% ratio exactly across the locked seed registry while keeping each episode deterministic.

## Reward And Grading

Per-step reward includes:

- `+0.3` for a correct action
- `-0.3` for an incorrect action
- `-1.0` for skipping a tier-1 item
- quadratic escalation penalty above a 20% escalation ratio
- `+0.05` diversity bonus once at least 3 action types have been used

Terminal reward adds:

- `2.0 * grader_score`

Dense step reward is clamped to `[-1.0, 1.0]` before the terminal outcome bonus is applied.

Final normalized score is computed as:

- 50% final classification accuracy
- 30% partial-order tier compliance
- 20% efficiency for resolving all tier-1 items within the first 40% of the step budget

To satisfy the benchmark validator, the final published task score is clamped to the open interval `(0, 1)` using a small epsilon. Exact `0.0` and `1.0` are not emitted.

## Baselines

Implemented baselines in [`ethicsguard/baselines.py`](C:/Users/GODREIGN/Desktop/scalerrrr/Scaler-hack/ethicsguard/baselines.py):

- `random`: uniformly samples both queue item and action
- `greedy_by_hint`: picks the highest visible hint and uses fixed thresholds
- `rule_based`: infers category from the synthetic snippet and applies the policy deterministically
- `always_escalate`: audit baseline
- `always_approve`: audit baseline

Run the calibration suite:

```bash
uv run python -c "from ethicsguard.baselines import run_all_baselines, audit_thresholds; import pprint; pprint.pp(run_all_baselines(split='all')); pprint.pp(audit_thresholds())"
```

Measured results from the current runtime:

| Difficulty | Agent | Mean | Std | Min | Max |
| --- | --- | ---: | ---: | ---: | ---: |
| Easy | random | 0.2629 | 0.1327 | 0.0158 | 0.6243 |
| Easy | greedy_by_hint | 0.5743 | 0.1804 | 0.1548 | 0.9375 |
| Easy | rule_based | 0.9999 | 0.0000 | 0.9999 | 0.9999 |
| Easy | always_escalate | 0.2882 | 0.1496 | 0.0000 | 0.5821 |
| Easy | always_approve | 0.3307 | 0.1808 | 0.0375 | 0.6537 |
| Medium | random | 0.2947 | 0.1172 | 0.1065 | 0.7667 |
| Medium | greedy_by_hint | 0.4538 | 0.1403 | 0.2224 | 0.9000 |
| Medium | rule_based | 0.9959 | 0.0280 | 0.8000 | 0.9999 |
| Medium | always_escalate | 0.3104 | 0.1453 | 0.0097 | 0.6625 |
| Medium | always_approve | 0.2894 | 0.1205 | 0.1097 | 0.6532 |
| Hard | random | 0.2778 | 0.1113 | 0.0187 | 0.7361 |
| Hard | greedy_by_hint | 0.3905 | 0.1245 | 0.1583 | 0.9328 |
| Hard | rule_based | 0.9886 | 0.0462 | 0.8000 | 0.9999 |
| Hard | always_escalate | 0.2665 | 0.1097 | 0.0805 | 0.6617 |
| Hard | always_approve | 0.2594 | 0.1102 | 0.0477 | 0.6888 |

Audit status:

- `always_escalate` stays below `0.35` mean on all three tasks
- `always_approve` stays below `0.35` mean on all three tasks

## Repository Layout

| Path | Purpose |
| --- | --- |
| [`ethicsguard/generator.py`](C:/Users/GODREIGN/Desktop/scalerrrr/Scaler-hack/ethicsguard/generator.py) | Queue generation, locked seeds, synthetic hints |
| [`ethicsguard/env.py`](C:/Users/GODREIGN/Desktop/scalerrrr/Scaler-hack/ethicsguard/env.py) | Core environment and episode loop |
| [`ethicsguard/reward.py`](C:/Users/GODREIGN/Desktop/scalerrrr/Scaler-hack/ethicsguard/reward.py) | Reward shaping |
| [`ethicsguard/grader.py`](C:/Users/GODREIGN/Desktop/scalerrrr/Scaler-hack/ethicsguard/grader.py) | Outcome-based grading |
| [`ethicsguard/baselines.py`](C:/Users/GODREIGN/Desktop/scalerrrr/Scaler-hack/ethicsguard/baselines.py) | Baselines and audit checks |
| [`server/app.py`](C:/Users/GODREIGN/Desktop/scalerrrr/Scaler-hack/server/app.py) | FastAPI server for local/HF deployment |
| [`inference.py`](C:/Users/GODREIGN/Desktop/scalerrrr/Scaler-hack/inference.py) | Required benchmark inference script |
| [`openenv.yaml`](C:/Users/GODREIGN/Desktop/scalerrrr/Scaler-hack/openenv.yaml) | OpenEnv metadata |

## Quick Start

### 1. Install

```bash
uv sync --extra dev --extra openenv
```

### 2. Run Tests

```bash
uv run pytest
```

### 3. Run Generator Smoke Test

```bash
uv run python -m ethicsguard.generator
```

### 4. Run Inference

Set environment variables first:

- `API_BASE_URL`
- `MODEL_NAME`
- `HF_TOKEN`

Then run:

```bash
uv run python inference.py
```

### 5. Validate OpenEnv Compatibility

```bash
uv run openenv validate
```

## API Surface

The deployed server exposes:

- `GET /`
- `GET /health`
- `GET /tasks`
- `POST /reset`
- `POST /step`
- `GET /state`
- `POST /close`

Local API run:

```bash
uv run uvicorn server.app:app --host 0.0.0.0 --port 7860
```

Local smoke test:

```bash
curl -X POST http://localhost:7860/reset -H "Content-Type: application/json" -d "{\"task\":\"easy\",\"seed\":2000}"
curl http://localhost:7860/tasks
curl http://localhost:7860/state
```

## Docker And Hugging Face Spaces

Build locally:

```bash
docker build -t ethicsguard .
docker run -p 7860:7860 ethicsguard
```

Container behavior:

- installs the package from [`pyproject.toml`](C:/Users/GODREIGN/Desktop/scalerrrr/Scaler-hack/pyproject.toml)
- runs `uvicorn server.app:app`
- exposes port `7860`
- includes a `/health` healthcheck

Live HF Space:

- [`https://godreign-ethicsguard.hf.space/health`](https://godreign-ethicsguard.hf.space/health)
- [`https://godreign-ethicsguard.hf.space/tasks`](https://godreign-ethicsguard.hf.space/tasks)

## Inference Script Contract

[`inference.py`](C:/Users/GODREIGN/Desktop/scalerrrr/Scaler-hack/inference.py):

- is located in the repo root
- uses the OpenAI client
- reads `API_BASE_URL`, `MODEL_NAME`, and `HF_TOKEN` from environment variables
- emits only the required structured stdout lines:
  - `[START]`
  - `[STEP]`
  - `[END]`

The current script runs one fixed representative eval seed per task:

- `easy`: first eval seed
- `medium`: first eval seed
- `hard`: first eval seed

## Submission Status

The following submission gates have been exercised successfully:

- HF Space deployment is live and responds to `/reset`
- `openenv validate` passes
- `docker build` succeeds
- `inference.py` runs and produces structured logs
- all local tests pass
- 3 tasks are exposed and scored strictly inside `(0, 1)`

## Comparison

| Feature | EthicsGuard | SOC triage env | Generic classifiers |
| --- | --- | --- | --- |
| Domain | AI outputs + generic systems | Security alerts | Single-item classification |
| Policy | Configurable JSON | Usually hardcoded | N/A |
| Scoring | Outcome + ordering | Often trajectory-based | Per-item accuracy |
| Hints | Noisy and partially hidden | Varies | Usually none |
| Actions | 4-way triage | Often binary | Binary |

## Known Limitations

- All content is synthetic and sanitized.
- Inter-item dependencies are out of scope in v1.
- `context_chain_id` is reserved for future support only.
- The current rule-based baseline is near-perfect because the snippet templates remain strongly class-indicative.
- If benchmark difficulty needs to increase, the preferred direction is richer synthetic language variation rather than changing the public API or grading contract.