Spaces:

Godreign
/

ethicsguard

Sleeping

App Files Files Community

ethicsguard / README.md

GodreignElgin

Clamp final task scores to open interval

bbcb74d 2 months ago

preview code

raw

history blame contribute delete

9.18 kB

	---
	title: EthicsGuard
	emoji: 🛡️
	colorFrom: blue
	colorTo: red
	sdk: docker
	app_port: 7860
	pinned: false
	---

	# EthicsGuard

	EthicsGuard is an OpenEnv-style benchmark for sequential moderation and policy-enforcement triage. Instead of classifying one item at a time, the agent sees the full remaining queue of synthetic flagged items and must decide both which item to process next and which action to take.

	The environment targets a practical gap: moderation systems are usually optimized for single-item scoring, while real operations depend on prioritization, escalation discipline, and queue-level efficiency.

	## Why This Environment Exists

	EthicsGuard models a real workflow humans actually perform:

	- review a queue of flagged AI or system-generated content
	- prioritize higher-risk cases first
	- choose among `approve`, `flag_remove`, `escalate`, and `skip`
	- operate under a fixed step budget

	This makes the benchmark useful for evaluating agent planning, policy following, and safety-oriented triage behavior rather than only binary classification accuracy.

	## Benchmark Summary

	\| Area \| Details \|
	\| --- \| --- \|
	\| Domain \| AI-output moderation and generic-system moderation \|
	\| Observation \| Full remaining queue, step count, steps remaining, compact policy summary \|
	\| Action \| `(item_id, action_type)` \|
	\| Actions \| `approve`, `flag_remove`, `escalate`, `skip` \|
	\| Episode End \| Queue empty or 15 steps reached \|
	\| Policy \| Configurable JSON policy mapping with ordered priority tiers \|
	\| Output Score \| Normalized final score in `(0.0, 1.0)` \|

	## Tasks

	\| Task \| Queue Size \| Difficulty \|
	\| --- \| ---: \| --- \|
	\| `easy` \| 8 \| Clear violations, stronger signals, low hint noise \|
	\| `medium` \| 10 \| Mixed cues, moderate hint noise, more ambiguous calls \|
	\| `hard` \| 12 \| Subtle violations, high hint noise, harder routing decisions \|

	Locked seed registry:

	- `easy`: 40 train, 10 eval
	- `medium`: 80 train, 20 eval
	- `hard`: 240 train, 60 eval

	Hint behavior:

	- `medium` uses exactly 3 null hints per queue.
	- `easy` and `hard` cannot represent exactly 30% null hints per individual episode because queue sizes are 8 and 12.
	- The implementation therefore matches the documented 30% ratio exactly across the locked seed registry while keeping each episode deterministic.

	## Reward And Grading

	Per-step reward includes:

	- `+0.3` for a correct action
	- `-0.3` for an incorrect action
	- `-1.0` for skipping a tier-1 item
	- quadratic escalation penalty above a 20% escalation ratio
	- `+0.05` diversity bonus once at least 3 action types have been used

	Terminal reward adds:

	- `2.0 * grader_score`

	Dense step reward is clamped to `[-1.0, 1.0]` before the terminal outcome bonus is applied.

	Final normalized score is computed as:

	- 50% final classification accuracy
	- 30% partial-order tier compliance
	- 20% efficiency for resolving all tier-1 items within the first 40% of the step budget

	To satisfy the benchmark validator, the final published task score is clamped to the open interval `(0, 1)` using a small epsilon. Exact `0.0` and `1.0` are not emitted.

	## Baselines

	Implemented baselines in [`ethicsguard/baselines.py`](C:/Users/GODREIGN/Desktop/scalerrrr/Scaler-hack/ethicsguard/baselines.py):

	- `random`: uniformly samples both queue item and action
	- `greedy_by_hint`: picks the highest visible hint and uses fixed thresholds
	- `rule_based`: infers category from the synthetic snippet and applies the policy deterministically
	- `always_escalate`: audit baseline
	- `always_approve`: audit baseline

	Run the calibration suite:

	```bash
	uv run python -c "from ethicsguard.baselines import run_all_baselines, audit_thresholds; import pprint; pprint.pp(run_all_baselines(split='all')); pprint.pp(audit_thresholds())"
	```

	Measured results from the current runtime:

	\| Difficulty \| Agent \| Mean \| Std \| Min \| Max \|
	\| --- \| --- \| ---: \| ---: \| ---: \| ---: \|
	\| Easy \| random \| 0.2629 \| 0.1327 \| 0.0158 \| 0.6243 \|
	\| Easy \| greedy_by_hint \| 0.5743 \| 0.1804 \| 0.1548 \| 0.9375 \|
	\| Easy \| rule_based \| 0.9999 \| 0.0000 \| 0.9999 \| 0.9999 \|
	\| Easy \| always_escalate \| 0.2882 \| 0.1496 \| 0.0000 \| 0.5821 \|
	\| Easy \| always_approve \| 0.3307 \| 0.1808 \| 0.0375 \| 0.6537 \|
	\| Medium \| random \| 0.2947 \| 0.1172 \| 0.1065 \| 0.7667 \|
	\| Medium \| greedy_by_hint \| 0.4538 \| 0.1403 \| 0.2224 \| 0.9000 \|
	\| Medium \| rule_based \| 0.9959 \| 0.0280 \| 0.8000 \| 0.9999 \|
	\| Medium \| always_escalate \| 0.3104 \| 0.1453 \| 0.0097 \| 0.6625 \|
	\| Medium \| always_approve \| 0.2894 \| 0.1205 \| 0.1097 \| 0.6532 \|
	\| Hard \| random \| 0.2778 \| 0.1113 \| 0.0187 \| 0.7361 \|
	\| Hard \| greedy_by_hint \| 0.3905 \| 0.1245 \| 0.1583 \| 0.9328 \|
	\| Hard \| rule_based \| 0.9886 \| 0.0462 \| 0.8000 \| 0.9999 \|
	\| Hard \| always_escalate \| 0.2665 \| 0.1097 \| 0.0805 \| 0.6617 \|
	\| Hard \| always_approve \| 0.2594 \| 0.1102 \| 0.0477 \| 0.6888 \|

	Audit status:

	- `always_escalate` stays below `0.35` mean on all three tasks
	- `always_approve` stays below `0.35` mean on all three tasks

	## Repository Layout

	\| Path \| Purpose \|
	\| --- \| --- \|
	\| [`ethicsguard/generator.py`](C:/Users/GODREIGN/Desktop/scalerrrr/Scaler-hack/ethicsguard/generator.py) \| Queue generation, locked seeds, synthetic hints \|
	\| [`ethicsguard/env.py`](C:/Users/GODREIGN/Desktop/scalerrrr/Scaler-hack/ethicsguard/env.py) \| Core environment and episode loop \|
	\| [`ethicsguard/reward.py`](C:/Users/GODREIGN/Desktop/scalerrrr/Scaler-hack/ethicsguard/reward.py) \| Reward shaping \|
	\| [`ethicsguard/grader.py`](C:/Users/GODREIGN/Desktop/scalerrrr/Scaler-hack/ethicsguard/grader.py) \| Outcome-based grading \|
	\| [`ethicsguard/baselines.py`](C:/Users/GODREIGN/Desktop/scalerrrr/Scaler-hack/ethicsguard/baselines.py) \| Baselines and audit checks \|
	\| [`server/app.py`](C:/Users/GODREIGN/Desktop/scalerrrr/Scaler-hack/server/app.py) \| FastAPI server for local/HF deployment \|
	\| [`inference.py`](C:/Users/GODREIGN/Desktop/scalerrrr/Scaler-hack/inference.py) \| Required benchmark inference script \|
	\| [`openenv.yaml`](C:/Users/GODREIGN/Desktop/scalerrrr/Scaler-hack/openenv.yaml) \| OpenEnv metadata \|

	## Quick Start

	### 1. Install

	```bash
	uv sync --extra dev --extra openenv
	```

	### 2. Run Tests

	```bash
	uv run pytest
	```

	### 3. Run Generator Smoke Test

	```bash
	uv run python -m ethicsguard.generator
	```

	### 4. Run Inference

	Set environment variables first:

	- `API_BASE_URL`
	- `MODEL_NAME`
	- `HF_TOKEN`

	Then run:

	```bash
	uv run python inference.py
	```

	### 5. Validate OpenEnv Compatibility

	```bash
	uv run openenv validate
	```

	## API Surface

	The deployed server exposes:

	- `GET /`
	- `GET /health`
	- `GET /tasks`
	- `POST /reset`
	- `POST /step`
	- `GET /state`
	- `POST /close`

	Local API run:

	```bash
	uv run uvicorn server.app:app --host 0.0.0.0 --port 7860
	```

	Local smoke test:

	```bash
	curl -X POST http://localhost:7860/reset -H "Content-Type: application/json" -d "{\"task\":\"easy\",\"seed\":2000}"
	curl http://localhost:7860/tasks
	curl http://localhost:7860/state
	```

	## Docker And Hugging Face Spaces

	Build locally:

	```bash
	docker build -t ethicsguard .
	docker run -p 7860:7860 ethicsguard
	```

	Container behavior:

	- installs the package from [`pyproject.toml`](C:/Users/GODREIGN/Desktop/scalerrrr/Scaler-hack/pyproject.toml)
	- runs `uvicorn server.app:app`
	- exposes port `7860`
	- includes a `/health` healthcheck

	Live HF Space:

	- [`https://godreign-ethicsguard.hf.space/health`](https://godreign-ethicsguard.hf.space/health)
	- [`https://godreign-ethicsguard.hf.space/tasks`](https://godreign-ethicsguard.hf.space/tasks)

	## Inference Script Contract

	[`inference.py`](C:/Users/GODREIGN/Desktop/scalerrrr/Scaler-hack/inference.py):

	- is located in the repo root
	- uses the OpenAI client
	- reads `API_BASE_URL`, `MODEL_NAME`, and `HF_TOKEN` from environment variables
	- emits only the required structured stdout lines:
	- `[START]`
	- `[STEP]`
	- `[END]`

	The current script runs one fixed representative eval seed per task:

	- `easy`: first eval seed
	- `medium`: first eval seed
	- `hard`: first eval seed

	## Submission Status

	The following submission gates have been exercised successfully:

	- HF Space deployment is live and responds to `/reset`
	- `openenv validate` passes
	- `docker build` succeeds
	- `inference.py` runs and produces structured logs
	- all local tests pass
	- 3 tasks are exposed and scored strictly inside `(0, 1)`

	## Comparison

	\| Feature \| EthicsGuard \| SOC triage env \| Generic classifiers \|
	\| --- \| --- \| --- \| --- \|
	\| Domain \| AI outputs + generic systems \| Security alerts \| Single-item classification \|
	\| Policy \| Configurable JSON \| Usually hardcoded \| N/A \|
	\| Scoring \| Outcome + ordering \| Often trajectory-based \| Per-item accuracy \|
	\| Hints \| Noisy and partially hidden \| Varies \| Usually none \|
	\| Actions \| 4-way triage \| Often binary \| Binary \|

	## Known Limitations

	- All content is synthetic and sanitized.
	- Inter-item dependencies are out of scope in v1.
	- `context_chain_id` is reserved for future support only.
	- The current rule-based baseline is near-perfect because the snippet templates remain strongly class-indicative.
	- If benchmark difficulty needs to increase, the preferred direction is richer synthetic language variation rather than changing the public API or grading contract.