Spaces:
Sleeping
Sleeping
metadata
title: Annotation QA Env
emoji: π
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 8000
π Annotation QA Environment
An OpenEnv environment where an AI agent reviews and corrects intentionally-flawed ML annotations on synthetic scenes. Built for the Meta OpenEnv Γ SST Hackathon.
π― The Challenge
Real-world ML training data is noisy. Annotation teams make mistakes β bounding boxes drift, class labels get swapped, objects get missed. This environment simulates that review pipeline:
- Agent receives a scene description + current annotations (some are wrong)
- Agent identifies errors by comparing annotations to scene objects
- Agent corrects errors through bbox adjustments, class changes, additions, and removals
- Agent submits and receives a score based on annotation quality improvement
π Tasks (3 Difficulty Levels)
| Task | Difficulty | Errors | Max Steps |
|---|---|---|---|
fix_bboxes |
Easy | Bbox expansion, shifting, shrinking, spurious, missing | 15 |
fix_classes |
Medium | Bbox errors + class label confusion (carβtruck, dogβcat) | 20 |
batch_audit |
Hard | Subtle bbox shifts + similar-class confusion + cross-batch issues | 30 |
ποΈ Architecture
annotation_qa_env/
βββ models.py β Action, Observation, State (Pydantic)
βββ client.py β EnvClient for WebSocket interaction
βββ inference.py β Baseline LLM agent (OpenAI client)
βββ server/
β βββ environment.py β Core game logic (reset, step, state)
β βββ grader.py β IoU-based deterministic grading
β βββ corruption.py β Annotation corruption strategies
β βββ app.py β FastAPI server
β βββ Dockerfile β Container definition
βββ data/
βββ generate_dataset.py β Synthetic scene generator
π Quick Start
Install & Run Locally
cd annotation_qa_env
pip install -e .
uvicorn server.app:app --host 0.0.0.0 --port 8000
Use the Client
from annotation_qa_env import AnnotationQAEnv, AnnotationQAAction
with AnnotationQAEnv(base_url="http://localhost:8000").sync() as env:
result = env.reset(task="fix_bboxes")
print(result.observation.annotations)
result = env.step(AnnotationQAAction(
action_type="adjust_bbox",
annotation_id=0,
new_bbox=[0.1, 0.2, 0.15, 0.1],
))
print(f"Reward: {result.reward}")
Docker
docker build -t annotation-qa-env:latest -f server/Dockerfile .
docker run -d -p 8000:8000 annotation-qa-env:latest
Deploy to HF Spaces
openenv push --repo-id username/annotation-qa-env
π Grading
The grading function is deterministic and returns scores in [0.0, 1.0]:
Score = (final_quality - initial_quality) / (1.0 - initial_quality)
Where quality is a weighted composite of:
- Mean IoU (40%) β How well do predicted bboxes overlap with gold?
- Class Accuracy (30%) β Are class labels correct?
- Precision (15%) β Are there spurious annotations?
- Recall (15%) β Are there missing annotations?
π€ Actions
| Action | Required Fields | Description |
|---|---|---|
adjust_bbox |
annotation_id, new_bbox |
Fix a bounding box |
change_class |
annotation_id, new_class |
Fix a class label |
add_annotation |
new_bbox, new_class |
Add a missing annotation |
remove_annotation |
annotation_id |
Remove a spurious annotation |
submit |
(none) | Finalize corrections |
π¦ Environment Variables
| Variable | Default | Description |
|---|---|---|
API_BASE_URL |
https://router.huggingface.co/v1 |
LLM API endpoint |
MODEL_NAME |
Qwen/Qwen2.5-72B-Instruct |
Model for inference |
HF_TOKEN |
β | API key |
π¬ Why Synthetic Scenes?
We use programmatic scene descriptions instead of real COCO images because:
- Docker size: COCO train2017 is ~18GB β exceeds container limits
- Memory: Base64 images in observations would spike past 8GB RAM
- LLM text-only: Evaluation uses text-only LLMs (no vision models)
- Determinism: Same seed = same data = reproducible scores
- Zero setup: No dataset download β everything is self-contained
The annotation QA task is fundamentally about spatial + categorical reasoning, which text captures fully.
π License
BSD-3-Clause (matching OpenEnv)