--- title: Annotation QA Env emoji: πŸ” colorFrom: blue colorTo: indigo sdk: docker app_port: 8000 --- # πŸ” Annotation QA Environment An **OpenEnv** environment where an AI agent reviews and corrects intentionally-flawed ML annotations on synthetic scenes. Built for the [Meta OpenEnv Γ— SST Hackathon](https://github.com/meta-pytorch/OpenEnv). ## 🎯 The Challenge Real-world ML training data is noisy. Annotation teams make mistakes β€” bounding boxes drift, class labels get swapped, objects get missed. This environment simulates that review pipeline: 1. **Agent receives** a scene description + current annotations (some are wrong) 2. **Agent identifies** errors by comparing annotations to scene objects 3. **Agent corrects** errors through bbox adjustments, class changes, additions, and removals 4. **Agent submits** and receives a score based on annotation quality improvement ## πŸ“‹ Tasks (3 Difficulty Levels) | Task | Difficulty | Errors | Max Steps | |------|-----------|--------|-----------| | `fix_bboxes` | Easy | Bbox expansion, shifting, shrinking, spurious, missing | 15 | | `fix_classes` | Medium | Bbox errors + class label confusion (car↔truck, dog↔cat) | 20 | | `batch_audit` | Hard | Subtle bbox shifts + similar-class confusion + cross-batch issues | 30 | ## πŸ—οΈ Architecture ``` annotation_qa_env/ β”œβ”€β”€ models.py ← Action, Observation, State (Pydantic) β”œβ”€β”€ client.py ← EnvClient for WebSocket interaction β”œβ”€β”€ inference.py ← Baseline LLM agent (OpenAI client) β”œβ”€β”€ server/ β”‚ β”œβ”€β”€ environment.py ← Core game logic (reset, step, state) β”‚ β”œβ”€β”€ grader.py ← IoU-based deterministic grading β”‚ β”œβ”€β”€ corruption.py ← Annotation corruption strategies β”‚ β”œβ”€β”€ app.py ← FastAPI server β”‚ └── Dockerfile ← Container definition └── data/ └── generate_dataset.py ← Synthetic scene generator ``` ## πŸš€ Quick Start ### Install & Run Locally ```bash cd annotation_qa_env pip install -e . uvicorn server.app:app --host 0.0.0.0 --port 8000 ``` ### Use the Client ```python from annotation_qa_env import AnnotationQAEnv, AnnotationQAAction with AnnotationQAEnv(base_url="http://localhost:8000").sync() as env: result = env.reset(task="fix_bboxes") print(result.observation.annotations) result = env.step(AnnotationQAAction( action_type="adjust_bbox", annotation_id=0, new_bbox=[0.1, 0.2, 0.15, 0.1], )) print(f"Reward: {result.reward}") ``` ### Docker ```bash docker build -t annotation-qa-env:latest -f server/Dockerfile . docker run -d -p 8000:8000 annotation-qa-env:latest ``` ### Deploy to HF Spaces ```bash openenv push --repo-id username/annotation-qa-env ``` ## πŸ“Š Grading The grading function is **deterministic** and returns scores in `[0.0, 1.0]`: ``` Score = (final_quality - initial_quality) / (1.0 - initial_quality) ``` Where `quality` is a weighted composite of: - **Mean IoU** (40%) β€” How well do predicted bboxes overlap with gold? - **Class Accuracy** (30%) β€” Are class labels correct? - **Precision** (15%) β€” Are there spurious annotations? - **Recall** (15%) β€” Are there missing annotations? ## πŸ€– Actions | Action | Required Fields | Description | |--------|----------------|-------------| | `adjust_bbox` | `annotation_id`, `new_bbox` | Fix a bounding box | | `change_class` | `annotation_id`, `new_class` | Fix a class label | | `add_annotation` | `new_bbox`, `new_class` | Add a missing annotation | | `remove_annotation` | `annotation_id` | Remove a spurious annotation | | `submit` | (none) | Finalize corrections | ## πŸ“¦ Environment Variables | Variable | Default | Description | |----------|---------|-------------| | `API_BASE_URL` | `https://router.huggingface.co/v1` | LLM API endpoint | | `MODEL_NAME` | `Qwen/Qwen2.5-72B-Instruct` | Model for inference | | `HF_TOKEN` | β€” | API key | ## πŸ”¬ Why Synthetic Scenes? We use programmatic scene descriptions instead of real COCO images because: 1. **Docker size**: COCO train2017 is ~18GB β€” exceeds container limits 2. **Memory**: Base64 images in observations would spike past 8GB RAM 3. **LLM text-only**: Evaluation uses text-only LLMs (no vision models) 4. **Determinism**: Same seed = same data = reproducible scores 5. **Zero setup**: No dataset download β€” everything is self-contained The annotation QA task is fundamentally about **spatial + categorical reasoning**, which text captures fully. ## πŸ“œ License BSD-3-Clause (matching OpenEnv)