Spaces:
Sleeping
Sleeping
File size: 4,614 Bytes
8b4d6a8 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 | ---
title: Annotation QA Env
emoji: π
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 8000
---
# π Annotation QA Environment
An **OpenEnv** environment where an AI agent reviews and corrects intentionally-flawed ML annotations on synthetic scenes. Built for the [Meta OpenEnv Γ SST Hackathon](https://github.com/meta-pytorch/OpenEnv).
## π― The Challenge
Real-world ML training data is noisy. Annotation teams make mistakes β bounding boxes drift, class labels get swapped, objects get missed. This environment simulates that review pipeline:
1. **Agent receives** a scene description + current annotations (some are wrong)
2. **Agent identifies** errors by comparing annotations to scene objects
3. **Agent corrects** errors through bbox adjustments, class changes, additions, and removals
4. **Agent submits** and receives a score based on annotation quality improvement
## π Tasks (3 Difficulty Levels)
| Task | Difficulty | Errors | Max Steps |
|------|-----------|--------|-----------|
| `fix_bboxes` | Easy | Bbox expansion, shifting, shrinking, spurious, missing | 15 |
| `fix_classes` | Medium | Bbox errors + class label confusion (carβtruck, dogβcat) | 20 |
| `batch_audit` | Hard | Subtle bbox shifts + similar-class confusion + cross-batch issues | 30 |
## ποΈ Architecture
```
annotation_qa_env/
βββ models.py β Action, Observation, State (Pydantic)
βββ client.py β EnvClient for WebSocket interaction
βββ inference.py β Baseline LLM agent (OpenAI client)
βββ server/
β βββ environment.py β Core game logic (reset, step, state)
β βββ grader.py β IoU-based deterministic grading
β βββ corruption.py β Annotation corruption strategies
β βββ app.py β FastAPI server
β βββ Dockerfile β Container definition
βββ data/
βββ generate_dataset.py β Synthetic scene generator
```
## π Quick Start
### Install & Run Locally
```bash
cd annotation_qa_env
pip install -e .
uvicorn server.app:app --host 0.0.0.0 --port 8000
```
### Use the Client
```python
from annotation_qa_env import AnnotationQAEnv, AnnotationQAAction
with AnnotationQAEnv(base_url="http://localhost:8000").sync() as env:
result = env.reset(task="fix_bboxes")
print(result.observation.annotations)
result = env.step(AnnotationQAAction(
action_type="adjust_bbox",
annotation_id=0,
new_bbox=[0.1, 0.2, 0.15, 0.1],
))
print(f"Reward: {result.reward}")
```
### Docker
```bash
docker build -t annotation-qa-env:latest -f server/Dockerfile .
docker run -d -p 8000:8000 annotation-qa-env:latest
```
### Deploy to HF Spaces
```bash
openenv push --repo-id username/annotation-qa-env
```
## π Grading
The grading function is **deterministic** and returns scores in `[0.0, 1.0]`:
```
Score = (final_quality - initial_quality) / (1.0 - initial_quality)
```
Where `quality` is a weighted composite of:
- **Mean IoU** (40%) β How well do predicted bboxes overlap with gold?
- **Class Accuracy** (30%) β Are class labels correct?
- **Precision** (15%) β Are there spurious annotations?
- **Recall** (15%) β Are there missing annotations?
## π€ Actions
| Action | Required Fields | Description |
|--------|----------------|-------------|
| `adjust_bbox` | `annotation_id`, `new_bbox` | Fix a bounding box |
| `change_class` | `annotation_id`, `new_class` | Fix a class label |
| `add_annotation` | `new_bbox`, `new_class` | Add a missing annotation |
| `remove_annotation` | `annotation_id` | Remove a spurious annotation |
| `submit` | (none) | Finalize corrections |
## π¦ Environment Variables
| Variable | Default | Description |
|----------|---------|-------------|
| `API_BASE_URL` | `https://router.huggingface.co/v1` | LLM API endpoint |
| `MODEL_NAME` | `Qwen/Qwen2.5-72B-Instruct` | Model for inference |
| `HF_TOKEN` | β | API key |
## π¬ Why Synthetic Scenes?
We use programmatic scene descriptions instead of real COCO images because:
1. **Docker size**: COCO train2017 is ~18GB β exceeds container limits
2. **Memory**: Base64 images in observations would spike past 8GB RAM
3. **LLM text-only**: Evaluation uses text-only LLMs (no vision models)
4. **Determinism**: Same seed = same data = reproducible scores
5. **Zero setup**: No dataset download β everything is self-contained
The annotation QA task is fundamentally about **spatial + categorical reasoning**, which text captures fully.
## π License
BSD-3-Clause (matching OpenEnv)
|