File size: 4,614 Bytes
8b4d6a8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
---
title: Annotation QA Env
emoji: πŸ”
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 8000
---
# πŸ” Annotation QA Environment

An **OpenEnv** environment where an AI agent reviews and corrects intentionally-flawed ML annotations on synthetic scenes. Built for the [Meta OpenEnv Γ— SST Hackathon](https://github.com/meta-pytorch/OpenEnv).

## 🎯 The Challenge

Real-world ML training data is noisy. Annotation teams make mistakes β€” bounding boxes drift, class labels get swapped, objects get missed. This environment simulates that review pipeline:

1. **Agent receives** a scene description + current annotations (some are wrong)
2. **Agent identifies** errors by comparing annotations to scene objects
3. **Agent corrects** errors through bbox adjustments, class changes, additions, and removals
4. **Agent submits** and receives a score based on annotation quality improvement

## πŸ“‹ Tasks (3 Difficulty Levels)

| Task | Difficulty | Errors | Max Steps |
|------|-----------|--------|-----------|
| `fix_bboxes` | Easy | Bbox expansion, shifting, shrinking, spurious, missing | 15 |
| `fix_classes` | Medium | Bbox errors + class label confusion (car↔truck, dog↔cat) | 20 |
| `batch_audit` | Hard | Subtle bbox shifts + similar-class confusion + cross-batch issues | 30 |

## πŸ—οΈ Architecture

```
annotation_qa_env/
β”œβ”€β”€ models.py              ← Action, Observation, State (Pydantic)
β”œβ”€β”€ client.py              ← EnvClient for WebSocket interaction
β”œβ”€β”€ inference.py           ← Baseline LLM agent (OpenAI client)
β”œβ”€β”€ server/
β”‚   β”œβ”€β”€ environment.py     ← Core game logic (reset, step, state)
β”‚   β”œβ”€β”€ grader.py          ← IoU-based deterministic grading
β”‚   β”œβ”€β”€ corruption.py      ← Annotation corruption strategies
β”‚   β”œβ”€β”€ app.py             ← FastAPI server
β”‚   └── Dockerfile         ← Container definition
└── data/
    └── generate_dataset.py ← Synthetic scene generator
```

## πŸš€ Quick Start

### Install & Run Locally
```bash
cd annotation_qa_env
pip install -e .
uvicorn server.app:app --host 0.0.0.0 --port 8000
```

### Use the Client
```python
from annotation_qa_env import AnnotationQAEnv, AnnotationQAAction

with AnnotationQAEnv(base_url="http://localhost:8000").sync() as env:
    result = env.reset(task="fix_bboxes")
    print(result.observation.annotations)

    result = env.step(AnnotationQAAction(
        action_type="adjust_bbox",
        annotation_id=0,
        new_bbox=[0.1, 0.2, 0.15, 0.1],
    ))
    print(f"Reward: {result.reward}")
```

### Docker
```bash
docker build -t annotation-qa-env:latest -f server/Dockerfile .
docker run -d -p 8000:8000 annotation-qa-env:latest
```

### Deploy to HF Spaces
```bash
openenv push --repo-id username/annotation-qa-env
```

## πŸ“Š Grading

The grading function is **deterministic** and returns scores in `[0.0, 1.0]`:

```
Score = (final_quality - initial_quality) / (1.0 - initial_quality)
```

Where `quality` is a weighted composite of:
- **Mean IoU** (40%) β€” How well do predicted bboxes overlap with gold?
- **Class Accuracy** (30%) β€” Are class labels correct?
- **Precision** (15%) β€” Are there spurious annotations?
- **Recall** (15%) β€” Are there missing annotations?

## πŸ€– Actions

| Action | Required Fields | Description |
|--------|----------------|-------------|
| `adjust_bbox` | `annotation_id`, `new_bbox` | Fix a bounding box |
| `change_class` | `annotation_id`, `new_class` | Fix a class label |
| `add_annotation` | `new_bbox`, `new_class` | Add a missing annotation |
| `remove_annotation` | `annotation_id` | Remove a spurious annotation |
| `submit` | (none) | Finalize corrections |

## πŸ“¦ Environment Variables

| Variable | Default | Description |
|----------|---------|-------------|
| `API_BASE_URL` | `https://router.huggingface.co/v1` | LLM API endpoint |
| `MODEL_NAME` | `Qwen/Qwen2.5-72B-Instruct` | Model for inference |
| `HF_TOKEN` | β€” | API key |

## πŸ”¬ Why Synthetic Scenes?

We use programmatic scene descriptions instead of real COCO images because:

1. **Docker size**: COCO train2017 is ~18GB β€” exceeds container limits
2. **Memory**: Base64 images in observations would spike past 8GB RAM
3. **LLM text-only**: Evaluation uses text-only LLMs (no vision models)
4. **Determinism**: Same seed = same data = reproducible scores
5. **Zero setup**: No dataset download β€” everything is self-contained

The annotation QA task is fundamentally about **spatial + categorical reasoning**, which text captures fully.

## πŸ“œ License

BSD-3-Clause (matching OpenEnv)