Spaces:
Sleeping
Sleeping
| title: Annotation QA Env | |
| emoji: π | |
| colorFrom: blue | |
| colorTo: indigo | |
| sdk: docker | |
| app_port: 8000 | |
| # π Annotation QA Environment | |
| An **OpenEnv** environment where an AI agent reviews and corrects intentionally-flawed ML annotations on synthetic scenes. Built for the [Meta OpenEnv Γ SST Hackathon](https://github.com/meta-pytorch/OpenEnv). | |
| ## π― The Challenge | |
| Real-world ML training data is noisy. Annotation teams make mistakes β bounding boxes drift, class labels get swapped, objects get missed. This environment simulates that review pipeline: | |
| 1. **Agent receives** a scene description + current annotations (some are wrong) | |
| 2. **Agent identifies** errors by comparing annotations to scene objects | |
| 3. **Agent corrects** errors through bbox adjustments, class changes, additions, and removals | |
| 4. **Agent submits** and receives a score based on annotation quality improvement | |
| ## π Tasks (3 Difficulty Levels) | |
| | Task | Difficulty | Errors | Max Steps | | |
| |------|-----------|--------|-----------| | |
| | `fix_bboxes` | Easy | Bbox expansion, shifting, shrinking, spurious, missing | 15 | | |
| | `fix_classes` | Medium | Bbox errors + class label confusion (carβtruck, dogβcat) | 20 | | |
| | `batch_audit` | Hard | Subtle bbox shifts + similar-class confusion + cross-batch issues | 30 | | |
| ## ποΈ Architecture | |
| ``` | |
| annotation_qa_env/ | |
| βββ models.py β Action, Observation, State (Pydantic) | |
| βββ client.py β EnvClient for WebSocket interaction | |
| βββ inference.py β Baseline LLM agent (OpenAI client) | |
| βββ server/ | |
| β βββ environment.py β Core game logic (reset, step, state) | |
| β βββ grader.py β IoU-based deterministic grading | |
| β βββ corruption.py β Annotation corruption strategies | |
| β βββ app.py β FastAPI server | |
| β βββ Dockerfile β Container definition | |
| βββ data/ | |
| βββ generate_dataset.py β Synthetic scene generator | |
| ``` | |
| ## π Quick Start | |
| ### Install & Run Locally | |
| ```bash | |
| cd annotation_qa_env | |
| pip install -e . | |
| uvicorn server.app:app --host 0.0.0.0 --port 8000 | |
| ``` | |
| ### Use the Client | |
| ```python | |
| from annotation_qa_env import AnnotationQAEnv, AnnotationQAAction | |
| with AnnotationQAEnv(base_url="http://localhost:8000").sync() as env: | |
| result = env.reset(task="fix_bboxes") | |
| print(result.observation.annotations) | |
| result = env.step(AnnotationQAAction( | |
| action_type="adjust_bbox", | |
| annotation_id=0, | |
| new_bbox=[0.1, 0.2, 0.15, 0.1], | |
| )) | |
| print(f"Reward: {result.reward}") | |
| ``` | |
| ### Docker | |
| ```bash | |
| docker build -t annotation-qa-env:latest -f server/Dockerfile . | |
| docker run -d -p 8000:8000 annotation-qa-env:latest | |
| ``` | |
| ### Deploy to HF Spaces | |
| ```bash | |
| openenv push --repo-id username/annotation-qa-env | |
| ``` | |
| ## π Grading | |
| The grading function is **deterministic** and returns scores in `[0.0, 1.0]`: | |
| ``` | |
| Score = (final_quality - initial_quality) / (1.0 - initial_quality) | |
| ``` | |
| Where `quality` is a weighted composite of: | |
| - **Mean IoU** (40%) β How well do predicted bboxes overlap with gold? | |
| - **Class Accuracy** (30%) β Are class labels correct? | |
| - **Precision** (15%) β Are there spurious annotations? | |
| - **Recall** (15%) β Are there missing annotations? | |
| ## π€ Actions | |
| | Action | Required Fields | Description | | |
| |--------|----------------|-------------| | |
| | `adjust_bbox` | `annotation_id`, `new_bbox` | Fix a bounding box | | |
| | `change_class` | `annotation_id`, `new_class` | Fix a class label | | |
| | `add_annotation` | `new_bbox`, `new_class` | Add a missing annotation | | |
| | `remove_annotation` | `annotation_id` | Remove a spurious annotation | | |
| | `submit` | (none) | Finalize corrections | | |
| ## π¦ Environment Variables | |
| | Variable | Default | Description | | |
| |----------|---------|-------------| | |
| | `API_BASE_URL` | `https://router.huggingface.co/v1` | LLM API endpoint | | |
| | `MODEL_NAME` | `Qwen/Qwen2.5-72B-Instruct` | Model for inference | | |
| | `HF_TOKEN` | β | API key | | |
| ## π¬ Why Synthetic Scenes? | |
| We use programmatic scene descriptions instead of real COCO images because: | |
| 1. **Docker size**: COCO train2017 is ~18GB β exceeds container limits | |
| 2. **Memory**: Base64 images in observations would spike past 8GB RAM | |
| 3. **LLM text-only**: Evaluation uses text-only LLMs (no vision models) | |
| 4. **Determinism**: Same seed = same data = reproducible scores | |
| 5. **Zero setup**: No dataset download β everything is self-contained | |
| The annotation QA task is fundamentally about **spatial + categorical reasoning**, which text captures fully. | |
| ## π License | |
| BSD-3-Clause (matching OpenEnv) | |