annotation-qa-env / README.md
k3tikvats
initial commit
8b4d6a8
metadata
title: Annotation QA Env
emoji: πŸ”
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 8000

πŸ” Annotation QA Environment

An OpenEnv environment where an AI agent reviews and corrects intentionally-flawed ML annotations on synthetic scenes. Built for the Meta OpenEnv Γ— SST Hackathon.

🎯 The Challenge

Real-world ML training data is noisy. Annotation teams make mistakes β€” bounding boxes drift, class labels get swapped, objects get missed. This environment simulates that review pipeline:

  1. Agent receives a scene description + current annotations (some are wrong)
  2. Agent identifies errors by comparing annotations to scene objects
  3. Agent corrects errors through bbox adjustments, class changes, additions, and removals
  4. Agent submits and receives a score based on annotation quality improvement

πŸ“‹ Tasks (3 Difficulty Levels)

Task Difficulty Errors Max Steps
fix_bboxes Easy Bbox expansion, shifting, shrinking, spurious, missing 15
fix_classes Medium Bbox errors + class label confusion (car↔truck, dog↔cat) 20
batch_audit Hard Subtle bbox shifts + similar-class confusion + cross-batch issues 30

πŸ—οΈ Architecture

annotation_qa_env/
β”œβ”€β”€ models.py              ← Action, Observation, State (Pydantic)
β”œβ”€β”€ client.py              ← EnvClient for WebSocket interaction
β”œβ”€β”€ inference.py           ← Baseline LLM agent (OpenAI client)
β”œβ”€β”€ server/
β”‚   β”œβ”€β”€ environment.py     ← Core game logic (reset, step, state)
β”‚   β”œβ”€β”€ grader.py          ← IoU-based deterministic grading
β”‚   β”œβ”€β”€ corruption.py      ← Annotation corruption strategies
β”‚   β”œβ”€β”€ app.py             ← FastAPI server
β”‚   └── Dockerfile         ← Container definition
└── data/
    └── generate_dataset.py ← Synthetic scene generator

πŸš€ Quick Start

Install & Run Locally

cd annotation_qa_env
pip install -e .
uvicorn server.app:app --host 0.0.0.0 --port 8000

Use the Client

from annotation_qa_env import AnnotationQAEnv, AnnotationQAAction

with AnnotationQAEnv(base_url="http://localhost:8000").sync() as env:
    result = env.reset(task="fix_bboxes")
    print(result.observation.annotations)

    result = env.step(AnnotationQAAction(
        action_type="adjust_bbox",
        annotation_id=0,
        new_bbox=[0.1, 0.2, 0.15, 0.1],
    ))
    print(f"Reward: {result.reward}")

Docker

docker build -t annotation-qa-env:latest -f server/Dockerfile .
docker run -d -p 8000:8000 annotation-qa-env:latest

Deploy to HF Spaces

openenv push --repo-id username/annotation-qa-env

πŸ“Š Grading

The grading function is deterministic and returns scores in [0.0, 1.0]:

Score = (final_quality - initial_quality) / (1.0 - initial_quality)

Where quality is a weighted composite of:

  • Mean IoU (40%) β€” How well do predicted bboxes overlap with gold?
  • Class Accuracy (30%) β€” Are class labels correct?
  • Precision (15%) β€” Are there spurious annotations?
  • Recall (15%) β€” Are there missing annotations?

πŸ€– Actions

Action Required Fields Description
adjust_bbox annotation_id, new_bbox Fix a bounding box
change_class annotation_id, new_class Fix a class label
add_annotation new_bbox, new_class Add a missing annotation
remove_annotation annotation_id Remove a spurious annotation
submit (none) Finalize corrections

πŸ“¦ Environment Variables

Variable Default Description
API_BASE_URL https://router.huggingface.co/v1 LLM API endpoint
MODEL_NAME Qwen/Qwen2.5-72B-Instruct Model for inference
HF_TOKEN β€” API key

πŸ”¬ Why Synthetic Scenes?

We use programmatic scene descriptions instead of real COCO images because:

  1. Docker size: COCO train2017 is ~18GB β€” exceeds container limits
  2. Memory: Base64 images in observations would spike past 8GB RAM
  3. LLM text-only: Evaluation uses text-only LLMs (no vision models)
  4. Determinism: Same seed = same data = reproducible scores
  5. Zero setup: No dataset download β€” everything is self-contained

The annotation QA task is fundamentally about spatial + categorical reasoning, which text captures fully.

πŸ“œ License

BSD-3-Clause (matching OpenEnv)