# 📝 Annotation Tool — Guide A HuggingFace Spaces app for validating AI-extracted dataset mentions in World Bank documents. --- ## Quick Start ### For Annotators 1. Go to the Space URL and click **🤗 Sign in with HuggingFace** 2. You'll see only your assigned documents in the dropdown 3. Navigate pages with **← Prev / Next →** 4. Open the **Data Mentions** panel to validate each mention 5. Track your progress in the top-right: `Progress: 📄 PDF 3/55 | 📑 Page 2/12 | 🏷️ Verified 4/8` ### Validation Actions | Action | What it does | |--------|-------------| | ✅ **Correct** | Confirms the AI extraction is a real dataset mention | | ❌ **Incorrect** | Marks the extraction as wrong / not a dataset | | **Click tag badge** | Change dataset type (named, descriptive, generic) | | **Highlight text → Annotate** | Manually add a dataset mention the AI missed | | 🗑️ **Delete** | Remove a dataset entry entirely | > **Tip:** If you try to click "Next" with unverified mentions, you'll get a confirmation prompt. --- ## Document Assignments Each annotator sees only their assigned documents. A configurable percentage (default 10%) are **overlap documents** shared across all annotators for inter-annotator agreement measurement. ### Configuration File `annotation_data/annotator_config.yaml`: ```yaml settings: overlap_percent: 10 # % of docs shared between all annotators annotators: - username: rafmacalaba # HuggingFace username docs: [2, 3, 14, ...] # assigned doc indices - username: rafaelmacalaba docs: [1, 2, 10, ...] ``` ### Auto-Generate Assignments ```bash # Preview assignment distribution: uv run --with pyyaml python3 generate_assignments.py --dry-run # Generate and save locally: uv run --with pyyaml python3 generate_assignments.py # Generate, save, and upload to HF: uv run --with pyyaml,huggingface_hub python3 generate_assignments.py --upload ``` The script: - Reads `annotator_config.yaml` for the annotator list and overlap % - Shuffles all available docs (deterministic seed=42) - Reserves `overlap_percent` docs shared by ALL annotators - Splits the rest evenly across annotators - Saves back to the YAML ### Adding a New Annotator 1. Add to `annotation_data/annotator_config.yaml`: ```yaml - username: new_hf_username docs: [] ``` 2. Re-run: `uv run --with pyyaml,huggingface_hub python3 generate_assignments.py --upload` 3. Add the username to `ALLOWED_USERS` in the Space settings ### Manual Editing You can manually edit the `docs` array for any annotator in the YAML file, then upload: ```bash uv run --with huggingface_hub python3 -c " from huggingface_hub import HfApi api = HfApi() api.upload_file('annotation_data/annotator_config.yaml', 'annotation_data/annotator_config.yaml', 'ai4data/annotation_data', repo_type='dataset') " ``` --- ## Per-Annotator Validation (Overlap Support) Each dataset mention stores validations per-annotator in a `validations` array: ```json { "dataset_name": { "text": "DHS Survey", "confidence": 0.95 }, "dataset_tag": "named", "validations": [ { "annotator": "rafmacalaba", "human_validated": true, "human_verdict": true, "human_notes": null, "validated_at": "2025-02-24T11:00:00Z" }, { "annotator": "rafaelmacalaba", "human_validated": true, "human_verdict": false, "human_notes": "This is a study name, not a dataset", "validated_at": "2025-02-24T11:05:00Z" } ] } ``` **Key behavior:** - Each annotator only sees **their own** validation status (no bias) - Progress bar and "Next" prompt count only **your** verifications - Tag edits (`dataset_tag`) are shared — they're factual, not judgment-based - Re-validating updates your existing entry (doesn't create duplicates) --- ## Data Pipeline ### `prepare_data.py` — Prepare & Upload Documents ```bash # Dry run (scan only): uv run --with huggingface_hub,requests,langdetect python3 prepare_data.py --dry-run # Upload missing docs + update links: uv run --with huggingface_hub,requests,langdetect python3 prepare_data.py # Only update wbg_pdf_links.json: uv run --with huggingface_hub,requests,langdetect python3 prepare_data.py --links-only ``` This script: - Scans local `annotation_data/wbg_extractions/` for real `_direct_judged.jsonl` files - Detects language using `langdetect` (excludes non-English: Arabic, French) - Uploads English docs to HF dataset - Updates `wbg_pdf_links.json` with `has_revalidation` and `language` fields --- ## Leaderboard 🏆 Click **🏆 Leaderboard** in the top bar to see annotator rankings. | Metric | Description | |--------|-------------| | ✅ Verified | Number of mentions validated | | ✍️ Added | Manually added dataset mentions | | 📄 Docs | Number of documents worked on | | ⭐ Score | `Verified + Added` | Cached for 2 minutes. --- ## API Endpoints | Endpoint | Method | Description | |----------|--------|-------------| | `/api/documents?user=X` | GET | List documents (filtered by user assignment) | | `/api/document?index=X&page=Y` | GET | Get page data for a specific document | | `/api/validate` | PUT | Submit validation for a dataset mention | | `/api/validate?doc=X&page=Y&idx=Z` | DELETE | Remove a dataset entry | | `/api/leaderboard` | GET | Annotator rankings | | `/api/pdf-proxy?url=X` | GET | Proxy PDF downloads (bypasses CORS) | | `/api/auth/login` | GET | Start HF OAuth flow | | `/api/auth/callback` | GET | OAuth callback | --- ## Architecture ``` hf_spaces_docker/ ├── app/ │ ├── page.js # Main app (client component) │ ├── globals.css # All styles │ ├── api/ │ │ ├── documents/route.js # Doc listing + user filtering │ │ ├── document/route.js # Single page data │ │ ├── validate/route.js # Validate/delete mentions │ │ ├── leaderboard/route.js # Leaderboard stats │ │ ├── pdf-proxy/route.js # PDF CORS proxy │ │ └── auth/ # HF OAuth login/callback │ └── components/ │ ├── AnnotationPanel.js # Side panel with dataset cards │ ├── AnnotationModal.js # Manual annotation dialog │ ├── DocumentSelector.js # Document dropdown │ ├── Leaderboard.js # Leaderboard modal │ ├── MarkdownAnnotator.js # Text viewer with highlighting │ ├── PageNavigator.js # Prev/Next page buttons │ ├── PdfViewer.js # PDF iframe with loading state │ └── ProgressBar.js # PDF/Page/Verified pills ├── annotation_data/ │ ├── annotator_config.yaml # Annotator assignments │ └── wbg_data/ │ └── wbg_pdf_links.json # Doc registry with URLs ├── prepare_data.py # Upload docs to HF └── generate_assignments.py # Auto-assign docs to annotators ``` --- ## Environment Variables | Variable | Required | Description | |----------|----------|-------------| | `HF_TOKEN` | Yes | HuggingFace API token (read/write) | | `OAUTH_CLIENT_ID` | Yes (Space) | HF OAuth app client ID | | `OAUTH_CLIENT_SECRET` | Yes (Space) | HF OAuth app client secret | | `ALLOWED_USERS` | Yes (Space) | Comma-separated HF usernames | | `NEXTAUTH_SECRET` | Yes | Secret for cookie signing |