data-use-annotation / ANNOTATION_GUIDE.md
rafmacalaba's picture
docs: add comprehensive ANNOTATION_GUIDE.md
e746bfe
# πŸ“ Annotation Tool β€” Guide
A HuggingFace Spaces app for validating AI-extracted dataset mentions in World Bank documents.
---
## Quick Start
### For Annotators
1. Go to the Space URL and click **πŸ€— Sign in with HuggingFace**
2. You'll see only your assigned documents in the dropdown
3. Navigate pages with **← Prev / Next β†’**
4. Open the **Data Mentions** panel to validate each mention
5. Track your progress in the top-right: `Progress: πŸ“„ PDF 3/55 | πŸ“‘ Page 2/12 | 🏷️ Verified 4/8`
### Validation Actions
| Action | What it does |
|--------|-------------|
| βœ… **Correct** | Confirms the AI extraction is a real dataset mention |
| ❌ **Incorrect** | Marks the extraction as wrong / not a dataset |
| **Click tag badge** | Change dataset type (named, descriptive, generic) |
| **Highlight text β†’ Annotate** | Manually add a dataset mention the AI missed |
| πŸ—‘οΈ **Delete** | Remove a dataset entry entirely |
> **Tip:** If you try to click "Next" with unverified mentions, you'll get a confirmation prompt.
---
## Document Assignments
Each annotator sees only their assigned documents. A configurable percentage (default 10%) are **overlap documents** shared across all annotators for inter-annotator agreement measurement.
### Configuration File
`annotation_data/annotator_config.yaml`:
```yaml
settings:
overlap_percent: 10 # % of docs shared between all annotators
annotators:
- username: rafmacalaba # HuggingFace username
docs: [2, 3, 14, ...] # assigned doc indices
- username: rafaelmacalaba
docs: [1, 2, 10, ...]
```
### Auto-Generate Assignments
```bash
# Preview assignment distribution:
uv run --with pyyaml python3 generate_assignments.py --dry-run
# Generate and save locally:
uv run --with pyyaml python3 generate_assignments.py
# Generate, save, and upload to HF:
uv run --with pyyaml,huggingface_hub python3 generate_assignments.py --upload
```
The script:
- Reads `annotator_config.yaml` for the annotator list and overlap %
- Shuffles all available docs (deterministic seed=42)
- Reserves `overlap_percent` docs shared by ALL annotators
- Splits the rest evenly across annotators
- Saves back to the YAML
### Adding a New Annotator
1. Add to `annotation_data/annotator_config.yaml`:
```yaml
- username: new_hf_username
docs: []
```
2. Re-run: `uv run --with pyyaml,huggingface_hub python3 generate_assignments.py --upload`
3. Add the username to `ALLOWED_USERS` in the Space settings
### Manual Editing
You can manually edit the `docs` array for any annotator in the YAML file, then upload:
```bash
uv run --with huggingface_hub python3 -c "
from huggingface_hub import HfApi
api = HfApi()
api.upload_file('annotation_data/annotator_config.yaml',
'annotation_data/annotator_config.yaml',
'ai4data/annotation_data', repo_type='dataset')
"
```
---
## Per-Annotator Validation (Overlap Support)
Each dataset mention stores validations per-annotator in a `validations` array:
```json
{
"dataset_name": { "text": "DHS Survey", "confidence": 0.95 },
"dataset_tag": "named",
"validations": [
{
"annotator": "rafmacalaba",
"human_validated": true,
"human_verdict": true,
"human_notes": null,
"validated_at": "2025-02-24T11:00:00Z"
},
{
"annotator": "rafaelmacalaba",
"human_validated": true,
"human_verdict": false,
"human_notes": "This is a study name, not a dataset",
"validated_at": "2025-02-24T11:05:00Z"
}
]
}
```
**Key behavior:**
- Each annotator only sees **their own** validation status (no bias)
- Progress bar and "Next" prompt count only **your** verifications
- Tag edits (`dataset_tag`) are shared β€” they're factual, not judgment-based
- Re-validating updates your existing entry (doesn't create duplicates)
---
## Data Pipeline
### `prepare_data.py` β€” Prepare & Upload Documents
```bash
# Dry run (scan only):
uv run --with huggingface_hub,requests,langdetect python3 prepare_data.py --dry-run
# Upload missing docs + update links:
uv run --with huggingface_hub,requests,langdetect python3 prepare_data.py
# Only update wbg_pdf_links.json:
uv run --with huggingface_hub,requests,langdetect python3 prepare_data.py --links-only
```
This script:
- Scans local `annotation_data/wbg_extractions/` for real `_direct_judged.jsonl` files
- Detects language using `langdetect` (excludes non-English: Arabic, French)
- Uploads English docs to HF dataset
- Updates `wbg_pdf_links.json` with `has_revalidation` and `language` fields
---
## Leaderboard πŸ†
Click **πŸ† Leaderboard** in the top bar to see annotator rankings.
| Metric | Description |
|--------|-------------|
| βœ… Verified | Number of mentions validated |
| ✍️ Added | Manually added dataset mentions |
| πŸ“„ Docs | Number of documents worked on |
| ⭐ Score | `Verified + Added` |
Cached for 2 minutes.
---
## API Endpoints
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/api/documents?user=X` | GET | List documents (filtered by user assignment) |
| `/api/document?index=X&page=Y` | GET | Get page data for a specific document |
| `/api/validate` | PUT | Submit validation for a dataset mention |
| `/api/validate?doc=X&page=Y&idx=Z` | DELETE | Remove a dataset entry |
| `/api/leaderboard` | GET | Annotator rankings |
| `/api/pdf-proxy?url=X` | GET | Proxy PDF downloads (bypasses CORS) |
| `/api/auth/login` | GET | Start HF OAuth flow |
| `/api/auth/callback` | GET | OAuth callback |
---
## Architecture
```
hf_spaces_docker/
β”œβ”€β”€ app/
β”‚ β”œβ”€β”€ page.js # Main app (client component)
β”‚ β”œβ”€β”€ globals.css # All styles
β”‚ β”œβ”€β”€ api/
β”‚ β”‚ β”œβ”€β”€ documents/route.js # Doc listing + user filtering
β”‚ β”‚ β”œβ”€β”€ document/route.js # Single page data
β”‚ β”‚ β”œβ”€β”€ validate/route.js # Validate/delete mentions
β”‚ β”‚ β”œβ”€β”€ leaderboard/route.js # Leaderboard stats
β”‚ β”‚ β”œβ”€β”€ pdf-proxy/route.js # PDF CORS proxy
β”‚ β”‚ └── auth/ # HF OAuth login/callback
β”‚ └── components/
β”‚ β”œβ”€β”€ AnnotationPanel.js # Side panel with dataset cards
β”‚ β”œβ”€β”€ AnnotationModal.js # Manual annotation dialog
β”‚ β”œβ”€β”€ DocumentSelector.js # Document dropdown
β”‚ β”œβ”€β”€ Leaderboard.js # Leaderboard modal
β”‚ β”œβ”€β”€ MarkdownAnnotator.js # Text viewer with highlighting
β”‚ β”œβ”€β”€ PageNavigator.js # Prev/Next page buttons
β”‚ β”œβ”€β”€ PdfViewer.js # PDF iframe with loading state
β”‚ └── ProgressBar.js # PDF/Page/Verified pills
β”œβ”€β”€ annotation_data/
β”‚ β”œβ”€β”€ annotator_config.yaml # Annotator assignments
β”‚ └── wbg_data/
β”‚ └── wbg_pdf_links.json # Doc registry with URLs
β”œβ”€β”€ prepare_data.py # Upload docs to HF
└── generate_assignments.py # Auto-assign docs to annotators
```
---
## Environment Variables
| Variable | Required | Description |
|----------|----------|-------------|
| `HF_TOKEN` | Yes | HuggingFace API token (read/write) |
| `OAUTH_CLIENT_ID` | Yes (Space) | HF OAuth app client ID |
| `OAUTH_CLIENT_SECRET` | Yes (Space) | HF OAuth app client secret |
| `ALLOWED_USERS` | Yes (Space) | Comma-separated HF usernames |
| `NEXTAUTH_SECRET` | Yes | Secret for cookie signing |