Spaces:

ai4data
/

data-use-annotation

Running

App Files Files Community

data-use-annotation / ANNOTATION_GUIDE.md

rafmacalaba

docs: add comprehensive ANNOTATION_GUIDE.md

e746bfe 18 days ago

preview code

raw

history blame contribute delete

7.47 kB

	# 📝 Annotation Tool — Guide

	A HuggingFace Spaces app for validating AI-extracted dataset mentions in World Bank documents.

	---

	## Quick Start

	### For Annotators

	1. Go to the Space URL and click 🤗 Sign in with HuggingFace
	2. You'll see only your assigned documents in the dropdown
	3. Navigate pages with ← Prev / Next →
	4. Open the Data Mentions panel to validate each mention
	5. Track your progress in the top-right: `Progress: 📄 PDF 3/55 \| 📑 Page 2/12 \| 🏷️ Verified 4/8`

	### Validation Actions

	\| Action \| What it does \|
	\|--------\|-------------\|
	\| ✅ Correct \| Confirms the AI extraction is a real dataset mention \|
	\| ❌ Incorrect \| Marks the extraction as wrong / not a dataset \|
	\| Click tag badge \| Change dataset type (named, descriptive, generic) \|
	\| Highlight text → Annotate \| Manually add a dataset mention the AI missed \|
	\| 🗑️ Delete \| Remove a dataset entry entirely \|

	> Tip: If you try to click "Next" with unverified mentions, you'll get a confirmation prompt.

	---

	## Document Assignments

	Each annotator sees only their assigned documents. A configurable percentage (default 10%) are overlap documents shared across all annotators for inter-annotator agreement measurement.

	### Configuration File

	`annotation_data/annotator_config.yaml`:
	```yaml
	settings:
	overlap_percent: 10 # % of docs shared between all annotators

	annotators:
	- username: rafmacalaba # HuggingFace username
	docs: [2, 3, 14, ...] # assigned doc indices
	- username: rafaelmacalaba
	docs: [1, 2, 10, ...]
	```

	### Auto-Generate Assignments

	```bash
	# Preview assignment distribution:
	uv run --with pyyaml python3 generate_assignments.py --dry-run

	# Generate and save locally:
	uv run --with pyyaml python3 generate_assignments.py

	# Generate, save, and upload to HF:
	uv run --with pyyaml,huggingface_hub python3 generate_assignments.py --upload
	```

	The script:
	- Reads `annotator_config.yaml` for the annotator list and overlap %
	- Shuffles all available docs (deterministic seed=42)
	- Reserves `overlap_percent` docs shared by ALL annotators
	- Splits the rest evenly across annotators
	- Saves back to the YAML

	### Adding a New Annotator

	1. Add to `annotation_data/annotator_config.yaml`:
	```yaml
	- username: new_hf_username
	docs: []
	```
	2. Re-run: `uv run --with pyyaml,huggingface_hub python3 generate_assignments.py --upload`
	3. Add the username to `ALLOWED_USERS` in the Space settings

	### Manual Editing

	You can manually edit the `docs` array for any annotator in the YAML file, then upload:
	```bash
	uv run --with huggingface_hub python3 -c "
	from huggingface_hub import HfApi
	api = HfApi()
	api.upload_file('annotation_data/annotator_config.yaml',
	'annotation_data/annotator_config.yaml',
	'ai4data/annotation_data', repo_type='dataset')
	"
	```

	---

	## Per-Annotator Validation (Overlap Support)

	Each dataset mention stores validations per-annotator in a `validations` array:

	```json
	{
	"dataset_name": { "text": "DHS Survey", "confidence": 0.95 },
	"dataset_tag": "named",
	"validations": [
	{
	"annotator": "rafmacalaba",
	"human_validated": true,
	"human_verdict": true,
	"human_notes": null,
	"validated_at": "2025-02-24T11:00:00Z"
	},
	{
	"annotator": "rafaelmacalaba",
	"human_validated": true,
	"human_verdict": false,
	"human_notes": "This is a study name, not a dataset",
	"validated_at": "2025-02-24T11:05:00Z"
	}
	]
	}
	```

	Key behavior:
	- Each annotator only sees their own validation status (no bias)
	- Progress bar and "Next" prompt count only your verifications
	- Tag edits (`dataset_tag`) are shared — they're factual, not judgment-based
	- Re-validating updates your existing entry (doesn't create duplicates)

	---

	## Data Pipeline

	### `prepare_data.py` — Prepare & Upload Documents

	```bash
	# Dry run (scan only):
	uv run --with huggingface_hub,requests,langdetect python3 prepare_data.py --dry-run

	# Upload missing docs + update links:
	uv run --with huggingface_hub,requests,langdetect python3 prepare_data.py

	# Only update wbg_pdf_links.json:
	uv run --with huggingface_hub,requests,langdetect python3 prepare_data.py --links-only
	```

	This script:
	- Scans local `annotation_data/wbg_extractions/` for real `_direct_judged.jsonl` files
	- Detects language using `langdetect` (excludes non-English: Arabic, French)
	- Uploads English docs to HF dataset
	- Updates `wbg_pdf_links.json` with `has_revalidation` and `language` fields

	---

	## Leaderboard 🏆

	Click 🏆 Leaderboard in the top bar to see annotator rankings.

	\| Metric \| Description \|
	\|--------\|-------------\|
	\| ✅ Verified \| Number of mentions validated \|
	\| ✍️ Added \| Manually added dataset mentions \|
	\| 📄 Docs \| Number of documents worked on \|
	\| ⭐ Score \| `Verified + Added` \|

	Cached for 2 minutes.

	---

	## API Endpoints

	\| Endpoint \| Method \| Description \|
	\|----------\|--------\|-------------\|
	\| `/api/documents?user=X` \| GET \| List documents (filtered by user assignment) \|
	\| `/api/document?index=X&page=Y` \| GET \| Get page data for a specific document \|
	\| `/api/validate` \| PUT \| Submit validation for a dataset mention \|
	\| `/api/validate?doc=X&page=Y&idx=Z` \| DELETE \| Remove a dataset entry \|
	\| `/api/leaderboard` \| GET \| Annotator rankings \|
	\| `/api/pdf-proxy?url=X` \| GET \| Proxy PDF downloads (bypasses CORS) \|
	\| `/api/auth/login` \| GET \| Start HF OAuth flow \|
	\| `/api/auth/callback` \| GET \| OAuth callback \|

	---

	## Architecture

	```
	hf_spaces_docker/
	├── app/
	│ ├── page.js # Main app (client component)
	│ ├── globals.css # All styles
	│ ├── api/
	│ │ ├── documents/route.js # Doc listing + user filtering
	│ │ ├── document/route.js # Single page data
	│ │ ├── validate/route.js # Validate/delete mentions
	│ │ ├── leaderboard/route.js # Leaderboard stats
	│ │ ├── pdf-proxy/route.js # PDF CORS proxy
	│ │ └── auth/ # HF OAuth login/callback
	│ └── components/
	│ ├── AnnotationPanel.js # Side panel with dataset cards
	│ ├── AnnotationModal.js # Manual annotation dialog
	│ ├── DocumentSelector.js # Document dropdown
	│ ├── Leaderboard.js # Leaderboard modal
	│ ├── MarkdownAnnotator.js # Text viewer with highlighting
	│ ├── PageNavigator.js # Prev/Next page buttons
	│ ├── PdfViewer.js # PDF iframe with loading state
	│ └── ProgressBar.js # PDF/Page/Verified pills
	├── annotation_data/
	│ ├── annotator_config.yaml # Annotator assignments
	│ └── wbg_data/
	│ └── wbg_pdf_links.json # Doc registry with URLs
	├── prepare_data.py # Upload docs to HF
	└── generate_assignments.py # Auto-assign docs to annotators
	```

	---

	## Environment Variables

	\| Variable \| Required \| Description \|
	\|----------\|----------\|-------------\|
	\| `HF_TOKEN` \| Yes \| HuggingFace API token (read/write) \|
	\| `OAUTH_CLIENT_ID` \| Yes (Space) \| HF OAuth app client ID \|
	\| `OAUTH_CLIENT_SECRET` \| Yes (Space) \| HF OAuth app client secret \|
	\| `ALLOWED_USERS` \| Yes (Space) \| Comma-separated HF usernames \|
	\| `NEXTAUTH_SECRET` \| Yes \| Secret for cookie signing \|