Spaces:
Running
Running
File size: 7,466 Bytes
e746bfe | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 | # π Annotation Tool β Guide
A HuggingFace Spaces app for validating AI-extracted dataset mentions in World Bank documents.
---
## Quick Start
### For Annotators
1. Go to the Space URL and click **π€ Sign in with HuggingFace**
2. You'll see only your assigned documents in the dropdown
3. Navigate pages with **β Prev / Next β**
4. Open the **Data Mentions** panel to validate each mention
5. Track your progress in the top-right: `Progress: π PDF 3/55 | π Page 2/12 | π·οΈ Verified 4/8`
### Validation Actions
| Action | What it does |
|--------|-------------|
| β
**Correct** | Confirms the AI extraction is a real dataset mention |
| β **Incorrect** | Marks the extraction as wrong / not a dataset |
| **Click tag badge** | Change dataset type (named, descriptive, generic) |
| **Highlight text β Annotate** | Manually add a dataset mention the AI missed |
| ποΈ **Delete** | Remove a dataset entry entirely |
> **Tip:** If you try to click "Next" with unverified mentions, you'll get a confirmation prompt.
---
## Document Assignments
Each annotator sees only their assigned documents. A configurable percentage (default 10%) are **overlap documents** shared across all annotators for inter-annotator agreement measurement.
### Configuration File
`annotation_data/annotator_config.yaml`:
```yaml
settings:
overlap_percent: 10 # % of docs shared between all annotators
annotators:
- username: rafmacalaba # HuggingFace username
docs: [2, 3, 14, ...] # assigned doc indices
- username: rafaelmacalaba
docs: [1, 2, 10, ...]
```
### Auto-Generate Assignments
```bash
# Preview assignment distribution:
uv run --with pyyaml python3 generate_assignments.py --dry-run
# Generate and save locally:
uv run --with pyyaml python3 generate_assignments.py
# Generate, save, and upload to HF:
uv run --with pyyaml,huggingface_hub python3 generate_assignments.py --upload
```
The script:
- Reads `annotator_config.yaml` for the annotator list and overlap %
- Shuffles all available docs (deterministic seed=42)
- Reserves `overlap_percent` docs shared by ALL annotators
- Splits the rest evenly across annotators
- Saves back to the YAML
### Adding a New Annotator
1. Add to `annotation_data/annotator_config.yaml`:
```yaml
- username: new_hf_username
docs: []
```
2. Re-run: `uv run --with pyyaml,huggingface_hub python3 generate_assignments.py --upload`
3. Add the username to `ALLOWED_USERS` in the Space settings
### Manual Editing
You can manually edit the `docs` array for any annotator in the YAML file, then upload:
```bash
uv run --with huggingface_hub python3 -c "
from huggingface_hub import HfApi
api = HfApi()
api.upload_file('annotation_data/annotator_config.yaml',
'annotation_data/annotator_config.yaml',
'ai4data/annotation_data', repo_type='dataset')
"
```
---
## Per-Annotator Validation (Overlap Support)
Each dataset mention stores validations per-annotator in a `validations` array:
```json
{
"dataset_name": { "text": "DHS Survey", "confidence": 0.95 },
"dataset_tag": "named",
"validations": [
{
"annotator": "rafmacalaba",
"human_validated": true,
"human_verdict": true,
"human_notes": null,
"validated_at": "2025-02-24T11:00:00Z"
},
{
"annotator": "rafaelmacalaba",
"human_validated": true,
"human_verdict": false,
"human_notes": "This is a study name, not a dataset",
"validated_at": "2025-02-24T11:05:00Z"
}
]
}
```
**Key behavior:**
- Each annotator only sees **their own** validation status (no bias)
- Progress bar and "Next" prompt count only **your** verifications
- Tag edits (`dataset_tag`) are shared β they're factual, not judgment-based
- Re-validating updates your existing entry (doesn't create duplicates)
---
## Data Pipeline
### `prepare_data.py` β Prepare & Upload Documents
```bash
# Dry run (scan only):
uv run --with huggingface_hub,requests,langdetect python3 prepare_data.py --dry-run
# Upload missing docs + update links:
uv run --with huggingface_hub,requests,langdetect python3 prepare_data.py
# Only update wbg_pdf_links.json:
uv run --with huggingface_hub,requests,langdetect python3 prepare_data.py --links-only
```
This script:
- Scans local `annotation_data/wbg_extractions/` for real `_direct_judged.jsonl` files
- Detects language using `langdetect` (excludes non-English: Arabic, French)
- Uploads English docs to HF dataset
- Updates `wbg_pdf_links.json` with `has_revalidation` and `language` fields
---
## Leaderboard π
Click **π Leaderboard** in the top bar to see annotator rankings.
| Metric | Description |
|--------|-------------|
| β
Verified | Number of mentions validated |
| βοΈ Added | Manually added dataset mentions |
| π Docs | Number of documents worked on |
| β Score | `Verified + Added` |
Cached for 2 minutes.
---
## API Endpoints
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/api/documents?user=X` | GET | List documents (filtered by user assignment) |
| `/api/document?index=X&page=Y` | GET | Get page data for a specific document |
| `/api/validate` | PUT | Submit validation for a dataset mention |
| `/api/validate?doc=X&page=Y&idx=Z` | DELETE | Remove a dataset entry |
| `/api/leaderboard` | GET | Annotator rankings |
| `/api/pdf-proxy?url=X` | GET | Proxy PDF downloads (bypasses CORS) |
| `/api/auth/login` | GET | Start HF OAuth flow |
| `/api/auth/callback` | GET | OAuth callback |
---
## Architecture
```
hf_spaces_docker/
βββ app/
β βββ page.js # Main app (client component)
β βββ globals.css # All styles
β βββ api/
β β βββ documents/route.js # Doc listing + user filtering
β β βββ document/route.js # Single page data
β β βββ validate/route.js # Validate/delete mentions
β β βββ leaderboard/route.js # Leaderboard stats
β β βββ pdf-proxy/route.js # PDF CORS proxy
β β βββ auth/ # HF OAuth login/callback
β βββ components/
β βββ AnnotationPanel.js # Side panel with dataset cards
β βββ AnnotationModal.js # Manual annotation dialog
β βββ DocumentSelector.js # Document dropdown
β βββ Leaderboard.js # Leaderboard modal
β βββ MarkdownAnnotator.js # Text viewer with highlighting
β βββ PageNavigator.js # Prev/Next page buttons
β βββ PdfViewer.js # PDF iframe with loading state
β βββ ProgressBar.js # PDF/Page/Verified pills
βββ annotation_data/
β βββ annotator_config.yaml # Annotator assignments
β βββ wbg_data/
β βββ wbg_pdf_links.json # Doc registry with URLs
βββ prepare_data.py # Upload docs to HF
βββ generate_assignments.py # Auto-assign docs to annotators
```
---
## Environment Variables
| Variable | Required | Description |
|----------|----------|-------------|
| `HF_TOKEN` | Yes | HuggingFace API token (read/write) |
| `OAUTH_CLIENT_ID` | Yes (Space) | HF OAuth app client ID |
| `OAUTH_CLIENT_SECRET` | Yes (Space) | HF OAuth app client secret |
| `ALLOWED_USERS` | Yes (Space) | Comma-separated HF usernames |
| `NEXTAUTH_SECRET` | Yes | Secret for cookie signing |
|