Spaces:

ai4data
/

data-use-annotation

Running

File size: 7,466 Bytes

e746bfe

# 📝 Annotation Tool — Guide

A HuggingFace Spaces app for validating AI-extracted dataset mentions in World Bank documents.

---

## Quick Start

### For Annotators

1. Go to the Space URL and click **🤗 Sign in with HuggingFace**
2. You'll see only your assigned documents in the dropdown
3. Navigate pages with **← Prev / Next →**
4. Open the **Data Mentions** panel to validate each mention
5. Track your progress in the top-right: `Progress: 📄 PDF 3/55 | 📑 Page 2/12 | 🏷️ Verified 4/8`

### Validation Actions

| Action | What it does |
|--------|-------------|
| ✅ **Correct** | Confirms the AI extraction is a real dataset mention |
| ❌ **Incorrect** | Marks the extraction as wrong / not a dataset |
| **Click tag badge** | Change dataset type (named, descriptive, generic) |
| **Highlight text → Annotate** | Manually add a dataset mention the AI missed |
| 🗑️ **Delete** | Remove a dataset entry entirely |

> **Tip:** If you try to click "Next" with unverified mentions, you'll get a confirmation prompt.

---

## Document Assignments

Each annotator sees only their assigned documents. A configurable percentage (default 10%) are **overlap documents** shared across all annotators for inter-annotator agreement measurement.

### Configuration File

`annotation_data/annotator_config.yaml`:
```yaml
settings:
  overlap_percent: 10  # % of docs shared between all annotators

annotators:
  - username: rafmacalaba     # HuggingFace username
    docs: [2, 3, 14, ...]     # assigned doc indices
  - username: rafaelmacalaba
    docs: [1, 2, 10, ...]
```

### Auto-Generate Assignments

```bash
# Preview assignment distribution:
uv run --with pyyaml python3 generate_assignments.py --dry-run

# Generate and save locally:
uv run --with pyyaml python3 generate_assignments.py

# Generate, save, and upload to HF:
uv run --with pyyaml,huggingface_hub python3 generate_assignments.py --upload
```

The script:
- Reads `annotator_config.yaml` for the annotator list and overlap %
- Shuffles all available docs (deterministic seed=42)
- Reserves `overlap_percent` docs shared by ALL annotators
- Splits the rest evenly across annotators
- Saves back to the YAML

### Adding a New Annotator

1. Add to `annotation_data/annotator_config.yaml`:
   ```yaml
   - username: new_hf_username
     docs: []
   ```
2. Re-run: `uv run --with pyyaml,huggingface_hub python3 generate_assignments.py --upload`
3. Add the username to `ALLOWED_USERS` in the Space settings

### Manual Editing

You can manually edit the `docs` array for any annotator in the YAML file, then upload:
```bash
uv run --with huggingface_hub python3 -c "
from huggingface_hub import HfApi
api = HfApi()
api.upload_file('annotation_data/annotator_config.yaml',
    'annotation_data/annotator_config.yaml',
    'ai4data/annotation_data', repo_type='dataset')
"
```

---

## Per-Annotator Validation (Overlap Support)

Each dataset mention stores validations per-annotator in a `validations` array:

```json
{
  "dataset_name": { "text": "DHS Survey", "confidence": 0.95 },
  "dataset_tag": "named",
  "validations": [
    {
      "annotator": "rafmacalaba",
      "human_validated": true,
      "human_verdict": true,
      "human_notes": null,
      "validated_at": "2025-02-24T11:00:00Z"
    },
    {
      "annotator": "rafaelmacalaba",
      "human_validated": true,
      "human_verdict": false,
      "human_notes": "This is a study name, not a dataset",
      "validated_at": "2025-02-24T11:05:00Z"
    }
  ]
}
```

**Key behavior:**
- Each annotator only sees **their own** validation status (no bias)
- Progress bar and "Next" prompt count only **your** verifications
- Tag edits (`dataset_tag`) are shared — they're factual, not judgment-based
- Re-validating updates your existing entry (doesn't create duplicates)

---

## Data Pipeline

### `prepare_data.py` — Prepare & Upload Documents

```bash
# Dry run (scan only):
uv run --with huggingface_hub,requests,langdetect python3 prepare_data.py --dry-run

# Upload missing docs + update links:
uv run --with huggingface_hub,requests,langdetect python3 prepare_data.py

# Only update wbg_pdf_links.json:
uv run --with huggingface_hub,requests,langdetect python3 prepare_data.py --links-only
```

This script:
- Scans local `annotation_data/wbg_extractions/` for real `_direct_judged.jsonl` files
- Detects language using `langdetect` (excludes non-English: Arabic, French)
- Uploads English docs to HF dataset
- Updates `wbg_pdf_links.json` with `has_revalidation` and `language` fields

---

## Leaderboard 🏆

Click **🏆 Leaderboard** in the top bar to see annotator rankings.

| Metric | Description |
|--------|-------------|
| ✅ Verified | Number of mentions validated |
| ✍️ Added | Manually added dataset mentions |
| 📄 Docs | Number of documents worked on |
| ⭐ Score | `Verified + Added` |

Cached for 2 minutes.

---

## API Endpoints

| Endpoint | Method | Description |
|----------|--------|-------------|
| `/api/documents?user=X` | GET | List documents (filtered by user assignment) |
| `/api/document?index=X&page=Y` | GET | Get page data for a specific document |
| `/api/validate` | PUT | Submit validation for a dataset mention |
| `/api/validate?doc=X&page=Y&idx=Z` | DELETE | Remove a dataset entry |
| `/api/leaderboard` | GET | Annotator rankings |
| `/api/pdf-proxy?url=X` | GET | Proxy PDF downloads (bypasses CORS) |
| `/api/auth/login` | GET | Start HF OAuth flow |
| `/api/auth/callback` | GET | OAuth callback |

---

## Architecture

```
hf_spaces_docker/
├── app/
│   ├── page.js                    # Main app (client component)
│   ├── globals.css                # All styles
│   ├── api/
│   │   ├── documents/route.js     # Doc listing + user filtering
│   │   ├── document/route.js      # Single page data
│   │   ├── validate/route.js      # Validate/delete mentions
│   │   ├── leaderboard/route.js   # Leaderboard stats
│   │   ├── pdf-proxy/route.js     # PDF CORS proxy
│   │   └── auth/                  # HF OAuth login/callback
│   └── components/
│       ├── AnnotationPanel.js     # Side panel with dataset cards
│       ├── AnnotationModal.js     # Manual annotation dialog
│       ├── DocumentSelector.js    # Document dropdown
│       ├── Leaderboard.js         # Leaderboard modal
│       ├── MarkdownAnnotator.js   # Text viewer with highlighting
│       ├── PageNavigator.js       # Prev/Next page buttons
│       ├── PdfViewer.js           # PDF iframe with loading state
│       └── ProgressBar.js         # PDF/Page/Verified pills
├── annotation_data/
│   ├── annotator_config.yaml      # Annotator assignments
│   └── wbg_data/
│       └── wbg_pdf_links.json     # Doc registry with URLs
├── prepare_data.py                # Upload docs to HF
└── generate_assignments.py        # Auto-assign docs to annotators
```

---

## Environment Variables

| Variable | Required | Description |
|----------|----------|-------------|
| `HF_TOKEN` | Yes | HuggingFace API token (read/write) |
| `OAUTH_CLIENT_ID` | Yes (Space) | HF OAuth app client ID |
| `OAUTH_CLIENT_SECRET` | Yes (Space) | HF OAuth app client secret |
| `ALLOWED_USERS` | Yes (Space) | Comma-separated HF usernames |
| `NEXTAUTH_SECRET` | Yes | Secret for cookie signing |