File size: 3,319 Bytes
aef1855
 
46f3190
aef1855
 
 
 
 
 
 
806cdf2
aef1855
 
46f3190
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fb7143c
46f3190
 
 
fb7143c
 
 
 
 
 
 
 
 
46f3190
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
---
title: Reliefweb Annotation
emoji: πŸ“Š
colorFrom: indigo
colorTo: red
sdk: gradio
sdk_version: 6.2.0
app_file: app.py
pinned: false
license: mit
short_description: Validation tool for UNHCR/ReliefWeb documents.
---

# πŸ“Š Dataset Annotation Tool

A Gradio-based annotation interface for validating dataset mentions extracted from UNHCR and ReliefWeb PDF documents.

## Features

- **Interactive Annotation**: Review and validate dataset mentions with contextual information
- **AI Comparison**: Compare extraction model predictions with GPT-5.2 judge verdicts
- **Progress Tracking**: Real-time statistics on annotation progress and agreement rates
- **Context Highlighting**: Dataset names are highlighted within their surrounding context
- **Dark Mode**: Toggle between light and dark themes for comfortable annotation
- **Persistent Storage**: 
  - πŸ’Ύ **Download Button**: Manually download annotations anytime
  - ☁️ **Auto-Backup**: Automatic sync to Hugging Face Datasets (optional)

## How to Use

1. Review the **Dataset Name** and its **Context** (highlighted in yellow)
2. Check the **Metadata** for document and stratum information
3. Optionally enable "πŸ€– Show what the AI thinks" to see model predictions
4. Click **βœ“ DATASET** if it's a valid dataset mention, or **βœ— NOT A DATASET** if it's not
5. Add optional notes if needed
6. Navigate using Previous/Next buttons or skip to the next unannotated sample
7. **Download your progress** using the "πŸ’Ύ Download Annotations" button

## Persistence Options

### Option 1: Manual Download (Always Available)
- Click the "πŸ’Ύ Download Annotations" button to save your progress locally
- The button updates automatically after each annotation
- Recommended: Download periodically to backup your work

### Option 2: Auto-Backup to HF Datasets (Optional)
To enable automatic cloud backup:

1. Create a private dataset on Hugging Face (e.g., `username/reliefweb-annotations`)
2. Get a write token from https://huggingface.co/settings/tokens
3. In your Space settings, add these secrets:
   - `HF_TOKEN`: Your Hugging Face write token
   - `HF_DATASET_REPO`: Your dataset repository (e.g., `username/reliefweb-annotations`)
   - `HF_RELIEFWEB_PDFS_REPO`: PDF dataset repository (e.g., `ai4data/reliefweb-pdfs`)

Once configured, annotations are automatically pushed to your HF Dataset after each annotation!

## PDF Viewer

The tool includes an embedded PDF viewer powered by `gradio-pdf`. PDFs can be sourced from:

- **Local files**: Use `--pdf-dir /path/to/pdfs` when running locally
- **Hugging Face Datasets**: Set `HF_RELIEFWEB_PDFS_REPO` environment variable

The PDF viewer automatically navigates to the page containing the dataset mention.

## Data

This tool processes validation samples from stratified sampling of UNHCR/ReliefWeb documents, comparing:
- **Extraction Model**: Fine-tuned dataset extraction model
- **Judge (GPT-5.2)**: LLM-based validation of extractions

## Output

Annotations are saved to `validation_sample_filtering_retained_human_validated.jsonl` with:
- Human verdicts (dataset/non-dataset)
- Agreement metrics with extraction model and judge
- Timestamp and optional notes

---

Check out the [Hugging Face Spaces documentation](https://huggingface.co/docs/hub/spaces-config-reference) for more information.