Spaces:
Sleeping
Sleeping
A newer version of the Gradio SDK is available:
6.5.1
metadata
title: Reliefweb Annotation
emoji: π
colorFrom: indigo
colorTo: red
sdk: gradio
sdk_version: 6.2.0
app_file: app.py
pinned: false
license: mit
short_description: Validation tool for UNHCR/ReliefWeb documents.
π Dataset Annotation Tool
A Gradio-based annotation interface for validating dataset mentions extracted from UNHCR and ReliefWeb PDF documents.
Features
- Interactive Annotation: Review and validate dataset mentions with contextual information
- AI Comparison: Compare extraction model predictions with GPT-5.2 judge verdicts
- Progress Tracking: Real-time statistics on annotation progress and agreement rates
- Context Highlighting: Dataset names are highlighted within their surrounding context
- Dark Mode: Toggle between light and dark themes for comfortable annotation
- Persistent Storage:
- πΎ Download Button: Manually download annotations anytime
- βοΈ Auto-Backup: Automatic sync to Hugging Face Datasets (optional)
How to Use
- Review the Dataset Name and its Context (highlighted in yellow)
- Check the Metadata for document and stratum information
- Optionally enable "π€ Show what the AI thinks" to see model predictions
- Click β DATASET if it's a valid dataset mention, or β NOT A DATASET if it's not
- Add optional notes if needed
- Navigate using Previous/Next buttons or skip to the next unannotated sample
- Download your progress using the "πΎ Download Annotations" button
Persistence Options
Option 1: Manual Download (Always Available)
- Click the "πΎ Download Annotations" button to save your progress locally
- The button updates automatically after each annotation
- Recommended: Download periodically to backup your work
Option 2: Auto-Backup to HF Datasets (Optional)
To enable automatic cloud backup:
- Create a private dataset on Hugging Face (e.g.,
username/reliefweb-annotations) - Get a write token from https://huggingface.co/settings/tokens
- In your Space settings, add these secrets:
HF_TOKEN: Your Hugging Face write tokenHF_DATASET_REPO: Your dataset repository (e.g.,username/reliefweb-annotations)HF_RELIEFWEB_PDFS_REPO: PDF dataset repository (e.g.,ai4data/reliefweb-pdfs)
Once configured, annotations are automatically pushed to your HF Dataset after each annotation!
PDF Viewer
The tool includes an embedded PDF viewer powered by gradio-pdf. PDFs can be sourced from:
- Local files: Use
--pdf-dir /path/to/pdfswhen running locally - Hugging Face Datasets: Set
HF_RELIEFWEB_PDFS_REPOenvironment variable
The PDF viewer automatically navigates to the page containing the dataset mention.
Data
This tool processes validation samples from stratified sampling of UNHCR/ReliefWeb documents, comparing:
- Extraction Model: Fine-tuned dataset extraction model
- Judge (GPT-5.2): LLM-based validation of extractions
Output
Annotations are saved to validation_sample_filtering_retained_human_validated.jsonl with:
- Human verdicts (dataset/non-dataset)
- Agreement metrics with extraction model and judge
- Timestamp and optional notes
Check out the Hugging Face Spaces documentation for more information.