--- title: Reliefweb Annotation emoji: 📊 colorFrom: indigo colorTo: red sdk: gradio sdk_version: 6.2.0 app_file: app.py pinned: false license: mit short_description: Validation tool for UNHCR/ReliefWeb documents. --- # 📊 Dataset Annotation Tool A Gradio-based annotation interface for validating dataset mentions extracted from UNHCR and ReliefWeb PDF documents. ## Features - **Interactive Annotation**: Review and validate dataset mentions with contextual information - **AI Comparison**: Compare extraction model predictions with GPT-5.2 judge verdicts - **Progress Tracking**: Real-time statistics on annotation progress and agreement rates - **Context Highlighting**: Dataset names are highlighted within their surrounding context - **Dark Mode**: Toggle between light and dark themes for comfortable annotation - **Persistent Storage**: - 💾 **Download Button**: Manually download annotations anytime - ☁️ **Auto-Backup**: Automatic sync to Hugging Face Datasets (optional) ## How to Use 1. Review the **Dataset Name** and its **Context** (highlighted in yellow) 2. Check the **Metadata** for document and stratum information 3. Optionally enable "🤖 Show what the AI thinks" to see model predictions 4. Click **✓ DATASET** if it's a valid dataset mention, or **✗ NOT A DATASET** if it's not 5. Add optional notes if needed 6. Navigate using Previous/Next buttons or skip to the next unannotated sample 7. **Download your progress** using the "💾 Download Annotations" button ## Persistence Options ### Option 1: Manual Download (Always Available) - Click the "💾 Download Annotations" button to save your progress locally - The button updates automatically after each annotation - Recommended: Download periodically to backup your work ### Option 2: Auto-Backup to HF Datasets (Optional) To enable automatic cloud backup: 1. Create a private dataset on Hugging Face (e.g., `username/reliefweb-annotations`) 2. Get a write token from https://huggingface.co/settings/tokens 3. In your Space settings, add these secrets: - `HF_TOKEN`: Your Hugging Face write token - `HF_DATASET_REPO`: Your dataset repository (e.g., `username/reliefweb-annotations`) - `HF_RELIEFWEB_PDFS_REPO`: PDF dataset repository (e.g., `ai4data/reliefweb-pdfs`) Once configured, annotations are automatically pushed to your HF Dataset after each annotation! ## PDF Viewer The tool includes an embedded PDF viewer powered by `gradio-pdf`. PDFs can be sourced from: - **Local files**: Use `--pdf-dir /path/to/pdfs` when running locally - **Hugging Face Datasets**: Set `HF_RELIEFWEB_PDFS_REPO` environment variable The PDF viewer automatically navigates to the page containing the dataset mention. ## Data This tool processes validation samples from stratified sampling of UNHCR/ReliefWeb documents, comparing: - **Extraction Model**: Fine-tuned dataset extraction model - **Judge (GPT-5.2)**: LLM-based validation of extractions ## Output Annotations are saved to `validation_sample_filtering_retained_human_validated.jsonl` with: - Human verdicts (dataset/non-dataset) - Agreement metrics with extraction model and judge - Timestamp and optional notes --- Check out the [Hugging Face Spaces documentation](https://huggingface.co/docs/hub/spaces-config-reference) for more information.