Spaces:
Sleeping
Sleeping
| title: Reliefweb Annotation | |
| emoji: π | |
| colorFrom: indigo | |
| colorTo: red | |
| sdk: gradio | |
| sdk_version: 6.2.0 | |
| app_file: app.py | |
| pinned: false | |
| license: mit | |
| short_description: Validation tool for UNHCR/ReliefWeb documents. | |
| # π Dataset Annotation Tool | |
| A Gradio-based annotation interface for validating dataset mentions extracted from UNHCR and ReliefWeb PDF documents. | |
| ## Features | |
| - **Interactive Annotation**: Review and validate dataset mentions with contextual information | |
| - **AI Comparison**: Compare extraction model predictions with GPT-5.2 judge verdicts | |
| - **Progress Tracking**: Real-time statistics on annotation progress and agreement rates | |
| - **Context Highlighting**: Dataset names are highlighted within their surrounding context | |
| - **Dark Mode**: Toggle between light and dark themes for comfortable annotation | |
| - **Persistent Storage**: | |
| - πΎ **Download Button**: Manually download annotations anytime | |
| - βοΈ **Auto-Backup**: Automatic sync to Hugging Face Datasets (optional) | |
| ## How to Use | |
| 1. Review the **Dataset Name** and its **Context** (highlighted in yellow) | |
| 2. Check the **Metadata** for document and stratum information | |
| 3. Optionally enable "π€ Show what the AI thinks" to see model predictions | |
| 4. Click **β DATASET** if it's a valid dataset mention, or **β NOT A DATASET** if it's not | |
| 5. Add optional notes if needed | |
| 6. Navigate using Previous/Next buttons or skip to the next unannotated sample | |
| 7. **Download your progress** using the "πΎ Download Annotations" button | |
| ## Persistence Options | |
| ### Option 1: Manual Download (Always Available) | |
| - Click the "πΎ Download Annotations" button to save your progress locally | |
| - The button updates automatically after each annotation | |
| - Recommended: Download periodically to backup your work | |
| ### Option 2: Auto-Backup to HF Datasets (Optional) | |
| To enable automatic cloud backup: | |
| 1. Create a private dataset on Hugging Face (e.g., `username/reliefweb-annotations`) | |
| 2. Get a write token from https://huggingface.co/settings/tokens | |
| 3. In your Space settings, add these secrets: | |
| - `HF_TOKEN`: Your Hugging Face write token | |
| - `HF_DATASET_REPO`: Your dataset repository (e.g., `username/reliefweb-annotations`) | |
| - `HF_RELIEFWEB_PDFS_REPO`: PDF dataset repository (e.g., `ai4data/reliefweb-pdfs`) | |
| Once configured, annotations are automatically pushed to your HF Dataset after each annotation! | |
| ## PDF Viewer | |
| The tool includes an embedded PDF viewer powered by `gradio-pdf`. PDFs can be sourced from: | |
| - **Local files**: Use `--pdf-dir /path/to/pdfs` when running locally | |
| - **Hugging Face Datasets**: Set `HF_RELIEFWEB_PDFS_REPO` environment variable | |
| The PDF viewer automatically navigates to the page containing the dataset mention. | |
| ## Data | |
| This tool processes validation samples from stratified sampling of UNHCR/ReliefWeb documents, comparing: | |
| - **Extraction Model**: Fine-tuned dataset extraction model | |
| - **Judge (GPT-5.2)**: LLM-based validation of extractions | |
| ## Output | |
| Annotations are saved to `validation_sample_filtering_retained_human_validated.jsonl` with: | |
| - Human verdicts (dataset/non-dataset) | |
| - Agreement metrics with extraction model and judge | |
| - Timestamp and optional notes | |
| --- | |
| Check out the [Hugging Face Spaces documentation](https://huggingface.co/docs/hub/spaces-config-reference) for more information. | |