rafmacalaba's picture
add gradio pdf and readme
fb7143c

A newer version of the Gradio SDK is available: 6.5.1

Upgrade
metadata
title: Reliefweb Annotation
emoji: πŸ“Š
colorFrom: indigo
colorTo: red
sdk: gradio
sdk_version: 6.2.0
app_file: app.py
pinned: false
license: mit
short_description: Validation tool for UNHCR/ReliefWeb documents.

πŸ“Š Dataset Annotation Tool

A Gradio-based annotation interface for validating dataset mentions extracted from UNHCR and ReliefWeb PDF documents.

Features

  • Interactive Annotation: Review and validate dataset mentions with contextual information
  • AI Comparison: Compare extraction model predictions with GPT-5.2 judge verdicts
  • Progress Tracking: Real-time statistics on annotation progress and agreement rates
  • Context Highlighting: Dataset names are highlighted within their surrounding context
  • Dark Mode: Toggle between light and dark themes for comfortable annotation
  • Persistent Storage:
    • πŸ’Ύ Download Button: Manually download annotations anytime
    • ☁️ Auto-Backup: Automatic sync to Hugging Face Datasets (optional)

How to Use

  1. Review the Dataset Name and its Context (highlighted in yellow)
  2. Check the Metadata for document and stratum information
  3. Optionally enable "πŸ€– Show what the AI thinks" to see model predictions
  4. Click βœ“ DATASET if it's a valid dataset mention, or βœ— NOT A DATASET if it's not
  5. Add optional notes if needed
  6. Navigate using Previous/Next buttons or skip to the next unannotated sample
  7. Download your progress using the "πŸ’Ύ Download Annotations" button

Persistence Options

Option 1: Manual Download (Always Available)

  • Click the "πŸ’Ύ Download Annotations" button to save your progress locally
  • The button updates automatically after each annotation
  • Recommended: Download periodically to backup your work

Option 2: Auto-Backup to HF Datasets (Optional)

To enable automatic cloud backup:

  1. Create a private dataset on Hugging Face (e.g., username/reliefweb-annotations)
  2. Get a write token from https://huggingface.co/settings/tokens
  3. In your Space settings, add these secrets:
    • HF_TOKEN: Your Hugging Face write token
    • HF_DATASET_REPO: Your dataset repository (e.g., username/reliefweb-annotations)
    • HF_RELIEFWEB_PDFS_REPO: PDF dataset repository (e.g., ai4data/reliefweb-pdfs)

Once configured, annotations are automatically pushed to your HF Dataset after each annotation!

PDF Viewer

The tool includes an embedded PDF viewer powered by gradio-pdf. PDFs can be sourced from:

  • Local files: Use --pdf-dir /path/to/pdfs when running locally
  • Hugging Face Datasets: Set HF_RELIEFWEB_PDFS_REPO environment variable

The PDF viewer automatically navigates to the page containing the dataset mention.

Data

This tool processes validation samples from stratified sampling of UNHCR/ReliefWeb documents, comparing:

  • Extraction Model: Fine-tuned dataset extraction model
  • Judge (GPT-5.2): LLM-based validation of extractions

Output

Annotations are saved to validation_sample_filtering_retained_human_validated.jsonl with:

  • Human verdicts (dataset/non-dataset)
  • Agreement metrics with extraction model and judge
  • Timestamp and optional notes

Check out the Hugging Face Spaces documentation for more information.