Spaces:

ai4data
/

reliefweb_annotation

Sleeping

App Files Files Community

reliefweb_annotation / README.md

rafmacalaba

add gradio pdf and readme

fb7143c 4 days ago

preview code

raw

history blame contribute delete

3.32 kB

	---
	title: Reliefweb Annotation
	emoji: 📊
	colorFrom: indigo
	colorTo: red
	sdk: gradio
	sdk_version: 6.2.0
	app_file: app.py
	pinned: false
	license: mit
	short_description: Validation tool for UNHCR/ReliefWeb documents.
	---

	# 📊 Dataset Annotation Tool

	A Gradio-based annotation interface for validating dataset mentions extracted from UNHCR and ReliefWeb PDF documents.

	## Features

	- Interactive Annotation: Review and validate dataset mentions with contextual information
	- AI Comparison: Compare extraction model predictions with GPT-5.2 judge verdicts
	- Progress Tracking: Real-time statistics on annotation progress and agreement rates
	- Context Highlighting: Dataset names are highlighted within their surrounding context
	- Dark Mode: Toggle between light and dark themes for comfortable annotation
	- Persistent Storage:
	- 💾 Download Button: Manually download annotations anytime
	- ☁️ Auto-Backup: Automatic sync to Hugging Face Datasets (optional)

	## How to Use

	1. Review the Dataset Name and its Context (highlighted in yellow)
	2. Check the Metadata for document and stratum information
	3. Optionally enable "🤖 Show what the AI thinks" to see model predictions
	4. Click ✓ DATASET if it's a valid dataset mention, or ✗ NOT A DATASET if it's not
	5. Add optional notes if needed
	6. Navigate using Previous/Next buttons or skip to the next unannotated sample
	7. Download your progress using the "💾 Download Annotations" button

	## Persistence Options

	### Option 1: Manual Download (Always Available)
	- Click the "💾 Download Annotations" button to save your progress locally
	- The button updates automatically after each annotation
	- Recommended: Download periodically to backup your work

	### Option 2: Auto-Backup to HF Datasets (Optional)
	To enable automatic cloud backup:

	1. Create a private dataset on Hugging Face (e.g., `username/reliefweb-annotations`)
	2. Get a write token from https://huggingface.co/settings/tokens
	3. In your Space settings, add these secrets:
	- `HF_TOKEN`: Your Hugging Face write token
	- `HF_DATASET_REPO`: Your dataset repository (e.g., `username/reliefweb-annotations`)
	- `HF_RELIEFWEB_PDFS_REPO`: PDF dataset repository (e.g., `ai4data/reliefweb-pdfs`)

	Once configured, annotations are automatically pushed to your HF Dataset after each annotation!

	## PDF Viewer

	The tool includes an embedded PDF viewer powered by `gradio-pdf`. PDFs can be sourced from:

	- Local files: Use `--pdf-dir /path/to/pdfs` when running locally
	- Hugging Face Datasets: Set `HF_RELIEFWEB_PDFS_REPO` environment variable

	The PDF viewer automatically navigates to the page containing the dataset mention.

	## Data

	This tool processes validation samples from stratified sampling of UNHCR/ReliefWeb documents, comparing:
	- Extraction Model: Fine-tuned dataset extraction model
	- Judge (GPT-5.2): LLM-based validation of extractions

	## Output

	Annotations are saved to `validation_sample_filtering_retained_human_validated.jsonl` with:
	- Human verdicts (dataset/non-dataset)
	- Agreement metrics with extraction model and judge
	- Timestamp and optional notes

	---

	Check out the [Hugging Face Spaces documentation](https://huggingface.co/docs/hub/spaces-config-reference) for more information.