reliefweb_annotation / PERSISTENCE_IMPLEMENTATION.md
rafmacalaba's picture
add all files
46f3190

A newer version of the Gradio SDK is available: 6.5.1

Upgrade

Persistence Implementation Summary

Overview

Successfully implemented dual persistence for the ReliefWeb Annotation Gradio Space:

  1. Manual Download Button - Always available, no configuration needed
  2. Auto-Backup to HF Datasets - Optional cloud sync with automatic persistence

Changes Made

1. Dependencies (requirements.txt)

Added:

  • huggingface_hub>=0.20.0 - For HF authentication and API
  • datasets>=2.16.0 - For HF Datasets integration

2. Core Application (app.py)

New Imports

import os
from huggingface_hub import HfApi, login
from datasets import Dataset, load_dataset

ValidationAnnotator Class Updates

Constructor (__init__)

  • Added hf_dataset_repo and hf_token parameters
  • Auto-detects HF credentials from environment variables
  • Attempts to login and enable HF Datasets if credentials available
  • Loads annotations from both HF Datasets (cloud) and local file

New Method: _push_to_hf_datasets()

  • Converts annotations dict to Dataset format
  • Pushes to HF Hub with private visibility
  • Called automatically after each annotation

Updated Method: _load_annotations()

  • First tries to load from HF Datasets (cloud backup)
  • Then loads from local file (may have newer annotations)
  • Merges both sources, with local taking precedence

Updated Method: _save_annotation()

  • Saves to local file (as before)
  • Automatically pushes to HF Datasets if enabled
  • Gracefully handles HF push failures

UI Updates

New Download Button

download_btn = gr.DownloadButton(
    "πŸ’Ύ Download Annotations",
    value=str(annotator.output_file) if annotator.output_file.exists() else None,
    size="sm",
    variant="secondary"
)
  • Updates automatically after each annotation
  • Always available, no configuration needed

Status Indicator

  • Shows "☁️ Auto-backup enabled" with link to dataset if HF enabled
  • Shows "⚠️ Auto-backup disabled" with setup hint if not enabled

Button Click Handlers

  • Accept and Reject buttons now chain to update download button
  • Download button gets fresh file path after each annotation

Main Entry Point

# Get HF credentials from environment
hf_dataset_repo = os.getenv("HF_DATASET_REPO")
hf_token = os.getenv("HF_TOKEN")

# Pass to create_app
app = create_app(input_file, hf_dataset_repo, hf_token)

3. Documentation Updates

README.md

  • Added persistence features to feature list
  • New section: "Persistence Options"
  • Instructions for both manual download and auto-backup
  • Step-by-step HF Datasets setup guide

DEPLOYMENT.md

  • New section: "Setting Up Persistence"
  • Detailed 4-step guide for HF Datasets configuration
  • Benefits of auto-backup listed
  • Updated "Important Notes" section

How It Works

Manual Download (Always Available)

  1. User clicks "πŸ’Ύ Download Annotations" button
  2. Browser downloads the JSONL file immediately
  3. File contains all annotations made so far
  4. Works offline, no configuration needed

Auto-Backup (Optional)

  1. On Startup:

    • Checks for HF_TOKEN and HF_DATASET_REPO env vars
    • If found, logs into HF and enables auto-backup
    • Loads existing annotations from HF Dataset
  2. On Each Annotation:

    • Saves to local file (ephemeral)
    • Converts all annotations to Dataset format
    • Pushes to HF Hub (replaces entire dataset)
    • Shows success/failure in console logs
  3. On Space Restart:

    • Loads annotations from HF Dataset
    • Continues from where user left off
    • No data loss!

Configuration for Users

For Manual Download Only

No configuration needed! Just use the app and click download when desired.

For Auto-Backup

  1. Create HF Dataset: https://huggingface.co/new-dataset
  2. Get write token: https://huggingface.co/settings/tokens
  3. Add Space secrets:
    • HF_TOKEN: Your write token
    • HF_DATASET_REPO: username/dataset-name
  4. Restart Space

Benefits

Manual Download

βœ… No setup required
βœ… Works offline
βœ… User controls when to backup
βœ… Simple and reliable

Auto-Backup

βœ… Survives Space restarts
βœ… Automatic version control
βœ… Resume from any device
βœ… Collaborative annotation
βœ… Easy dataset management
βœ… No manual intervention needed

Testing

Local Testing

cd /Users/rafaelmacalaba/WBG/monitoring_of_datause/revalidation/analysis/unhcr_reliefweb/reliefweb_annotation

# Without HF Datasets (download only)
uv run app.py

# With HF Datasets
export HF_TOKEN="your_token"
export HF_DATASET_REPO="username/dataset-name"
uv run app.py

On HF Spaces

  1. Upload files to Space
  2. Optionally configure secrets
  3. Space auto-deploys
  4. Check console logs for HF Datasets status

Files Modified

  1. βœ… requirements.txt - Added HF dependencies
  2. βœ… app.py - Implemented dual persistence
  3. βœ… README.md - Documented features and setup
  4. βœ… DEPLOYMENT.md - Added configuration guide

Deployment Checklist

  • Dependencies updated
  • Download button implemented
  • HF Datasets integration implemented
  • UI indicators added
  • Documentation updated
  • Local testing completed
  • Deploy to HF Spaces
  • Configure HF secrets (optional)
  • Test in production

Next Steps

  1. Deploy to HF Spaces (see DEPLOYMENT.md)
  2. Configure secrets for auto-backup (optional)
  3. Start annotating!
  4. Download periodically as backup
  5. Monitor HF Dataset for automatic backups

Status: βœ… Implementation Complete
Ready for Deployment: Yes
Breaking Changes: None (backward compatible)