data-use-annotation / ANNOTATOR_GUIDE.md
rafmacalaba's picture
docs: add annotator guide for the Data Use Annotation Tool
f6b1e29

๐Ÿ“– Data Use Annotation Tool โ€” Annotator Guide

Welcome! This guide explains how to use the Data Use Annotation Tool to review and annotate data/dataset mentions in documents.


1. Getting Started

Signing In

  1. Open the tool โ€” you'll see a login screen
  2. Click ๐Ÿค— Sign in with HuggingFace
  3. Authorize with your HuggingFace account
  4. You'll be redirected to the tool showing your assigned documents

Note: Only accounts listed in the annotator configuration will see documents. If you see no documents after logging in, contact the admin.


2. Interface Overview

The tool has two main panels:

Panel Purpose
Left โ€” PDF Viewer Shows the original PDF for the current page
Right โ€” Markdown Annotation Shows extracted text with highlighted data mentions

Top Bar

  • Title โ€” "Data Use Annotation Tool"
  • Progress Bar โ€” Overall annotation progress across all corpora
  • User Badge โ€” Your HuggingFace username
  • ๐Ÿ“Š Leaderboard โ€” See annotation stats for all annotators

Document Selector

  • Dropdown at top-left showing your assigned documents
  • Documents are labeled by corpus: [World Bank], [UNHCR], etc.
  • Format: [Corpus] Doc N (X pages)

3. Page Navigation

At the bottom of the screen you'll find the page navigator:

โฎ  โ† Prev  |  Page 3 โ—  (3 / 11)  |  Next โ†’  โญ
Button Action
โ† Prev / Next โ†’ Move one page at a time
โฎ / โญ Jump to the previous/next page that has data mentions
โ— (green dot) Indicates the current page has AI-detected data mentions

All pages are shown, including those without mentions. Use the jump buttons to quickly navigate to pages of interest.


4. Understanding Data Mentions

The AI model pre-detects potential dataset mentions in the text. Each mention is highlighted with a color based on its tag:

Color Tag Meaning
๐ŸŸข Green Named A specific, named dataset (e.g. "2022 National Census")
๐ŸŸก Amber Descriptive A described but not formally named dataset (e.g. "a household survey")
๐ŸŸฃ Purple Vague An unclear or ambiguous data reference
โšช Gray Non-Dataset Flagged by the model but not actually a dataset

A legend above the text shows the count of each type on the current page.


5. Reviewing Existing Mentions (Validation)

Click the toggle button (โ€น) on the right edge to open the Data Mentions side panel. For each AI-detected mention you can:

Validate

  1. Click Validate on a mention
  2. Optionally add notes explaining your decision
  3. Choose one of:
    • โœ… Correct โ€” The mention is a real dataset
    • โŒ Wrong โ€” The mention is not a dataset (false positive)

Change Tag

  • Click the tag badge (e.g. "Named") to edit it
  • Select the correct tag from the dropdown
  • Click Save to update

Delete

  • Click ๐Ÿ—‘ Delete to remove a false mention
  • Click again to confirm (auto-cancels after 3 seconds)

Status Indicators

  • "Needs review" โ€” Not yet validated by you
  • "โœ“ verified" / "โœ— rejected" โ€” Your validation result
  • A checkmark appears next to validated mentions

6. Adding New Annotations

If you spot a dataset mention that the AI missed:

  1. Select the text โ€” Click and drag to highlight the dataset name in the markdown preview
  2. Click "โœ๏ธ Annotate Selection" โ€” The annotation modal will appear
  3. Choose a Dataset Tag:
    • Named Dataset โ€” A specific named dataset
    • Descriptive โ€” A described but unnamed dataset
    • Vague โ€” An ambiguous reference
  4. Click "Save Annotation" โ€” Your annotation is saved

Tip: If no text is selected when you click the button, it will shake to remind you to select text first.


7. Page Workflow

For each page, the recommended workflow is:

  1. Read the markdown text on the right while referencing the PDF on the left
  2. Review each highlighted mention โ€” validate or reject in the side panel
  3. Add any missed mentions using text selection
  4. Move to the next page (a warning appears if you have unvalidated mentions)

Unvalidated Mentions Warning

When moving to the next page with unvalidated mentions, you'll see:

โš ๏ธ You have N unverified data mention(s) on this page. Do you want to proceed?

You can proceed or go back to finish validating.


8. Tips & Best Practices

  • Use the PDF for context โ€” the markdown is extracted text and may have formatting issues
  • Jump buttons (โฎ/โญ) let you skip pages without mentions quickly
  • Pages without mentions may still contain datasets the AI missed โ€” browse them when possible
  • Validate everything on a page before moving on for the most efficient workflow
  • Be precise when selecting text for new annotations โ€” select just the dataset name, not surrounding context

9. FAQ

Q: Can I undo a validation? A: Click "Validate" again to re-validate with a different verdict.

Q: What if the markdown text doesn't match the PDF? A: This can happen with complex layouts (tables, figures). Annotate based on what you can read. The PDF is the source of truth.

Q: Why are some pages empty? A: Some pages (like cover pages or blank pages) may have no extracted text. Use the jump buttons to skip them.

Q: Who sees my annotations? A: Annotations are stored centrally. Admins and other annotators with access to the same documents may see your work.


Need Help?

Contact the project admin if you encounter issues or have questions about specific annotation decisions.