Spaces:

ai4data
/

data-use-annotation

Sleeping

File size: 5,756 Bytes

f6b1e29

# 📖 Data Use Annotation Tool — Annotator Guide

Welcome! This guide explains how to use the **Data Use Annotation Tool** to review and annotate data/dataset mentions in documents.

---

## 1. Getting Started

### Signing In
1. Open the tool — you'll see a login screen
2. Click **🤗 Sign in with HuggingFace**
3. Authorize with your HuggingFace account
4. You'll be redirected to the tool showing your assigned documents

> **Note:** Only accounts listed in the annotator configuration will see documents. If you see no documents after logging in, contact the admin.

---

## 2. Interface Overview

The tool has two main panels:

| Panel | Purpose |
|-------|---------|
| **Left — PDF Viewer** | Shows the original PDF for the current page |
| **Right — Markdown Annotation** | Shows extracted text with highlighted data mentions |

### Top Bar
- **Title** — "Data Use Annotation Tool"
- **Progress Bar** — Overall annotation progress across all corpora
- **User Badge** — Your HuggingFace username
- **📊 Leaderboard** — See annotation stats for all annotators

### Document Selector
- Dropdown at top-left showing your assigned documents
- Documents are labeled by corpus: **[World Bank]**, **[UNHCR]**, etc.
- Format: `[Corpus] Doc N (X pages)`

---

## 3. Page Navigation

At the bottom of the screen you'll find the page navigator:

```
⏮  ← Prev  |  Page 3 ●  (3 / 11)  |  Next →  ⏭
```

| Button | Action |
|--------|--------|
| **← Prev / Next →** | Move one page at a time |
| **⏮ / ⏭** | Jump to the previous/next page that has data mentions |
| **● (green dot)** | Indicates the current page has AI-detected data mentions |

All pages are shown, including those without mentions. Use the jump buttons to quickly navigate to pages of interest.

---

## 4. Understanding Data Mentions

The AI model pre-detects potential dataset mentions in the text. Each mention is highlighted with a color based on its **tag**:

| Color | Tag | Meaning |
|-------|-----|---------|
| 🟢 Green | **Named** | A specific, named dataset (e.g. "2022 National Census") |
| 🟡 Amber | **Descriptive** | A described but not formally named dataset (e.g. "a household survey") |
| 🟣 Purple | **Vague** | An unclear or ambiguous data reference |
| ⚪ Gray | **Non-Dataset** | Flagged by the model but not actually a dataset |

A **legend** above the text shows the count of each type on the current page.

---

## 5. Reviewing Existing Mentions (Validation)

Click the **toggle button (‹)** on the right edge to open the **Data Mentions** side panel. For each AI-detected mention you can:

### Validate
1. Click **Validate** on a mention
2. Optionally add notes explaining your decision
3. Choose one of:
   - ✅ **Correct** — The mention is a real dataset
   - ❌ **Wrong** — The mention is not a dataset (false positive)

### Change Tag
- Click the **tag badge** (e.g. "Named") to edit it
- Select the correct tag from the dropdown
- Click **Save** to update

### Delete
- Click **🗑 Delete** to remove a false mention
- Click again to confirm (auto-cancels after 3 seconds)

### Status Indicators
- **"Needs review"** — Not yet validated by you
- **"✓ verified"** / **"✗ rejected"** — Your validation result
- A checkmark appears next to validated mentions

---

## 6. Adding New Annotations

If you spot a dataset mention that the AI missed:

1. **Select the text** — Click and drag to highlight the dataset name in the markdown preview
2. **Click "✍️ Annotate Selection"** — The annotation modal will appear
3. **Choose a Dataset Tag**:
   - **Named Dataset** — A specific named dataset
   - **Descriptive** — A described but unnamed dataset
   - **Vague** — An ambiguous reference
4. **Click "Save Annotation"** — Your annotation is saved

> **Tip:** If no text is selected when you click the button, it will shake to remind you to select text first.

---

## 7. Page Workflow

For each page, the recommended workflow is:

1. **Read** the markdown text on the right while referencing the PDF on the left
2. **Review** each highlighted mention — validate or reject in the side panel
3. **Add** any missed mentions using text selection
4. **Move** to the next page (a warning appears if you have unvalidated mentions)

### Unvalidated Mentions Warning
When moving to the next page with unvalidated mentions, you'll see:
> ⚠️ You have N unverified data mention(s) on this page. Do you want to proceed?

You can proceed or go back to finish validating.

---

## 8. Tips & Best Practices

- **Use the PDF** for context — the markdown is extracted text and may have formatting issues
- **Jump buttons (⏮/⏭)** let you skip pages without mentions quickly
- **Pages without mentions** may still contain datasets the AI missed — browse them when possible
- **Validate everything** on a page before moving on for the most efficient workflow
- **Be precise** when selecting text for new annotations — select just the dataset name, not surrounding context

---

## 9. FAQ

**Q: Can I undo a validation?**
A: Click "Validate" again to re-validate with a different verdict.

**Q: What if the markdown text doesn't match the PDF?**
A: This can happen with complex layouts (tables, figures). Annotate based on what you can read. The PDF is the source of truth.

**Q: Why are some pages empty?**
A: Some pages (like cover pages or blank pages) may have no extracted text. Use the jump buttons to skip them.

**Q: Who sees my annotations?**
A: Annotations are stored centrally. Admins and other annotators with access to the same documents may see your work.

---

## Need Help?

Contact the project admin if you encounter issues or have questions about specific annotation decisions.