Spaces:

ai4data
/

data-use-annotation

Running

App Files Files Community

data-use-annotation / ANNOTATOR_GUIDE.md

rafmacalaba

docs: add annotator guide for the Data Use Annotation Tool

f6b1e29 19 days ago

preview code

raw

history blame contribute delete

5.76 kB

	# 📖 Data Use Annotation Tool — Annotator Guide

	Welcome! This guide explains how to use the Data Use Annotation Tool to review and annotate data/dataset mentions in documents.

	---

	## 1. Getting Started

	### Signing In
	1. Open the tool — you'll see a login screen
	2. Click 🤗 Sign in with HuggingFace
	3. Authorize with your HuggingFace account
	4. You'll be redirected to the tool showing your assigned documents

	> Note: Only accounts listed in the annotator configuration will see documents. If you see no documents after logging in, contact the admin.

	---

	## 2. Interface Overview

	The tool has two main panels:

	\| Panel \| Purpose \|
	\|-------\|---------\|
	\| Left — PDF Viewer \| Shows the original PDF for the current page \|
	\| Right — Markdown Annotation \| Shows extracted text with highlighted data mentions \|

	### Top Bar
	- Title — "Data Use Annotation Tool"
	- Progress Bar — Overall annotation progress across all corpora
	- User Badge — Your HuggingFace username
	- 📊 Leaderboard — See annotation stats for all annotators

	### Document Selector
	- Dropdown at top-left showing your assigned documents
	- Documents are labeled by corpus: [World Bank], [UNHCR], etc.
	- Format: `[Corpus] Doc N (X pages)`

	---

	## 3. Page Navigation

	At the bottom of the screen you'll find the page navigator:

	```
	⏮ ← Prev \| Page 3 ● (3 / 11) \| Next → ⏭
	```

	\| Button \| Action \|
	\|--------\|--------\|
	\| ← Prev / Next → \| Move one page at a time \|
	\| ⏮ / ⏭ \| Jump to the previous/next page that has data mentions \|
	\| ● (green dot) \| Indicates the current page has AI-detected data mentions \|

	All pages are shown, including those without mentions. Use the jump buttons to quickly navigate to pages of interest.

	---

	## 4. Understanding Data Mentions

	The AI model pre-detects potential dataset mentions in the text. Each mention is highlighted with a color based on its tag:

	\| Color \| Tag \| Meaning \|
	\|-------\|-----\|---------\|
	\| 🟢 Green \| Named \| A specific, named dataset (e.g. "2022 National Census") \|
	\| 🟡 Amber \| Descriptive \| A described but not formally named dataset (e.g. "a household survey") \|
	\| 🟣 Purple \| Vague \| An unclear or ambiguous data reference \|
	\| ⚪ Gray \| Non-Dataset \| Flagged by the model but not actually a dataset \|

	A legend above the text shows the count of each type on the current page.

	---

	## 5. Reviewing Existing Mentions (Validation)

	Click the toggle button (‹) on the right edge to open the Data Mentions side panel. For each AI-detected mention you can:

	### Validate
	1. Click Validate on a mention
	2. Optionally add notes explaining your decision
	3. Choose one of:
	- ✅ Correct — The mention is a real dataset
	- ❌ Wrong — The mention is not a dataset (false positive)

	### Change Tag
	- Click the tag badge (e.g. "Named") to edit it
	- Select the correct tag from the dropdown
	- Click Save to update

	### Delete
	- Click 🗑 Delete to remove a false mention
	- Click again to confirm (auto-cancels after 3 seconds)

	### Status Indicators
	- "Needs review" — Not yet validated by you
	- "✓ verified" / "✗ rejected" — Your validation result
	- A checkmark appears next to validated mentions

	---

	## 6. Adding New Annotations

	If you spot a dataset mention that the AI missed:

	1. Select the text — Click and drag to highlight the dataset name in the markdown preview
	2. Click "✍️ Annotate Selection" — The annotation modal will appear
	3. Choose a Dataset Tag:
	- Named Dataset — A specific named dataset
	- Descriptive — A described but unnamed dataset
	- Vague — An ambiguous reference
	4. Click "Save Annotation" — Your annotation is saved

	> Tip: If no text is selected when you click the button, it will shake to remind you to select text first.

	---

	## 7. Page Workflow

	For each page, the recommended workflow is:

	1. Read the markdown text on the right while referencing the PDF on the left
	2. Review each highlighted mention — validate or reject in the side panel
	3. Add any missed mentions using text selection
	4. Move to the next page (a warning appears if you have unvalidated mentions)

	### Unvalidated Mentions Warning
	When moving to the next page with unvalidated mentions, you'll see:
	> ⚠️ You have N unverified data mention(s) on this page. Do you want to proceed?

	You can proceed or go back to finish validating.

	---

	## 8. Tips & Best Practices

	- Use the PDF for context — the markdown is extracted text and may have formatting issues
	- Jump buttons (⏮/⏭) let you skip pages without mentions quickly
	- Pages without mentions may still contain datasets the AI missed — browse them when possible
	- Validate everything on a page before moving on for the most efficient workflow
	- Be precise when selecting text for new annotations — select just the dataset name, not surrounding context

	---

	## 9. FAQ

	Q: Can I undo a validation?
	A: Click "Validate" again to re-validate with a different verdict.

	Q: What if the markdown text doesn't match the PDF?
	A: This can happen with complex layouts (tables, figures). Annotate based on what you can read. The PDF is the source of truth.

	Q: Why are some pages empty?
	A: Some pages (like cover pages or blank pages) may have no extracted text. Use the jump buttons to skip them.

	Q: Who sees my annotations?
	A: Annotations are stored centrally. Admins and other annotators with access to the same documents may see your work.

	---

	## Need Help?

	Contact the project admin if you encounter issues or have questions about specific annotation decisions.

	# 📖 Data Use Annotation Tool — Annotator Guide

	Welcome! This guide explains how to use the Data Use Annotation Tool to review and annotate data/dataset mentions in documents.

	---

	## 1. Getting Started

	### Signing In
	1. Open the tool — you'll see a login screen
	2. Click 🤗 Sign in with HuggingFace
	3. Authorize with your HuggingFace account
	4. You'll be redirected to the tool showing your assigned documents

	> Note: Only accounts listed in the annotator configuration will see documents. If you see no documents after logging in, contact the admin.

	---

	## 2. Interface Overview

	The tool has two main panels:

	\| Panel \| Purpose \|
	\|-------\|---------\|
	\| Left — PDF Viewer \| Shows the original PDF for the current page \|
	\| Right — Markdown Annotation \| Shows extracted text with highlighted data mentions \|

	### Top Bar
	- Title — "Data Use Annotation Tool"
	- Progress Bar — Overall annotation progress across all corpora
	- User Badge — Your HuggingFace username
	- 📊 Leaderboard — See annotation stats for all annotators

	### Document Selector
	- Dropdown at top-left showing your assigned documents
	- Documents are labeled by corpus: [World Bank], [UNHCR], etc.
	- Format: `[Corpus] Doc N (X pages)`

	---

	## 3. Page Navigation

	At the bottom of the screen you'll find the page navigator:

	```
	⏮ ← Prev \| Page 3 ● (3 / 11) \| Next → ⏭
	```

	\| Button \| Action \|
	\|--------\|--------\|
	\| ← Prev / Next → \| Move one page at a time \|
	\| ⏮ / ⏭ \| Jump to the previous/next page that has data mentions \|
	\| ● (green dot) \| Indicates the current page has AI-detected data mentions \|

	All pages are shown, including those without mentions. Use the jump buttons to quickly navigate to pages of interest.

	---

	## 4. Understanding Data Mentions

	The AI model pre-detects potential dataset mentions in the text. Each mention is highlighted with a color based on its tag:

	\| Color \| Tag \| Meaning \|
	\|-------\|-----\|---------\|
	\| 🟢 Green \| Named \| A specific, named dataset (e.g. "2022 National Census") \|
	\| 🟡 Amber \| Descriptive \| A described but not formally named dataset (e.g. "a household survey") \|
	\| 🟣 Purple \| Vague \| An unclear or ambiguous data reference \|
	\| ⚪ Gray \| Non-Dataset \| Flagged by the model but not actually a dataset \|

	A legend above the text shows the count of each type on the current page.

	---

	## 5. Reviewing Existing Mentions (Validation)

	Click the toggle button (‹) on the right edge to open the Data Mentions side panel. For each AI-detected mention you can:

	### Validate
	1. Click Validate on a mention
	2. Optionally add notes explaining your decision
	3. Choose one of:
	- ✅ Correct — The mention is a real dataset
	- ❌ Wrong — The mention is not a dataset (false positive)

	### Change Tag
	- Click the tag badge (e.g. "Named") to edit it
	- Select the correct tag from the dropdown
	- Click Save to update

	### Delete
	- Click 🗑 Delete to remove a false mention
	- Click again to confirm (auto-cancels after 3 seconds)

	### Status Indicators
	- "Needs review" — Not yet validated by you
	- "✓ verified" / "✗ rejected" — Your validation result
	- A checkmark appears next to validated mentions

	---

	## 6. Adding New Annotations

	If you spot a dataset mention that the AI missed:

	1. Select the text — Click and drag to highlight the dataset name in the markdown preview
	2. Click "✍️ Annotate Selection" — The annotation modal will appear
	3. Choose a Dataset Tag:
	- Named Dataset — A specific named dataset
	- Descriptive — A described but unnamed dataset
	- Vague — An ambiguous reference
	4. Click "Save Annotation" — Your annotation is saved

	> Tip: If no text is selected when you click the button, it will shake to remind you to select text first.

	---

	## 7. Page Workflow

	For each page, the recommended workflow is:

	1. Read the markdown text on the right while referencing the PDF on the left
	2. Review each highlighted mention — validate or reject in the side panel
	3. Add any missed mentions using text selection
	4. Move to the next page (a warning appears if you have unvalidated mentions)

	### Unvalidated Mentions Warning
	When moving to the next page with unvalidated mentions, you'll see:
	> ⚠️ You have N unverified data mention(s) on this page. Do you want to proceed?

	You can proceed or go back to finish validating.

	---

	## 8. Tips & Best Practices

	- Use the PDF for context — the markdown is extracted text and may have formatting issues
	- Jump buttons (⏮/⏭) let you skip pages without mentions quickly
	- Pages without mentions may still contain datasets the AI missed — browse them when possible
	- Validate everything on a page before moving on for the most efficient workflow
	- Be precise when selecting text for new annotations — select just the dataset name, not surrounding context

	---

	## 9. FAQ

	Q: Can I undo a validation?
	A: Click "Validate" again to re-validate with a different verdict.

	Q: What if the markdown text doesn't match the PDF?
	A: This can happen with complex layouts (tables, figures). Annotate based on what you can read. The PDF is the source of truth.

	Q: Why are some pages empty?
	A: Some pages (like cover pages or blank pages) may have no extracted text. Use the jump buttons to skip them.

	Q: Who sees my annotations?
	A: Annotations are stored centrally. Admins and other annotators with access to the same documents may see your work.

	---

	## Need Help?

	Contact the project admin if you encounter issues or have questions about specific annotation decisions.