Upload README.md with huggingface_hub

8948590 verified 23 days ago

3.85 kB

	---
	base_model: fastino/gliner2-large-v1
	library_name: peft
	tags:
	- base_model:adapter:fastino/gliner2-large-v1
	- lora
	- transformers
	- gliner2
	- dataset-extraction
	- data-use
	---

	# GLiNER2 Data Use Extraction Adapter (v2)

	This is a fine-tuned LoRA adapter for `fastino/gliner2-large-v1` trained to extract datasets, data mentions, and their relations from academic papers, research, and reports (with a focus on World Bank/UNHCR documents).

	- Repository: [https://github.com/rafmacalaba/monitoring_of_datause](https://github.com/rafmacalaba/monitoring_of_datause)
	- Base Model: `fastino/gliner2-large-v1`
	- Adapter ID: `ai4data/datause-extraction-v2`

	---

	## How to Get Started with the Model

	It is highly recommended to use this model through the official `ai4data` Python library wrapper. The library automatically handles:
	- Markdown-aware chunking (respecting model context limits).
	- Character offset index adjustment across multiple chunk pages.
	- Greedy overlap resolution and text normalization.
	- Pre-filtering pre-classifiers to skip non-data pages.
	- Deduplication and acronym matching.

	### 1. Installation

	Clone and install the repository:

	```bash
	git clone <repository-url>
	cd monitoring_of_datause
	uv sync
	```

	### 2. Python Usage

	To extract dataset mentions and their attributes (like timeframe, producer, and acronyms):

	```python
	from ai4data import extract_from_text, extract_from_document

	text = """Our analysis uses the 2022 Demographic and Health Survey (DHS) conducted by
	the National Statistics Office. We complement this with administrative systems, but
	only the DHS is used in the empirical models."""

	# Extract from raw text
	results = extract_from_text(text)
	print(results["datasets"])

	# Extract from a PDF document
	pdf_results = extract_from_document("report.pdf", pages=[0, 1, 2])
	print(pdf_results)
	```

	---

	## Model Schema & Response Structure

	The model extracts up to 7 attributes per data mention. When querying via `ai4data`, each extracted entity in the `"datasets"` list has the following structure:

	```json
	{
	"mention_name": {
	"text": "Demographic and Health Survey",
	"confidence": 0.9999,
	"start": 23,
	"end": 52
	},
	"specificity_tag": {
	"text": "named",
	"confidence": 0.9999,
	"start": 23,
	"end": 52
	},
	"usage_context": {
	"text": "primary",
	"confidence": 0.9999,
	"start": 23,
	"end": 52
	},
	"typology_tag": {
	"text": "survey",
	"confidence": 0.9999,
	"start": 23,
	"end": 52
	},
	"acronym": {
	"text": "DHS",
	"confidence": 0.9996,
	"start": 54,
	"end": 57
	},
	"producer": {
	"text": "National Statistics Office",
	"confidence": 0.9999,
	"start": 72,
	"end": 98
	},
	"reference_year": {
	"text": "2022",
	"confidence": 0.9999,
	"start": 18,
	"end": 22
	},
	"is_used": {
	"text": "True",
	"confidence": 0.9999,
	"start": 23,
	"end": 52
	},
	"geography": {
	"text": "",
	"confidence": 0.9999,
	"start": 23,
	"end": 52
	}
	}
	```

	---

	## Annotation Guidelines & What Counts as a Data Mention

	### Specificity Taxonomy
	- `named`: A specific, citable dataset (e.g., `"DHS 2020"`, `"World Development Indicators"`, `"Ghana Living Standards Survey (GLSS)"`)
	- `descriptive`: A general category of data, not a specific named dataset (e.g., `"household survey data"`, `"administrative records"`, `"panel data on firms"`)
	- `vague`: An indirect or ambiguous reference (e.g., `"available data"`, `"our dataset"`, `"the data used in this study"`)

	### Usage Context
	- `primary`: Core data driving the main analysis in the report.
	- `supporting`: Secondary data used to validate, calibrate, or provide robustness checks.
	- `background`: Mentioned in passing, in a literature review, or as historical context.