File size: 9,014 Bytes

a04db1c
80ad386
 
a04db1c
 
80ad386
 
 
 
 
 
 
a04db1c
 
80ad386
a04db1c
80ad386
a04db1c
80ad386
a04db1c
80ad386
a04db1c
4db563d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
80ad386
a04db1c
80ad386
a04db1c
80ad386
a04db1c
80ad386
2e1e7d7
80ad386
a04db1c
80ad386
a04db1c
80ad386
2e1e7d7
a04db1c
80ad386
 
 
 
a04db1c
80ad386
a04db1c
80ad386
 
 
 
 
 
 
 
 
 
a04db1c
80ad386
a04db1c
80ad386
2e1e7d7
a04db1c
80ad386
 
 
a04db1c
80ad386
 
 
 
 
a04db1c
80ad386
a04db1c
80ad386
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a04db1c
80ad386
a04db1c
80ad386

---
library_name: gliner2
license: mit
base_model: fastino/gliner2-large-v1
tags:
  - ner
  - relation-extraction
  - data-mention-extraction
  - lora
  - gliner2
  - development-economics
  - geography
---

# datause-extraction-v1

This is the official fine-tuned GLiNER2 LoRA adapter for extracting structured data mentions, their attributes, geographic coverage, and usage roles from economics and development research documents.

The model leverages a joint **entity + relation extraction schema** to detect mentions and link them to metadata (producers, acronyms, timeframes, and countries) without suffering from choices-based prefix collision.

---

## Rationale and Context: Forced Displacement, Refugees, and FCV

### Why This Model Was Created
Tracking and monitoring the use of datasets in Fragility, Conflict, and Violence (FCV) settings and forced displacement contexts is critical. Research on refugees, internally displaced persons (IDPs), and host communities is highly dependent on diverse data sources—ranging from large-scale household surveys to localized administrative registration systems. 

Understanding *which* datasets are being utilized, *who* is producing them, and *how* they are integrated into policy analysis helps international organizations (such as the World Bank and UNHCR), researchers, and funding bodies:
1. **Monitor Data Investments**: Quantify the impact and academic/policy reach of dedicated data initiatives (e.g., those funded by the World Bank-UNHCR Joint Data Center on Forced Displacement).
2. **Identify Data Gaps**: Discover regions or populations where FCV analyses lack primary microdata and are forced to rely solely on background or secondary estimates.
3. **Avoid Duplication**: Map existing research projects to avoid redundant data collection efforts in challenging, insecure environments.

Due to the unstructured nature of academic literature and policy briefs, this has historically required labor-intensive manual reviews. This model automates this pipeline by identifying verbatim data mentions, their creators, and their exact analytical roles (primary data source, validation support, or passing background citation).

### Data Sources & Domain Coverage
The training data for this model was curated using actual research documents, reports, and working papers from major development institutions operating in FCV regions. Key data sources referenced in the corpus include:
* **Humanitarian Registries**: UNHCR's proGRES database, registration rolls from national border/refugee agencies, and program databases.
* **Displacement Tracking Systems**: IOM's Displacement Tracking Matrix (DTM) reports and the Internal Displacement Monitoring Centre (IDMC) registries.
* **Household Surveys in FCV Contexts**: Living Standards Measurement Study (LSMS) surveys, Demographic and Health Surveys (DHS), Multiple Indicator Cluster Surveys (MICS), and specialized welfare monitoring surveys (e.g., SHINE, SESRE).
* **Geospatial & Spatial Databases**: Climate/weather indicators, conflict event databases (e.g., ACLED), and satellite camp imagery.

---

## Usage Option 1: Using the `ai4data` Library Wrapper (Recommended)

It is **highly recommended** to interact with this model using the official **`ai4data`** Python wrapper. The library handles markdown-aware document parsing, sliding context windows, overlap resolution, and entity-relation alignment automatically.

Install the library directly from GitHub:

```bash
pip install git+https://github.com/worldbank/ai4data.git
```

### 1. Extract from Text

```python
from ai4data.data_use import extract_from_text

text = """
We use the 2022 Demographic and Health Survey (DHS) conducted by the National Statistics Office in Uganda 
to analyze child health outcomes. We complement this with population records from the Ministry of Health.
"""

results = extract_from_text(text, include_confidence=True)

for dataset in results["datasets"]:
    name = dataset["mention_name"]["text"]
    acronym = dataset["acronym"]["text"] or "N/A"
    producer = dataset["producer"]["text"] or "N/A"
    geography = dataset["geography"]["text"] or "N/A"
    usage = dataset["usage_context"]["text"]
    
    print(f"Dataset: {name} ({acronym})")
    print(f"  Producer: {producer} | Geography: {geography} | Role: {usage}\n")
```

### 2. Extract from PDF

```python
from ai4data.data_use import extract_from_document

# Extracts from a local path or a PDF URL
pdf_url = "https://pdf.usaid.gov/pdf_docs/PA00TB5D.pdf"
results = extract_from_document(pdf_url, pages=[0, 1, 2])

for page_data in results:
    print(f"--- Page {page_data['page']} ---")
    for dataset in page_data["datasets"]:
        print(f"Found mention: {dataset['mention_name']['text']}")
```

---

## Usage Option 2: Using raw `gliner2` Library (Without the wrapper)

If you prefer to integrate the model directly without using the wrapper library, you can use the raw `gliner2` package.

### 1. Installation

```bash
pip install gliner2 huggingface_hub
```

### 2. Code Example

```python
from gliner2 import GLiNER2
from huggingface_hub import snapshot_download

BASE_MODEL = "fastino/gliner2-large-v1"
ADAPTER_ID = "ai4data/datause-extraction-v1"

# 1. Load model and adapter
model = GLiNER2.from_pretrained(BASE_MODEL)
model.load_adapter(snapshot_download(ADAPTER_ID))
model.eval()

# 2. Define schema
ENTITY_DEFS = {
    "name": "The exact full name of the data source or dataset",
    "acronym": "The acronym or abbreviation if any",
    "producer": "The organization or entity that produced or published the data",
    "timeframe": "The year or time period of the data such as 2019 or 2019 to 2020",
    "datatype": "The type of data verbatim from text such as survey, report, census, program, system, or assessment",
    "geography": "The country, region, or geographic area the data covers",
    "specificity": "Whether this mention is named, descriptive, or vague",
    "usage": "Whether this is primary, supporting, or background data",
}

RELATION_DEFS = {
    "has_acronym": "The acronym of the dataset",
    "has_producer": "The producer of the dataset",
    "has_timeframe": "The timeframe of the dataset",
    "has_datatype": "The data type of the dataset",
    "has_geography": "The country or geographic coverage area of the dataset",
    "has_specificity": "Whether this dataset is named, descriptive, or vague",
    "has_usage": "Whether this dataset is primary, supporting, or background",
}

schema = model.create_schema()
schema.entities(ENTITY_DEFS)
schema.relations(RELATION_DEFS)

# 3. Add prompt prefix
LABEL_PREFIX = "specificity: named | descriptive | vague usage: primary | supporting | background |"
text = "We use the 2022 Demographic and Health Survey (DHS) conducted by the National Statistics Office in Uganda."
prefixed_text = f"{LABEL_PREFIX} {text}"

# 4. Extract
outputs = model.extract(prefixed_text, schema, threshold=0.3)
print(outputs)
```

---

## Response Structure (Wrapper Output)

Each item in the returned `"datasets"` list from the `ai4data` library is structured as follows:

```json
{
  "mention_name": {
    "text": "Demographic and Health Survey",
    "confidence": 0.9998,
    "start": 12,
    "end": 41
  },
  "specificity_tag": {
    "text": "named",
    "confidence": 0.9998,
    "start": 12,
    "end": 41
  },
  "usage_context": {
    "text": "primary",
    "confidence": 0.9998,
    "start": 12,
    "end": 41
  },
  "typology_tag": {
    "text": "survey",
    "confidence": 0.9998,
    "start": 12,
    "end": 41
  },
  "acronym": {
    "text": "DHS",
    "confidence": 0.9996,
    "start": 43,
    "end": 46
  },
  "producer": {
    "text": "National Statistics Office",
    "confidence": 0.9992,
    "start": 60,
    "end": 86
  },
  "reference_year": {
    "text": "2022",
    "confidence": 0.9998,
    "start": 7,
    "end": 11
  },
  "geography": {
    "text": "Uganda",
    "confidence": 0.9997,
    "start": 90,
    "end": 96
  }
}
```

### Attribute Fields

| Field | Type | Description |
|---|---|---|
| `mention_name` | String / Span | Verbatim name of the dataset mentioned in the text |
| `specificity_tag` | Choice / Span | Precision classification: `named` / `descriptive` / `vague` |
| `usage_context` | Choice / Span | Analytical role: `primary` (core dataset) / `supporting` (context/validation) / `background` (passing reference) |
| `is_used` | Boolean / Span | Derived field: `True` if `usage_context` is `primary`/`supporting`, `False` if `background` |
| `typology_tag` | Choice / Span | Derived/mapped data type: `survey` / `census` / `administrative` / `database` / `indicator` / `geospatial` / `microdata` / `report` / `other` |
| `acronym` | String / Span | Abbreviation or acronym linked to the dataset |
| `producer` | String / Span | Organizing body or agency that published/collected the data |
| `reference_year` | String / Span | Year or timeframe the data represents |
| `geography` | String / Span | Geographic coverage or region the data represents |