GLiNER2
Safetensors
ner
relation-extraction
data-mention-extraction
lora
development-economics
geography
Instructions to use ai4data/datause-extraction-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- GLiNER2
How to use ai4data/datause-extraction-v1 with GLiNER2:
from gliner2 import GLiNER2 model = GLiNER2.from_pretrained("ai4data/datause-extraction-v1") # Extract entities text = "Apple CEO Tim Cook announced iPhone 15 in Cupertino yesterday." result = extractor.extract_entities(text, ["company", "person", "product", "location"]) print(result) - Notebooks
- Google Colab
- Kaggle
File size: 9,014 Bytes
a04db1c 80ad386 a04db1c 80ad386 a04db1c 80ad386 a04db1c 80ad386 a04db1c 80ad386 a04db1c 80ad386 a04db1c 4db563d 80ad386 a04db1c 80ad386 a04db1c 80ad386 a04db1c 80ad386 2e1e7d7 80ad386 a04db1c 80ad386 a04db1c 80ad386 2e1e7d7 a04db1c 80ad386 a04db1c 80ad386 a04db1c 80ad386 a04db1c 80ad386 a04db1c 80ad386 2e1e7d7 a04db1c 80ad386 a04db1c 80ad386 a04db1c 80ad386 a04db1c 80ad386 a04db1c 80ad386 a04db1c 80ad386 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 | ---
library_name: gliner2
license: mit
base_model: fastino/gliner2-large-v1
tags:
- ner
- relation-extraction
- data-mention-extraction
- lora
- gliner2
- development-economics
- geography
---
# datause-extraction-v1
This is the official fine-tuned GLiNER2 LoRA adapter for extracting structured data mentions, their attributes, geographic coverage, and usage roles from economics and development research documents.
The model leverages a joint **entity + relation extraction schema** to detect mentions and link them to metadata (producers, acronyms, timeframes, and countries) without suffering from choices-based prefix collision.
---
## Rationale and Context: Forced Displacement, Refugees, and FCV
### Why This Model Was Created
Tracking and monitoring the use of datasets in Fragility, Conflict, and Violence (FCV) settings and forced displacement contexts is critical. Research on refugees, internally displaced persons (IDPs), and host communities is highly dependent on diverse data sources—ranging from large-scale household surveys to localized administrative registration systems.
Understanding *which* datasets are being utilized, *who* is producing them, and *how* they are integrated into policy analysis helps international organizations (such as the World Bank and UNHCR), researchers, and funding bodies:
1. **Monitor Data Investments**: Quantify the impact and academic/policy reach of dedicated data initiatives (e.g., those funded by the World Bank-UNHCR Joint Data Center on Forced Displacement).
2. **Identify Data Gaps**: Discover regions or populations where FCV analyses lack primary microdata and are forced to rely solely on background or secondary estimates.
3. **Avoid Duplication**: Map existing research projects to avoid redundant data collection efforts in challenging, insecure environments.
Due to the unstructured nature of academic literature and policy briefs, this has historically required labor-intensive manual reviews. This model automates this pipeline by identifying verbatim data mentions, their creators, and their exact analytical roles (primary data source, validation support, or passing background citation).
### Data Sources & Domain Coverage
The training data for this model was curated using actual research documents, reports, and working papers from major development institutions operating in FCV regions. Key data sources referenced in the corpus include:
* **Humanitarian Registries**: UNHCR's proGRES database, registration rolls from national border/refugee agencies, and program databases.
* **Displacement Tracking Systems**: IOM's Displacement Tracking Matrix (DTM) reports and the Internal Displacement Monitoring Centre (IDMC) registries.
* **Household Surveys in FCV Contexts**: Living Standards Measurement Study (LSMS) surveys, Demographic and Health Surveys (DHS), Multiple Indicator Cluster Surveys (MICS), and specialized welfare monitoring surveys (e.g., SHINE, SESRE).
* **Geospatial & Spatial Databases**: Climate/weather indicators, conflict event databases (e.g., ACLED), and satellite camp imagery.
---
## Usage Option 1: Using the `ai4data` Library Wrapper (Recommended)
It is **highly recommended** to interact with this model using the official **`ai4data`** Python wrapper. The library handles markdown-aware document parsing, sliding context windows, overlap resolution, and entity-relation alignment automatically.
Install the library directly from GitHub:
```bash
pip install git+https://github.com/worldbank/ai4data.git
```
### 1. Extract from Text
```python
from ai4data.data_use import extract_from_text
text = """
We use the 2022 Demographic and Health Survey (DHS) conducted by the National Statistics Office in Uganda
to analyze child health outcomes. We complement this with population records from the Ministry of Health.
"""
results = extract_from_text(text, include_confidence=True)
for dataset in results["datasets"]:
name = dataset["mention_name"]["text"]
acronym = dataset["acronym"]["text"] or "N/A"
producer = dataset["producer"]["text"] or "N/A"
geography = dataset["geography"]["text"] or "N/A"
usage = dataset["usage_context"]["text"]
print(f"Dataset: {name} ({acronym})")
print(f" Producer: {producer} | Geography: {geography} | Role: {usage}\n")
```
### 2. Extract from PDF
```python
from ai4data.data_use import extract_from_document
# Extracts from a local path or a PDF URL
pdf_url = "https://pdf.usaid.gov/pdf_docs/PA00TB5D.pdf"
results = extract_from_document(pdf_url, pages=[0, 1, 2])
for page_data in results:
print(f"--- Page {page_data['page']} ---")
for dataset in page_data["datasets"]:
print(f"Found mention: {dataset['mention_name']['text']}")
```
---
## Usage Option 2: Using raw `gliner2` Library (Without the wrapper)
If you prefer to integrate the model directly without using the wrapper library, you can use the raw `gliner2` package.
### 1. Installation
```bash
pip install gliner2 huggingface_hub
```
### 2. Code Example
```python
from gliner2 import GLiNER2
from huggingface_hub import snapshot_download
BASE_MODEL = "fastino/gliner2-large-v1"
ADAPTER_ID = "ai4data/datause-extraction-v1"
# 1. Load model and adapter
model = GLiNER2.from_pretrained(BASE_MODEL)
model.load_adapter(snapshot_download(ADAPTER_ID))
model.eval()
# 2. Define schema
ENTITY_DEFS = {
"name": "The exact full name of the data source or dataset",
"acronym": "The acronym or abbreviation if any",
"producer": "The organization or entity that produced or published the data",
"timeframe": "The year or time period of the data such as 2019 or 2019 to 2020",
"datatype": "The type of data verbatim from text such as survey, report, census, program, system, or assessment",
"geography": "The country, region, or geographic area the data covers",
"specificity": "Whether this mention is named, descriptive, or vague",
"usage": "Whether this is primary, supporting, or background data",
}
RELATION_DEFS = {
"has_acronym": "The acronym of the dataset",
"has_producer": "The producer of the dataset",
"has_timeframe": "The timeframe of the dataset",
"has_datatype": "The data type of the dataset",
"has_geography": "The country or geographic coverage area of the dataset",
"has_specificity": "Whether this dataset is named, descriptive, or vague",
"has_usage": "Whether this dataset is primary, supporting, or background",
}
schema = model.create_schema()
schema.entities(ENTITY_DEFS)
schema.relations(RELATION_DEFS)
# 3. Add prompt prefix
LABEL_PREFIX = "specificity: named | descriptive | vague usage: primary | supporting | background |"
text = "We use the 2022 Demographic and Health Survey (DHS) conducted by the National Statistics Office in Uganda."
prefixed_text = f"{LABEL_PREFIX} {text}"
# 4. Extract
outputs = model.extract(prefixed_text, schema, threshold=0.3)
print(outputs)
```
---
## Response Structure (Wrapper Output)
Each item in the returned `"datasets"` list from the `ai4data` library is structured as follows:
```json
{
"mention_name": {
"text": "Demographic and Health Survey",
"confidence": 0.9998,
"start": 12,
"end": 41
},
"specificity_tag": {
"text": "named",
"confidence": 0.9998,
"start": 12,
"end": 41
},
"usage_context": {
"text": "primary",
"confidence": 0.9998,
"start": 12,
"end": 41
},
"typology_tag": {
"text": "survey",
"confidence": 0.9998,
"start": 12,
"end": 41
},
"acronym": {
"text": "DHS",
"confidence": 0.9996,
"start": 43,
"end": 46
},
"producer": {
"text": "National Statistics Office",
"confidence": 0.9992,
"start": 60,
"end": 86
},
"reference_year": {
"text": "2022",
"confidence": 0.9998,
"start": 7,
"end": 11
},
"geography": {
"text": "Uganda",
"confidence": 0.9997,
"start": 90,
"end": 96
}
}
```
### Attribute Fields
| Field | Type | Description |
|---|---|---|
| `mention_name` | String / Span | Verbatim name of the dataset mentioned in the text |
| `specificity_tag` | Choice / Span | Precision classification: `named` / `descriptive` / `vague` |
| `usage_context` | Choice / Span | Analytical role: `primary` (core dataset) / `supporting` (context/validation) / `background` (passing reference) |
| `is_used` | Boolean / Span | Derived field: `True` if `usage_context` is `primary`/`supporting`, `False` if `background` |
| `typology_tag` | Choice / Span | Derived/mapped data type: `survey` / `census` / `administrative` / `database` / `indicator` / `geospatial` / `microdata` / `report` / `other` |
| `acronym` | String / Span | Abbreviation or acronym linked to the dataset |
| `producer` | String / Span | Organizing body or agency that published/collected the data |
| `reference_year` | String / Span | Year or timeframe the data represents |
| `geography` | String / Span | Geographic coverage or region the data represents |
|