File size: 5,847 Bytes
191fab5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a622a01
191fab5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a622a01
191fab5
 
 
 
 
 
 
 
 
 
a622a01
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
---
language:
  - en
license: apache-2.0
tags:
  - gliner2
  - ner
  - dataset-extraction
  - lora
  - world-bank
base_model: fastino/gliner2-large-v1
library_name: gliner2
pipeline_tag: token-classification
datasets:
  - rafmacalaba/datause-v8
model-index:
  - name: datause-extraction
    results:
      - task:
          type: token-classification
          name: Dataset Mention Extraction
        metrics:
          - type: f1
            value: 84.8
            name: F1 (max_tokens=512)
          - type: precision
            value: 90.0
            name: Precision
          - type: recall
            value: 80.2
            name: Recall
---

# Dataset Use Extraction

A fine-tuned [GLiNER2](https://huggingface.co/fastino/gliner2-large-v1) adapter for extracting structured dataset mentions from research documents and policy papers.

Developed as part of the **AI for Data—Data for AI** program, a collaboration between the **World Bank** and **UNHCR**, to monitor and measure data use across development research.

## Overview

This model identifies and extracts structured information about datasets mentioned in text, including formal survey names, descriptive data references, and vague data allusions. It extracts rich metadata for each mention including the dataset name, acronym, producer, geography, data type, and usage context.

## Performance

Evaluated on a held-out test set of 199 annotated text passages:

| Metric | Score |
|---|---|
| **F1** | **84.8%** |
| Precision | 90.0% |
| Recall | 80.2% |

### Performance by mention type

| Tag | Total | Found | Recall |
|---|---|---|---|
| Named | 394 | 317 | 80.5% |
| Descriptive | 135 | 108 | 80.0% |
| Vague | 87 | 70 | 80.5% |

## Extracted Fields

For each dataset mention, the model extracts up to 13 structured fields:

| Field | Type | Description |
|---|---|---|
| `dataset_name` | string | Name or description of the dataset |
| `acronym` | string | Abbreviation (e.g., "DHS", "LSMS") |
| `author` | string | Individual author(s) |
| `producer` | string | Organization that created the dataset |
| `publication_year` | string | Year published |
| `reference_year` | string | Year data was collected |
| `reference_population` | string | Target population |
| `geography` | string | Geographic coverage |
| `description` | string | Content description |
| `data_type` | choice | survey, census, database, administrative, indicator, geospatial, microdata, report, other |
| `dataset_tag` | choice | named, descriptive, vague |
| `usage_context` | choice | primary, supporting, background |
| `is_used` | choice | True, False |

## Usage

### With `ai4data` library (recommended)

```bash
pip install git+https://github.com/rafmacalaba/monitoring_of_datause.git
```

```python
from ai4data import extract_from_text, extract_from_document

# Extract from text
text = """We use the Demographic and Health Survey (DHS) from 2020 as our
primary data source to analyze outcomes in Ghana. For robustness checks,
we also reference the Ghana Living Standard Survey (GLSS) from 2012."""

results = extract_from_text(text)
for ds in results["datasets"]:
    print(f"  {ds['dataset_name']} [{ds['dataset_tag']}]")

# Extract from PDF (URL or local file)
url = "https://documents1.worldbank.org/curated/en/.../report.pdf"
results = extract_from_document(url)
```

### With GLiNER2 directly

```python
from gliner2 import GLiNER2
from huggingface_hub import snapshot_download

# Load base model + adapter
model = GLiNER2.from_pretrained("fastino/gliner2-large-v1")
adapter_path = snapshot_download("ai4data/datause-extraction")
model.load_adapter(adapter_path)

# Define extraction schema
schema = (
    model.create_schema()
    .structure("dataset_mention")
        .field("dataset_name", dtype="str")
        .field("acronym", dtype="str")
        .field("producer", dtype="str")
        .field("geography", dtype="str")
        .field("description", dtype="str")
        .field("data_type", dtype="str",
               choices=["survey", "census", "database", "administrative",
                        "indicator", "geospatial", "microdata", "report", "other"])
        .field("dataset_tag", dtype="str",
               choices=["named", "descriptive", "vague"])
        .field("usage_context", dtype="str",
               choices=["primary", "supporting", "background"])
        .field("is_used", dtype="str", choices=["True", "False"])
)

results = model.extract(text, schema)
for mention in results["dataset_mention"]:
    print(mention)
```

## Training Details

- **Base model**: [fastino/gliner2-large-v1](https://huggingface.co/fastino/gliner2-large-v1) (DeBERTa-v3-large encoder)
- **Method**: LoRA (r=16, alpha=32)
- **Training data**: ~3,400 synthetic examples (v8 dataset) generated with GPT-4o and Gemini 2.5 Flash
- **Max context**: 512 tokens (aligned with DeBERTa-v3 position embeddings)
- **Data format**: Context-aware passages with markdown formatting, footnotes, and structured annotations

## Limitations

- Optimized for English-language research documents and policy papers
- Best suited for World Bank-style development research documents
- May not generalize well to non-research text (news articles, social media, etc.)
- Requires the `fastino/gliner2-large-v1` base model

## Citation

If you use this model, please cite:

```bibtex
@misc{ai4data-datause-extraction,
  title={Dataset Use Extraction Model},
  author={AI for Data—Data for AI},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/ai4data/datause-extraction}
}
```

## Links

- **Library**: [ai4data](https://github.com/rafmacalaba/monitoring_of_datause)
- **Base model**: [fastino/gliner2-large-v1](https://huggingface.co/fastino/gliner2-large-v1)
- **Program**: [AI for Data—Data for AI](https://www.worldbank.org/en/programs/ai4data) (World Bank & UNHCR)