Instructions to use ai4data/datause-extraction-v2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use ai4data/datause-extraction-v2 with PEFT:
Task type is invalid.
- Transformers
How to use ai4data/datause-extraction-v2 with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("ai4data/datause-extraction-v2", dtype="auto") - GLiNER2
How to use ai4data/datause-extraction-v2 with GLiNER2:
from gliner2 import GLiNER2 model = GLiNER2.from_pretrained("ai4data/datause-extraction-v2") # Extract entities text = "Apple CEO Tim Cook announced iPhone 15 in Cupertino yesterday." result = extractor.extract_entities(text, ["company", "person", "product", "location"]) print(result) - Notebooks
- Google Colab
- Kaggle
File size: 3,854 Bytes
27793a1 b817b87 27793a1 b817b87 27793a1 b817b87 27793a1 b817b87 27793a1 b817b87 27793a1 8948590 b817b87 8948590 b817b87 8948590 27793a1 8948590 27793a1 8948590 27793a1 8948590 b817b87 8948590 b817b87 8948590 b817b87 8948590 b817b87 8948590 b817b87 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 | ---
base_model: fastino/gliner2-large-v1
library_name: peft
tags:
- base_model:adapter:fastino/gliner2-large-v1
- lora
- transformers
- gliner2
- dataset-extraction
- data-use
---
# GLiNER2 Data Use Extraction Adapter (v2)
This is a fine-tuned LoRA adapter for `fastino/gliner2-large-v1` trained to extract datasets, data mentions, and their relations from academic papers, research, and reports (with a focus on World Bank/UNHCR documents).
- **Repository:** [https://github.com/rafmacalaba/monitoring_of_datause](https://github.com/rafmacalaba/monitoring_of_datause)
- **Base Model:** `fastino/gliner2-large-v1`
- **Adapter ID:** `ai4data/datause-extraction-v2`
---
## How to Get Started with the Model
It is **highly recommended** to use this model through the official **`ai4data`** Python library wrapper. The library automatically handles:
- **Markdown-aware chunking** (respecting model context limits).
- **Character offset index adjustment** across multiple chunk pages.
- **Greedy overlap resolution** and text normalization.
- **Pre-filtering pre-classifiers** to skip non-data pages.
- **Deduplication** and acronym matching.
### 1. Installation
Clone and install the repository:
```bash
git clone <repository-url>
cd monitoring_of_datause
uv sync
```
### 2. Python Usage
To extract dataset mentions and their attributes (like timeframe, producer, and acronyms):
```python
from ai4data import extract_from_text, extract_from_document
text = """Our analysis uses the 2022 Demographic and Health Survey (DHS) conducted by
the National Statistics Office. We complement this with administrative systems, but
only the DHS is used in the empirical models."""
# Extract from raw text
results = extract_from_text(text)
print(results["datasets"])
# Extract from a PDF document
pdf_results = extract_from_document("report.pdf", pages=[0, 1, 2])
print(pdf_results)
```
---
## Model Schema & Response Structure
The model extracts up to 7 attributes per data mention. When querying via `ai4data`, each extracted entity in the `"datasets"` list has the following structure:
```json
{
"mention_name": {
"text": "Demographic and Health Survey",
"confidence": 0.9999,
"start": 23,
"end": 52
},
"specificity_tag": {
"text": "named",
"confidence": 0.9999,
"start": 23,
"end": 52
},
"usage_context": {
"text": "primary",
"confidence": 0.9999,
"start": 23,
"end": 52
},
"typology_tag": {
"text": "survey",
"confidence": 0.9999,
"start": 23,
"end": 52
},
"acronym": {
"text": "DHS",
"confidence": 0.9996,
"start": 54,
"end": 57
},
"producer": {
"text": "National Statistics Office",
"confidence": 0.9999,
"start": 72,
"end": 98
},
"reference_year": {
"text": "2022",
"confidence": 0.9999,
"start": 18,
"end": 22
},
"is_used": {
"text": "True",
"confidence": 0.9999,
"start": 23,
"end": 52
},
"geography": {
"text": "",
"confidence": 0.9999,
"start": 23,
"end": 52
}
}
```
---
## Annotation Guidelines & What Counts as a Data Mention
### Specificity Taxonomy
- **`named`**: A specific, citable dataset (e.g., `"DHS 2020"`, `"World Development Indicators"`, `"Ghana Living Standards Survey (GLSS)"`)
- **`descriptive`**: A general category of data, not a specific named dataset (e.g., `"household survey data"`, `"administrative records"`, `"panel data on firms"`)
- **`vague`**: An indirect or ambiguous reference (e.g., `"available data"`, `"our dataset"`, `"the data used in this study"`)
### Usage Context
- **`primary`**: Core data driving the main analysis in the report.
- **`supporting`**: Secondary data used to validate, calibrate, or provide robustness checks.
- **`background`**: Mentioned in passing, in a literature review, or as historical context. |