Instructions to use ai4data/datause-extraction-v2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use ai4data/datause-extraction-v2 with PEFT:
Task type is invalid.
- Transformers
How to use ai4data/datause-extraction-v2 with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("ai4data/datause-extraction-v2", dtype="auto") - GLiNER2
How to use ai4data/datause-extraction-v2 with GLiNER2:
from gliner2 import GLiNER2 model = GLiNER2.from_pretrained("ai4data/datause-extraction-v2") # Extract entities text = "Apple CEO Tim Cook announced iPhone 15 in Cupertino yesterday." result = extractor.extract_entities(text, ["company", "person", "product", "location"]) print(result) - Notebooks
- Google Colab
- Kaggle
| base_model: fastino/gliner2-large-v1 | |
| library_name: peft | |
| tags: | |
| - base_model:adapter:fastino/gliner2-large-v1 | |
| - lora | |
| - transformers | |
| - gliner2 | |
| - dataset-extraction | |
| - data-use | |
| # GLiNER2 Data Use Extraction Adapter (v2) | |
| This is a fine-tuned LoRA adapter for `fastino/gliner2-large-v1` trained to extract datasets, data mentions, and their relations from academic papers, research, and reports (with a focus on World Bank/UNHCR documents). | |
| - **Repository:** [https://github.com/rafmacalaba/monitoring_of_datause](https://github.com/rafmacalaba/monitoring_of_datause) | |
| - **Base Model:** `fastino/gliner2-large-v1` | |
| - **Adapter ID:** `ai4data/datause-extraction-v2` | |
| --- | |
| ## How to Get Started with the Model | |
| It is **highly recommended** to use this model through the official **`ai4data`** Python library wrapper. The library automatically handles: | |
| - **Markdown-aware chunking** (respecting model context limits). | |
| - **Character offset index adjustment** across multiple chunk pages. | |
| - **Greedy overlap resolution** and text normalization. | |
| - **Pre-filtering pre-classifiers** to skip non-data pages. | |
| - **Deduplication** and acronym matching. | |
| ### 1. Installation | |
| Clone and install the repository: | |
| ```bash | |
| git clone <repository-url> | |
| cd monitoring_of_datause | |
| uv sync | |
| ``` | |
| ### 2. Python Usage | |
| To extract dataset mentions and their attributes (like timeframe, producer, and acronyms): | |
| ```python | |
| from ai4data import extract_from_text, extract_from_document | |
| text = """Our analysis uses the 2022 Demographic and Health Survey (DHS) conducted by | |
| the National Statistics Office. We complement this with administrative systems, but | |
| only the DHS is used in the empirical models.""" | |
| # Extract from raw text | |
| results = extract_from_text(text) | |
| print(results["datasets"]) | |
| # Extract from a PDF document | |
| pdf_results = extract_from_document("report.pdf", pages=[0, 1, 2]) | |
| print(pdf_results) | |
| ``` | |
| --- | |
| ## Model Schema & Response Structure | |
| The model extracts up to 7 attributes per data mention. When querying via `ai4data`, each extracted entity in the `"datasets"` list has the following structure: | |
| ```json | |
| { | |
| "mention_name": { | |
| "text": "Demographic and Health Survey", | |
| "confidence": 0.9999, | |
| "start": 23, | |
| "end": 52 | |
| }, | |
| "specificity_tag": { | |
| "text": "named", | |
| "confidence": 0.9999, | |
| "start": 23, | |
| "end": 52 | |
| }, | |
| "usage_context": { | |
| "text": "primary", | |
| "confidence": 0.9999, | |
| "start": 23, | |
| "end": 52 | |
| }, | |
| "typology_tag": { | |
| "text": "survey", | |
| "confidence": 0.9999, | |
| "start": 23, | |
| "end": 52 | |
| }, | |
| "acronym": { | |
| "text": "DHS", | |
| "confidence": 0.9996, | |
| "start": 54, | |
| "end": 57 | |
| }, | |
| "producer": { | |
| "text": "National Statistics Office", | |
| "confidence": 0.9999, | |
| "start": 72, | |
| "end": 98 | |
| }, | |
| "reference_year": { | |
| "text": "2022", | |
| "confidence": 0.9999, | |
| "start": 18, | |
| "end": 22 | |
| }, | |
| "is_used": { | |
| "text": "True", | |
| "confidence": 0.9999, | |
| "start": 23, | |
| "end": 52 | |
| }, | |
| "geography": { | |
| "text": "", | |
| "confidence": 0.9999, | |
| "start": 23, | |
| "end": 52 | |
| } | |
| } | |
| ``` | |
| --- | |
| ## Annotation Guidelines & What Counts as a Data Mention | |
| ### Specificity Taxonomy | |
| - **`named`**: A specific, citable dataset (e.g., `"DHS 2020"`, `"World Development Indicators"`, `"Ghana Living Standards Survey (GLSS)"`) | |
| - **`descriptive`**: A general category of data, not a specific named dataset (e.g., `"household survey data"`, `"administrative records"`, `"panel data on firms"`) | |
| - **`vague`**: An indirect or ambiguous reference (e.g., `"available data"`, `"our dataset"`, `"the data used in this study"`) | |
| ### Usage Context | |
| - **`primary`**: Core data driving the main analysis in the report. | |
| - **`supporting`**: Secondary data used to validate, calibrate, or provide robustness checks. | |
| - **`background`**: Mentioned in passing, in a literature review, or as historical context. |