ai4data
/

datause-extraction-v2

@@ -5,202 +5,129 @@ tags:
 - base_model:adapter:fastino/gliner2-large-v1
 - lora
 - transformers
 ---
-# Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
-## Model Details
-### Model Description
-<!-- Provide a longer summary of what this model is. -->
-- **Developed by:** [More Information Needed]
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [More Information Needed]
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
-### Model Sources [optional]
-<!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
-## Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
-### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
-### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
-## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
-### Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
 ## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
-## Training Details
-### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
-### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-[More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
-## Model Card Contact
-[More Information Needed]
-### Framework versions
-- PEFT 0.19.1

 - base_model:adapter:fastino/gliner2-large-v1
 - lora
 - transformers
+- gliner2
+- dataset-extraction
+- data-use
 ---
+# GLiNER2 Data Use Extraction Adapter (v2)
+This is a fine-tuned LoRA adapter for `fastino/gliner2-large-v1` trained to extract datasets, data mentions, and their relations from academic papers, research, and reports (with a focus on World Bank/UNHCR documents).
+- **Repository:** [https://github.com/rafmacalaba/monitoring_of_datause](https://github.com/rafmacalaba/monitoring_of_datause)
+- **Base Model:** `fastino/gliner2-large-v1`
+- **Adapter ID:** `ai4data/datause-extraction-v2`
+---
+## Model Schema & Label Prefix
+The model is trained on a **7-field all-string schema** to optimize the GLiNER2 count head and prevent collapse.
+### Label Prefix
+Every input text **must** be prepended with the following fixed label prefix:
+```
+specificity: named | descriptive | vague usage: primary | supporting | background |
+```
+### Schema Structure
+```python
+SCHEMA = {
+    "data_mention": [
+        "name::str::The exact full name of the data source or dataset",
+        "acronym::str::The acronym or abbreviation if any",
+        "specificity::str::Whether this mention is named, descriptive, or vague",
+        "usage::str::Whether this is primary, supporting, or background data",
+        "datatype::str::The type of data verbatim from text such as survey, report, census, program, system, or assessment",
+        "producer::str::The organization or entity that produced or published the data",
+        "timeframe::str::The year or time period of the data such as 2019 or 2019 to 2020",
+    ]
+}
+```
+---
 ## How to Get Started with the Model
+### 1. Using the `ai4data` Python Library (Recommended)
+If you are using the repository's native wrapper library, simply import and run:
+```python
+from ai4data import extract_from_text, extract_from_document
+text = """Our analysis uses the 2022 Demographic and Health Survey (DHS) conducted by
+the National Statistics Office. We complement this with administrative systems, but
+only the DHS is used in the empirical models."""
+# Extract from raw text
+results = extract_from_text(text)
+print(results["datasets"])
+# Extract from a PDF document
+pdf_results = extract_from_document("report.pdf", pages=[0, 1, 2])
+print(pdf_results)
+```
+### 2. Direct Usage via standard `gliner2` Library
+If you want to use the raw `gliner2` model and adapter:
+```python
+from gliner2 import GLiNER2
+from huggingface_hub import snapshot_download
+# Load the base model and LoRA adapter
+model = GLiNER2.from_pretrained("fastino/gliner2-large-v1")
+adapter_path = snapshot_download("ai4data/datause-extraction-v2")
+model.load_adapter(adapter_path)
+model.eval()
+# Configure the entity and relation schemas
+schema = model.create_schema()
+schema.entities({
+    "name": "The exact full name of the data source or dataset",
+    "acronym": "The acronym or abbreviation if any",
+    "producer": "The organization or entity that produced the data",
+    "timeframe": "The year or time period such as 2019 or 2019 to 2020",
+    "datatype": "The type of data verbatim from text",
+    "specificity": "Whether this mention is named, descriptive, or vague",
+    "usage": "Whether this is primary, supporting, or background data",
+})
+schema.relations({
+    "has_acronym": "The acronym of the dataset",
+    "has_producer": "The producer of the dataset",
+    "has_timeframe": "The timeframe of the dataset",
+    "has_datatype": "The data type of the dataset",
+    "has_specificity": "Whether this dataset is named, descriptive, or vague",
+    "has_usage": "Whether this dataset is primary, supporting, or background",
+})
+# format input text with the required label prefix
+text = "We use the Ghana Living Standard Survey (GLSS) 2020 conducted by Ghana Statistical Service."
+prefix = "specificity: named | descriptive | vague usage: primary | supporting | background |"
+prefixed_text = f"{prefix} {text}"
+# Extract
+result = model.extract(
+    prefixed_text,
+    schema,
+    threshold=0.3,
+    include_confidence=True,
+    include_spans=True,
+)
+print(result)
+```
+## Annotation Guidelines & What Counts as a Data Mention
+### Specificity Taxonomy
+- **`named`**: A specific, citable dataset (e.g., `"DHS 2020"`, `"World Development Indicators"`, `"Ghana Living Standards Survey (GLSS)"`)
+- **`descriptive`**: A general category of data, not a specific named dataset (e.g., `"household survey data"`, `"administrative records"`, `"panel data on firms"`)
+- **`vague`**: An indirect or ambiguous reference (e.g., `"available data"`, `"our dataset"`, `"the data used in this study"`)
+### Usage Context
+- **`primary`**: Core data driving the main analysis in the report.
+- **`supporting`**: Secondary data used to validate, calibrate, or provide robustness checks.
+- **`background`**: Mentioned in passing, in a literature review, or as historical context.