---
language: en
license: apache-2.0
tags:
- gliner
- named-entity-recognition
- token-classification
- data-mention-extraction
- survey-detection
- geocoded-data
- DHS
- census
- ml-intern
library_name: gliner
base_model: urchade/gliner_large-v2.1
---

# GLiNER Large — Data Mention Extraction

Fine-tuned from [`urchade/gliner_large-v2.1`](https://huggingface.co/urchade/gliner_large-v2.1) for extracting **data source mentions** in social science and global health research papers.

> **Status:** Training job pending. See `train_gliner.py` for the complete self-contained training script.

---

## Entity Types

| Label | Description | Examples |
|---|---|---|
| `SURVEY` | Named survey programs | Demographic and Health Survey, DHS, MICS, LSMS, Afrobarometer |
| `DATASET` | Specific named datasets | Census microdata, 2010 Population and Housing Census, LSMS-ISA panel dataset |
| `DATABASE` | Named databases/repositories | World Development Indicators, IHME GBD, IPUMS-DHS, FAOSTAT, GRID3 |
| `GEOCODED_DATA` | Geocoded/spatial data mentions | GPS coordinates, geocoded DHS cluster coordinates, geo-referenced household data |
| `VAGUE_MENTION` | Vague/informal references | "a survey from Ghana", "a nationally representative household survey" |

---

## Usage (after training)

```python
from gliner import GLiNER

model = GLiNER.from_pretrained("rafmacalaba/gliner-large-data-mentions")

text = "We used GPS-tagged household locations from the 2018 DHS and linked them to the WorldPop gridded population database."

entities = model.predict_entities(
    text,
    labels=["SURVEY", "DATASET", "DATABASE", "GEOCODED_DATA", "VAGUE_MENTION"],
    threshold=0.4,
)

for e in entities:
    print(f"[{e['label']}] '{e['text']}' (score={e['score']:.3f})")
# [GEOCODED_DATA] 'GPS-tagged household locations' (score=0.87)
# [SURVEY]        '2018 DHS' (score=0.92)
# [DATABASE]      'WorldPop gridded population database' (score=0.83)
```

---

## Training Data

- **121 manually annotated sentences** / **175 entity spans**
- **Domain:** Social science, global health, demographic surveys, GIS/spatial data
- **No public dataset available** — custom-built for this task
- All data is embedded inline in `train_gliner.py` for portability

Entity distribution:
```
SURVEY        : 55 spans (31.4%)
GEOCODED_DATA : 40 spans (22.9%)
DATABASE      : 38 spans (21.7%)
DATASET       : 24 spans (13.7%)
VAGUE_MENTION : 18 spans (10.3%)
```

---

## Training Configuration

Based on the original GLiNER paper (AAAI 2024, arxiv:2311.08526) fine-tuning recipe:

| Parameter | Value |
|---|---|
| Base model | `urchade/gliner_large-v2.1` (DeBERTa-v3-large) |
| Epochs | 25 |
| Batch size | 8 |
| Encoder LR | 1e-5 |
| Head LR (others_lr) | 5e-5 |
| LR scheduler | cosine |
| Warmup | 8% of total steps |
| Loss | sum reduction, focal_loss_alpha=-1 |
| Negatives ratio | 1.0 |
| Masking | global |
| FP16 | True (GPU) |

---

## How to Run Training

### Option A: HF Jobs (recommended)

```python
from hf_jobs import run

run(
    script="https://huggingface.co/rafmacalaba/gliner-large-data-mentions/raw/main/train_gliner.py",
    dependencies=["gliner==0.2.26", "torch", "trackio"],
    hardware_flavor="t4-small",
    timeout="3h",
    env={
        "TRACKIO_SPACE_ID": "rafmacalaba/ml-intern-gliner01",
        "TRACKIO_PROJECT": "gliner-data-mentions",
    }
)
```

Or via the ML Intern agent tool:
```
hf_jobs(
  operation="run",
  script="https://huggingface.co/rafmacalaba/gliner-large-data-mentions/raw/main/train_gliner.py",
  dependencies=["gliner==0.2.26", "torch", "trackio"],
  hardware_flavor="t4-small",
  timeout="3h",
  trackio_space_id="rafmacalaba/ml-intern-gliner01",
  trackio_project="gliner-data-mentions"
)
```

### Option B: Local GPU

```bash
pip install gliner==0.2.26 torch trackio
python train_gliner.py
# or smoke test:
python train_gliner.py --smoke_test
```

---

## Agent Execution Plan

For an autonomous agent completing this task, follow these steps **in order**:

### Step 1 — Verify prerequisites ✅ (already done)
- [x] `gliner==0.2.26` API validated: `Trainer`, `TrainingArguments`, `model._create_data_collator()`
- [x] `TrainingArguments` has `others_lr`, `loss_reduction`, `negatives`, `masking`, `focal_loss_*` params
- [x] `GLiNER.push_to_hub()` confirmed available
- [x] `report_to="trackio"` natively supported in `TrainingArguments`
- [x] Training data: 121 examples, 175 spans, 0 validation errors
- [x] Hub repo created: `rafmacalaba/gliner-large-data-mentions`

### Step 2 — Submit training job
```
hf_jobs(
  operation="run",
  script="https://huggingface.co/rafmacalaba/gliner-large-data-mentions/raw/main/train_gliner.py",
  dependencies=["gliner==0.2.26", "torch", "trackio"],
  hardware_flavor="t4-small",
  timeout="3h",
  trackio_space_id="rafmacalaba/ml-intern-gliner01",
  trackio_project="gliner-data-mentions"
)
```

Expected: job starts, downloads `urchade/gliner_large-v2.1` (~900MB), trains 25 epochs (~103 examples, 13 steps/epoch = 325 total steps), ~20-40 min on T4.

### Step 3 — Monitor
```
hf_jobs(operation="logs", job_id="<job_id_from_step2>")
```
Look for: `=== Starting training ===`, then `{'loss': ..., 'epoch': ...}` every 5 steps.
Watch for OOM (reduce batch_size to 4, increase gradient_accumulation_steps to 2).

### Step 4 — Verify Hub push
```
hf_repo_files(operation="list", repo_id="rafmacalaba/gliner-large-data-mentions")
```
Expect: `pytorch_model.bin` or `model.safetensors`, `config.json`, `tokenizer_config.json`.

### Step 5 — Evaluate (post-training)
Run inference on test sentences and confirm entities are extracted with score > 0.4:
- "The Demographic and Health Survey collected data in 47 countries." → `[SURVEY] Demographic and Health Survey`
- "Geocoded DHS cluster coordinates were overlaid with flood maps." → `[GEOCODED_DATA]`, `[SURVEY]`
- "A survey from Ghana collected child nutrition data." → `[VAGUE_MENTION] A survey from Ghana`

### Step 6 — Iterate (if results are poor)
If entity scores are low (< 0.5) or entities are missed:
1. Add more training examples (especially VAGUE_MENTION — currently underrepresented at 18 examples)
2. Increase epochs to 40 (small dataset benefits from more epochs)
3. Consider lowering encoder LR to 5e-6 if loss is oscillating

---

## Key References

- GLiNER paper: [arxiv:2311.08526](https://arxiv.org/abs/2311.08526) (AAAI 2024)
- Dataset mention extraction: [arxiv:2502.10263](https://arxiv.org/abs/2502.10263) (World Bank, 2025)
- GSAP-NER (scholarly entity extraction): [arxiv:2311.09860](https://arxiv.org/abs/2311.09860)
- Coleridge "Show US the Data" Kaggle: best public dataset for dataset mentions in social science papers

<!-- ml-intern-provenance -->
## Generated by ML Intern

This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub.

- Try ML Intern: https://smolagents-ml-intern.hf.space
- Source code: https://github.com/huggingface/ml-intern