rafmacalaba's picture
Update ML Intern artifact metadata
d51bcc0 verified
---
language: en
license: apache-2.0
tags:
- gliner
- named-entity-recognition
- token-classification
- data-mention-extraction
- survey-detection
- geocoded-data
- DHS
- census
- ml-intern
library_name: gliner
base_model: urchade/gliner_large-v2.1
---
# GLiNER Large β€” Data Mention Extraction
Fine-tuned from [`urchade/gliner_large-v2.1`](https://huggingface.co/urchade/gliner_large-v2.1) for extracting **data source mentions** in social science and global health research papers.
> **Status:** Training job pending. See `train_gliner.py` for the complete self-contained training script.
---
## Entity Types
| Label | Description | Examples |
|---|---|---|
| `SURVEY` | Named survey programs | Demographic and Health Survey, DHS, MICS, LSMS, Afrobarometer |
| `DATASET` | Specific named datasets | Census microdata, 2010 Population and Housing Census, LSMS-ISA panel dataset |
| `DATABASE` | Named databases/repositories | World Development Indicators, IHME GBD, IPUMS-DHS, FAOSTAT, GRID3 |
| `GEOCODED_DATA` | Geocoded/spatial data mentions | GPS coordinates, geocoded DHS cluster coordinates, geo-referenced household data |
| `VAGUE_MENTION` | Vague/informal references | "a survey from Ghana", "a nationally representative household survey" |
---
## Usage (after training)
```python
from gliner import GLiNER
model = GLiNER.from_pretrained("rafmacalaba/gliner-large-data-mentions")
text = "We used GPS-tagged household locations from the 2018 DHS and linked them to the WorldPop gridded population database."
entities = model.predict_entities(
text,
labels=["SURVEY", "DATASET", "DATABASE", "GEOCODED_DATA", "VAGUE_MENTION"],
threshold=0.4,
)
for e in entities:
print(f"[{e['label']}] '{e['text']}' (score={e['score']:.3f})")
# [GEOCODED_DATA] 'GPS-tagged household locations' (score=0.87)
# [SURVEY] '2018 DHS' (score=0.92)
# [DATABASE] 'WorldPop gridded population database' (score=0.83)
```
---
## Training Data
- **121 manually annotated sentences** / **175 entity spans**
- **Domain:** Social science, global health, demographic surveys, GIS/spatial data
- **No public dataset available** β€” custom-built for this task
- All data is embedded inline in `train_gliner.py` for portability
Entity distribution:
```
SURVEY : 55 spans (31.4%)
GEOCODED_DATA : 40 spans (22.9%)
DATABASE : 38 spans (21.7%)
DATASET : 24 spans (13.7%)
VAGUE_MENTION : 18 spans (10.3%)
```
---
## Training Configuration
Based on the original GLiNER paper (AAAI 2024, arxiv:2311.08526) fine-tuning recipe:
| Parameter | Value |
|---|---|
| Base model | `urchade/gliner_large-v2.1` (DeBERTa-v3-large) |
| Epochs | 25 |
| Batch size | 8 |
| Encoder LR | 1e-5 |
| Head LR (others_lr) | 5e-5 |
| LR scheduler | cosine |
| Warmup | 8% of total steps |
| Loss | sum reduction, focal_loss_alpha=-1 |
| Negatives ratio | 1.0 |
| Masking | global |
| FP16 | True (GPU) |
---
## How to Run Training
### Option A: HF Jobs (recommended)
```python
from hf_jobs import run
run(
script="https://huggingface.co/rafmacalaba/gliner-large-data-mentions/raw/main/train_gliner.py",
dependencies=["gliner==0.2.26", "torch", "trackio"],
hardware_flavor="t4-small",
timeout="3h",
env={
"TRACKIO_SPACE_ID": "rafmacalaba/ml-intern-gliner01",
"TRACKIO_PROJECT": "gliner-data-mentions",
}
)
```
Or via the ML Intern agent tool:
```
hf_jobs(
operation="run",
script="https://huggingface.co/rafmacalaba/gliner-large-data-mentions/raw/main/train_gliner.py",
dependencies=["gliner==0.2.26", "torch", "trackio"],
hardware_flavor="t4-small",
timeout="3h",
trackio_space_id="rafmacalaba/ml-intern-gliner01",
trackio_project="gliner-data-mentions"
)
```
### Option B: Local GPU
```bash
pip install gliner==0.2.26 torch trackio
python train_gliner.py
# or smoke test:
python train_gliner.py --smoke_test
```
---
## Agent Execution Plan
For an autonomous agent completing this task, follow these steps **in order**:
### Step 1 β€” Verify prerequisites βœ… (already done)
- [x] `gliner==0.2.26` API validated: `Trainer`, `TrainingArguments`, `model._create_data_collator()`
- [x] `TrainingArguments` has `others_lr`, `loss_reduction`, `negatives`, `masking`, `focal_loss_*` params
- [x] `GLiNER.push_to_hub()` confirmed available
- [x] `report_to="trackio"` natively supported in `TrainingArguments`
- [x] Training data: 121 examples, 175 spans, 0 validation errors
- [x] Hub repo created: `rafmacalaba/gliner-large-data-mentions`
### Step 2 β€” Submit training job
```
hf_jobs(
operation="run",
script="https://huggingface.co/rafmacalaba/gliner-large-data-mentions/raw/main/train_gliner.py",
dependencies=["gliner==0.2.26", "torch", "trackio"],
hardware_flavor="t4-small",
timeout="3h",
trackio_space_id="rafmacalaba/ml-intern-gliner01",
trackio_project="gliner-data-mentions"
)
```
Expected: job starts, downloads `urchade/gliner_large-v2.1` (~900MB), trains 25 epochs (~103 examples, 13 steps/epoch = 325 total steps), ~20-40 min on T4.
### Step 3 β€” Monitor
```
hf_jobs(operation="logs", job_id="<job_id_from_step2>")
```
Look for: `=== Starting training ===`, then `{'loss': ..., 'epoch': ...}` every 5 steps.
Watch for OOM (reduce batch_size to 4, increase gradient_accumulation_steps to 2).
### Step 4 β€” Verify Hub push
```
hf_repo_files(operation="list", repo_id="rafmacalaba/gliner-large-data-mentions")
```
Expect: `pytorch_model.bin` or `model.safetensors`, `config.json`, `tokenizer_config.json`.
### Step 5 β€” Evaluate (post-training)
Run inference on test sentences and confirm entities are extracted with score > 0.4:
- "The Demographic and Health Survey collected data in 47 countries." β†’ `[SURVEY] Demographic and Health Survey`
- "Geocoded DHS cluster coordinates were overlaid with flood maps." β†’ `[GEOCODED_DATA]`, `[SURVEY]`
- "A survey from Ghana collected child nutrition data." β†’ `[VAGUE_MENTION] A survey from Ghana`
### Step 6 β€” Iterate (if results are poor)
If entity scores are low (< 0.5) or entities are missed:
1. Add more training examples (especially VAGUE_MENTION β€” currently underrepresented at 18 examples)
2. Increase epochs to 40 (small dataset benefits from more epochs)
3. Consider lowering encoder LR to 5e-6 if loss is oscillating
---
## Key References
- GLiNER paper: [arxiv:2311.08526](https://arxiv.org/abs/2311.08526) (AAAI 2024)
- Dataset mention extraction: [arxiv:2502.10263](https://arxiv.org/abs/2502.10263) (World Bank, 2025)
- GSAP-NER (scholarly entity extraction): [arxiv:2311.09860](https://arxiv.org/abs/2311.09860)
- Coleridge "Show US the Data" Kaggle: best public dataset for dataset mentions in social science papers
<!-- ml-intern-provenance -->
## Generated by ML Intern
This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub.
- Try ML Intern: https://smolagents-ml-intern.hf.space
- Source code: https://github.com/huggingface/ml-intern