Instructions to use rafmacalaba/gliner-large-data-mentions with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- GLiNER
How to use rafmacalaba/gliner-large-data-mentions with GLiNER:
from gliner import GLiNER model = GLiNER.from_pretrained("rafmacalaba/gliner-large-data-mentions") - Notebooks
- Google Colab
- Kaggle
language: en
license: apache-2.0
tags:
- gliner
- named-entity-recognition
- token-classification
- data-mention-extraction
- survey-detection
- geocoded-data
- DHS
- census
- ml-intern
library_name: gliner
base_model: urchade/gliner_large-v2.1
GLiNER Large β Data Mention Extraction
Fine-tuned from urchade/gliner_large-v2.1 for extracting data source mentions in social science and global health research papers.
Status: Training job pending. See
train_gliner.pyfor the complete self-contained training script.
Entity Types
| Label | Description | Examples |
|---|---|---|
SURVEY |
Named survey programs | Demographic and Health Survey, DHS, MICS, LSMS, Afrobarometer |
DATASET |
Specific named datasets | Census microdata, 2010 Population and Housing Census, LSMS-ISA panel dataset |
DATABASE |
Named databases/repositories | World Development Indicators, IHME GBD, IPUMS-DHS, FAOSTAT, GRID3 |
GEOCODED_DATA |
Geocoded/spatial data mentions | GPS coordinates, geocoded DHS cluster coordinates, geo-referenced household data |
VAGUE_MENTION |
Vague/informal references | "a survey from Ghana", "a nationally representative household survey" |
Usage (after training)
from gliner import GLiNER
model = GLiNER.from_pretrained("rafmacalaba/gliner-large-data-mentions")
text = "We used GPS-tagged household locations from the 2018 DHS and linked them to the WorldPop gridded population database."
entities = model.predict_entities(
text,
labels=["SURVEY", "DATASET", "DATABASE", "GEOCODED_DATA", "VAGUE_MENTION"],
threshold=0.4,
)
for e in entities:
print(f"[{e['label']}] '{e['text']}' (score={e['score']:.3f})")
# [GEOCODED_DATA] 'GPS-tagged household locations' (score=0.87)
# [SURVEY] '2018 DHS' (score=0.92)
# [DATABASE] 'WorldPop gridded population database' (score=0.83)
Training Data
- 121 manually annotated sentences / 175 entity spans
- Domain: Social science, global health, demographic surveys, GIS/spatial data
- No public dataset available β custom-built for this task
- All data is embedded inline in
train_gliner.pyfor portability
Entity distribution:
SURVEY : 55 spans (31.4%)
GEOCODED_DATA : 40 spans (22.9%)
DATABASE : 38 spans (21.7%)
DATASET : 24 spans (13.7%)
VAGUE_MENTION : 18 spans (10.3%)
Training Configuration
Based on the original GLiNER paper (AAAI 2024, arxiv:2311.08526) fine-tuning recipe:
| Parameter | Value |
|---|---|
| Base model | urchade/gliner_large-v2.1 (DeBERTa-v3-large) |
| Epochs | 25 |
| Batch size | 8 |
| Encoder LR | 1e-5 |
| Head LR (others_lr) | 5e-5 |
| LR scheduler | cosine |
| Warmup | 8% of total steps |
| Loss | sum reduction, focal_loss_alpha=-1 |
| Negatives ratio | 1.0 |
| Masking | global |
| FP16 | True (GPU) |
How to Run Training
Option A: HF Jobs (recommended)
from hf_jobs import run
run(
script="https://huggingface.co/rafmacalaba/gliner-large-data-mentions/raw/main/train_gliner.py",
dependencies=["gliner==0.2.26", "torch", "trackio"],
hardware_flavor="t4-small",
timeout="3h",
env={
"TRACKIO_SPACE_ID": "rafmacalaba/ml-intern-gliner01",
"TRACKIO_PROJECT": "gliner-data-mentions",
}
)
Or via the ML Intern agent tool:
hf_jobs(
operation="run",
script="https://huggingface.co/rafmacalaba/gliner-large-data-mentions/raw/main/train_gliner.py",
dependencies=["gliner==0.2.26", "torch", "trackio"],
hardware_flavor="t4-small",
timeout="3h",
trackio_space_id="rafmacalaba/ml-intern-gliner01",
trackio_project="gliner-data-mentions"
)
Option B: Local GPU
pip install gliner==0.2.26 torch trackio
python train_gliner.py
# or smoke test:
python train_gliner.py --smoke_test
Agent Execution Plan
For an autonomous agent completing this task, follow these steps in order:
Step 1 β Verify prerequisites β (already done)
-
gliner==0.2.26API validated:Trainer,TrainingArguments,model._create_data_collator() -
TrainingArgumentshasothers_lr,loss_reduction,negatives,masking,focal_loss_*params -
GLiNER.push_to_hub()confirmed available -
report_to="trackio"natively supported inTrainingArguments - Training data: 121 examples, 175 spans, 0 validation errors
- Hub repo created:
rafmacalaba/gliner-large-data-mentions
Step 2 β Submit training job
hf_jobs(
operation="run",
script="https://huggingface.co/rafmacalaba/gliner-large-data-mentions/raw/main/train_gliner.py",
dependencies=["gliner==0.2.26", "torch", "trackio"],
hardware_flavor="t4-small",
timeout="3h",
trackio_space_id="rafmacalaba/ml-intern-gliner01",
trackio_project="gliner-data-mentions"
)
Expected: job starts, downloads urchade/gliner_large-v2.1 (900MB), trains 25 epochs (103 examples, 13 steps/epoch = 325 total steps), ~20-40 min on T4.
Step 3 β Monitor
hf_jobs(operation="logs", job_id="<job_id_from_step2>")
Look for: === Starting training ===, then {'loss': ..., 'epoch': ...} every 5 steps.
Watch for OOM (reduce batch_size to 4, increase gradient_accumulation_steps to 2).
Step 4 β Verify Hub push
hf_repo_files(operation="list", repo_id="rafmacalaba/gliner-large-data-mentions")
Expect: pytorch_model.bin or model.safetensors, config.json, tokenizer_config.json.
Step 5 β Evaluate (post-training)
Run inference on test sentences and confirm entities are extracted with score > 0.4:
- "The Demographic and Health Survey collected data in 47 countries." β
[SURVEY] Demographic and Health Survey - "Geocoded DHS cluster coordinates were overlaid with flood maps." β
[GEOCODED_DATA],[SURVEY] - "A survey from Ghana collected child nutrition data." β
[VAGUE_MENTION] A survey from Ghana
Step 6 β Iterate (if results are poor)
If entity scores are low (< 0.5) or entities are missed:
- Add more training examples (especially VAGUE_MENTION β currently underrepresented at 18 examples)
- Increase epochs to 40 (small dataset benefits from more epochs)
- Consider lowering encoder LR to 5e-6 if loss is oscillating
Key References
- GLiNER paper: arxiv:2311.08526 (AAAI 2024)
- Dataset mention extraction: arxiv:2502.10263 (World Bank, 2025)
- GSAP-NER (scholarly entity extraction): arxiv:2311.09860
- Coleridge "Show US the Data" Kaggle: best public dataset for dataset mentions in social science papers
Generated by ML Intern
This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.
- Try ML Intern: https://smolagents-ml-intern.hf.space
- Source code: https://github.com/huggingface/ml-intern