GLiNER Large β€” Data Mention Extraction

Fine-tuned from urchade/gliner_large-v2.1 for extracting data source mentions in social science and global health research papers.

Status: Training job pending. See train_gliner.py for the complete self-contained training script.


Entity Types

Label Description Examples
SURVEY Named survey programs Demographic and Health Survey, DHS, MICS, LSMS, Afrobarometer
DATASET Specific named datasets Census microdata, 2010 Population and Housing Census, LSMS-ISA panel dataset
DATABASE Named databases/repositories World Development Indicators, IHME GBD, IPUMS-DHS, FAOSTAT, GRID3
GEOCODED_DATA Geocoded/spatial data mentions GPS coordinates, geocoded DHS cluster coordinates, geo-referenced household data
VAGUE_MENTION Vague/informal references "a survey from Ghana", "a nationally representative household survey"

Usage (after training)

from gliner import GLiNER

model = GLiNER.from_pretrained("rafmacalaba/gliner-large-data-mentions")

text = "We used GPS-tagged household locations from the 2018 DHS and linked them to the WorldPop gridded population database."

entities = model.predict_entities(
    text,
    labels=["SURVEY", "DATASET", "DATABASE", "GEOCODED_DATA", "VAGUE_MENTION"],
    threshold=0.4,
)

for e in entities:
    print(f"[{e['label']}] '{e['text']}' (score={e['score']:.3f})")
# [GEOCODED_DATA] 'GPS-tagged household locations' (score=0.87)
# [SURVEY]        '2018 DHS' (score=0.92)
# [DATABASE]      'WorldPop gridded population database' (score=0.83)

Training Data

  • 121 manually annotated sentences / 175 entity spans
  • Domain: Social science, global health, demographic surveys, GIS/spatial data
  • No public dataset available β€” custom-built for this task
  • All data is embedded inline in train_gliner.py for portability

Entity distribution:

SURVEY        : 55 spans (31.4%)
GEOCODED_DATA : 40 spans (22.9%)
DATABASE      : 38 spans (21.7%)
DATASET       : 24 spans (13.7%)
VAGUE_MENTION : 18 spans (10.3%)

Training Configuration

Based on the original GLiNER paper (AAAI 2024, arxiv:2311.08526) fine-tuning recipe:

Parameter Value
Base model urchade/gliner_large-v2.1 (DeBERTa-v3-large)
Epochs 25
Batch size 8
Encoder LR 1e-5
Head LR (others_lr) 5e-5
LR scheduler cosine
Warmup 8% of total steps
Loss sum reduction, focal_loss_alpha=-1
Negatives ratio 1.0
Masking global
FP16 True (GPU)

How to Run Training

Option A: HF Jobs (recommended)

from hf_jobs import run

run(
    script="https://huggingface.co/rafmacalaba/gliner-large-data-mentions/raw/main/train_gliner.py",
    dependencies=["gliner==0.2.26", "torch", "trackio"],
    hardware_flavor="t4-small",
    timeout="3h",
    env={
        "TRACKIO_SPACE_ID": "rafmacalaba/ml-intern-gliner01",
        "TRACKIO_PROJECT": "gliner-data-mentions",
    }
)

Or via the ML Intern agent tool:

hf_jobs(
  operation="run",
  script="https://huggingface.co/rafmacalaba/gliner-large-data-mentions/raw/main/train_gliner.py",
  dependencies=["gliner==0.2.26", "torch", "trackio"],
  hardware_flavor="t4-small",
  timeout="3h",
  trackio_space_id="rafmacalaba/ml-intern-gliner01",
  trackio_project="gliner-data-mentions"
)

Option B: Local GPU

pip install gliner==0.2.26 torch trackio
python train_gliner.py
# or smoke test:
python train_gliner.py --smoke_test

Agent Execution Plan

For an autonomous agent completing this task, follow these steps in order:

Step 1 β€” Verify prerequisites βœ… (already done)

  • gliner==0.2.26 API validated: Trainer, TrainingArguments, model._create_data_collator()
  • TrainingArguments has others_lr, loss_reduction, negatives, masking, focal_loss_* params
  • GLiNER.push_to_hub() confirmed available
  • report_to="trackio" natively supported in TrainingArguments
  • Training data: 121 examples, 175 spans, 0 validation errors
  • Hub repo created: rafmacalaba/gliner-large-data-mentions

Step 2 β€” Submit training job

hf_jobs(
  operation="run",
  script="https://huggingface.co/rafmacalaba/gliner-large-data-mentions/raw/main/train_gliner.py",
  dependencies=["gliner==0.2.26", "torch", "trackio"],
  hardware_flavor="t4-small",
  timeout="3h",
  trackio_space_id="rafmacalaba/ml-intern-gliner01",
  trackio_project="gliner-data-mentions"
)

Expected: job starts, downloads urchade/gliner_large-v2.1 (900MB), trains 25 epochs (103 examples, 13 steps/epoch = 325 total steps), ~20-40 min on T4.

Step 3 β€” Monitor

hf_jobs(operation="logs", job_id="<job_id_from_step2>")

Look for: === Starting training ===, then {'loss': ..., 'epoch': ...} every 5 steps. Watch for OOM (reduce batch_size to 4, increase gradient_accumulation_steps to 2).

Step 4 β€” Verify Hub push

hf_repo_files(operation="list", repo_id="rafmacalaba/gliner-large-data-mentions")

Expect: pytorch_model.bin or model.safetensors, config.json, tokenizer_config.json.

Step 5 β€” Evaluate (post-training)

Run inference on test sentences and confirm entities are extracted with score > 0.4:

  • "The Demographic and Health Survey collected data in 47 countries." β†’ [SURVEY] Demographic and Health Survey
  • "Geocoded DHS cluster coordinates were overlaid with flood maps." β†’ [GEOCODED_DATA], [SURVEY]
  • "A survey from Ghana collected child nutrition data." β†’ [VAGUE_MENTION] A survey from Ghana

Step 6 β€” Iterate (if results are poor)

If entity scores are low (< 0.5) or entities are missed:

  1. Add more training examples (especially VAGUE_MENTION β€” currently underrepresented at 18 examples)
  2. Increase epochs to 40 (small dataset benefits from more epochs)
  3. Consider lowering encoder LR to 5e-6 if loss is oscillating

Key References

  • GLiNER paper: arxiv:2311.08526 (AAAI 2024)
  • Dataset mention extraction: arxiv:2502.10263 (World Bank, 2025)
  • GSAP-NER (scholarly entity extraction): arxiv:2311.09860
  • Coleridge "Show US the Data" Kaggle: best public dataset for dataset mentions in social science papers

Generated by ML Intern

This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for rafmacalaba/gliner-large-data-mentions

Finetuned
(9)
this model

Papers for rafmacalaba/gliner-large-data-mentions