Update ML Intern artifact metadata

d51bcc0 verified 5 days ago

7.04 kB

language: en
license: apache-2.0
tags:
  - gliner
  - named-entity-recognition
  - token-classification
  - data-mention-extraction
  - survey-detection
  - geocoded-data
  - DHS
  - census
  - ml-intern
library_name: gliner
base_model: urchade/gliner_large-v2.1

GLiNER Large — Data Mention Extraction

Fine-tuned from urchade/gliner_large-v2.1 for extracting data source mentions in social science and global health research papers.

Status: Training job pending. See train_gliner.py for the complete self-contained training script.

Entity Types

Label	Description	Examples
`SURVEY`	Named survey programs	Demographic and Health Survey, DHS, MICS, LSMS, Afrobarometer
`DATASET`	Specific named datasets	Census microdata, 2010 Population and Housing Census, LSMS-ISA panel dataset
`DATABASE`	Named databases/repositories	World Development Indicators, IHME GBD, IPUMS-DHS, FAOSTAT, GRID3
`GEOCODED_DATA`	Geocoded/spatial data mentions	GPS coordinates, geocoded DHS cluster coordinates, geo-referenced household data
`VAGUE_MENTION`	Vague/informal references	"a survey from Ghana", "a nationally representative household survey"

Usage (after training)

from gliner import GLiNER

model = GLiNER.from_pretrained("rafmacalaba/gliner-large-data-mentions")

text = "We used GPS-tagged household locations from the 2018 DHS and linked them to the WorldPop gridded population database."

entities = model.predict_entities(
    text,
    labels=["SURVEY", "DATASET", "DATABASE", "GEOCODED_DATA", "VAGUE_MENTION"],
    threshold=0.4,
)

for e in entities:
    print(f"[{e['label']}] '{e['text']}' (score={e['score']:.3f})")
# [GEOCODED_DATA] 'GPS-tagged household locations' (score=0.87)
# [SURVEY]        '2018 DHS' (score=0.92)
# [DATABASE]      'WorldPop gridded population database' (score=0.83)

Training Data

121 manually annotated sentences / 175 entity spans
Domain: Social science, global health, demographic surveys, GIS/spatial data
No public dataset available — custom-built for this task
All data is embedded inline in train_gliner.py for portability

Entity distribution:

SURVEY        : 55 spans (31.4%)
GEOCODED_DATA : 40 spans (22.9%)
DATABASE      : 38 spans (21.7%)
DATASET       : 24 spans (13.7%)
VAGUE_MENTION : 18 spans (10.3%)

Training Configuration

Based on the original GLiNER paper (AAAI 2024, arxiv:2311.08526) fine-tuning recipe:

Parameter	Value
Base model	`urchade/gliner_large-v2.1` (DeBERTa-v3-large)
Epochs	25
Batch size	8
Encoder LR	1e-5
Head LR (others_lr)	5e-5
LR scheduler	cosine
Warmup	8% of total steps
Loss	sum reduction, focal_loss_alpha=-1
Negatives ratio	1.0
Masking	global
FP16	True (GPU)

How to Run Training

Option A: HF Jobs (recommended)

from hf_jobs import run

run(
    script="https://huggingface.co/rafmacalaba/gliner-large-data-mentions/raw/main/train_gliner.py",
    dependencies=["gliner==0.2.26", "torch", "trackio"],
    hardware_flavor="t4-small",
    timeout="3h",
    env={
        "TRACKIO_SPACE_ID": "rafmacalaba/ml-intern-gliner01",
        "TRACKIO_PROJECT": "gliner-data-mentions",
    }
)

Or via the ML Intern agent tool:

hf_jobs(
  operation="run",
  script="https://huggingface.co/rafmacalaba/gliner-large-data-mentions/raw/main/train_gliner.py",
  dependencies=["gliner==0.2.26", "torch", "trackio"],
  hardware_flavor="t4-small",
  timeout="3h",
  trackio_space_id="rafmacalaba/ml-intern-gliner01",
  trackio_project="gliner-data-mentions"
)

Option B: Local GPU

pip install gliner==0.2.26 torch trackio
python train_gliner.py
# or smoke test:
python train_gliner.py --smoke_test

Agent Execution Plan

For an autonomous agent completing this task, follow these steps in order:

Step 1 — Verify prerequisites ✅ (already done)

gliner==0.2.26 API validated: Trainer, TrainingArguments, model._create_data_collator()
TrainingArguments has others_lr, loss_reduction, negatives, masking, focal_loss_* params
GLiNER.push_to_hub() confirmed available
report_to="trackio" natively supported in TrainingArguments
Training data: 121 examples, 175 spans, 0 validation errors
Hub repo created: rafmacalaba/gliner-large-data-mentions

Step 2 — Submit training job

hf_jobs(
  operation="run",
  script="https://huggingface.co/rafmacalaba/gliner-large-data-mentions/raw/main/train_gliner.py",
  dependencies=["gliner==0.2.26", "torch", "trackio"],
  hardware_flavor="t4-small",
  timeout="3h",
  trackio_space_id="rafmacalaba/ml-intern-gliner01",
  trackio_project="gliner-data-mentions"
)

Expected: job starts, downloads urchade/gliner_large-v2.1 (~~900MB), trains 25 epochs (~~103 examples, 13 steps/epoch = 325 total steps), ~20-40 min on T4.

Step 3 — Monitor

hf_jobs(operation="logs", job_id="<job_id_from_step2>")

Look for: === Starting training ===, then {'loss': ..., 'epoch': ...} every 5 steps. Watch for OOM (reduce batch_size to 4, increase gradient_accumulation_steps to 2).

Step 4 — Verify Hub push

hf_repo_files(operation="list", repo_id="rafmacalaba/gliner-large-data-mentions")

Expect: pytorch_model.bin or model.safetensors, config.json, tokenizer_config.json.

Step 5 — Evaluate (post-training)

Run inference on test sentences and confirm entities are extracted with score > 0.4:

"The Demographic and Health Survey collected data in 47 countries." → [SURVEY] Demographic and Health Survey
"Geocoded DHS cluster coordinates were overlaid with flood maps." → [GEOCODED_DATA], [SURVEY]
"A survey from Ghana collected child nutrition data." → [VAGUE_MENTION] A survey from Ghana

Step 6 — Iterate (if results are poor)

If entity scores are low (< 0.5) or entities are missed:

Add more training examples (especially VAGUE_MENTION — currently underrepresented at 18 examples)
Increase epochs to 40 (small dataset benefits from more epochs)
Consider lowering encoder LR to 5e-6 if loss is oscillating

Key References

GLiNER paper: arxiv:2311.08526 (AAAI 2024)
Dataset mention extraction: arxiv:2502.10263 (World Bank, 2025)
GSAP-NER (scholarly entity extraction): arxiv:2311.09860
Coleridge "Show US the Data" Kaggle: best public dataset for dataset mentions in social science papers

Generated by ML Intern

This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.

Try ML Intern: https://smolagents-ml-intern.hf.space
Source code: https://github.com/huggingface/ml-intern