--- language: en license: apache-2.0 tags: - gliner - named-entity-recognition - token-classification - data-mention-extraction - survey-detection - geocoded-data - DHS - census - ml-intern library_name: gliner base_model: urchade/gliner_large-v2.1 --- # GLiNER Large — Data Mention Extraction Fine-tuned from [`urchade/gliner_large-v2.1`](https://huggingface.co/urchade/gliner_large-v2.1) for extracting **data source mentions** in social science and global health research papers. > **Status:** Training job pending. See `train_gliner.py` for the complete self-contained training script. --- ## Entity Types | Label | Description | Examples | |---|---|---| | `SURVEY` | Named survey programs | Demographic and Health Survey, DHS, MICS, LSMS, Afrobarometer | | `DATASET` | Specific named datasets | Census microdata, 2010 Population and Housing Census, LSMS-ISA panel dataset | | `DATABASE` | Named databases/repositories | World Development Indicators, IHME GBD, IPUMS-DHS, FAOSTAT, GRID3 | | `GEOCODED_DATA` | Geocoded/spatial data mentions | GPS coordinates, geocoded DHS cluster coordinates, geo-referenced household data | | `VAGUE_MENTION` | Vague/informal references | "a survey from Ghana", "a nationally representative household survey" | --- ## Usage (after training) ```python from gliner import GLiNER model = GLiNER.from_pretrained("rafmacalaba/gliner-large-data-mentions") text = "We used GPS-tagged household locations from the 2018 DHS and linked them to the WorldPop gridded population database." entities = model.predict_entities( text, labels=["SURVEY", "DATASET", "DATABASE", "GEOCODED_DATA", "VAGUE_MENTION"], threshold=0.4, ) for e in entities: print(f"[{e['label']}] '{e['text']}' (score={e['score']:.3f})") # [GEOCODED_DATA] 'GPS-tagged household locations' (score=0.87) # [SURVEY] '2018 DHS' (score=0.92) # [DATABASE] 'WorldPop gridded population database' (score=0.83) ``` --- ## Training Data - **121 manually annotated sentences** / **175 entity spans** - **Domain:** Social science, global health, demographic surveys, GIS/spatial data - **No public dataset available** — custom-built for this task - All data is embedded inline in `train_gliner.py` for portability Entity distribution: ``` SURVEY : 55 spans (31.4%) GEOCODED_DATA : 40 spans (22.9%) DATABASE : 38 spans (21.7%) DATASET : 24 spans (13.7%) VAGUE_MENTION : 18 spans (10.3%) ``` --- ## Training Configuration Based on the original GLiNER paper (AAAI 2024, arxiv:2311.08526) fine-tuning recipe: | Parameter | Value | |---|---| | Base model | `urchade/gliner_large-v2.1` (DeBERTa-v3-large) | | Epochs | 25 | | Batch size | 8 | | Encoder LR | 1e-5 | | Head LR (others_lr) | 5e-5 | | LR scheduler | cosine | | Warmup | 8% of total steps | | Loss | sum reduction, focal_loss_alpha=-1 | | Negatives ratio | 1.0 | | Masking | global | | FP16 | True (GPU) | --- ## How to Run Training ### Option A: HF Jobs (recommended) ```python from hf_jobs import run run( script="https://huggingface.co/rafmacalaba/gliner-large-data-mentions/raw/main/train_gliner.py", dependencies=["gliner==0.2.26", "torch", "trackio"], hardware_flavor="t4-small", timeout="3h", env={ "TRACKIO_SPACE_ID": "rafmacalaba/ml-intern-gliner01", "TRACKIO_PROJECT": "gliner-data-mentions", } ) ``` Or via the ML Intern agent tool: ``` hf_jobs( operation="run", script="https://huggingface.co/rafmacalaba/gliner-large-data-mentions/raw/main/train_gliner.py", dependencies=["gliner==0.2.26", "torch", "trackio"], hardware_flavor="t4-small", timeout="3h", trackio_space_id="rafmacalaba/ml-intern-gliner01", trackio_project="gliner-data-mentions" ) ``` ### Option B: Local GPU ```bash pip install gliner==0.2.26 torch trackio python train_gliner.py # or smoke test: python train_gliner.py --smoke_test ``` --- ## Agent Execution Plan For an autonomous agent completing this task, follow these steps **in order**: ### Step 1 — Verify prerequisites ✅ (already done) - [x] `gliner==0.2.26` API validated: `Trainer`, `TrainingArguments`, `model._create_data_collator()` - [x] `TrainingArguments` has `others_lr`, `loss_reduction`, `negatives`, `masking`, `focal_loss_*` params - [x] `GLiNER.push_to_hub()` confirmed available - [x] `report_to="trackio"` natively supported in `TrainingArguments` - [x] Training data: 121 examples, 175 spans, 0 validation errors - [x] Hub repo created: `rafmacalaba/gliner-large-data-mentions` ### Step 2 — Submit training job ``` hf_jobs( operation="run", script="https://huggingface.co/rafmacalaba/gliner-large-data-mentions/raw/main/train_gliner.py", dependencies=["gliner==0.2.26", "torch", "trackio"], hardware_flavor="t4-small", timeout="3h", trackio_space_id="rafmacalaba/ml-intern-gliner01", trackio_project="gliner-data-mentions" ) ``` Expected: job starts, downloads `urchade/gliner_large-v2.1` (~900MB), trains 25 epochs (~103 examples, 13 steps/epoch = 325 total steps), ~20-40 min on T4. ### Step 3 — Monitor ``` hf_jobs(operation="logs", job_id="") ``` Look for: `=== Starting training ===`, then `{'loss': ..., 'epoch': ...}` every 5 steps. Watch for OOM (reduce batch_size to 4, increase gradient_accumulation_steps to 2). ### Step 4 — Verify Hub push ``` hf_repo_files(operation="list", repo_id="rafmacalaba/gliner-large-data-mentions") ``` Expect: `pytorch_model.bin` or `model.safetensors`, `config.json`, `tokenizer_config.json`. ### Step 5 — Evaluate (post-training) Run inference on test sentences and confirm entities are extracted with score > 0.4: - "The Demographic and Health Survey collected data in 47 countries." → `[SURVEY] Demographic and Health Survey` - "Geocoded DHS cluster coordinates were overlaid with flood maps." → `[GEOCODED_DATA]`, `[SURVEY]` - "A survey from Ghana collected child nutrition data." → `[VAGUE_MENTION] A survey from Ghana` ### Step 6 — Iterate (if results are poor) If entity scores are low (< 0.5) or entities are missed: 1. Add more training examples (especially VAGUE_MENTION — currently underrepresented at 18 examples) 2. Increase epochs to 40 (small dataset benefits from more epochs) 3. Consider lowering encoder LR to 5e-6 if loss is oscillating --- ## Key References - GLiNER paper: [arxiv:2311.08526](https://arxiv.org/abs/2311.08526) (AAAI 2024) - Dataset mention extraction: [arxiv:2502.10263](https://arxiv.org/abs/2502.10263) (World Bank, 2025) - GSAP-NER (scholarly entity extraction): [arxiv:2311.09860](https://arxiv.org/abs/2311.09860) - Coleridge "Show US the Data" Kaggle: best public dataset for dataset mentions in social science papers ## Generated by ML Intern This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub. - Try ML Intern: https://smolagents-ml-intern.hf.space - Source code: https://github.com/huggingface/ml-intern