Token Classification
GLiNER
English
named-entity-recognition
data-mention-extraction
survey-detection
geocoded-data
DHS
census
ml-intern
Instructions to use rafmacalaba/gliner-large-data-mentions with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- GLiNER
How to use rafmacalaba/gliner-large-data-mentions with GLiNER:
from gliner import GLiNER model = GLiNER.from_pretrained("rafmacalaba/gliner-large-data-mentions") - Notebooks
- Google Colab
- Kaggle
| language: en | |
| license: apache-2.0 | |
| tags: | |
| - gliner | |
| - named-entity-recognition | |
| - token-classification | |
| - data-mention-extraction | |
| - survey-detection | |
| - geocoded-data | |
| - DHS | |
| - census | |
| - ml-intern | |
| library_name: gliner | |
| base_model: urchade/gliner_large-v2.1 | |
| # GLiNER Large β Data Mention Extraction | |
| Fine-tuned from [`urchade/gliner_large-v2.1`](https://huggingface.co/urchade/gliner_large-v2.1) for extracting **data source mentions** in social science and global health research papers. | |
| > **Status:** Training job pending. See `train_gliner.py` for the complete self-contained training script. | |
| --- | |
| ## Entity Types | |
| | Label | Description | Examples | | |
| |---|---|---| | |
| | `SURVEY` | Named survey programs | Demographic and Health Survey, DHS, MICS, LSMS, Afrobarometer | | |
| | `DATASET` | Specific named datasets | Census microdata, 2010 Population and Housing Census, LSMS-ISA panel dataset | | |
| | `DATABASE` | Named databases/repositories | World Development Indicators, IHME GBD, IPUMS-DHS, FAOSTAT, GRID3 | | |
| | `GEOCODED_DATA` | Geocoded/spatial data mentions | GPS coordinates, geocoded DHS cluster coordinates, geo-referenced household data | | |
| | `VAGUE_MENTION` | Vague/informal references | "a survey from Ghana", "a nationally representative household survey" | | |
| --- | |
| ## Usage (after training) | |
| ```python | |
| from gliner import GLiNER | |
| model = GLiNER.from_pretrained("rafmacalaba/gliner-large-data-mentions") | |
| text = "We used GPS-tagged household locations from the 2018 DHS and linked them to the WorldPop gridded population database." | |
| entities = model.predict_entities( | |
| text, | |
| labels=["SURVEY", "DATASET", "DATABASE", "GEOCODED_DATA", "VAGUE_MENTION"], | |
| threshold=0.4, | |
| ) | |
| for e in entities: | |
| print(f"[{e['label']}] '{e['text']}' (score={e['score']:.3f})") | |
| # [GEOCODED_DATA] 'GPS-tagged household locations' (score=0.87) | |
| # [SURVEY] '2018 DHS' (score=0.92) | |
| # [DATABASE] 'WorldPop gridded population database' (score=0.83) | |
| ``` | |
| --- | |
| ## Training Data | |
| - **121 manually annotated sentences** / **175 entity spans** | |
| - **Domain:** Social science, global health, demographic surveys, GIS/spatial data | |
| - **No public dataset available** β custom-built for this task | |
| - All data is embedded inline in `train_gliner.py` for portability | |
| Entity distribution: | |
| ``` | |
| SURVEY : 55 spans (31.4%) | |
| GEOCODED_DATA : 40 spans (22.9%) | |
| DATABASE : 38 spans (21.7%) | |
| DATASET : 24 spans (13.7%) | |
| VAGUE_MENTION : 18 spans (10.3%) | |
| ``` | |
| --- | |
| ## Training Configuration | |
| Based on the original GLiNER paper (AAAI 2024, arxiv:2311.08526) fine-tuning recipe: | |
| | Parameter | Value | | |
| |---|---| | |
| | Base model | `urchade/gliner_large-v2.1` (DeBERTa-v3-large) | | |
| | Epochs | 25 | | |
| | Batch size | 8 | | |
| | Encoder LR | 1e-5 | | |
| | Head LR (others_lr) | 5e-5 | | |
| | LR scheduler | cosine | | |
| | Warmup | 8% of total steps | | |
| | Loss | sum reduction, focal_loss_alpha=-1 | | |
| | Negatives ratio | 1.0 | | |
| | Masking | global | | |
| | FP16 | True (GPU) | | |
| --- | |
| ## How to Run Training | |
| ### Option A: HF Jobs (recommended) | |
| ```python | |
| from hf_jobs import run | |
| run( | |
| script="https://huggingface.co/rafmacalaba/gliner-large-data-mentions/raw/main/train_gliner.py", | |
| dependencies=["gliner==0.2.26", "torch", "trackio"], | |
| hardware_flavor="t4-small", | |
| timeout="3h", | |
| env={ | |
| "TRACKIO_SPACE_ID": "rafmacalaba/ml-intern-gliner01", | |
| "TRACKIO_PROJECT": "gliner-data-mentions", | |
| } | |
| ) | |
| ``` | |
| Or via the ML Intern agent tool: | |
| ``` | |
| hf_jobs( | |
| operation="run", | |
| script="https://huggingface.co/rafmacalaba/gliner-large-data-mentions/raw/main/train_gliner.py", | |
| dependencies=["gliner==0.2.26", "torch", "trackio"], | |
| hardware_flavor="t4-small", | |
| timeout="3h", | |
| trackio_space_id="rafmacalaba/ml-intern-gliner01", | |
| trackio_project="gliner-data-mentions" | |
| ) | |
| ``` | |
| ### Option B: Local GPU | |
| ```bash | |
| pip install gliner==0.2.26 torch trackio | |
| python train_gliner.py | |
| # or smoke test: | |
| python train_gliner.py --smoke_test | |
| ``` | |
| --- | |
| ## Agent Execution Plan | |
| For an autonomous agent completing this task, follow these steps **in order**: | |
| ### Step 1 β Verify prerequisites β (already done) | |
| - [x] `gliner==0.2.26` API validated: `Trainer`, `TrainingArguments`, `model._create_data_collator()` | |
| - [x] `TrainingArguments` has `others_lr`, `loss_reduction`, `negatives`, `masking`, `focal_loss_*` params | |
| - [x] `GLiNER.push_to_hub()` confirmed available | |
| - [x] `report_to="trackio"` natively supported in `TrainingArguments` | |
| - [x] Training data: 121 examples, 175 spans, 0 validation errors | |
| - [x] Hub repo created: `rafmacalaba/gliner-large-data-mentions` | |
| ### Step 2 β Submit training job | |
| ``` | |
| hf_jobs( | |
| operation="run", | |
| script="https://huggingface.co/rafmacalaba/gliner-large-data-mentions/raw/main/train_gliner.py", | |
| dependencies=["gliner==0.2.26", "torch", "trackio"], | |
| hardware_flavor="t4-small", | |
| timeout="3h", | |
| trackio_space_id="rafmacalaba/ml-intern-gliner01", | |
| trackio_project="gliner-data-mentions" | |
| ) | |
| ``` | |
| Expected: job starts, downloads `urchade/gliner_large-v2.1` (~900MB), trains 25 epochs (~103 examples, 13 steps/epoch = 325 total steps), ~20-40 min on T4. | |
| ### Step 3 β Monitor | |
| ``` | |
| hf_jobs(operation="logs", job_id="<job_id_from_step2>") | |
| ``` | |
| Look for: `=== Starting training ===`, then `{'loss': ..., 'epoch': ...}` every 5 steps. | |
| Watch for OOM (reduce batch_size to 4, increase gradient_accumulation_steps to 2). | |
| ### Step 4 β Verify Hub push | |
| ``` | |
| hf_repo_files(operation="list", repo_id="rafmacalaba/gliner-large-data-mentions") | |
| ``` | |
| Expect: `pytorch_model.bin` or `model.safetensors`, `config.json`, `tokenizer_config.json`. | |
| ### Step 5 β Evaluate (post-training) | |
| Run inference on test sentences and confirm entities are extracted with score > 0.4: | |
| - "The Demographic and Health Survey collected data in 47 countries." β `[SURVEY] Demographic and Health Survey` | |
| - "Geocoded DHS cluster coordinates were overlaid with flood maps." β `[GEOCODED_DATA]`, `[SURVEY]` | |
| - "A survey from Ghana collected child nutrition data." β `[VAGUE_MENTION] A survey from Ghana` | |
| ### Step 6 β Iterate (if results are poor) | |
| If entity scores are low (< 0.5) or entities are missed: | |
| 1. Add more training examples (especially VAGUE_MENTION β currently underrepresented at 18 examples) | |
| 2. Increase epochs to 40 (small dataset benefits from more epochs) | |
| 3. Consider lowering encoder LR to 5e-6 if loss is oscillating | |
| --- | |
| ## Key References | |
| - GLiNER paper: [arxiv:2311.08526](https://arxiv.org/abs/2311.08526) (AAAI 2024) | |
| - Dataset mention extraction: [arxiv:2502.10263](https://arxiv.org/abs/2502.10263) (World Bank, 2025) | |
| - GSAP-NER (scholarly entity extraction): [arxiv:2311.09860](https://arxiv.org/abs/2311.09860) | |
| - Coleridge "Show US the Data" Kaggle: best public dataset for dataset mentions in social science papers | |
| <!-- ml-intern-provenance --> | |
| ## Generated by ML Intern | |
| This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub. | |
| - Try ML Intern: https://smolagents-ml-intern.hf.space | |
| - Source code: https://github.com/huggingface/ml-intern | |