Update ML Intern artifact metadata

d51bcc0 verified 5 days ago

7.04 kB

	---
	language: en
	license: apache-2.0
	tags:
	- gliner
	- named-entity-recognition
	- token-classification
	- data-mention-extraction
	- survey-detection
	- geocoded-data
	- DHS
	- census
	- ml-intern
	library_name: gliner
	base_model: urchade/gliner_large-v2.1
	---

	# GLiNER Large — Data Mention Extraction

	Fine-tuned from [`urchade/gliner_large-v2.1`](https://huggingface.co/urchade/gliner_large-v2.1) for extracting data source mentions in social science and global health research papers.

	> Status: Training job pending. See `train_gliner.py` for the complete self-contained training script.

	---

	## Entity Types

	\| Label \| Description \| Examples \|
	\|---\|---\|---\|
	\| `SURVEY` \| Named survey programs \| Demographic and Health Survey, DHS, MICS, LSMS, Afrobarometer \|
	\| `DATASET` \| Specific named datasets \| Census microdata, 2010 Population and Housing Census, LSMS-ISA panel dataset \|
	\| `DATABASE` \| Named databases/repositories \| World Development Indicators, IHME GBD, IPUMS-DHS, FAOSTAT, GRID3 \|
	\| `GEOCODED_DATA` \| Geocoded/spatial data mentions \| GPS coordinates, geocoded DHS cluster coordinates, geo-referenced household data \|
	\| `VAGUE_MENTION` \| Vague/informal references \| "a survey from Ghana", "a nationally representative household survey" \|

	---

	## Usage (after training)

	```python
	from gliner import GLiNER

	model = GLiNER.from_pretrained("rafmacalaba/gliner-large-data-mentions")

	text = "We used GPS-tagged household locations from the 2018 DHS and linked them to the WorldPop gridded population database."

	entities = model.predict_entities(
	text,
	labels=["SURVEY", "DATASET", "DATABASE", "GEOCODED_DATA", "VAGUE_MENTION"],
	threshold=0.4,
	)

	for e in entities:
	print(f"[{e['label']}] '{e['text']}' (score={e['score']:.3f})")
	# [GEOCODED_DATA] 'GPS-tagged household locations' (score=0.87)
	# [SURVEY] '2018 DHS' (score=0.92)
	# [DATABASE] 'WorldPop gridded population database' (score=0.83)
	```

	---

	## Training Data

	- 121 manually annotated sentences / 175 entity spans
	- Domain: Social science, global health, demographic surveys, GIS/spatial data
	- No public dataset available — custom-built for this task
	- All data is embedded inline in `train_gliner.py` for portability

	Entity distribution:
	```
	SURVEY : 55 spans (31.4%)
	GEOCODED_DATA : 40 spans (22.9%)
	DATABASE : 38 spans (21.7%)
	DATASET : 24 spans (13.7%)
	VAGUE_MENTION : 18 spans (10.3%)
	```

	---

	## Training Configuration

	Based on the original GLiNER paper (AAAI 2024, arxiv:2311.08526) fine-tuning recipe:

	\| Parameter \| Value \|
	\|---\|---\|
	\| Base model \| `urchade/gliner_large-v2.1` (DeBERTa-v3-large) \|
	\| Epochs \| 25 \|
	\| Batch size \| 8 \|
	\| Encoder LR \| 1e-5 \|
	\| Head LR (others_lr) \| 5e-5 \|
	\| LR scheduler \| cosine \|
	\| Warmup \| 8% of total steps \|
	\| Loss \| sum reduction, focal_loss_alpha=-1 \|
	\| Negatives ratio \| 1.0 \|
	\| Masking \| global \|
	\| FP16 \| True (GPU) \|

	---

	## How to Run Training

	### Option A: HF Jobs (recommended)

	```python
	from hf_jobs import run

	run(
	script="https://huggingface.co/rafmacalaba/gliner-large-data-mentions/raw/main/train_gliner.py",
	dependencies=["gliner==0.2.26", "torch", "trackio"],
	hardware_flavor="t4-small",
	timeout="3h",
	env={
	"TRACKIO_SPACE_ID": "rafmacalaba/ml-intern-gliner01",
	"TRACKIO_PROJECT": "gliner-data-mentions",
	}
	)
	```

	Or via the ML Intern agent tool:
	```
	hf_jobs(
	operation="run",
	script="https://huggingface.co/rafmacalaba/gliner-large-data-mentions/raw/main/train_gliner.py",
	dependencies=["gliner==0.2.26", "torch", "trackio"],
	hardware_flavor="t4-small",
	timeout="3h",
	trackio_space_id="rafmacalaba/ml-intern-gliner01",
	trackio_project="gliner-data-mentions"
	)
	```

	### Option B: Local GPU

	```bash
	pip install gliner==0.2.26 torch trackio
	python train_gliner.py
	# or smoke test:
	python train_gliner.py --smoke_test
	```

	---

	## Agent Execution Plan

	For an autonomous agent completing this task, follow these steps in order:

	### Step 1 — Verify prerequisites ✅ (already done)
	- [x] `gliner==0.2.26` API validated: `Trainer`, `TrainingArguments`, `model._create_data_collator()`
	- [x] `TrainingArguments` has `others_lr`, `loss_reduction`, `negatives`, `masking`, `focal_loss_*` params
	- [x] `GLiNER.push_to_hub()` confirmed available
	- [x] `report_to="trackio"` natively supported in `TrainingArguments`
	- [x] Training data: 121 examples, 175 spans, 0 validation errors
	- [x] Hub repo created: `rafmacalaba/gliner-large-data-mentions`

	### Step 2 — Submit training job
	```
	hf_jobs(
	operation="run",
	script="https://huggingface.co/rafmacalaba/gliner-large-data-mentions/raw/main/train_gliner.py",
	dependencies=["gliner==0.2.26", "torch", "trackio"],
	hardware_flavor="t4-small",
	timeout="3h",
	trackio_space_id="rafmacalaba/ml-intern-gliner01",
	trackio_project="gliner-data-mentions"
	)
	```

	Expected: job starts, downloads `urchade/gliner_large-v2.1` (~900MB), trains 25 epochs (~103 examples, 13 steps/epoch = 325 total steps), ~20-40 min on T4.

	### Step 3 — Monitor
	```
	hf_jobs(operation="logs", job_id="<job_id_from_step2>")
	```
	Look for: `=== Starting training ===`, then `{'loss': ..., 'epoch': ...}` every 5 steps.
	Watch for OOM (reduce batch_size to 4, increase gradient_accumulation_steps to 2).

	### Step 4 — Verify Hub push
	```
	hf_repo_files(operation="list", repo_id="rafmacalaba/gliner-large-data-mentions")
	```
	Expect: `pytorch_model.bin` or `model.safetensors`, `config.json`, `tokenizer_config.json`.

	### Step 5 — Evaluate (post-training)
	Run inference on test sentences and confirm entities are extracted with score > 0.4:
	- "The Demographic and Health Survey collected data in 47 countries." → `[SURVEY] Demographic and Health Survey`
	- "Geocoded DHS cluster coordinates were overlaid with flood maps." → `[GEOCODED_DATA]`, `[SURVEY]`
	- "A survey from Ghana collected child nutrition data." → `[VAGUE_MENTION] A survey from Ghana`

	### Step 6 — Iterate (if results are poor)
	If entity scores are low (< 0.5) or entities are missed:
	1. Add more training examples (especially VAGUE_MENTION — currently underrepresented at 18 examples)
	2. Increase epochs to 40 (small dataset benefits from more epochs)
	3. Consider lowering encoder LR to 5e-6 if loss is oscillating

	---

	## Key References

	- GLiNER paper: [arxiv:2311.08526](https://arxiv.org/abs/2311.08526) (AAAI 2024)
	- Dataset mention extraction: [arxiv:2502.10263](https://arxiv.org/abs/2502.10263) (World Bank, 2025)
	- GSAP-NER (scholarly entity extraction): [arxiv:2311.09860](https://arxiv.org/abs/2311.09860)
	- Coleridge "Show US the Data" Kaggle: best public dataset for dataset mentions in social science papers

	<!-- ml-intern-provenance -->
	## Generated by ML Intern

	This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub.

	- Try ML Intern: https://smolagents-ml-intern.hf.space
	- Source code: https://github.com/huggingface/ml-intern