Update README with full training package documentation

ad4257a verified 25 days ago

3.16 kB

	---
	license: apache-2.0
	base_model: Qwen/Qwen2.5-Coder-0.5B-Instruct
	tags:
	- code
	- gis
	- geospatial
	- geopandas
	- shapely
	- rasterio
	- osmnx
	- folium
	- peft
	- lora
	- trl
	- sft
	language:
	- en
	pipeline_tag: text-generation
	library_name: peft
	---

	# GIS-Coder — A Code Model for Geographic Information Systems

	A LoRA-adapted code model specialized for GIS and geospatial Python programming. Includes a ready-to-run training package for scaling up to 7B on your own GPU cluster.

	## 📦 This Repo Contains

	\| File \| Description \|
	\|------\|-------------\|
	\| `adapter_model.safetensors` \| Trained LoRA adapter (0.5B base, proof of concept) \|
	\| `train_7b.py` \| Production 7B QLoRA training script with CLI args \|
	\| `evaluate.py` \| Evaluation suite (12 GIS benchmarks with scoring) \|
	\| `requirements.txt` \| All dependencies \|
	\| `TRAINING_README.md` \| Detailed training guide — hardware, hyperparameters, ablations \|

	## 🚀 Train the 7B Model on Your GPUs

	```bash
	# 1. Clone this repo
	git clone https://huggingface.co/RhodWeo/GIS-Coder-7B
	cd GIS-Coder-7B

	# 2. Install deps
	pip install -r requirements.txt

	# 3. Login
	huggingface-cli login

	# 4. Train! (A100 80GB recommended)
	python train_7b.py

	# For A10G/RTX 4090 (24GB):
	python train_7b.py --batch_size 1 --grad_accum 16 --max_length 2048

	# For H100:
	python train_7b.py --batch_size 4 --grad_accum 4 --max_length 8192

	# 5. Evaluate
	python evaluate.py --adapter_id ./gis-coder-7b-output/final --compare_base
	```

	See [TRAINING_README.md](TRAINING_README.md) for the full guide with hardware-specific settings, ablation ideas, and expected results.

	## 🗺️ GIS Libraries Covered (13)

	\| Priority \| Libraries \| Coverage \|
	\|----------\|-----------\|----------\|
	\| Tier 1 (0% baseline) \| OSMnx, MovingPandas, Rasterio, GDAL, PyProj \| Heavy — these are where models fail \|
	\| Tier 2 \| GeoPandas, Shapely, H3 \| Core GIS operations \|
	\| Tier 3 \| Folium, xarray, PyQGIS, Fiona, PySAL \| Real-world workflows \|

	## 📊 Proof-of-Concept Results (0.5B)

	Trained on CPU with the smaller base model to validate the approach:

	\| Metric \| Start → End \|
	\|--------\|------------\|
	\| Loss \| 1.52 → 0.88 (−42%) \|
	\| Token Accuracy \| 69.3% → 79.3% (+10pp) \|
	\| Eval Quality \| 85% (code + library + CoT + function) \|

	## 🔬 Training Recipe

	Based on published research:

	\| Principle \| Source \| Applied \|
	\|-----------\|--------\|---------\|
	\| QLoRA SFT beats 72B models \| [CFD paper](https://arxiv.org/abs/2504.09602) \| r=32, all-linear, lr=2e-4 \|
	\| Qwen2.5-Coder best backbone \| [MapCoder-Lite](https://arxiv.org/abs/2509.17489) \| Base model selection \|
	\| Models score 0% on GIS \| [GIS Benchmark](https://arxiv.org/abs/2410.04617) \| Heavy OSMnx/MovingPandas coverage \|
	\| CoT boosts +20.9% pass@1 \| CFD paper ablation \| All examples include CoT \|
	\| Target all linear layers \| [LoRA Without Regret](https://arxiv.org/abs/2410.13732) \| `target_modules="all-linear"` \|

	## 📚 Dataset

	[RhodWeo/gis-code-instructions](https://huggingface.co/datasets/RhodWeo/gis-code-instructions) — 70 expert-curated examples with Chain-of-Thought annotations.

	## License

	Apache 2.0