---
license: apache-2.0
base_model: Qwen/Qwen2.5-Coder-0.5B-Instruct
tags:
- code
- gis
- geospatial
- geopandas
- shapely
- rasterio
- osmnx
- folium
- peft
- lora
- trl
- sft
language:
- en
pipeline_tag: text-generation
library_name: peft
---

# GIS-Coder — A Code Model for Geographic Information Systems

A LoRA-adapted code model specialized for GIS and geospatial Python programming. Includes a **ready-to-run training package** for scaling up to 7B on your own GPU cluster.

## 📦 This Repo Contains

| File | Description |
|------|-------------|
| `adapter_model.safetensors` | Trained LoRA adapter (0.5B base, proof of concept) |
| `train_7b.py` | **Production 7B QLoRA training script** with CLI args |
| `evaluate.py` | Evaluation suite (12 GIS benchmarks with scoring) |
| `requirements.txt` | All dependencies |
| `TRAINING_README.md` | **Detailed training guide** — hardware, hyperparameters, ablations |

## 🚀 Train the 7B Model on Your GPUs

```bash
# 1. Clone this repo
git clone https://huggingface.co/RhodWeo/GIS-Coder-7B
cd GIS-Coder-7B

# 2. Install deps
pip install -r requirements.txt

# 3. Login
huggingface-cli login

# 4. Train! (A100 80GB recommended)
python train_7b.py

# For A10G/RTX 4090 (24GB):
python train_7b.py --batch_size 1 --grad_accum 16 --max_length 2048

# For H100:
python train_7b.py --batch_size 4 --grad_accum 4 --max_length 8192

# 5. Evaluate
python evaluate.py --adapter_id ./gis-coder-7b-output/final --compare_base
```

See **[TRAINING_README.md](TRAINING_README.md)** for the full guide with hardware-specific settings, ablation ideas, and expected results.

## 🗺️ GIS Libraries Covered (13)

| Priority | Libraries | Coverage |
|----------|-----------|----------|
| **Tier 1** (0% baseline) | OSMnx, MovingPandas, Rasterio, GDAL, PyProj | Heavy — these are where models fail |
| **Tier 2** | GeoPandas, Shapely, H3 | Core GIS operations |
| **Tier 3** | Folium, xarray, PyQGIS, Fiona, PySAL | Real-world workflows |

## 📊 Proof-of-Concept Results (0.5B)

Trained on CPU with the smaller base model to validate the approach:

| Metric | Start → End |
|--------|------------|
| **Loss** | 1.52 → 0.88 (−42%) |
| **Token Accuracy** | 69.3% → **79.3%** (+10pp) |
| **Eval Quality** | **85%** (code + library + CoT + function) |

## 🔬 Training Recipe

Based on published research:

| Principle | Source | Applied |
|-----------|--------|---------|
| QLoRA SFT beats 72B models | [CFD paper](https://arxiv.org/abs/2504.09602) | r=32, all-linear, lr=2e-4 |
| Qwen2.5-Coder best backbone | [MapCoder-Lite](https://arxiv.org/abs/2509.17489) | Base model selection |
| Models score 0% on GIS | [GIS Benchmark](https://arxiv.org/abs/2410.04617) | Heavy OSMnx/MovingPandas coverage |
| CoT boosts +20.9% pass@1 | CFD paper ablation | All examples include CoT |
| Target all linear layers | [LoRA Without Regret](https://arxiv.org/abs/2410.13732) | `target_modules="all-linear"` |

## 📚 Dataset

**[RhodWeo/gis-code-instructions](https://huggingface.co/datasets/RhodWeo/gis-code-instructions)** — 70 expert-curated examples with Chain-of-Thought annotations.

## License

Apache 2.0