File size: 3,157 Bytes

b0627d1
50b1aaa
b0627d1
 
50b1aaa
 
 
 
 
 
 
 
 
 
b0627d1
 
50b1aaa
 
 
 
b0627d1
 
ad4257a
 
 
 
 
 
 
 
 
 
 
 
 
b0627d1
ad4257a
b0627d1
ad4257a
 
 
 
50b1aaa
ad4257a
 
50b1aaa
ad4257a
 
50b1aaa
ad4257a
 
50b1aaa
ad4257a
 
 
 
 
 
 
 
b0627d1
 
ad4257a
50b1aaa
ad4257a
50b1aaa
ad4257a
 
 
 
 
b0627d1
ad4257a
b0627d1
ad4257a
b0627d1
ad4257a
 
 
 
 
b0627d1
ad4257a
b0627d1
ad4257a
b0627d1
ad4257a
 
 
 
 
 
 
b0627d1
ad4257a
b0627d1
ad4257a
b0627d1
50b1aaa
b0627d1
ad4257a

---
license: apache-2.0
base_model: Qwen/Qwen2.5-Coder-0.5B-Instruct
tags:
- code
- gis
- geospatial
- geopandas
- shapely
- rasterio
- osmnx
- folium
- peft
- lora
- trl
- sft
language:
- en
pipeline_tag: text-generation
library_name: peft
---

# GIS-Coder — A Code Model for Geographic Information Systems

A LoRA-adapted code model specialized for GIS and geospatial Python programming. Includes a **ready-to-run training package** for scaling up to 7B on your own GPU cluster.

## 📦 This Repo Contains

| File | Description |
|------|-------------|
| `adapter_model.safetensors` | Trained LoRA adapter (0.5B base, proof of concept) |
| `train_7b.py` | **Production 7B QLoRA training script** with CLI args |
| `evaluate.py` | Evaluation suite (12 GIS benchmarks with scoring) |
| `requirements.txt` | All dependencies |
| `TRAINING_README.md` | **Detailed training guide** — hardware, hyperparameters, ablations |

## 🚀 Train the 7B Model on Your GPUs

```bash
# 1. Clone this repo
git clone https://huggingface.co/RhodWeo/GIS-Coder-7B
cd GIS-Coder-7B

# 2. Install deps
pip install -r requirements.txt

# 3. Login
huggingface-cli login

# 4. Train! (A100 80GB recommended)
python train_7b.py

# For A10G/RTX 4090 (24GB):
python train_7b.py --batch_size 1 --grad_accum 16 --max_length 2048

# For H100:
python train_7b.py --batch_size 4 --grad_accum 4 --max_length 8192

# 5. Evaluate
python evaluate.py --adapter_id ./gis-coder-7b-output/final --compare_base
```

See **[TRAINING_README.md](TRAINING_README.md)** for the full guide with hardware-specific settings, ablation ideas, and expected results.

## 🗺️ GIS Libraries Covered (13)

| Priority | Libraries | Coverage |
|----------|-----------|----------|
| **Tier 1** (0% baseline) | OSMnx, MovingPandas, Rasterio, GDAL, PyProj | Heavy — these are where models fail |
| **Tier 2** | GeoPandas, Shapely, H3 | Core GIS operations |
| **Tier 3** | Folium, xarray, PyQGIS, Fiona, PySAL | Real-world workflows |

## 📊 Proof-of-Concept Results (0.5B)

Trained on CPU with the smaller base model to validate the approach:

| Metric | Start → End |
|--------|------------|
| **Loss** | 1.52 → 0.88 (−42%) |
| **Token Accuracy** | 69.3% → **79.3%** (+10pp) |
| **Eval Quality** | **85%** (code + library + CoT + function) |

## 🔬 Training Recipe

Based on published research:

| Principle | Source | Applied |
|-----------|--------|---------|
| QLoRA SFT beats 72B models | [CFD paper](https://arxiv.org/abs/2504.09602) | r=32, all-linear, lr=2e-4 |
| Qwen2.5-Coder best backbone | [MapCoder-Lite](https://arxiv.org/abs/2509.17489) | Base model selection |
| Models score 0% on GIS | [GIS Benchmark](https://arxiv.org/abs/2410.04617) | Heavy OSMnx/MovingPandas coverage |
| CoT boosts +20.9% pass@1 | CFD paper ablation | All examples include CoT |
| Target all linear layers | [LoRA Without Regret](https://arxiv.org/abs/2410.13732) | `target_modules="all-linear"` |

## 📚 Dataset

**[RhodWeo/gis-code-instructions](https://huggingface.co/datasets/RhodWeo/gis-code-instructions)** — 70 expert-curated examples with Chain-of-Thought annotations.

## License

Apache 2.0