GIS-Coder-7B / README.md
RhodWeo's picture
Update README with full training package documentation
ad4257a verified
---
license: apache-2.0
base_model: Qwen/Qwen2.5-Coder-0.5B-Instruct
tags:
- code
- gis
- geospatial
- geopandas
- shapely
- rasterio
- osmnx
- folium
- peft
- lora
- trl
- sft
language:
- en
pipeline_tag: text-generation
library_name: peft
---
# GIS-Coder β€” A Code Model for Geographic Information Systems
A LoRA-adapted code model specialized for GIS and geospatial Python programming. Includes a **ready-to-run training package** for scaling up to 7B on your own GPU cluster.
## πŸ“¦ This Repo Contains
| File | Description |
|------|-------------|
| `adapter_model.safetensors` | Trained LoRA adapter (0.5B base, proof of concept) |
| `train_7b.py` | **Production 7B QLoRA training script** with CLI args |
| `evaluate.py` | Evaluation suite (12 GIS benchmarks with scoring) |
| `requirements.txt` | All dependencies |
| `TRAINING_README.md` | **Detailed training guide** β€” hardware, hyperparameters, ablations |
## πŸš€ Train the 7B Model on Your GPUs
```bash
# 1. Clone this repo
git clone https://huggingface.co/RhodWeo/GIS-Coder-7B
cd GIS-Coder-7B
# 2. Install deps
pip install -r requirements.txt
# 3. Login
huggingface-cli login
# 4. Train! (A100 80GB recommended)
python train_7b.py
# For A10G/RTX 4090 (24GB):
python train_7b.py --batch_size 1 --grad_accum 16 --max_length 2048
# For H100:
python train_7b.py --batch_size 4 --grad_accum 4 --max_length 8192
# 5. Evaluate
python evaluate.py --adapter_id ./gis-coder-7b-output/final --compare_base
```
See **[TRAINING_README.md](TRAINING_README.md)** for the full guide with hardware-specific settings, ablation ideas, and expected results.
## πŸ—ΊοΈ GIS Libraries Covered (13)
| Priority | Libraries | Coverage |
|----------|-----------|----------|
| **Tier 1** (0% baseline) | OSMnx, MovingPandas, Rasterio, GDAL, PyProj | Heavy β€” these are where models fail |
| **Tier 2** | GeoPandas, Shapely, H3 | Core GIS operations |
| **Tier 3** | Folium, xarray, PyQGIS, Fiona, PySAL | Real-world workflows |
## πŸ“Š Proof-of-Concept Results (0.5B)
Trained on CPU with the smaller base model to validate the approach:
| Metric | Start β†’ End |
|--------|------------|
| **Loss** | 1.52 β†’ 0.88 (βˆ’42%) |
| **Token Accuracy** | 69.3% β†’ **79.3%** (+10pp) |
| **Eval Quality** | **85%** (code + library + CoT + function) |
## πŸ”¬ Training Recipe
Based on published research:
| Principle | Source | Applied |
|-----------|--------|---------|
| QLoRA SFT beats 72B models | [CFD paper](https://arxiv.org/abs/2504.09602) | r=32, all-linear, lr=2e-4 |
| Qwen2.5-Coder best backbone | [MapCoder-Lite](https://arxiv.org/abs/2509.17489) | Base model selection |
| Models score 0% on GIS | [GIS Benchmark](https://arxiv.org/abs/2410.04617) | Heavy OSMnx/MovingPandas coverage |
| CoT boosts +20.9% pass@1 | CFD paper ablation | All examples include CoT |
| Target all linear layers | [LoRA Without Regret](https://arxiv.org/abs/2410.13732) | `target_modules="all-linear"` |
## πŸ“š Dataset
**[RhodWeo/gis-code-instructions](https://huggingface.co/datasets/RhodWeo/gis-code-instructions)** β€” 70 expert-curated examples with Chain-of-Thought annotations.
## License
Apache 2.0