--- license: apache-2.0 base_model: Qwen/Qwen2.5-Coder-0.5B-Instruct tags: - code - gis - geospatial - geopandas - shapely - rasterio - osmnx - folium - peft - lora - trl - sft language: - en pipeline_tag: text-generation library_name: peft --- # GIS-Coder β€” A Code Model for Geographic Information Systems A LoRA-adapted code model specialized for GIS and geospatial Python programming. Includes a **ready-to-run training package** for scaling up to 7B on your own GPU cluster. ## πŸ“¦ This Repo Contains | File | Description | |------|-------------| | `adapter_model.safetensors` | Trained LoRA adapter (0.5B base, proof of concept) | | `train_7b.py` | **Production 7B QLoRA training script** with CLI args | | `evaluate.py` | Evaluation suite (12 GIS benchmarks with scoring) | | `requirements.txt` | All dependencies | | `TRAINING_README.md` | **Detailed training guide** β€” hardware, hyperparameters, ablations | ## πŸš€ Train the 7B Model on Your GPUs ```bash # 1. Clone this repo git clone https://huggingface.co/RhodWeo/GIS-Coder-7B cd GIS-Coder-7B # 2. Install deps pip install -r requirements.txt # 3. Login huggingface-cli login # 4. Train! (A100 80GB recommended) python train_7b.py # For A10G/RTX 4090 (24GB): python train_7b.py --batch_size 1 --grad_accum 16 --max_length 2048 # For H100: python train_7b.py --batch_size 4 --grad_accum 4 --max_length 8192 # 5. Evaluate python evaluate.py --adapter_id ./gis-coder-7b-output/final --compare_base ``` See **[TRAINING_README.md](TRAINING_README.md)** for the full guide with hardware-specific settings, ablation ideas, and expected results. ## πŸ—ΊοΈ GIS Libraries Covered (13) | Priority | Libraries | Coverage | |----------|-----------|----------| | **Tier 1** (0% baseline) | OSMnx, MovingPandas, Rasterio, GDAL, PyProj | Heavy β€” these are where models fail | | **Tier 2** | GeoPandas, Shapely, H3 | Core GIS operations | | **Tier 3** | Folium, xarray, PyQGIS, Fiona, PySAL | Real-world workflows | ## πŸ“Š Proof-of-Concept Results (0.5B) Trained on CPU with the smaller base model to validate the approach: | Metric | Start β†’ End | |--------|------------| | **Loss** | 1.52 β†’ 0.88 (βˆ’42%) | | **Token Accuracy** | 69.3% β†’ **79.3%** (+10pp) | | **Eval Quality** | **85%** (code + library + CoT + function) | ## πŸ”¬ Training Recipe Based on published research: | Principle | Source | Applied | |-----------|--------|---------| | QLoRA SFT beats 72B models | [CFD paper](https://arxiv.org/abs/2504.09602) | r=32, all-linear, lr=2e-4 | | Qwen2.5-Coder best backbone | [MapCoder-Lite](https://arxiv.org/abs/2509.17489) | Base model selection | | Models score 0% on GIS | [GIS Benchmark](https://arxiv.org/abs/2410.04617) | Heavy OSMnx/MovingPandas coverage | | CoT boosts +20.9% pass@1 | CFD paper ablation | All examples include CoT | | Target all linear layers | [LoRA Without Regret](https://arxiv.org/abs/2410.13732) | `target_modules="all-linear"` | ## πŸ“š Dataset **[RhodWeo/gis-code-instructions](https://huggingface.co/datasets/RhodWeo/gis-code-instructions)** β€” 70 expert-curated examples with Chain-of-Thought annotations. ## License Apache 2.0