RhodWeo commited on
Commit
ad4257a
Β·
verified Β·
1 Parent(s): b7efbc7

Update README with full training package documentation

Browse files
Files changed (1) hide show
  1. README.md +58 -100
README.md CHANGED
@@ -20,124 +20,82 @@ pipeline_tag: text-generation
20
  library_name: peft
21
  ---
22
 
23
- # GIS-Coder: A Code Model for Geographic Information Systems
24
-
25
- **GIS-Coder** is a LoRA fine-tuned code model specialized in GIS (Geographic Information Systems) and geospatial analysis. Built on top of [Qwen2.5-Coder-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-0.5B-Instruct), it has been fine-tuned on expert-curated GIS code instruction data covering 13+ geospatial Python libraries.
26
-
27
- ## πŸ—ΊοΈ What Can GIS-Coder Do?
28
-
29
- - **OSMnx**: Road network analysis, routing, isochrones, POI extraction from OpenStreetMap
30
- - **GeoPandas**: Spatial joins, buffering, dissolving, geocoding, file I/O
31
- - **Rasterio**: Raster reprojection, NDVI calculation, zonal statistics, hillshade
32
- - **Shapely**: Geometry creation, boolean operations, validation, simplification
33
- - **GDAL/OGR**: Raster merging, clipping, format conversion
34
- - **PyProj**: CRS conversions, UTM zone detection, coordinate transformations
35
- - **Folium**: Interactive web maps, heatmaps, choropleths, marker clusters
36
- - **MovingPandas**: GPS trajectory analysis, stop detection, generalization
37
- - **H3**: Hexagonal spatial indexing, density aggregation
38
- - **Fiona**: Low-level vector I/O, filtering, format conversion
39
- - **xarray/rioxarray**: Climate data analysis, NetCDF processing, raster export
40
- - **PyQGIS**: QGIS scripting, processing algorithms, map layouts
41
- - **PySAL**: Spatial autocorrelation (Moran's I), LISA clusters
42
-
43
- ## πŸ“Š Training Details
44
-
45
- | Metric | Value |
46
- |--------|-------|
47
- | **Base Model** | Qwen/Qwen2.5-Coder-0.5B-Instruct (494M params) |
48
- | **Method** | LoRA SFT (r=8, Ξ±=16, target: q/k/v/o/gate/up/down_proj) |
49
- | **Dataset** | 30 expert-curated GIS code instruction pairs with CoT |
50
- | **Training Loss** | 1.52 β†’ 0.88 (βˆ’42% over 3 epochs) |
51
- | **Token Accuracy** | 69.3% β†’ 79.3% (+10pp) |
52
- | **Evaluation Score** | 85% (code + GIS lib + CoT + function quality) |
53
-
54
- ### Training Recipe
55
-
56
- Based on research from:
57
- - **CFD fine-tuning paper** (arxiv:2504.09602): LoRA SFT recipe that outperformed 72B models with a 7B model
58
- - **MapCoder-Lite** (arxiv:2509.17489): Qwen2.5-Coder as best backbone for domain code tasks
59
- - **GIS benchmark** (arxiv:2410.04617): Identified critical gaps in OSMNX, MovingPandas, Rasterio coverage
60
-
61
- ### Hyperparameters
62
-
63
- ```yaml
64
- learning_rate: 2e-4
65
- lr_scheduler: cosine
66
- warmup_ratio: 0.1
67
- epochs: 3
68
- batch_size: 1 (gradient_accumulation_steps: 4)
69
- max_length: 1024
70
- lora_r: 8
71
- lora_alpha: 16
72
- lora_dropout: 0.05
73
- ```
74
 
75
- ## πŸš€ Usage
76
 
77
- ```python
78
- from transformers import AutoModelForCausalLM, AutoTokenizer
79
- from peft import PeftModel
 
80
 
81
- # Load base model + adapter
82
- base_model = AutoModelForCausalLM.from_pretrained(
83
- "Qwen/Qwen2.5-Coder-0.5B-Instruct",
84
- trust_remote_code=True,
85
- )
86
- model = PeftModel.from_pretrained(base_model, "RhodWeo/GIS-Coder-7B")
87
- tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-0.5B-Instruct")
88
 
89
- # Create prompt
90
- messages = [
91
- {"role": "system", "content": "You are GIS-Coder, an expert Python programmer specializing in GIS and geospatial analysis."},
92
- {"role": "user", "content": "Write a function to calculate NDVI from a satellite image using rasterio."}
93
- ]
94
 
95
- text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
96
- inputs = tokenizer(text, return_tensors="pt")
97
 
98
- outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7, top_p=0.9)
99
- print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 
 
 
 
 
 
100
  ```
101
 
102
- ## πŸ“ˆ Training Curves
103
 
104
- | Epoch | Loss | Token Accuracy |
105
- |-------|------|----------------|
106
- | 0.13 | 1.520 | 69.3% |
107
- | 1.00 | 1.238 | 72.2% |
108
- | 2.00 | 1.007 | 75.5% |
109
- | 3.00 | 0.880 | **79.3%** |
110
 
111
- ## πŸ”¬ Evaluation Results
 
 
 
 
112
 
113
- Tested on 5 GIS code generation prompts covering OSMnx, Rasterio, GeoPandas, CRS handling, and multi-library workflows:
114
 
115
- | Metric | Score |
116
- |--------|-------|
117
- | Code blocks generated | 100% |
118
- | Correct GIS library usage | 100% |
119
- | Chain-of-thought reasoning | 60% |
120
- | Function definitions | 80% |
121
- | **Overall Quality** | **85%** |
122
 
123
- ## πŸ“š Dataset
 
 
 
 
124
 
125
- Training dataset: [RhodWeo/gis-code-instructions](https://huggingface.co/datasets/RhodWeo/gis-code-instructions)
126
 
127
- 30 expert-curated instruction pairs covering:
128
- - Tier 1 (models score 0%): OSMnx, MovingPandas, Rasterio, GDAL, PyProj
129
- - Tier 2 (partial coverage): GeoPandas, Shapely, H3
130
- - Tier 3 (workflow): Folium, xarray, PyQGIS, Fiona, PySAL
131
 
132
- ## ⚑ Scaling Up
 
 
 
 
 
 
133
 
134
- This model was trained on CPU with a 0.5B parameter base. For production use, we recommend:
135
 
136
- 1. **Scale the base model**: Use `Qwen/Qwen2.5-Coder-7B-Instruct` with QLoRA on A100 GPU
137
- 2. **Scale the dataset**: Generate 20K+ examples using OSS-Instruct (Magicoder) pattern with GIS code as seeds
138
- 3. **Add execution-based filtering**: Test all generated code and keep only passing examples
139
- 4. **Include CoT annotations**: +20.9% pass@1 improvement per CFD paper ablation
140
 
141
  ## License
142
 
143
- Apache 2.0 (same as base model)
 
20
  library_name: peft
21
  ---
22
 
23
+ # GIS-Coder β€” A Code Model for Geographic Information Systems
24
+
25
+ A LoRA-adapted code model specialized for GIS and geospatial Python programming. Includes a **ready-to-run training package** for scaling up to 7B on your own GPU cluster.
26
+
27
+ ## πŸ“¦ This Repo Contains
28
+
29
+ | File | Description |
30
+ |------|-------------|
31
+ | `adapter_model.safetensors` | Trained LoRA adapter (0.5B base, proof of concept) |
32
+ | `train_7b.py` | **Production 7B QLoRA training script** with CLI args |
33
+ | `evaluate.py` | Evaluation suite (12 GIS benchmarks with scoring) |
34
+ | `requirements.txt` | All dependencies |
35
+ | `TRAINING_README.md` | **Detailed training guide** β€” hardware, hyperparameters, ablations |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
 
37
+ ## πŸš€ Train the 7B Model on Your GPUs
38
 
39
+ ```bash
40
+ # 1. Clone this repo
41
+ git clone https://huggingface.co/RhodWeo/GIS-Coder-7B
42
+ cd GIS-Coder-7B
43
 
44
+ # 2. Install deps
45
+ pip install -r requirements.txt
 
 
 
 
 
46
 
47
+ # 3. Login
48
+ huggingface-cli login
 
 
 
49
 
50
+ # 4. Train! (A100 80GB recommended)
51
+ python train_7b.py
52
 
53
+ # For A10G/RTX 4090 (24GB):
54
+ python train_7b.py --batch_size 1 --grad_accum 16 --max_length 2048
55
+
56
+ # For H100:
57
+ python train_7b.py --batch_size 4 --grad_accum 4 --max_length 8192
58
+
59
+ # 5. Evaluate
60
+ python evaluate.py --adapter_id ./gis-coder-7b-output/final --compare_base
61
  ```
62
 
63
+ See **[TRAINING_README.md](TRAINING_README.md)** for the full guide with hardware-specific settings, ablation ideas, and expected results.
64
 
65
+ ## πŸ—ΊοΈ GIS Libraries Covered (13)
 
 
 
 
 
66
 
67
+ | Priority | Libraries | Coverage |
68
+ |----------|-----------|----------|
69
+ | **Tier 1** (0% baseline) | OSMnx, MovingPandas, Rasterio, GDAL, PyProj | Heavy β€” these are where models fail |
70
+ | **Tier 2** | GeoPandas, Shapely, H3 | Core GIS operations |
71
+ | **Tier 3** | Folium, xarray, PyQGIS, Fiona, PySAL | Real-world workflows |
72
 
73
+ ## πŸ“Š Proof-of-Concept Results (0.5B)
74
 
75
+ Trained on CPU with the smaller base model to validate the approach:
 
 
 
 
 
 
76
 
77
+ | Metric | Start β†’ End |
78
+ |--------|------------|
79
+ | **Loss** | 1.52 β†’ 0.88 (βˆ’42%) |
80
+ | **Token Accuracy** | 69.3% β†’ **79.3%** (+10pp) |
81
+ | **Eval Quality** | **85%** (code + library + CoT + function) |
82
 
83
+ ## πŸ”¬ Training Recipe
84
 
85
+ Based on published research:
 
 
 
86
 
87
+ | Principle | Source | Applied |
88
+ |-----------|--------|---------|
89
+ | QLoRA SFT beats 72B models | [CFD paper](https://arxiv.org/abs/2504.09602) | r=32, all-linear, lr=2e-4 |
90
+ | Qwen2.5-Coder best backbone | [MapCoder-Lite](https://arxiv.org/abs/2509.17489) | Base model selection |
91
+ | Models score 0% on GIS | [GIS Benchmark](https://arxiv.org/abs/2410.04617) | Heavy OSMnx/MovingPandas coverage |
92
+ | CoT boosts +20.9% pass@1 | CFD paper ablation | All examples include CoT |
93
+ | Target all linear layers | [LoRA Without Regret](https://arxiv.org/abs/2410.13732) | `target_modules="all-linear"` |
94
 
95
+ ## πŸ“š Dataset
96
 
97
+ **[RhodWeo/gis-code-instructions](https://huggingface.co/datasets/RhodWeo/gis-code-instructions)** β€” 70 expert-curated examples with Chain-of-Thought annotations.
 
 
 
98
 
99
  ## License
100
 
101
+ Apache 2.0