File size: 5,664 Bytes
817c44f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ff811f4
 
 
 
d1b1b19
ff811f4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d1b1b19
b81a0d2
ff811f4
 
 
b81a0d2
ff811f4
 
 
 
b81a0d2
ff811f4
b81a0d2
ff811f4
b81a0d2
 
 
ff811f4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b81a0d2
 
 
 
 
 
ff811f4
b81a0d2
ff811f4
b81a0d2
ff811f4
b81a0d2
 
 
 
ff811f4
 
b81a0d2
ff811f4
 
b81a0d2
 
 
 
 
 
 
 
ff811f4
 
 
b81a0d2
 
 
ff811f4
 
 
 
 
 
 
 
 
 
 
 
b81a0d2
 
 
 
ff811f4
b81a0d2
 
ff811f4
 
 
 
 
b81a0d2
 
 
ff811f4
 
b81a0d2
 
ff811f4
 
817c44f
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
---
language:
- en
license: apache-2.0
library_name: transformers
pipeline_tag: text-generation
base_model: Qwen/Qwen3-8B
thumbnail: https://huggingface.co/pengfali/GeohazardGPT/logo/GeohazardGPT_logo.png
tags:
- geohazard
- geology
- geoscience
- geotechnical-engineering
- landslide
- qwen3
- lora
- rag
datasets:
- vicgalle/alpaca-gpt4
---

<p align="center">
  <img src="./logo/GeohazardGPT_logo.png" alt="GeohazardGPT" width="420"/>
</p>

# GeohazardGPT

**GeohazardGPT** is the first large language model purpose-built for geohazard analysis and engineering practice. Built on a Qwen3-8B backbone with LoRA-based parameter-efficient fine-tuning, it is trained on a curated domain corpus of 883 million tokens spanning 12 major geological hazard categories. When combined with a retrieval-augmented generation (RAG) pipeline over authoritative engineering standards, GeohazardGPT achieves performance comparable to much larger models on both general geohazard knowledge and professional engineering examination tasks.

---

## Model Details

| Property | Value |
|---|---|
| **Base model** | Qwen3-8B |
| **Fine-tuning method** | LoRA (rank 128, α 256) |
| **Trainable parameters** | 349M |
| **Training data** | ~100K instruction–response pairs |
| **Domain corpus** | 883M tokens / 1.82M documents |
| **Hazard categories** | 12 major / 49 subcategories |
| **Context length** | 32K tokens (extendable to 128K via YaRN) |
| **Language** | English |
| **License** | Apache 2.0 |

---

## Intended Use

GeohazardGPT supports knowledge-intensive workflows in geohazard assessment and geotechnical engineering practice, including:

- **Factual QA** — precise recall of geohazard definitions, geomaterial properties, and code requirements
- **Open-ended explanation** — interpretation of hazard mechanisms, failure processes, and impact analysis
- **Engineering recommendation** — selection of stabilization measures, mitigation strategies, and monitoring plans for site-specific conditions
- **Report summarization** — structured extraction of key findings from investigation reports, case studies, and technical specifications

It is designed for use by geotechnical engineers, geohazard researchers, and practitioners who require technically accurate, domain-grounded responses. **Model outputs should complement, not replace, professional field investigation and expert judgment.**

---

## Training Data

The instruction-tuning dataset was constructed using **GeoInstruct**, a taxonomy-guided and corpus-grounded instruction generation framework. It comprises:

- **49,776** domain-specific instruction–response pairs generated from a filtered geohazard corpus
- **51,699** general instruction samples (Alpaca-GPT4) to preserve general instruction-following capability
- **~100K** total training pairs

The geohazard corpus draws from four sources:

| Source | Documents | Tokens |
|---|---|---|
| Open-access full-text papers | 1,613,089 | 788.9M |
| Licensed scientific books | 118,217 | 54.5M |
| Closed-access abstracts | 87,668 | 28.9M |
| Filtered C4 web corpus | 3,443 | 10.8M |
| **Total** | **1,822,417** | **883.1M** |

---

## RAG Integration

For standards-based engineering questions, GeohazardGPT is designed to be used with a retrieve-and-rerank RAG pipeline:

1. **Offline indexing** — technical specifications are chunked into sections/clauses and encoded with `Qwen3-Embedding` into a `ChromaDB` vector database
2. **Dense retrieval** — top-30 candidate clauses are retrieved via approximate nearest-neighbor search
3. **Cross-encoder re-ranking** — candidates are re-ranked using `Qwen3-Reranker-4B`; top-15 clauses are retained as final evidence
4. **Grounded generation** — retrieved clauses are injected into the prompt alongside the query

The RAG corpus covers national and sectoral standards in geotechnical investigation, foundation engineering, seismic design, transportation infrastructure, and hydraulic engineering.

---

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "pengfali/GeohazardGPT"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto"
)

prompt = "What engineering measures should be adopted for a landslide with a tension crack at the crest and signs of local seepage?"

messages = [
    {"role": "system", "content": "You are an expert in geological disasters. This is a recommendation task."},
    {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
response = tokenizer.decode(outputs[0][len(inputs.input_ids[0]):], skip_special_tokens=True)
print(response)
```

---

## Hardware Requirements

| Configuration | GPU Memory | Latency |
|---|---|---|
| GeohazardGPT (standalone) | ~10 GB | ~3.9 s/query |
| GeohazardGPT + RAG | ~26 GB | ~5.8 s/query |

Tested on NVIDIA A100 (80GB) under 4-bit deployment. The RAG configuration includes additional memory for `Qwen3-Embedding-4B` and `Qwen3-Reranker-4B`.

---


## Citation

If you use GeohazardGPT in your research, please cite:

```bibtex
@article{ge2025geohazardgpt,
  title={GeohazardGPT: Towards Large Language Models for Geohazards},
  author={Ge, Qi and Li, Pengfa and Dai, Yinhao and Li, Jin and An, Ni and Yu, Yang and Lv, Qing and Sun, Hongyue},
  journal={Under review},
  year={2025}
}
```

---

## License

This model is released under the Apache 2.0 License.

---