---
language:
- en
license: apache-2.0
library_name: transformers
pipeline_tag: text-generation
base_model: Qwen/Qwen3-8B
thumbnail: https://huggingface.co/pengfali/GeohazardGPT/logo/GeohazardGPT_logo.png
tags:
- geohazard
- geology
- geoscience
- geotechnical-engineering
- landslide
- qwen3
- lora
- rag
datasets:
- vicgalle/alpaca-gpt4
---
# GeohazardGPT
**GeohazardGPT** is the first large language model purpose-built for geohazard analysis and engineering practice. Built on a Qwen3-8B backbone with LoRA-based parameter-efficient fine-tuning, it is trained on a curated domain corpus of 883 million tokens spanning 12 major geological hazard categories. When combined with a retrieval-augmented generation (RAG) pipeline over authoritative engineering standards, GeohazardGPT achieves performance comparable to much larger models on both general geohazard knowledge and professional engineering examination tasks.
---
## Model Details
| Property | Value |
|---|---|
| **Base model** | Qwen3-8B |
| **Fine-tuning method** | LoRA (rank 128, α 256) |
| **Trainable parameters** | 349M |
| **Training data** | ~100K instruction–response pairs |
| **Domain corpus** | 883M tokens / 1.82M documents |
| **Hazard categories** | 12 major / 49 subcategories |
| **Context length** | 32K tokens (extendable to 128K via YaRN) |
| **Language** | English |
| **License** | Apache 2.0 |
---
## Intended Use
GeohazardGPT supports knowledge-intensive workflows in geohazard assessment and geotechnical engineering practice, including:
- **Factual QA** — precise recall of geohazard definitions, geomaterial properties, and code requirements
- **Open-ended explanation** — interpretation of hazard mechanisms, failure processes, and impact analysis
- **Engineering recommendation** — selection of stabilization measures, mitigation strategies, and monitoring plans for site-specific conditions
- **Report summarization** — structured extraction of key findings from investigation reports, case studies, and technical specifications
It is designed for use by geotechnical engineers, geohazard researchers, and practitioners who require technically accurate, domain-grounded responses. **Model outputs should complement, not replace, professional field investigation and expert judgment.**
---
## Training Data
The instruction-tuning dataset was constructed using **GeoInstruct**, a taxonomy-guided and corpus-grounded instruction generation framework. It comprises:
- **49,776** domain-specific instruction–response pairs generated from a filtered geohazard corpus
- **51,699** general instruction samples (Alpaca-GPT4) to preserve general instruction-following capability
- **~100K** total training pairs
The geohazard corpus draws from four sources:
| Source | Documents | Tokens |
|---|---|---|
| Open-access full-text papers | 1,613,089 | 788.9M |
| Licensed scientific books | 118,217 | 54.5M |
| Closed-access abstracts | 87,668 | 28.9M |
| Filtered C4 web corpus | 3,443 | 10.8M |
| **Total** | **1,822,417** | **883.1M** |
---
## RAG Integration
For standards-based engineering questions, GeohazardGPT is designed to be used with a retrieve-and-rerank RAG pipeline:
1. **Offline indexing** — technical specifications are chunked into sections/clauses and encoded with `Qwen3-Embedding` into a `ChromaDB` vector database
2. **Dense retrieval** — top-30 candidate clauses are retrieved via approximate nearest-neighbor search
3. **Cross-encoder re-ranking** — candidates are re-ranked using `Qwen3-Reranker-4B`; top-15 clauses are retained as final evidence
4. **Grounded generation** — retrieved clauses are injected into the prompt alongside the query
The RAG corpus covers national and sectoral standards in geotechnical investigation, foundation engineering, seismic design, transportation infrastructure, and hydraulic engineering.
---
## Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "pengfali/GeohazardGPT"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype="auto",
device_map="auto"
)
prompt = "What engineering measures should be adopted for a landslide with a tension crack at the crest and signs of local seepage?"
messages = [
{"role": "system", "content": "You are an expert in geological disasters. This is a recommendation task."},
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
response = tokenizer.decode(outputs[0][len(inputs.input_ids[0]):], skip_special_tokens=True)
print(response)
```
---
## Hardware Requirements
| Configuration | GPU Memory | Latency |
|---|---|---|
| GeohazardGPT (standalone) | ~10 GB | ~3.9 s/query |
| GeohazardGPT + RAG | ~26 GB | ~5.8 s/query |
Tested on NVIDIA A100 (80GB) under 4-bit deployment. The RAG configuration includes additional memory for `Qwen3-Embedding-4B` and `Qwen3-Reranker-4B`.
---
## Citation
If you use GeohazardGPT in your research, please cite:
```bibtex
@article{ge2025geohazardgpt,
title={GeohazardGPT: Towards Large Language Models for Geohazards},
author={Ge, Qi and Li, Pengfa and Dai, Yinhao and Li, Jin and An, Ni and Yu, Yang and Lv, Qing and Sun, Hongyue},
journal={Under review},
year={2025}
}
```
---
## License
This model is released under the Apache 2.0 License.
---