---
language:
- en
license: apache-2.0
library_name: transformers
pipeline_tag: text-generation
base_model: Qwen/Qwen3-8B
thumbnail: https://huggingface.co/pengfali/GeohazardGPT/logo/GeohazardGPT_logo.png
tags:
- geohazard
- geology
- geoscience
- geotechnical-engineering
- landslide
- qwen3
- lora
- rag
datasets:
- vicgalle/alpaca-gpt4
---

<p align="center">
  <img src="./logo/GeohazardGPT_logo.png" alt="GeohazardGPT" width="420"/>
</p>

# GeohazardGPT

**GeohazardGPT** is the first large language model purpose-built for geohazard analysis and engineering practice. Built on a Qwen3-8B backbone with LoRA-based parameter-efficient fine-tuning, it is trained on a curated domain corpus of 883 million tokens spanning 12 major geological hazard categories. When combined with a retrieval-augmented generation (RAG) pipeline over authoritative engineering standards, GeohazardGPT achieves performance comparable to much larger models on both general geohazard knowledge and professional engineering examination tasks.

---

## Model Details

| Property | Value |
|---|---|
| **Base model** | Qwen3-8B |
| **Fine-tuning method** | LoRA (rank 128, α 256) |
| **Trainable parameters** | 349M |
| **Training data** | ~100K instruction–response pairs |
| **Domain corpus** | 883M tokens / 1.82M documents |
| **Hazard categories** | 12 major / 49 subcategories |
| **Context length** | 32K tokens (extendable to 128K via YaRN) |
| **Language** | English |
| **License** | Apache 2.0 |

---

## Intended Use

GeohazardGPT supports knowledge-intensive workflows in geohazard assessment and geotechnical engineering practice, including:

- **Factual QA** — precise recall of geohazard definitions, geomaterial properties, and code requirements
- **Open-ended explanation** — interpretation of hazard mechanisms, failure processes, and impact analysis
- **Engineering recommendation** — selection of stabilization measures, mitigation strategies, and monitoring plans for site-specific conditions
- **Report summarization** — structured extraction of key findings from investigation reports, case studies, and technical specifications

It is designed for use by geotechnical engineers, geohazard researchers, and practitioners who require technically accurate, domain-grounded responses. **Model outputs should complement, not replace, professional field investigation and expert judgment.**

---

## Training Data

The instruction-tuning dataset was constructed using **GeoInstruct**, a taxonomy-guided and corpus-grounded instruction generation framework. It comprises:

- **49,776** domain-specific instruction–response pairs generated from a filtered geohazard corpus
- **51,699** general instruction samples (Alpaca-GPT4) to preserve general instruction-following capability
- **~100K** total training pairs

The geohazard corpus draws from four sources:

| Source | Documents | Tokens |
|---|---|---|
| Open-access full-text papers | 1,613,089 | 788.9M |
| Licensed scientific books | 118,217 | 54.5M |
| Closed-access abstracts | 87,668 | 28.9M |
| Filtered C4 web corpus | 3,443 | 10.8M |
| **Total** | **1,822,417** | **883.1M** |

---

## RAG Integration

For standards-based engineering questions, GeohazardGPT is designed to be used with a retrieve-and-rerank RAG pipeline:

1. **Offline indexing** — technical specifications are chunked into sections/clauses and encoded with `Qwen3-Embedding` into a `ChromaDB` vector database
2. **Dense retrieval** — top-30 candidate clauses are retrieved via approximate nearest-neighbor search
3. **Cross-encoder re-ranking** — candidates are re-ranked using `Qwen3-Reranker-4B`; top-15 clauses are retained as final evidence
4. **Grounded generation** — retrieved clauses are injected into the prompt alongside the query

The RAG corpus covers national and sectoral standards in geotechnical investigation, foundation engineering, seismic design, transportation infrastructure, and hydraulic engineering.

---

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "pengfali/GeohazardGPT"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto"
)

prompt = "What engineering measures should be adopted for a landslide with a tension crack at the crest and signs of local seepage?"

messages = [
    {"role": "system", "content": "You are an expert in geological disasters. This is a recommendation task."},
    {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
response = tokenizer.decode(outputs[0][len(inputs.input_ids[0]):], skip_special_tokens=True)
print(response)
```

---

## Hardware Requirements

| Configuration | GPU Memory | Latency |
|---|---|---|
| GeohazardGPT (standalone) | ~10 GB | ~3.9 s/query |
| GeohazardGPT + RAG | ~26 GB | ~5.8 s/query |

Tested on NVIDIA A100 (80GB) under 4-bit deployment. The RAG configuration includes additional memory for `Qwen3-Embedding-4B` and `Qwen3-Reranker-4B`.

---


## Citation

If you use GeohazardGPT in your research, please cite:

```bibtex
@article{ge2025geohazardgpt,
  title={GeohazardGPT: Towards Large Language Models for Geohazards},
  author={Ge, Qi and Li, Pengfa and Dai, Yinhao and Li, Jin and An, Ni and Yu, Yang and Lv, Qing and Sun, Hongyue},
  journal={Under review},
  year={2025}
}
```

---

## License

This model is released under the Apache 2.0 License.

---