--- language: - en license: apache-2.0 library_name: transformers pipeline_tag: text-generation base_model: Qwen/Qwen3-8B thumbnail: https://huggingface.co/pengfali/GeohazardGPT/logo/GeohazardGPT_logo.png tags: - geohazard - geology - geoscience - geotechnical-engineering - landslide - qwen3 - lora - rag datasets: - vicgalle/alpaca-gpt4 ---

GeohazardGPT

# GeohazardGPT **GeohazardGPT** is the first large language model purpose-built for geohazard analysis and engineering practice. Built on a Qwen3-8B backbone with LoRA-based parameter-efficient fine-tuning, it is trained on a curated domain corpus of 883 million tokens spanning 12 major geological hazard categories. When combined with a retrieval-augmented generation (RAG) pipeline over authoritative engineering standards, GeohazardGPT achieves performance comparable to much larger models on both general geohazard knowledge and professional engineering examination tasks. --- ## Model Details | Property | Value | |---|---| | **Base model** | Qwen3-8B | | **Fine-tuning method** | LoRA (rank 128, α 256) | | **Trainable parameters** | 349M | | **Training data** | ~100K instruction–response pairs | | **Domain corpus** | 883M tokens / 1.82M documents | | **Hazard categories** | 12 major / 49 subcategories | | **Context length** | 32K tokens (extendable to 128K via YaRN) | | **Language** | English | | **License** | Apache 2.0 | --- ## Intended Use GeohazardGPT supports knowledge-intensive workflows in geohazard assessment and geotechnical engineering practice, including: - **Factual QA** — precise recall of geohazard definitions, geomaterial properties, and code requirements - **Open-ended explanation** — interpretation of hazard mechanisms, failure processes, and impact analysis - **Engineering recommendation** — selection of stabilization measures, mitigation strategies, and monitoring plans for site-specific conditions - **Report summarization** — structured extraction of key findings from investigation reports, case studies, and technical specifications It is designed for use by geotechnical engineers, geohazard researchers, and practitioners who require technically accurate, domain-grounded responses. **Model outputs should complement, not replace, professional field investigation and expert judgment.** --- ## Training Data The instruction-tuning dataset was constructed using **GeoInstruct**, a taxonomy-guided and corpus-grounded instruction generation framework. It comprises: - **49,776** domain-specific instruction–response pairs generated from a filtered geohazard corpus - **51,699** general instruction samples (Alpaca-GPT4) to preserve general instruction-following capability - **~100K** total training pairs The geohazard corpus draws from four sources: | Source | Documents | Tokens | |---|---|---| | Open-access full-text papers | 1,613,089 | 788.9M | | Licensed scientific books | 118,217 | 54.5M | | Closed-access abstracts | 87,668 | 28.9M | | Filtered C4 web corpus | 3,443 | 10.8M | | **Total** | **1,822,417** | **883.1M** | --- ## RAG Integration For standards-based engineering questions, GeohazardGPT is designed to be used with a retrieve-and-rerank RAG pipeline: 1. **Offline indexing** — technical specifications are chunked into sections/clauses and encoded with `Qwen3-Embedding` into a `ChromaDB` vector database 2. **Dense retrieval** — top-30 candidate clauses are retrieved via approximate nearest-neighbor search 3. **Cross-encoder re-ranking** — candidates are re-ranked using `Qwen3-Reranker-4B`; top-15 clauses are retained as final evidence 4. **Grounded generation** — retrieved clauses are injected into the prompt alongside the query The RAG corpus covers national and sectoral standards in geotechnical investigation, foundation engineering, seismic design, transportation infrastructure, and hydraulic engineering. --- ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_id = "pengfali/GeohazardGPT" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype="auto", device_map="auto" ) prompt = "What engineering measures should be adopted for a landslide with a tension crack at the crest and signs of local seepage?" messages = [ {"role": "system", "content": "You are an expert in geological disasters. This is a recommendation task."}, {"role": "user", "content": prompt} ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) inputs = tokenizer([text], return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=512) response = tokenizer.decode(outputs[0][len(inputs.input_ids[0]):], skip_special_tokens=True) print(response) ``` --- ## Hardware Requirements | Configuration | GPU Memory | Latency | |---|---|---| | GeohazardGPT (standalone) | ~10 GB | ~3.9 s/query | | GeohazardGPT + RAG | ~26 GB | ~5.8 s/query | Tested on NVIDIA A100 (80GB) under 4-bit deployment. The RAG configuration includes additional memory for `Qwen3-Embedding-4B` and `Qwen3-Reranker-4B`. --- ## Citation If you use GeohazardGPT in your research, please cite: ```bibtex @article{ge2025geohazardgpt, title={GeohazardGPT: Towards Large Language Models for Geohazards}, author={Ge, Qi and Li, Pengfa and Dai, Yinhao and Li, Jin and An, Ni and Yu, Yang and Lv, Qing and Sun, Hongyue}, journal={Under review}, year={2025} } ``` --- ## License This model is released under the Apache 2.0 License. ---