**CoastalGPT: A Data-Model-Knowledge Integrated Large Language Model for Coastal Science**

Model • Data Sources • Model Training • Performance Evaluation • Applications • Acknowledgements • 中文

CoastalGPT

A Data-Model-Knowledge Integrated Large Language Model for the Coastal Zone Domain

CoastalGPT is a large language model designed for the coastal zone domain. It aims to explore unified organization and reasoning methods for data, models, and knowledge in coastal science research. Centered on data resources, modeling methods, and scientific knowledge in the coastal zone domain, the project builds a unified semantic representation and CoT (Chain-of-Thought) reasoning framework to support relational organization, process reasoning, and coordinated interaction among these three components. The core goal of CoastalGPT is to enable the model to understand the relationships between data resources and modeling methods, connect scientific questions with research workflows, and support model selection, method interpretation, process reasoning, and automated research organization in complex scientific tasks. To achieve this goal, the project constructs CoT instruction datasets for large language model fine-tuning based on coastal domain knowledge, scientific literature, data-resource metadata, and model-resource metadata, and adapts general-purpose large language models to the coastal domain to improve their scientific reasoning and automated research support capabilities in coastal science.

This project was conducted under the guidance of Professor Zhaoyuan Yu, Professor Linwang Yuan, and Professor Wen Luo. Jian Wang, Pei Du, Zhenxia Liu, Zengjie Wang, Zhuo Zhao, Boda Chen, Lei Nie, Jianbo Zhou, Yuhao Xiong, and others contributed to data construction, model fine-tuning, and system implementation.

Data Sources

The training data of CoastalGPT are organized around three types of objects in the coastal zone domain: data, models, and knowledge. The data sources mainly include five parts: approximately 100,000 coastal-domain basic-knowledge CoT instruction samples constructed from open knowledge sources such as Wikipedia; more than 400,000 coastal scientific literature metadata records with the topic coast* collected up to 2025; collected and organized coastal data-resource metadata; collected and organized coastal model-resource metadata; and scientific literature knowledge CoT data for research-oriented tasks constructed from open-access coastal papers published from 2020 to 2025.

Training Data Construction

The training data of CoastalGPT mainly take the form of CoT reasoning data. The project uses Qwen3.5-27B as the data generation model to generate instruction data with reasoning processes around coastal basic knowledge, scientific literature knowledge, data-resource metadata, and model-resource metadata. The current fine-tuning data are all constructed as CoT instruction data containing reasoning processes. By introducing reasoning processes, CoastalGPT learns not only final-answer generation, but also analytical pathways, knowledge organization patterns, and scientific explanation logic for coastal zone problems.

Model Training

CoastalGPT currently uses Qwen3.5-9B as the base model and adopts a parameter-efficient fine-tuning method for domain adaptation.

Item	Setting
Data generation model	Qwen3.5-27B
Fine-tuning base model	Qwen3.5-9B
Training data format	CoT instruction data
Fine-tuning method	LoRA
Training framework	Unsloth
Inference deployment	vLLM

Core Objectives

CoastalGPT aims to achieve the following objectives:

Objective	Description
Domain knowledge enhancement	Improve the model's understanding of specialized coastal concepts, typical objects, and process mechanisms
Literature knowledge injection	Transform research topics, modeling methods, indicator systems, and experimental knowledge from papers into training data
Data-resource understanding	Improve the model's ability to extract, describe, and interpret metadata of coastal data resources
Model-resource understanding	Improve the model's ability to explain functions, identify inputs and outputs, and analyze applicable scenarios of coastal model resources
Scientific question answering	Support professional question answering and multi-step explanation for coastal research questions
Data-model-knowledge integration	Build the foundation for domain intelligent services that combine data-driven, model-driven, and knowledge-driven approaches

Technical Route

CoastalGPT currently follows the technical route below:

Domain Resource Collection
Collect coastal basic knowledge, scientific literature, data-resource metadata, and model-resource metadata.
Training Sample Generation
Use Qwen3.5-27B to generate CoT instruction data for tasks such as concept explanation, process mechanisms, literature understanding, data-resource description, model-resource explanation, and multi-step scientific question answering.
Domain Model Fine-Tuning
Perform domain adaptation based on Qwen3.5-9B using the LoRA parameter-efficient fine-tuning method.
Model Inference Deployment
Deploy CoastalGPT using inference frameworks such as vLLM to support subsequent domain knowledge service applications.
Evaluation System Construction
Evaluate model capabilities from aspects including coastal basic knowledge, scientific literature understanding, data-resource metadata extraction, model-resource explanation, and multi-step scientific question answering.

Application Directions

CoastalGPT can support tasks such as data-resource organization, modeling-method understanding, and research workflow assistance in the coastal zone domain, including:

Metadata extraction and structured description of coastal data resources;
Analysis of coastal model resources and explanation of applicable scenarios;
Explanation of modeling methods, indicator systems, and data-processing workflows;
Analysis of scientific research questions and organization of research directions;
Construction of coastal intelligent service systems oriented toward data-model-knowledge collaboration;
Auto Research workflow support for coastal science research.

Current Version

The project is currently under continuous development. The current open-source version is the initial version of CoastalGPT, built from 100,000+ coastal-domain question-answering samples and fine-tuned from the Qwen3.5-9B base model.

Performance Evaluation

To evaluate CoastalGPT's performance in coastal-domain knowledge understanding and multiple-choice reasoning, a multiple-choice benchmark was constructed across five dimensions: Spatial Distribution, Coastal Processes, Impacts & Risks, Monitoring & Modeling, and Management & Adaptation. CoastalGPT was compared with OceanGPT-basic-30B-A3B-Thinking and Qwen3.5-9B. The evaluation metric is Strict Exact-Match Accuracy, meaning that a response is counted as correct only when all options selected by the model exactly match the ground-truth answer.

The benchmark questions are provided in the repository files as data/Multiple-choice question.jsonl; the current evaluation set consists entirely of multiple-choice questions.

The overall results show that CoastalGPT-9B-3 achieves an overall accuracy of 73.4%, clearly outperforming OceanGPT at 58.6% and Qwen3.5 at 57.9%. This indicates that, after fine-tuning with coastal-domain data, CoastalGPT demonstrates stronger stability in specialized concept recognition, multi-factor judgment, and domain knowledge organization.

Across the five dimensions, CoastalGPT obtains the highest accuracy in every category. It performs particularly well on Spatial Distribution and Impacts & Risks, achieving 82.0% accuracy in both categories. This suggests that the model can effectively identify coastal spatial patterns, regional differences, hazard exposure, ecological impacts, and risk factors. In the Coastal Processes dimension, CoastalGPT reaches 80.0%, showing a clear advantage over OceanGPT at 70.0% and Qwen3.5 at 58.0%. This reflects improved understanding of tides, hydrodynamics, sediment transport, ecogeomorphic feedbacks, and biogeochemical processes.

In the Monitoring & Modeling dimension, CoastalGPT achieves 62.8%, still higher than OceanGPT at 52.6% and Qwen3.5 at 51.3%, although the margin is smaller than in other dimensions. This indicates that tasks involving remote sensing monitoring, modeling methods, dataset construction, and predictive evaluation remain relatively challenging for the current model. In the Management & Adaptation dimension, CoastalGPT reaches 66.0%, slightly higher than OceanGPT at 64.0% and substantially higher than Qwen3.5 at 50.0%. This shows that the model has advantages in applied questions related to coastal governance, ecological restoration, adaptation strategies, and resource management, while there is still room for further improvement.

Overall, CoastalGPT demonstrates a stable domain-enhancement effect on the specialized coastal benchmark, with especially clear advantages in spatial pattern recognition, coastal process understanding, and impact-risk assessment. Compared with the general base model Qwen3.5, the improvement of CoastalGPT shows that coastal-domain CoT instruction data can effectively strengthen the model's mastery of specialized knowledge and complex multiple-choice tasks. Compared with the larger OceanGPT model, CoastalGPT still achieves higher accuracy, indicating the importance of high-quality domain-specific data adaptation for coastal zone tasks.