Model • Data Sources • Model Training • Performance Evaluation • Applications • Acknowledgements • 中文
CoastalGPT
A Data-Model-Knowledge Integrated Large Language Model for the Coastal Zone Domain
CoastalGPT is a large language model designed for the coastal zone domain. It aims to explore unified organization and reasoning methods for data, models, and knowledge in coastal science research. Centered on data resources, modeling methods, and scientific knowledge in the coastal zone domain, the project builds a unified semantic representation and CoT (Chain-of-Thought) reasoning framework to support relational organization, process reasoning, and coordinated interaction among these three components. The core goal of CoastalGPT is to enable the model to understand the relationships between data resources and modeling methods, connect scientific questions with research workflows, and support model selection, method interpretation, process reasoning, and automated research organization in complex scientific tasks. To achieve this goal, the project constructs CoT instruction datasets for large language model fine-tuning based on coastal domain knowledge, scientific literature, data-resource metadata, and model-resource metadata, and adapts general-purpose large language models to the coastal domain to improve their scientific reasoning and automated research support capabilities in coastal science.
This project was conducted under the guidance of Professor Zhaoyuan Yu, Professor Linwang Yuan, and Professor Wen Luo. Jian Wang, Pei Du, Zhenxia Liu, Zengjie Wang, Zhuo Zhao, Boda Chen, Lei Nie, Jianbo Zhou, Yuhao Xiong, and others contributed to data construction, model fine-tuning, and system implementation.
Data Sources
The training data of CoastalGPT are organized around three types of objects in the coastal zone domain: data, models, and knowledge. The data sources mainly include five parts: approximately 100,000 coastal-domain basic-knowledge CoT instruction samples constructed from open knowledge sources such as Wikipedia; more than 400,000 coastal scientific literature metadata records with the topic coast* collected up to 2025; collected and organized coastal data-resource metadata; collected and organized coastal model-resource metadata; and scientific literature knowledge CoT data for research-oriented tasks constructed from open-access coastal papers published from 2020 to 2025.
Training Data Construction
The training data of CoastalGPT mainly take the form of CoT reasoning data. The project uses Qwen3.5-27B as the data generation model to generate instruction data with reasoning processes around coastal basic knowledge, scientific literature knowledge, data-resource metadata, and model-resource metadata. The current fine-tuning data are all constructed as CoT instruction data containing reasoning processes. By introducing reasoning processes, CoastalGPT learns not only final-answer generation, but also analytical pathways, knowledge organization patterns, and scientific explanation logic for coastal zone problems.
Model Training
CoastalGPT currently uses Qwen3.5-9B as the base model and adopts a parameter-efficient fine-tuning method for domain adaptation.
Item Setting Data generation model Qwen3.5-27B Fine-tuning base model Qwen3.5-9B Training data format CoT instruction data Fine-tuning method LoRA Training framework Unsloth Inference deployment vLLM
Core Objectives
CoastalGPT aims to achieve the following objectives:
Objective Description Domain knowledge enhancement Improve the model's understanding of specialized coastal concepts, typical objects, and process mechanisms Literature knowledge injection Transform research topics, modeling methods, indicator systems, and experimental knowledge from papers into training data Data-resource understanding Improve the model's ability to extract, describe, and interpret metadata of coastal data resources Model-resource understanding Improve the model's ability to explain functions, identify inputs and outputs, and analyze applicable scenarios of coastal model resources Scientific question answering Support professional question answering and multi-step explanation for coastal research questions Data-model-knowledge integration Build the foundation for domain intelligent services that combine data-driven, model-driven, and knowledge-driven approaches
Technical Route
CoastalGPT currently follows the technical route below:
Domain Resource Collection
Collect coastal basic knowledge, scientific literature, data-resource metadata, and model-resource metadata.Training Sample Generation
Use Qwen3.5-27B to generate CoT instruction data for tasks such as concept explanation, process mechanisms, literature understanding, data-resource description, model-resource explanation, and multi-step scientific question answering.Domain Model Fine-Tuning
Perform domain adaptation based on Qwen3.5-9B using the LoRA parameter-efficient fine-tuning method.Model Inference Deployment
Deploy CoastalGPT using inference frameworks such as vLLM to support subsequent domain knowledge service applications.Evaluation System Construction
Evaluate model capabilities from aspects including coastal basic knowledge, scientific literature understanding, data-resource metadata extraction, model-resource explanation, and multi-step scientific question answering.
Application Directions
CoastalGPT can support tasks such as data-resource organization, modeling-method understanding, and research workflow assistance in the coastal zone domain, including:
Metadata extraction and structured description of coastal data resources;
Analysis of coastal model resources and explanation of applicable scenarios;
Explanation of modeling methods, indicator systems, and data-processing workflows;
Analysis of scientific research questions and organization of research directions;
Construction of coastal intelligent service systems oriented toward data-model-knowledge collaboration;
Auto Research workflow support for coastal science research.
Current Version
The project is currently under continuous development. The current open-source version is the initial version of CoastalGPT, built from 100,000+ coastal-domain question-answering samples and fine-tuned from the Qwen3.5-9B base model.
Performance Evaluation
To evaluate CoastalGPT's performance in coastal-domain knowledge understanding and multiple-choice reasoning, a multiple-choice benchmark was constructed across five dimensions: Spatial Distribution, Coastal Processes, Impacts & Risks, Monitoring & Modeling, and Management & Adaptation. CoastalGPT was compared with OceanGPT-basic-30B-A3B-Thinking and Qwen3.5-9B. The evaluation metric is Strict Exact-Match Accuracy, meaning that a response is counted as correct only when all options selected by the model exactly match the ground-truth answer.
The benchmark questions are provided in the repository files as data/Multiple-choice question.jsonl; the current evaluation set consists entirely of multiple-choice questions.
The overall results show that CoastalGPT-9B-3 achieves an overall accuracy of 73.4%, clearly outperforming OceanGPT at 58.6% and Qwen3.5 at 57.9%. This indicates that, after fine-tuning with coastal-domain data, CoastalGPT demonstrates stronger stability in specialized concept recognition, multi-factor judgment, and domain knowledge organization.
Across the five dimensions, CoastalGPT obtains the highest accuracy in every category. It performs particularly well on Spatial Distribution and Impacts & Risks, achieving 82.0% accuracy in both categories. This suggests that the model can effectively identify coastal spatial patterns, regional differences, hazard exposure, ecological impacts, and risk factors. In the Coastal Processes dimension, CoastalGPT reaches 80.0%, showing a clear advantage over OceanGPT at 70.0% and Qwen3.5 at 58.0%. This reflects improved understanding of tides, hydrodynamics, sediment transport, ecogeomorphic feedbacks, and biogeochemical processes.
In the Monitoring & Modeling dimension, CoastalGPT achieves 62.8%, still higher than OceanGPT at 52.6% and Qwen3.5 at 51.3%, although the margin is smaller than in other dimensions. This indicates that tasks involving remote sensing monitoring, modeling methods, dataset construction, and predictive evaluation remain relatively challenging for the current model. In the Management & Adaptation dimension, CoastalGPT reaches 66.0%, slightly higher than OceanGPT at 64.0% and substantially higher than Qwen3.5 at 50.0%. This shows that the model has advantages in applied questions related to coastal governance, ecological restoration, adaptation strategies, and resource management, while there is still room for further improvement.
Overall, CoastalGPT demonstrates a stable domain-enhancement effect on the specialized coastal benchmark, with especially clear advantages in spatial pattern recognition, coastal process understanding, and impact-risk assessment. Compared with the general base model Qwen3.5, the improvement of CoastalGPT shows that coastal-domain CoT instruction data can effectively strengthen the model's mastery of specialized knowledge and complex multiple-choice tasks. Compared with the larger OceanGPT model, CoastalGPT still achieves higher accuracy, indicating the importance of high-quality domain-specific data adaptation for coastal zone tasks.
Acknowledgements
We thank the following open-source models and tools for supporting this project:
- Qwen
- Unsloth
- vLLM
- HuggingFace
- Downloads last month
- 27
