SustcZhangYX commited on
Commit ·
6dac5ce
1
Parent(s): f3b68b2
upload EnvGPT-14B
Browse files- .gitattributes +2 -0
- LOGO.PNG +0 -0
- Modelfile +16 -0
- README.md +126 -0
- added_tokens.json +3 -0
- chat_template.jinja +54 -0
- config.json +3 -0
- generation_config.json +3 -0
- merges.txt +0 -0
- model-00001-of-00006.safetensors +3 -0
- model-00002-of-00006.safetensors +3 -0
- model-00003-of-00006.safetensors +3 -0
- model-00004-of-00006.safetensors +3 -0
- model-00005-of-00006.safetensors +3 -0
- model-00006-of-00006.safetensors +3 -0
- model.safetensors.index.json +3 -0
- special_tokens_map.json +3 -0
- tokenizer.json +3 -0
- tokenizer_config.json +3 -0
- vocab.json +3 -0
.gitattributes
CHANGED
|
@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
|
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
| 37 |
+
*.json filter=lfs diff=lfs merge=lfs -text
|
LOGO.PNG
ADDED
|
|
Modelfile
ADDED
|
@@ -0,0 +1,16 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# ollama modelfile auto-generated by llamafactory
|
| 2 |
+
|
| 3 |
+
FROM .
|
| 4 |
+
|
| 5 |
+
TEMPLATE """{{ if .System }}<|im_start|>system
|
| 6 |
+
{{ .System }}<|im_end|>
|
| 7 |
+
{{ end }}{{ range .Messages }}{{ if eq .Role "user" }}<|im_start|>user
|
| 8 |
+
{{ .Content }}<|im_end|>
|
| 9 |
+
<|im_start|>assistant
|
| 10 |
+
{{ else if eq .Role "assistant" }}{{ .Content }}<|im_end|>
|
| 11 |
+
{{ end }}{{ end }}"""
|
| 12 |
+
|
| 13 |
+
SYSTEM """You are Qwen, created by Alibaba Cloud. You are a helpful assistant."""
|
| 14 |
+
|
| 15 |
+
PARAMETER stop "<|im_end|>"
|
| 16 |
+
PARAMETER num_ctx 4096
|
README.md
CHANGED
|
@@ -1,3 +1,129 @@
|
|
| 1 |
---
|
| 2 |
license: mit
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: mit
|
| 3 |
+
datasets:
|
| 4 |
+
- SustcZhangYX/ChatEnv
|
| 5 |
+
- SustcZhangYX/ChatEnv-zh
|
| 6 |
+
language:
|
| 7 |
+
- en
|
| 8 |
+
- zh
|
| 9 |
+
tags:
|
| 10 |
+
- Environmental Science
|
| 11 |
---
|
| 12 |
+
|
| 13 |
+
<div align="center">
|
| 14 |
+
<img src="LOGO.PNG" width="450px">
|
| 15 |
+
<h1 align="center"><font face="Arial">EnvGPT-14B</font></h1>
|
| 16 |
+
</div>
|
| 17 |
+
|
| 18 |
+
|
| 19 |
+
**EnvGPT-14B** is a domain-specific large language model tailored for environmental science tasks, fine-tuned on both English and Chinese datasets.
|
| 20 |
+
|
| 21 |
+
Environmental science presents unique challenges for LLMs due to its interdisciplinary nature. EnvGPT-14B was developed to address these challenges by leveraging environmental science-specific instruction datasets and benchmarks.
|
| 22 |
+
|
| 23 |
+
*The model was fine-tuned on the environmental science-specific instruction datasets, [ChatEnv](https://huggingface.co/datasets/SustcZhangYX/ChatEnv) and [ChatEnv-zh](https://huggingface.co/datasets/SustcZhangYX/ChatEnv-zh), through Supervised Fine-Tuning (SFT). The combined dataset includes over **200 million tokens**, covering diverse topics in environmental science in both English and Chinese. This bilingual training enables EnvGPT-14B to achieve strong performance in Chinese as well as English tasks.*
|
| 24 |
+
|
| 25 |
+
|
| 26 |
+
## 🚀 Getting Started
|
| 27 |
+
|
| 28 |
+
### Download the model
|
| 29 |
+
|
| 30 |
+
Download the model: [EnvGPT-14B](https://huggingface.co/SustcZhangYX/EnvGPT-14B)
|
| 31 |
+
|
| 32 |
+
```shell
|
| 33 |
+
git lfs install
|
| 34 |
+
git clone https://huggingface.co/SustcZhangYX/EnvGPT-14B
|
| 35 |
+
```
|
| 36 |
+
|
| 37 |
+
### Model Usage
|
| 38 |
+
|
| 39 |
+
Here is a Python code snippet that demonstrates how to load the tokenizer and model and generate text using EnvGPT.
|
| 40 |
+
|
| 41 |
+
```python
|
| 42 |
+
import torch
|
| 43 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
|
| 44 |
+
|
| 45 |
+
# 1. Set your local EnvGPT model path here
|
| 46 |
+
model_path = "YOUR_LOCAL_MODEL_PATH"
|
| 47 |
+
|
| 48 |
+
# 2. Load tokenizer and model
|
| 49 |
+
tokenizer = AutoTokenizer.from_pretrained(model_path)
|
| 50 |
+
model = AutoModelForCausalLM.from_pretrained(
|
| 51 |
+
model_path,
|
| 52 |
+
torch_dtype=torch.bfloat16,
|
| 53 |
+
device_map="auto",
|
| 54 |
+
)
|
| 55 |
+
|
| 56 |
+
# 3. Build chat messages
|
| 57 |
+
messages = [
|
| 58 |
+
{"role": "system", "content": "You are an expert assistant in environmental science, EnvGPT. You are a helpful assistant."},
|
| 59 |
+
{"role": "user", "content": "What is the definition of environmental science?"},
|
| 60 |
+
]
|
| 61 |
+
|
| 62 |
+
# 4. Format the prompt using the chat template
|
| 63 |
+
# add_generation_prompt=True appends the assistant start token (e.g., <|assistant|>)
|
| 64 |
+
text = tokenizer.apply_chat_template(
|
| 65 |
+
messages,
|
| 66 |
+
tokenize=False,
|
| 67 |
+
add_generation_prompt=True,
|
| 68 |
+
)
|
| 69 |
+
|
| 70 |
+
# 5. Initialize the text-generation pipeline
|
| 71 |
+
text_gen = pipeline(
|
| 72 |
+
"text-generation",
|
| 73 |
+
model=model,
|
| 74 |
+
tokenizer=tokenizer,
|
| 75 |
+
device_map="auto",
|
| 76 |
+
torch_dtype=torch.bfloat16,
|
| 77 |
+
return_full_text=False, # Only return the newly generated text
|
| 78 |
+
)
|
| 79 |
+
|
| 80 |
+
# 6. Generate the response
|
| 81 |
+
# do_sample=True enables sampling (stochastic decoding)
|
| 82 |
+
# top_p=0.6 applies nucleus sampling
|
| 83 |
+
# temperature=0.8 controls randomness
|
| 84 |
+
# max_new_tokens=4096 allows up to 4096 new tokens
|
| 85 |
+
outputs = text_gen(
|
| 86 |
+
text,
|
| 87 |
+
max_new_tokens=4096, # Up to 4096 new tokens
|
| 88 |
+
do_sample=True, # Enable sampling instead of greedy decoding
|
| 89 |
+
top_p=0.6, # Nucleus sampling parameter
|
| 90 |
+
temperature=0.8, # Sampling temperature
|
| 91 |
+
)
|
| 92 |
+
|
| 93 |
+
# 7. Print the assistant’s reply (without the original prompt)
|
| 94 |
+
print(outputs[0]["generated_text"])
|
| 95 |
+
|
| 96 |
+
```
|
| 97 |
+
|
| 98 |
+
This code demonstrates how to load the tokenizer and model from your local path, define environmental science-specific prompts, and generate responses using sampling techniques like top-p and temperature.
|
| 99 |
+
|
| 100 |
+
## 🌏 Acknowledgement
|
| 101 |
+
|
| 102 |
+
EnvGPT-14B is fine-tuned based on the open-sourced [Qwen2.5](https://huggingface.co/Qwen). We sincerely thank the Qwen team for their efforts in developing and releasing such a powerful open-source foundation model, which makes domain-specific adaptations like EnvGPT possible.
|
| 103 |
+
|
| 104 |
+
## ❗Disclaimer
|
| 105 |
+
|
| 106 |
+
This project is intended solely for academic research and exploration. Please note that, like all large language models, this model may exhibit limitations, including potential inaccuracies or hallucinations in generated outputs.
|
| 107 |
+
|
| 108 |
+
## Limitations
|
| 109 |
+
|
| 110 |
+
- The model may produce hallucinated outputs or inaccuracies, which are inherent to large language models.
|
| 111 |
+
- The model's identity has not been specifically optimized and may generate content that resembles outputs from other LLaMA-based models or similar architectures.
|
| 112 |
+
- Generated outputs can vary between attempts due to sensitivity to prompt phrasing and token context.
|
| 113 |
+
|
| 114 |
+
## 🚩Citation
|
| 115 |
+
|
| 116 |
+
If you find our work helpful, please consider citing our research: "[Fine-Tuning Large Language Models for Interdisciplinary Environmental Challenges](https://doi.org/10.1016/j.ese.2025.100608)":
|
| 117 |
+
|
| 118 |
+
```bibtex
|
| 119 |
+
@article{ZHANG2025100608,
|
| 120 |
+
title = {Fine-Tuning Large Language Models for Interdisciplinary Environmental Challenges},
|
| 121 |
+
journal = {Environmental Science and Ecotechnology},
|
| 122 |
+
pages = {100608},
|
| 123 |
+
year = {2025},
|
| 124 |
+
issn = {2666-4984},
|
| 125 |
+
doi = {https://doi.org/10.1016/j.ese.2025.100608},
|
| 126 |
+
url = {https://www.sciencedirect.com/science/article/pii/S2666498425000869},
|
| 127 |
+
author = {Yuanxin Zhang and Sijie Lin and Yaxin Xiong and Nan Li and Lijin Zhong and Longzhen Ding and Qing Hu}
|
| 128 |
+
}
|
| 129 |
+
```
|
added_tokens.json
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:58b54bbe36fc752f79a24a271ef66a0a0830054b4dfad94bde757d851968060b
|
| 3 |
+
size 605
|
chat_template.jinja
ADDED
|
@@ -0,0 +1,54 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{%- if tools %}
|
| 2 |
+
{{- '<|im_start|>system\n' }}
|
| 3 |
+
{%- if messages[0]['role'] == 'system' %}
|
| 4 |
+
{{- messages[0]['content'] }}
|
| 5 |
+
{%- else %}
|
| 6 |
+
{{- 'You are EnvGPT, an expert assistant in environmental science. Provide professional, accurate, and concise answers for environmental topics (e.g., climate, ecosystems, water, soil, energy, policy). Be helpful and evidence-aware.' }}
|
| 7 |
+
{%- endif %}
|
| 8 |
+
{{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
|
| 9 |
+
{%- for tool in tools %}
|
| 10 |
+
{{- "\n" }}
|
| 11 |
+
{{- tool | tojson }}
|
| 12 |
+
{%- endfor %}
|
| 13 |
+
{{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
|
| 14 |
+
{%- else %}
|
| 15 |
+
{%- if messages[0]['role'] == 'system' %}
|
| 16 |
+
{{- '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }}
|
| 17 |
+
{%- else %}
|
| 18 |
+
{{- '<|im_start|>system\nYou are EnvGPT, an expert assistant in environmental science. Provide professional, accurate, and concise answers for environmental topics (e.g., climate, ecosystems, water, soil, energy, policy). Be helpful and evidence-aware.<|im_end|>\n' }}
|
| 19 |
+
{%- endif %}
|
| 20 |
+
{%- endif %}
|
| 21 |
+
{%- for message in messages %}
|
| 22 |
+
{%- if (message.role == "user") or (message.role == "system" and not loop.first) or (message.role == "assistant" and not message.tool_calls) %}
|
| 23 |
+
{{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
|
| 24 |
+
{%- elif message.role == "assistant" %}
|
| 25 |
+
{{- '<|im_start|>' + message.role }}
|
| 26 |
+
{%- if message.content %}
|
| 27 |
+
{{- '\n' + message.content }}
|
| 28 |
+
{%- endif %}
|
| 29 |
+
{%- for tool_call in message.tool_calls %}
|
| 30 |
+
{%- if tool_call.function is defined %}
|
| 31 |
+
{%- set tool_call = tool_call.function %}
|
| 32 |
+
{%- endif %}
|
| 33 |
+
{{- '\n<tool_call>\n{"name": "' }}
|
| 34 |
+
{{- tool_call.name }}
|
| 35 |
+
{{- '", "arguments": ' }}
|
| 36 |
+
{{- tool_call.arguments | tojson }}
|
| 37 |
+
{{- '}\n</tool_call>' }}
|
| 38 |
+
{%- endfor %}
|
| 39 |
+
{{- '<|im_end|>\n' }}
|
| 40 |
+
{%- elif message.role == "tool" %}
|
| 41 |
+
{%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %}
|
| 42 |
+
{{- '<|im_start|>user' }}
|
| 43 |
+
{%- endif %}
|
| 44 |
+
{{- '\n<tool_response>\n' }}
|
| 45 |
+
{{- message.content }}
|
| 46 |
+
{{- '\n</tool_response>' }}
|
| 47 |
+
{%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
|
| 48 |
+
{{- '<|im_end|>\n' }}
|
| 49 |
+
{%- endif %}
|
| 50 |
+
{%- endif %}
|
| 51 |
+
{%- endfor %}
|
| 52 |
+
{%- if add_generation_prompt %}
|
| 53 |
+
{{- '<|im_start|>assistant\n' }}
|
| 54 |
+
{%- endif %}
|
config.json
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:f6da3ffeff86a5042871ce5fde20d2cb6146bb24b75c9205a0126d3edfebaef2
|
| 3 |
+
size 1797
|
generation_config.json
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:a2b18f765bc5c1485718243048c4c4ce69c4b0cd0c5e6f4c952bbe2dfc1e2403
|
| 3 |
+
size 243
|
merges.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
model-00001-of-00006.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:d33d83e15833019e83363a833f9babca3fa6c7a02d6149d11a14c406c447a219
|
| 3 |
+
size 4986211280
|
model-00002-of-00006.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:6b812377f1df9dbfedb272932e2133742fcc1e3e5798a3a2801b12d4d6b0c94c
|
| 3 |
+
size 4954847344
|
model-00003-of-00006.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:f857cec00d7756fe66d5cde79c700a9714bae642cf20b03b4802cfd31b49c883
|
| 3 |
+
size 4954847392
|
model-00004-of-00006.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:621df7a5a6d111536f98b6ea494e42912cf82369f7d7f042461b54fcf098fa9e
|
| 3 |
+
size 4954847392
|
model-00005-of-00006.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:b6128f712b0282741df96a7d0f94c5d0ceeefc5c3beeee07a315f42648f0f624
|
| 3 |
+
size 4954847392
|
model-00006-of-00006.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:98f850c00dcc8eb232ede15973282e94f6e56dc93e86ba89dfaf083bd5649653
|
| 3 |
+
size 4734533160
|
model.safetensors.index.json
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:fde4e9e466a8a9357dca0a4d3b5e3656d528fa718a87520c6a5a9491f453f87d
|
| 3 |
+
size 47509
|
special_tokens_map.json
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:76862e765266b85aa9459767e33cbaf13970f327a0e88d1c65846c2ddd3a1ecd
|
| 3 |
+
size 613
|
tokenizer.json
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:9c5ae00e602b8860cbd784ba82a8aa14e8feecec692e7076590d014d7b7fdafa
|
| 3 |
+
size 11421896
|
tokenizer_config.json
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:d10d4ad57348e7bf9b899b6d9b3b9cf8209776b47803e754541ca275e03f2dd3
|
| 3 |
+
size 4712
|
vocab.json
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:ca10d7e9fb3ed18575dd1e277a2579c16d108e32f27439684afa0e10b1440910
|
| 3 |
+
size 2776833
|