SustcZhangYX commited on
Commit
6dac5ce
·
1 Parent(s): f3b68b2

upload EnvGPT-14B

Browse files
.gitattributes CHANGED
@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
37
+ *.json filter=lfs diff=lfs merge=lfs -text
LOGO.PNG ADDED
Modelfile ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ollama modelfile auto-generated by llamafactory
2
+
3
+ FROM .
4
+
5
+ TEMPLATE """{{ if .System }}<|im_start|>system
6
+ {{ .System }}<|im_end|>
7
+ {{ end }}{{ range .Messages }}{{ if eq .Role "user" }}<|im_start|>user
8
+ {{ .Content }}<|im_end|>
9
+ <|im_start|>assistant
10
+ {{ else if eq .Role "assistant" }}{{ .Content }}<|im_end|>
11
+ {{ end }}{{ end }}"""
12
+
13
+ SYSTEM """You are Qwen, created by Alibaba Cloud. You are a helpful assistant."""
14
+
15
+ PARAMETER stop "<|im_end|>"
16
+ PARAMETER num_ctx 4096
README.md CHANGED
@@ -1,3 +1,129 @@
1
  ---
2
  license: mit
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ datasets:
4
+ - SustcZhangYX/ChatEnv
5
+ - SustcZhangYX/ChatEnv-zh
6
+ language:
7
+ - en
8
+ - zh
9
+ tags:
10
+ - Environmental Science
11
  ---
12
+
13
+ <div align="center">
14
+ <img src="LOGO.PNG" width="450px">
15
+ <h1 align="center"><font face="Arial">EnvGPT-14B</font></h1>
16
+ </div>
17
+
18
+
19
+ **EnvGPT-14B** is a domain-specific large language model tailored for environmental science tasks, fine-tuned on both English and Chinese datasets.
20
+
21
+ Environmental science presents unique challenges for LLMs due to its interdisciplinary nature. EnvGPT-14B was developed to address these challenges by leveraging environmental science-specific instruction datasets and benchmarks.
22
+
23
+ *The model was fine-tuned on the environmental science-specific instruction datasets, [ChatEnv](https://huggingface.co/datasets/SustcZhangYX/ChatEnv) and [ChatEnv-zh](https://huggingface.co/datasets/SustcZhangYX/ChatEnv-zh), through Supervised Fine-Tuning (SFT). The combined dataset includes over **200 million tokens**, covering diverse topics in environmental science in both English and Chinese. This bilingual training enables EnvGPT-14B to achieve strong performance in Chinese as well as English tasks.*
24
+
25
+
26
+ ## 🚀 Getting Started
27
+
28
+ ### Download the model
29
+
30
+ Download the model: [EnvGPT-14B](https://huggingface.co/SustcZhangYX/EnvGPT-14B)
31
+
32
+ ```shell
33
+ git lfs install
34
+ git clone https://huggingface.co/SustcZhangYX/EnvGPT-14B
35
+ ```
36
+
37
+ ### Model Usage
38
+
39
+ Here is a Python code snippet that demonstrates how to load the tokenizer and model and generate text using EnvGPT.
40
+
41
+ ```python
42
+ import torch
43
+ from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
44
+
45
+ # 1. Set your local EnvGPT model path here
46
+ model_path = "YOUR_LOCAL_MODEL_PATH"
47
+
48
+ # 2. Load tokenizer and model
49
+ tokenizer = AutoTokenizer.from_pretrained(model_path)
50
+ model = AutoModelForCausalLM.from_pretrained(
51
+ model_path,
52
+ torch_dtype=torch.bfloat16,
53
+ device_map="auto",
54
+ )
55
+
56
+ # 3. Build chat messages
57
+ messages = [
58
+ {"role": "system", "content": "You are an expert assistant in environmental science, EnvGPT. You are a helpful assistant."},
59
+ {"role": "user", "content": "What is the definition of environmental science?"},
60
+ ]
61
+
62
+ # 4. Format the prompt using the chat template
63
+ # add_generation_prompt=True appends the assistant start token (e.g., <|assistant|>)
64
+ text = tokenizer.apply_chat_template(
65
+ messages,
66
+ tokenize=False,
67
+ add_generation_prompt=True,
68
+ )
69
+
70
+ # 5. Initialize the text-generation pipeline
71
+ text_gen = pipeline(
72
+ "text-generation",
73
+ model=model,
74
+ tokenizer=tokenizer,
75
+ device_map="auto",
76
+ torch_dtype=torch.bfloat16,
77
+ return_full_text=False, # Only return the newly generated text
78
+ )
79
+
80
+ # 6. Generate the response
81
+ # do_sample=True enables sampling (stochastic decoding)
82
+ # top_p=0.6 applies nucleus sampling
83
+ # temperature=0.8 controls randomness
84
+ # max_new_tokens=4096 allows up to 4096 new tokens
85
+ outputs = text_gen(
86
+ text,
87
+ max_new_tokens=4096, # Up to 4096 new tokens
88
+ do_sample=True, # Enable sampling instead of greedy decoding
89
+ top_p=0.6, # Nucleus sampling parameter
90
+ temperature=0.8, # Sampling temperature
91
+ )
92
+
93
+ # 7. Print the assistant’s reply (without the original prompt)
94
+ print(outputs[0]["generated_text"])
95
+
96
+ ```
97
+
98
+ This code demonstrates how to load the tokenizer and model from your local path, define environmental science-specific prompts, and generate responses using sampling techniques like top-p and temperature.
99
+
100
+ ## 🌏 Acknowledgement
101
+
102
+ EnvGPT-14B is fine-tuned based on the open-sourced [Qwen2.5](https://huggingface.co/Qwen). We sincerely thank the Qwen team for their efforts in developing and releasing such a powerful open-source foundation model, which makes domain-specific adaptations like EnvGPT possible.
103
+
104
+ ## ❗Disclaimer
105
+
106
+ This project is intended solely for academic research and exploration. Please note that, like all large language models, this model may exhibit limitations, including potential inaccuracies or hallucinations in generated outputs.
107
+
108
+ ## Limitations
109
+
110
+ - The model may produce hallucinated outputs or inaccuracies, which are inherent to large language models.
111
+ - The model's identity has not been specifically optimized and may generate content that resembles outputs from other LLaMA-based models or similar architectures.
112
+ - Generated outputs can vary between attempts due to sensitivity to prompt phrasing and token context.
113
+
114
+ ## 🚩Citation
115
+
116
+ If you find our work helpful, please consider citing our research: "[Fine-Tuning Large Language Models for Interdisciplinary Environmental Challenges](https://doi.org/10.1016/j.ese.2025.100608)":
117
+
118
+ ```bibtex
119
+ @article{ZHANG2025100608,
120
+ title = {Fine-Tuning Large Language Models for Interdisciplinary Environmental Challenges},
121
+ journal = {Environmental Science and Ecotechnology},
122
+ pages = {100608},
123
+ year = {2025},
124
+ issn = {2666-4984},
125
+ doi = {https://doi.org/10.1016/j.ese.2025.100608},
126
+ url = {https://www.sciencedirect.com/science/article/pii/S2666498425000869},
127
+ author = {Yuanxin Zhang and Sijie Lin and Yaxin Xiong and Nan Li and Lijin Zhong and Longzhen Ding and Qing Hu}
128
+ }
129
+ ```
added_tokens.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:58b54bbe36fc752f79a24a271ef66a0a0830054b4dfad94bde757d851968060b
3
+ size 605
chat_template.jinja ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {%- if tools %}
2
+ {{- '<|im_start|>system\n' }}
3
+ {%- if messages[0]['role'] == 'system' %}
4
+ {{- messages[0]['content'] }}
5
+ {%- else %}
6
+ {{- 'You are EnvGPT, an expert assistant in environmental science. Provide professional, accurate, and concise answers for environmental topics (e.g., climate, ecosystems, water, soil, energy, policy). Be helpful and evidence-aware.' }}
7
+ {%- endif %}
8
+ {{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
9
+ {%- for tool in tools %}
10
+ {{- "\n" }}
11
+ {{- tool | tojson }}
12
+ {%- endfor %}
13
+ {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
14
+ {%- else %}
15
+ {%- if messages[0]['role'] == 'system' %}
16
+ {{- '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }}
17
+ {%- else %}
18
+ {{- '<|im_start|>system\nYou are EnvGPT, an expert assistant in environmental science. Provide professional, accurate, and concise answers for environmental topics (e.g., climate, ecosystems, water, soil, energy, policy). Be helpful and evidence-aware.<|im_end|>\n' }}
19
+ {%- endif %}
20
+ {%- endif %}
21
+ {%- for message in messages %}
22
+ {%- if (message.role == "user") or (message.role == "system" and not loop.first) or (message.role == "assistant" and not message.tool_calls) %}
23
+ {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
24
+ {%- elif message.role == "assistant" %}
25
+ {{- '<|im_start|>' + message.role }}
26
+ {%- if message.content %}
27
+ {{- '\n' + message.content }}
28
+ {%- endif %}
29
+ {%- for tool_call in message.tool_calls %}
30
+ {%- if tool_call.function is defined %}
31
+ {%- set tool_call = tool_call.function %}
32
+ {%- endif %}
33
+ {{- '\n<tool_call>\n{"name": "' }}
34
+ {{- tool_call.name }}
35
+ {{- '", "arguments": ' }}
36
+ {{- tool_call.arguments | tojson }}
37
+ {{- '}\n</tool_call>' }}
38
+ {%- endfor %}
39
+ {{- '<|im_end|>\n' }}
40
+ {%- elif message.role == "tool" %}
41
+ {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %}
42
+ {{- '<|im_start|>user' }}
43
+ {%- endif %}
44
+ {{- '\n<tool_response>\n' }}
45
+ {{- message.content }}
46
+ {{- '\n</tool_response>' }}
47
+ {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
48
+ {{- '<|im_end|>\n' }}
49
+ {%- endif %}
50
+ {%- endif %}
51
+ {%- endfor %}
52
+ {%- if add_generation_prompt %}
53
+ {{- '<|im_start|>assistant\n' }}
54
+ {%- endif %}
config.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f6da3ffeff86a5042871ce5fde20d2cb6146bb24b75c9205a0126d3edfebaef2
3
+ size 1797
generation_config.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a2b18f765bc5c1485718243048c4c4ce69c4b0cd0c5e6f4c952bbe2dfc1e2403
3
+ size 243
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model-00001-of-00006.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d33d83e15833019e83363a833f9babca3fa6c7a02d6149d11a14c406c447a219
3
+ size 4986211280
model-00002-of-00006.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6b812377f1df9dbfedb272932e2133742fcc1e3e5798a3a2801b12d4d6b0c94c
3
+ size 4954847344
model-00003-of-00006.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f857cec00d7756fe66d5cde79c700a9714bae642cf20b03b4802cfd31b49c883
3
+ size 4954847392
model-00004-of-00006.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:621df7a5a6d111536f98b6ea494e42912cf82369f7d7f042461b54fcf098fa9e
3
+ size 4954847392
model-00005-of-00006.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b6128f712b0282741df96a7d0f94c5d0ceeefc5c3beeee07a315f42648f0f624
3
+ size 4954847392
model-00006-of-00006.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:98f850c00dcc8eb232ede15973282e94f6e56dc93e86ba89dfaf083bd5649653
3
+ size 4734533160
model.safetensors.index.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fde4e9e466a8a9357dca0a4d3b5e3656d528fa718a87520c6a5a9491f453f87d
3
+ size 47509
special_tokens_map.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:76862e765266b85aa9459767e33cbaf13970f327a0e88d1c65846c2ddd3a1ecd
3
+ size 613
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9c5ae00e602b8860cbd784ba82a8aa14e8feecec692e7076590d014d7b7fdafa
3
+ size 11421896
tokenizer_config.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d10d4ad57348e7bf9b899b6d9b3b9cf8209776b47803e754541ca275e03f2dd3
3
+ size 4712
vocab.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ca10d7e9fb3ed18575dd1e277a2579c16d108e32f27439684afa0e10b1440910
3
+ size 2776833