File size: 4,559 Bytes
a7367db
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
---
language:
  - en
license: apache-2.0
base_model: Qwen/Qwen3-8B
datasets:
  - HuggingFaceTB/smollm-corpus
tags:
  - text-generation
  - transformers
  - safetensors
  - qwen
  - climate
  - planetary-boundaries
  - domain-adaptation
pipeline_tag: text-generation
---

# ClimateGPT-3-8B

ClimateGPT-3-8B is an open language model domain-adapted for climate science and the **Planetary Boundaries** framework.

## Model details

- **Base model**: `Qwen/Qwen3-8B`
- **Model type**: Causal LM
- **Language(s)**: English
- **Context length**: 8192 tokens (SFT configuration)
- **License**: Apache-2.0
- **Release artifact**: Fully merged weights (standalone model; no adapter required)

## Intended use

- Climate and sustainability Q&A
- Planetary Boundaries–focused education and analysis
- Drafting and summarization of climate-related content

## Limitations

- The model may produce incorrect or outdated information.
- Training data is largely English web content; this can introduce geographic/cultural and topical biases.
- The model is not a substitute for professional scientific, medical, legal, or policy advice.

## Training

ClimateGPT-3-8B was built in multiple stages:

### Continued pretraining (CPT)

Starting from `Qwen/Qwen3-8B`, we performed continued pretraining on climate-focused corpora primarily derived from FineWeb-Edu (SmolLM-Corpus) using climate- and Planetary Boundaries–oriented filtering.

The data selection emphasizes climate science and Planetary Boundaries terminology and includes filtering to reduce off-topic matches from ambiguous terms.

### Supervised fine-tuning (SFT)

We performed supervised fine-tuning using a mixture of:

- Climate instruction-following data
- Multi-turn conversations
- Safety/refusal examples
- Tool-use data
- Synthetic climate / Planetary Boundaries Q&A

## Training data

### Public data

- **FineWeb-Edu (via `HuggingFaceTB/smollm-corpus`)**
  - Used for climate- and Planetary Boundaries–filtered continued pretraining.
  - **Dataset license**: ODC-By
  - Dataset page: https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus

### Non-public / generated data

In addition to public data, the training mix includes internal and/or generated instruction data. These datasets are not redistributed with this model.

## Evaluation

We evaluate climate-domain performance using a Planetary Boundaries evaluation suite compatible with EleutherAI’s `lm-evaluation-harness`.

A representative comparison (from this project’s Planetary Boundaries evaluation artifacts) between a ClimateGPT 8B checkpoint and the base Qwen3-8B:

| Task | Metric | ClimateGPT | Qwen3-8B |
|---|---:|---:|---:|
| `planetary_boundaries_mcq_large` | acc | 0.4422 | 0.3533 |
| `planetary_boundaries_mcq_large` | acc_norm | 0.4278 | 0.3900 |
| `planetary_boundaries_mcq_hard` | acc | 0.3467 | 0.2711 |
| `planetary_boundaries_mcq_hard` | acc_norm | 0.3800 | 0.3400 |
| `planetary_boundaries_qa_large` | exact_match | 0.9000 | 0.8467 |
| `planetary_boundaries_qa_strict_core_nolist` | exact_match | 0.6556 | 0.4889 |

## How to use

### Transformers

This repository contains a standalone model. You can load it directly with Transformers.

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "Erasmus-AI/climategpt-3-8b"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

prompt = "Explain the Planetary Boundaries framework in simple terms."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

out = model.generate(
    **inputs,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.6,
    top_p=0.95,
)
print(tokenizer.decode(out[0], skip_special_tokens=True))
```

### vLLM

This model is intended to be compatible with vLLM.

## License

- **Model weights**: Apache-2.0
- **Base model**: `Qwen/Qwen3-8B` (Apache-2.0)

## Attribution

If you use this model, please cite/attribute the upstream resources where appropriate:

- Base model: https://huggingface.co/Qwen/Qwen3-8B
- Training data (public portion): https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus (ODC-By)

## Citation

If you use this model in academic work, please cite:

```bibtex
@misc{climategpt3,
  title        = {ClimateGPT-3-8B},
  howpublished = {\url{https://huggingface.co/Erasmus-AI/climategpt-3-8b}},
  year         = {2026}
}
```

## Contact

If you have questions, issues, or evaluation results to share, please open a discussion/issue in the repository that accompanies this release.