File size: 6,858 Bytes
f88f2f0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bad8e4c
f88f2f0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bad8e4c
f88f2f0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bad8e4c
f88f2f0
 
 
 
 
 
bad8e4c
 
f88f2f0
 
 
 
 
bad8e4c
f88f2f0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bad8e4c
f88f2f0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
---
license: mit
library_name: pytorch
tags:
- chemistry
- biology
- drug-discovery
- molecular-language-modeling
- autoregressive
- smiles
- deepsmiles
- safe
- fragseq
- scaling-laws
datasets:
- SZU-ADDG/MLM-Scaling-datasets
---

# MLM-Scaling-Model

![Overview](./mlm_scaling_overview.png)

## Model Description

**MLM-Scaling-Model** is the companion model zoo for the paper **"Unveiling Scaling Behaviors in Molecular Language Models: Effects of Model Size, Data, and Representation"**. It releases GPT-style autoregressive molecular language models trained under a compute-controlled scaling setup across multiple molecular string representations, model sizes, and token budgets.

This repository is mainly intended for:

- scaling-law studies for molecular language models
- controlled comparison of molecular representations
- initialization for downstream molecular property prediction
- autoregressive molecular string modeling research


## Model Sources

- **Paper:** [arXiv:2601.22757](https://arxiv.org/abs/2601.22757)
- **Code:** [SZU-ADDG/MLM-Scaling](https://github.com/SZU-ADDG/MLM-Scaling)
- **Dataset repository:** [SZU-ADDG/MLM-Scaling-datasets](https://huggingface.co/datasets/SZU-ADDG/MLM-Scaling-datasets)

## Repository Contents

The current repository layout contains checkpoints grouped by representation and model size.

### DeepSMILES
- 1M
- 4M
- 16M
- 43M
- 85M
- 152M
- 278M
- 650M

### FragSeq
- 1M
- 4M
- 16M
- 43M
- 85M
- 152M
- 278M
- 650M

### FragLink
- 1M
- 4M
- 16M
- 43M
- 85M
- 152M
- 278M
- 650M

### SAFE
- 1M
- 4M
- 16M
- 43M
- 85M
- 152M
- 278M
- 650M

### SMILES
- 1M
- 4M
- 16M
- 43M
- 85M
- 152M
- 278M
- 650M

## Training Details

### Architecture

All released models are **decoder-only GPT-style Transformers** trained with an **autoregressive next-token objective** on molecular strings.

### Molecular Representations

The paper studies five string representations:

- **SMILES**
- **DeepSMILES**
- **SAFE**
- **FragSeq**
- **FragLink**

### Scaling Grid

The main compute-controlled training grid uses:

- **Model sizes:** 1M, 4M, 16M, 43M, 85M, 152M, 278M, 650M parameters
- **Dataset token budgets:** 100M, 300M, 1B, 3B tokens
- **Training style:** single-epoch, from-scratch runs for the main scaling analysis

The paper also includes repeated-pass runs on fixed corpora for auxiliary duration analysis, but the central scaling results are based on the single-epoch grid.

### Training Data

The pretraining corpus is built from large-scale unlabeled molecules collected from **ZINC** and **UniChem**, then serialized into the five molecular string representations listed above.

## Intended Use

### Primary Uses

These checkpoints are suitable for:

1. studying pretraining loss scaling under matched compute
2. comparing molecular representations under fixed token budgets
3. initializing downstream adaptation on molecular property prediction tasks
4. controlled research on autoregressive molecular language modeling

### Out-of-Scope Uses

These checkpoints are **not** intended to be used as:

- a clinical decision system
- a stand-alone drug design pipeline for real-world deployment
- a universal best model across all chemistry tasks
- a substitute for task-specific validation, synthesis checks, docking, or wet-lab confirmation

## Performance Highlights

The paper reports that scaling trends are visible in both **pretraining loss** and **downstream transfer**, and that the best molecular representation is **task-dependent** rather than universal.

### Downstream Tasks

Downstream transfer is evaluated on nine MoleculeNet benchmarks:

- **Classification:** BACE, HIV, BBBP, SIDER, Tox21, ClinTox
- **Regression:** ESOL, FreeSolv, Lipophilicity

### Representative Best Results Among Released Representations

| Task | Metric | Best Released Representation | Score |
|---|---:|---|---:|
| BACE | ROC-AUC ↑ | FragLink | 89.7 |
| HIV | ROC-AUC ↑ | SAFE* | 83.3 |
| BBBP | ROC-AUC ↑ | DeepSMILES | 97.8 |
| SIDER | ROC-AUC ↑ | FragSeq | 68.8 |
| Tox21 | ROC-AUC ↑ | FragSeq | 83.7 |
| ClinTox | ROC-AUC ↑ | SMILES / DeepSMILES | 99.8 |
| ESOL | RMSE ↓ | DeepSMILES | 0.362 |
| FreeSolv | RMSE ↓ | FragLink | 1.095 |
| Lipophilicity | RMSE ↓ | FragLink | 0.593 |

\* The paper notes that SAFE reaches the highest HIV score, but also points out that SAFE only covers about 83% of the original HIV test set in that comparison. For full context, please check the paper.

### Task-Level Takeaways

- **FragLink** is especially strong on **BACE** and the **biophysics regression tasks**.
- **SMILES** and **DeepSMILES** are strong on **HIV**, **BBBP**, and **ClinTox**.
- **FragSeq** is particularly competitive on **SIDER** and **Tox21**.
- There is **no single best representation for every downstream task**.

## Important Caveats

The paper makes two points that are worth keeping on the card:

1. Common **de novo generation metrics** such as validity, uniqueness, novelty, and diversity can saturate early and are sensitive to sampling settings.
2. **Goal-directed optimization scores** can be strongly affected by the search objective and search procedure, so they should not be treated as the main basis for scaling claims.

Because of this, the central conclusions in the paper are grounded mainly in:

- compute-controlled **validation loss**
- downstream transfer on **property prediction tasks**

## How to Get Started

These checkpoints are intended to be used together with the official training / inference codebase.

### 1. Clone the official code

```bash
git clone https://github.com/SZU-ADDG/MLM-Scaling.git
cd MLM-Scaling
```

### 2. Download this repository

```python
from huggingface_hub import snapshot_download

local_dir = snapshot_download(
    repo_id="SZU-ADDG/MLM-Scaling-Model",
    repo_type="model",
)

print(local_dir)
```

### 3. Choose a subfolder

Examples:

- `SMILES 152M`
- `DeepSMILES 85M`
- `FragSeq 43M`
- `FragLink 152M`
- `SAFE 278M`

Then load the selected checkpoint with the official codebase and the matching configuration.

## Limitations

- The checkpoints are research releases, not task-aligned production models.
- Representation choice matters a lot; a stronger result on one task does not imply stronger results on all tasks.
- Compute-optimal conclusions in the paper are drawn within the studied compute range.
- The released checkpoints should be paired with the correct tokenizer / representation and configuration.

## Citation

If you use this model repository in your research, please cite:

```bibtex
@article{xu2026mlmscaling,
  title={Unveiling Scaling Behaviors in Molecular Language Models: Effects of Model Size, Data, and Representation},
  author={Xu, Dong and Pan, Qihua and Yuan, Sisi and Li, Jianqiang and Zhu, Zexuan and Ji, Junkai},
  journal={arXiv preprint arXiv:2601.22757},
  year={2026}
}
```