Upload README.md

bad8e4c verified 1 day ago

6.86 kB

license: mit
library_name: pytorch
tags:
  - chemistry
  - biology
  - drug-discovery
  - molecular-language-modeling
  - autoregressive
  - smiles
  - deepsmiles
  - safe
  - fragseq
  - scaling-laws
datasets:
  - SZU-ADDG/MLM-Scaling-datasets

MLM-Scaling-Model

Model Description

MLM-Scaling-Model is the companion model zoo for the paper "Unveiling Scaling Behaviors in Molecular Language Models: Effects of Model Size, Data, and Representation". It releases GPT-style autoregressive molecular language models trained under a compute-controlled scaling setup across multiple molecular string representations, model sizes, and token budgets.

This repository is mainly intended for:

scaling-law studies for molecular language models
controlled comparison of molecular representations
initialization for downstream molecular property prediction
autoregressive molecular string modeling research

Model Sources

Paper: arXiv:2601.22757
Code: SZU-ADDG/MLM-Scaling
Dataset repository: SZU-ADDG/MLM-Scaling-datasets

Repository Contents

The current repository layout contains checkpoints grouped by representation and model size.

DeepSMILES

1M
4M
16M
43M
85M
152M
278M
650M

FragSeq

1M
4M
16M
43M
85M
152M
278M
650M

FragLink

1M
4M
16M
43M
85M
152M
278M
650M

SAFE

1M
4M
16M
43M
85M
152M
278M
650M

SMILES

1M
4M
16M
43M
85M
152M
278M
650M

Training Details

Architecture

All released models are decoder-only GPT-style Transformers trained with an autoregressive next-token objective on molecular strings.

Molecular Representations

The paper studies five string representations:

SMILES
DeepSMILES
SAFE
FragSeq
FragLink

Scaling Grid

The main compute-controlled training grid uses:

Model sizes: 1M, 4M, 16M, 43M, 85M, 152M, 278M, 650M parameters
Dataset token budgets: 100M, 300M, 1B, 3B tokens
Training style: single-epoch, from-scratch runs for the main scaling analysis

The paper also includes repeated-pass runs on fixed corpora for auxiliary duration analysis, but the central scaling results are based on the single-epoch grid.

Training Data

The pretraining corpus is built from large-scale unlabeled molecules collected from ZINC and UniChem, then serialized into the five molecular string representations listed above.

Intended Use

Primary Uses

These checkpoints are suitable for:

studying pretraining loss scaling under matched compute
comparing molecular representations under fixed token budgets
initializing downstream adaptation on molecular property prediction tasks
controlled research on autoregressive molecular language modeling

Out-of-Scope Uses

These checkpoints are not intended to be used as:

a clinical decision system
a stand-alone drug design pipeline for real-world deployment
a universal best model across all chemistry tasks
a substitute for task-specific validation, synthesis checks, docking, or wet-lab confirmation

Performance Highlights

The paper reports that scaling trends are visible in both pretraining loss and downstream transfer, and that the best molecular representation is task-dependent rather than universal.

Downstream Tasks

Downstream transfer is evaluated on nine MoleculeNet benchmarks:

Classification: BACE, HIV, BBBP, SIDER, Tox21, ClinTox
Regression: ESOL, FreeSolv, Lipophilicity

Representative Best Results Among Released Representations

Task	Metric	Best Released Representation	Score
BACE	ROC-AUC ↑	FragLink	89.7
HIV	ROC-AUC ↑	SAFE*	83.3
BBBP	ROC-AUC ↑	DeepSMILES	97.8
SIDER	ROC-AUC ↑	FragSeq	68.8
Tox21	ROC-AUC ↑	FragSeq	83.7
ClinTox	ROC-AUC ↑	SMILES / DeepSMILES	99.8
ESOL	RMSE ↓	DeepSMILES	0.362
FreeSolv	RMSE ↓	FragLink	1.095
Lipophilicity	RMSE ↓	FragLink	0.593

* The paper notes that SAFE reaches the highest HIV score, but also points out that SAFE only covers about 83% of the original HIV test set in that comparison. For full context, please check the paper.

Task-Level Takeaways

FragLink is especially strong on BACE and the biophysics regression tasks.
SMILES and DeepSMILES are strong on HIV, BBBP, and ClinTox.
FragSeq is particularly competitive on SIDER and Tox21.
There is no single best representation for every downstream task.

Important Caveats

The paper makes two points that are worth keeping on the card:

Common de novo generation metrics such as validity, uniqueness, novelty, and diversity can saturate early and are sensitive to sampling settings.
Goal-directed optimization scores can be strongly affected by the search objective and search procedure, so they should not be treated as the main basis for scaling claims.

Because of this, the central conclusions in the paper are grounded mainly in:

compute-controlled validation loss
downstream transfer on property prediction tasks

How to Get Started

These checkpoints are intended to be used together with the official training / inference codebase.

1. Clone the official code

git clone https://github.com/SZU-ADDG/MLM-Scaling.git
cd MLM-Scaling

2. Download this repository

from huggingface_hub import snapshot_download

local_dir = snapshot_download(
    repo_id="SZU-ADDG/MLM-Scaling-Model",
    repo_type="model",
)

print(local_dir)

3. Choose a subfolder

Examples:

SMILES 152M
DeepSMILES 85M
FragSeq 43M
FragLink 152M
SAFE 278M

Then load the selected checkpoint with the official codebase and the matching configuration.

Limitations

The checkpoints are research releases, not task-aligned production models.
Representation choice matters a lot; a stronger result on one task does not imply stronger results on all tasks.
Compute-optimal conclusions in the paper are drawn within the studied compute range.
The released checkpoints should be paired with the correct tokenizer / representation and configuration.

Citation

If you use this model repository in your research, please cite:

@article{xu2026mlmscaling,
  title={Unveiling Scaling Behaviors in Molecular Language Models: Effects of Model Size, Data, and Representation},
  author={Xu, Dong and Pan, Qihua and Yuan, Sisi and Li, Jianqiang and Zhu, Zexuan and Ji, Junkai},
  journal={arXiv preprint arXiv:2601.22757},
  year={2026}
}