| --- |
| license: mit |
| library_name: pytorch |
| tags: |
| - chemistry |
| - biology |
| - drug-discovery |
| - molecular-language-modeling |
| - autoregressive |
| - smiles |
| - deepsmiles |
| - safe |
| - fragseq |
| - scaling-laws |
| datasets: |
| - SZU-ADDG/MLM-Scaling-datasets |
| --- |
| |
| # MLM-Scaling-Model |
|
|
|  |
|
|
| ## Model Description |
|
|
| **MLM-Scaling-Model** is the companion model zoo for the paper **"Unveiling Scaling Behaviors in Molecular Language Models: Effects of Model Size, Data, and Representation"**. It releases GPT-style autoregressive molecular language models trained under a compute-controlled scaling setup across multiple molecular string representations, model sizes, and token budgets. |
|
|
| This repository is mainly intended for: |
|
|
| - scaling-law studies for molecular language models |
| - controlled comparison of molecular representations |
| - initialization for downstream molecular property prediction |
| - autoregressive molecular string modeling research |
|
|
|
|
| ## Model Sources |
|
|
| - **Paper:** [arXiv:2601.22757](https://arxiv.org/abs/2601.22757) |
| - **Code:** [SZU-ADDG/MLM-Scaling](https://github.com/SZU-ADDG/MLM-Scaling) |
| - **Dataset repository:** [SZU-ADDG/MLM-Scaling-datasets](https://huggingface.co/datasets/SZU-ADDG/MLM-Scaling-datasets) |
|
|
| ## Repository Contents |
|
|
| The current repository layout contains checkpoints grouped by representation and model size. |
|
|
| ### DeepSMILES |
| - 1M |
| - 4M |
| - 16M |
| - 43M |
| - 85M |
| - 152M |
| - 278M |
| - 650M |
|
|
| ### FragSeq |
| - 1M |
| - 4M |
| - 16M |
| - 43M |
| - 85M |
| - 152M |
| - 278M |
| - 650M |
|
|
| ### FragLink |
| - 1M |
| - 4M |
| - 16M |
| - 43M |
| - 85M |
| - 152M |
| - 278M |
| - 650M |
|
|
| ### SAFE |
| - 1M |
| - 4M |
| - 16M |
| - 43M |
| - 85M |
| - 152M |
| - 278M |
| - 650M |
|
|
| ### SMILES |
| - 1M |
| - 4M |
| - 16M |
| - 43M |
| - 85M |
| - 152M |
| - 278M |
| - 650M |
|
|
| ## Training Details |
|
|
| ### Architecture |
|
|
| All released models are **decoder-only GPT-style Transformers** trained with an **autoregressive next-token objective** on molecular strings. |
|
|
| ### Molecular Representations |
|
|
| The paper studies five string representations: |
|
|
| - **SMILES** |
| - **DeepSMILES** |
| - **SAFE** |
| - **FragSeq** |
| - **FragLink** |
|
|
| ### Scaling Grid |
|
|
| The main compute-controlled training grid uses: |
|
|
| - **Model sizes:** 1M, 4M, 16M, 43M, 85M, 152M, 278M, 650M parameters |
| - **Dataset token budgets:** 100M, 300M, 1B, 3B tokens |
| - **Training style:** single-epoch, from-scratch runs for the main scaling analysis |
|
|
| The paper also includes repeated-pass runs on fixed corpora for auxiliary duration analysis, but the central scaling results are based on the single-epoch grid. |
|
|
| ### Training Data |
|
|
| The pretraining corpus is built from large-scale unlabeled molecules collected from **ZINC** and **UniChem**, then serialized into the five molecular string representations listed above. |
|
|
| ## Intended Use |
|
|
| ### Primary Uses |
|
|
| These checkpoints are suitable for: |
|
|
| 1. studying pretraining loss scaling under matched compute |
| 2. comparing molecular representations under fixed token budgets |
| 3. initializing downstream adaptation on molecular property prediction tasks |
| 4. controlled research on autoregressive molecular language modeling |
|
|
| ### Out-of-Scope Uses |
|
|
| These checkpoints are **not** intended to be used as: |
|
|
| - a clinical decision system |
| - a stand-alone drug design pipeline for real-world deployment |
| - a universal best model across all chemistry tasks |
| - a substitute for task-specific validation, synthesis checks, docking, or wet-lab confirmation |
|
|
| ## Performance Highlights |
|
|
| The paper reports that scaling trends are visible in both **pretraining loss** and **downstream transfer**, and that the best molecular representation is **task-dependent** rather than universal. |
|
|
| ### Downstream Tasks |
|
|
| Downstream transfer is evaluated on nine MoleculeNet benchmarks: |
|
|
| - **Classification:** BACE, HIV, BBBP, SIDER, Tox21, ClinTox |
| - **Regression:** ESOL, FreeSolv, Lipophilicity |
|
|
| ### Representative Best Results Among Released Representations |
|
|
| | Task | Metric | Best Released Representation | Score | |
| |---|---:|---|---:| |
| | BACE | ROC-AUC β | FragLink | 89.7 | |
| | HIV | ROC-AUC β | SAFE* | 83.3 | |
| | BBBP | ROC-AUC β | DeepSMILES | 97.8 | |
| | SIDER | ROC-AUC β | FragSeq | 68.8 | |
| | Tox21 | ROC-AUC β | FragSeq | 83.7 | |
| | ClinTox | ROC-AUC β | SMILES / DeepSMILES | 99.8 | |
| | ESOL | RMSE β | DeepSMILES | 0.362 | |
| | FreeSolv | RMSE β | FragLink | 1.095 | |
| | Lipophilicity | RMSE β | FragLink | 0.593 | |
|
|
| \* The paper notes that SAFE reaches the highest HIV score, but also points out that SAFE only covers about 83% of the original HIV test set in that comparison. For full context, please check the paper. |
|
|
| ### Task-Level Takeaways |
|
|
| - **FragLink** is especially strong on **BACE** and the **biophysics regression tasks**. |
| - **SMILES** and **DeepSMILES** are strong on **HIV**, **BBBP**, and **ClinTox**. |
| - **FragSeq** is particularly competitive on **SIDER** and **Tox21**. |
| - There is **no single best representation for every downstream task**. |
|
|
| ## Important Caveats |
|
|
| The paper makes two points that are worth keeping on the card: |
|
|
| 1. Common **de novo generation metrics** such as validity, uniqueness, novelty, and diversity can saturate early and are sensitive to sampling settings. |
| 2. **Goal-directed optimization scores** can be strongly affected by the search objective and search procedure, so they should not be treated as the main basis for scaling claims. |
|
|
| Because of this, the central conclusions in the paper are grounded mainly in: |
|
|
| - compute-controlled **validation loss** |
| - downstream transfer on **property prediction tasks** |
|
|
| ## How to Get Started |
|
|
| These checkpoints are intended to be used together with the official training / inference codebase. |
|
|
| ### 1. Clone the official code |
|
|
| ```bash |
| git clone https://github.com/SZU-ADDG/MLM-Scaling.git |
| cd MLM-Scaling |
| ``` |
|
|
| ### 2. Download this repository |
|
|
| ```python |
| from huggingface_hub import snapshot_download |
| |
| local_dir = snapshot_download( |
| repo_id="SZU-ADDG/MLM-Scaling-Model", |
| repo_type="model", |
| ) |
| |
| print(local_dir) |
| ``` |
|
|
| ### 3. Choose a subfolder |
|
|
| Examples: |
|
|
| - `SMILES 152M` |
| - `DeepSMILES 85M` |
| - `FragSeq 43M` |
| - `FragLink 152M` |
| - `SAFE 278M` |
|
|
| Then load the selected checkpoint with the official codebase and the matching configuration. |
|
|
| ## Limitations |
|
|
| - The checkpoints are research releases, not task-aligned production models. |
| - Representation choice matters a lot; a stronger result on one task does not imply stronger results on all tasks. |
| - Compute-optimal conclusions in the paper are drawn within the studied compute range. |
| - The released checkpoints should be paired with the correct tokenizer / representation and configuration. |
|
|
| ## Citation |
|
|
| If you use this model repository in your research, please cite: |
|
|
| ```bibtex |
| @article{xu2026mlmscaling, |
| title={Unveiling Scaling Behaviors in Molecular Language Models: Effects of Model Size, Data, and Representation}, |
| author={Xu, Dong and Pan, Qihua and Yuan, Sisi and Li, Jianqiang and Zhu, Zexuan and Ji, Junkai}, |
| journal={arXiv preprint arXiv:2601.22757}, |
| year={2026} |
| } |
| ``` |
|
|