File size: 5,398 Bytes
d62f68c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
---
license: agpl-3.0
pipeline_tag: text-generation
tags:
- chemistry
language:
- en
- zh
base-model:
- Qwen/Qwen2.5-14B
---
# ChemDFM-v2.0-14B

ChemDFM-v2.0 is the latest non-thinking model of ChemDFM, the pioneering open-sourced dialogue foundation model for Chemistry and molecule science.

To achieve better chemical capabilities, both the domain pre-training stage and the instruction tuning stage are upgraded. **In the domain pre-training stage**, we introduce web-scale molecules and reactions into the corpus along with their functional-group information and properties. In this way, ChemDFM is able to better acquire chemical knowledge at a finer level of granularity. **In the instruction tuning stage**, we significantly improve the diversity of our instruction tuning dataset by introducing more tasks and increasing the variability in the phrasing and expression of the instruction texts.

## News

* <font color="#935000">**2025-10-26**:</font> The parameter of [ChemDFM-R-14B](https://huggingface.co/OpenDFM/ChemDFM-R-14B) is open-sourced!
* <font color="#935000">**2025-10-26**</font>: [ChemDFM-v2.0-14B](https://huggingface.co/OpenDFM/ChemDFM-v2.0-14B) is released! The improved domain pre-training and instruction tuning procedure is implemented on Qwen2.5-14B to achieve a more advanced general LLM in Chemistry. More details can be found [here](https://huggingface.co/OpenDFM/ChemDFM-v2.0-14B).
* <font color="#935000">**2025-07-29**</font>: The paper of ChemDFM-R-14B is released on arXiv: [ChemDFM-R: A Chemical Reasoning LLM Enhanced with Atomized Chemical Knowledge](https://arxiv.org/abs/2507.21990).

* 2024-11-09: [ChemDFM-v1.5-8B](https://huggingface.co/OpenDFM/ChemDFM-v1.5-8B) is released! We implemented our domain pre-training and instruction tuning procedure on a stronger base model LLaMA-3-8B.
* 2024-03-12: The parameter of [ChemDFM-v1.0-13B](https://huggingface.co/OpenDFM/ChemDFM-v1.0-13B) is open-sourced!
* 2024-01-26: The paper of ChemDFM-13B is released on arXiv: [ChemDFM: Dialogue Foundation Model for Chemistry](https://arxiv.org/abs/2401.14818)

### local inference

To load and run ChemDFM-v2.0 locally, here is an example:

```python
import re
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

model_name_or_id = "OpenDFM/ChemDFM-v2.0-14B"
tokenizer = AutoTokenizer.from_pretrained(model_name_or_id)
model = AutoModelForCausalLM.from_pretrained(model_name_or_id, torch_dtype=torch.float16).to("cuda")

instruction = "Can you please give detailed descriptions of the molecule below?\nCl.O=C1c2c(O)cccc2-c2nn(CCNCCO)c3ccc(NCCNCCO)c1c23"
message = [
    {
        "role": "system",
        "content": "You are a helpful assistant."
    },
    {
        "role": "user",
        "content": instruction
    }
]

input_text = tokenizer.apply_chat_template(message, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(input_text, return_tensors="pt").to("cuda")
generation_config = GenerationConfig(
    do_sample=True,
    top_k=20,
    top_p=0.9,
    temperature=0.9,
    max_new_tokens=1024,
    repetition_penalty=1.05,
    eos_token_id=tokenizer.eos_token_id
)
outputs = model.generate(**inputs, generation_config=generation_config)

generated_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
input_text = tokenizer.decode(inputs["input_ids"][0], skip_special_tokens=True)
generated_text = generated_text[len(input_text):].strip()
print(f"{generated_text=}")
```

### SMILES preprocess

When there involves SMILES notation in your input, we recommend to preprocess the SMILES with the `rdkit` package to canonicalize the SMILES. Here is an example:
```python
from rdkit import Chem
def canonicalize_smiles(smiles):
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        return None
    return Chem.MolToSmiles(mol, isomericSmiles=True, kekuleSmiles=False)
```
or directly:
```python
from rdkit import Chem
def canonicalize_smiles(smiles):
    return Chem.CanonSmiles(smiles, useChiral=True)
```

## Citation
```bibtex
@article{zhao2025developing,
         title={Developing ChemDFM as a large language foundation model for chemistry},
         author={Zhao, Zihan and Ma, Da and Chen, Lu and Sun, Liangtai and Li, Zihao and Xia, Yi and Chen, Bo and Xu, Hongshen and Zhu, Zichen and Zhu, Su and others},
         journal={Cell Reports Physical Science},
         volume={6},
         number={4},
         year={2025},
         publisher={Elsevier}
}

@misc{zhao2025chemdfmr,
      title={ChemDFM-R: An Chemical Reasoner LLM Enhanced with Atomized Chemical Knowledge}, 
      author={Zihan Zhao and Bo Chen and Ziping Wan and Lu Chen and Xuanze Lin and Shiyang Yu and Situo Zhang and Da Ma and Zichen Zhu and Danyang Zhang and Huayang Wang and Zhongyang Dai and Liyang Wen and Xin Chen and Kai Yu},
      year={2025},
      eprint={2507.21990},
      archivePrefix={arXiv},
      primaryClass={cs.CE},
      url={https://arxiv.org/abs/2507.21990}, 
}
```

## Disclaimer
Current version of ChemDFM may generate incorrect or misleading information. Please use it with caution and verify the results with domain experts before making any decisions based on the results.

## Contact

If you have any questions or further requests, please contact [Zihan Zhao](mailto:zhao_mengxin@sjtu.edu.cn), [Bo Chen](mailto:chenb@szlab.ac.cn), and [Lu Chen](mailto:chenlusz@sjtu.edu.cn).