initial commit
Browse filesCo-authored-by: SeulLee05 <SeulLee05@users.noreply.huggingface.co>
- .gitattributes +35 -0
- README.md +150 -0
- nv-reasyn-ar-166m-v2.ckpt +3 -0
.gitattributes
ADDED
|
@@ -0,0 +1,35 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
*.7z filter=lfs diff=lfs merge=lfs -text
|
| 2 |
+
*.arrow filter=lfs diff=lfs merge=lfs -text
|
| 3 |
+
*.bin filter=lfs diff=lfs merge=lfs -text
|
| 4 |
+
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
| 5 |
+
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
| 6 |
+
*.ftz filter=lfs diff=lfs merge=lfs -text
|
| 7 |
+
*.gz filter=lfs diff=lfs merge=lfs -text
|
| 8 |
+
*.h5 filter=lfs diff=lfs merge=lfs -text
|
| 9 |
+
*.joblib filter=lfs diff=lfs merge=lfs -text
|
| 10 |
+
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
| 11 |
+
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
| 12 |
+
*.model filter=lfs diff=lfs merge=lfs -text
|
| 13 |
+
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
| 14 |
+
*.npy filter=lfs diff=lfs merge=lfs -text
|
| 15 |
+
*.npz filter=lfs diff=lfs merge=lfs -text
|
| 16 |
+
*.onnx filter=lfs diff=lfs merge=lfs -text
|
| 17 |
+
*.ot filter=lfs diff=lfs merge=lfs -text
|
| 18 |
+
*.parquet filter=lfs diff=lfs merge=lfs -text
|
| 19 |
+
*.pb filter=lfs diff=lfs merge=lfs -text
|
| 20 |
+
*.pickle filter=lfs diff=lfs merge=lfs -text
|
| 21 |
+
*.pkl filter=lfs diff=lfs merge=lfs -text
|
| 22 |
+
*.pt filter=lfs diff=lfs merge=lfs -text
|
| 23 |
+
*.pth filter=lfs diff=lfs merge=lfs -text
|
| 24 |
+
*.rar filter=lfs diff=lfs merge=lfs -text
|
| 25 |
+
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
| 26 |
+
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
| 27 |
+
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
| 28 |
+
*.tar filter=lfs diff=lfs merge=lfs -text
|
| 29 |
+
*.tflite filter=lfs diff=lfs merge=lfs -text
|
| 30 |
+
*.tgz filter=lfs diff=lfs merge=lfs -text
|
| 31 |
+
*.wasm filter=lfs diff=lfs merge=lfs -text
|
| 32 |
+
*.xz filter=lfs diff=lfs merge=lfs -text
|
| 33 |
+
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
+
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
+
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
README.md
ADDED
|
@@ -0,0 +1,150 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: other
|
| 3 |
+
license_name: nvidia-open-model-license
|
| 4 |
+
license_link: >-
|
| 5 |
+
https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/
|
| 6 |
+
library_name: clara
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## Model Overview
|
| 10 |
+
|
| 11 |
+
The code for using the ReaSyn model checkpoint is available in the [official Github repository](https://github.com/NVIDIA-Digital-Bio/ReaSyn).
|
| 12 |
+
|
| 13 |
+
### Description
|
| 14 |
+
|
| 15 |
+
ReaSyn is a model for predicting the synthesis pathway, reaction steps from reactants to final product(s), for a target product molecule. When the target molecule cannot be synthesized directly using known reaction steps, ReaSyn will generate the pathways for the most structurally similar, synthesizable analog of the target molecule.The model uses an encoder-decoder Transformer architecture, where a full synthetic pathway is represented as a text sequence. ReaSyn v2 improves the reconstruction and projection capabilities of ReaSyn v1 using a more advanced search (by combining top-down and bottom-up tree traversal) in addition to an Edit Flow model that edits generated pathways via deletion, substitution, and insertion operations. This approach allows the model to achieve SOTA performance in tasks like synthesis planning and incorporating synthesizability into goal-directed molecular property optimization.
|
| 16 |
+
|
| 17 |
+
This model is ready for commercial use.
|
| 18 |
+
|
| 19 |
+
### License/Terms of Use
|
| 20 |
+
|
| 21 |
+
GOVERNING TERMS: Use of this model is governed by the [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/). ReaSyn source code is licensed under [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0).
|
| 22 |
+
|
| 23 |
+
Deployment Geography: Global
|
| 24 |
+
|
| 25 |
+
Use Case: <br>
|
| 26 |
+
ReaSyn v2 is a model for predicting the synthetic pathway, reaction steps from reactants to final product(s), for a target product molecule. The model can be used in the pharmaceutical and chemical industries and in academic research to identify how to synthesize a molecule, help chemists in planning a first time synthesis of a molecule, the optimization of an existing synthesis pathway, or the filtering of candidate molecules based on ease of synthesis. <br>
|
| 27 |
+
|
| 28 |
+
Release Date: <br>
|
| 29 |
+
Github 1/8/2026 via https://github.com/NVIDIA-Digital-Bio/ReaSyn <br>
|
| 30 |
+
NGC 1/8/2026 via https://catalog.ngc.nvidia.com/orgs/nvidia/teams/clara/resources/reasyn?version=2.0 <br>
|
| 31 |
+
Hugging Face 1/8/2026 via:
|
| 32 |
+
- https://huggingface.co/nvidia/NV-ReaSyn-AR-166M-v2
|
| 33 |
+
- https://huggingface.co/nvidia/NV-ReaSyn-EB-174M-v2 <br>
|
| 34 |
+
|
| 35 |
+
### References
|
| 36 |
+
Research paper: "Exploring Synthesizable Chemical Space with Iterative Pathway Refinements" https://arxiv.org/abs/2509.16084
|
| 37 |
+
|
| 38 |
+
### Model Architecture
|
| 39 |
+
|
| 40 |
+
Architecture Type: Encoder-decoder
|
| 41 |
+
Network Architecture: Encoder-decoder Transformer
|
| 42 |
+
ReaSyn v2 utilizes an encoder-decoder Transformer architecture which takes a molecular SMILES as input and outputs its synthetic pathway autoregressively. Encoder contains 6 layers and decoder contains 10 layers. Both encoder and decoder have a hidden size of 768, 16 attention heads, and a feed-forward dimension of 4096.
|
| 43 |
+
ReaSyn v2 has another Edit Flow model, which has the same encoder-decoder Transformer architecture as backbone but with three additional heads. The Edit Flow model takes a molecular SMILES and synthetic pathway generated from the autoregressive model as input and outputs the probabilities of edit operations: insertion, deletion, and substitution, that yield a more refined synthetic pathway.
|
| 44 |
+
|
| 45 |
+
The autoregressive model has 166M parameters and the Edit Bridge model has 174M parameters.
|
| 46 |
+
|
| 47 |
+
### Autoregressive model
|
| 48 |
+
|
| 49 |
+
#### Input
|
| 50 |
+
|
| 51 |
+
Input Types: Text<br>
|
| 52 |
+
Input Formats: SMILES string<br>
|
| 53 |
+
Input Parameters: One-Dimensional (1D)<br>
|
| 54 |
+
Other Properties Related to Input: Maximum input length is 256 tokens.
|
| 55 |
+
|
| 56 |
+
#### Output
|
| 57 |
+
|
| 58 |
+
Output Types: Text<br>
|
| 59 |
+
Output Formats: Molecular synthetic pathway<br>
|
| 60 |
+
Output Parameters: One-Dimensional (1D)<br>
|
| 61 |
+
Other Properties Related to Output: Maximum output length is 512 tokens.
|
| 62 |
+
|
| 63 |
+
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
|
| 64 |
+
|
| 65 |
+
### Edit Flow model
|
| 66 |
+
|
| 67 |
+
#### Input
|
| 68 |
+
|
| 69 |
+
Input Types: Text<br>
|
| 70 |
+
Input Formats: SMILES string, molecular synthetic pathway<br>
|
| 71 |
+
Input Parameters: One-Dimensional (1D)<br>
|
| 72 |
+
Other Properties Related to Input: Maximum input length of SMILES string is 256 tokens. Maximum input length of molecular synthetic pathway is 512 tokens.
|
| 73 |
+
|
| 74 |
+
#### Output
|
| 75 |
+
|
| 76 |
+
Output Types: Text<br>
|
| 77 |
+
Output Formats: Molecular synthetic pathway<br>
|
| 78 |
+
Output Parameters: One-Dimensional (1D)<br>
|
| 79 |
+
Other Properties Related to Output: Maximum output length is 512 tokens.
|
| 80 |
+
|
| 81 |
+
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
|
| 82 |
+
|
| 83 |
+
### Software Integration
|
| 84 |
+
|
| 85 |
+
Runtime Engine: Torch<br>
|
| 86 |
+
Supported Hardware Microarchitecture Compatibility: NVIDIA Ampere<br>
|
| 87 |
+
Preferred Operating System: Linux, Windows
|
| 88 |
+
|
| 89 |
+
The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.
|
| 90 |
+
|
| 91 |
+
### Model Versions
|
| 92 |
+
|
| 93 |
+
ReaSyn v2
|
| 94 |
+
|
| 95 |
+
## Training and Evaluation Datasets
|
| 96 |
+
|
| 97 |
+
### Training Datasets
|
| 98 |
+
|
| 99 |
+
SynFormer Reaction Templates<br>
|
| 100 |
+
Link: https://github.com/wenhao-gao/synformer/blob/main/data/rxn_templates/comprehensive.txt<br>
|
| 101 |
+
Data Modality: Text<br>
|
| 102 |
+
Text Training Data Size: 1 Billion to 10 Trillion Tokens<br>
|
| 103 |
+
Data Collection Method by dataset: Human<br>
|
| 104 |
+
Labeling Method by dataset: Automated<br>
|
| 105 |
+
Properties: 115 molecular reaction templates in the SMARTS format
|
| 106 |
+
|
| 107 |
+
Building Blocks in Enamine US Stock retrieved in October 2023<br>
|
| 108 |
+
Link: https://enamine.net/building-blocks/building-blocks-catalog<br>
|
| 109 |
+
Data Modality: Text<br>
|
| 110 |
+
Text Training Data Size: 1 Billion to 10 Trillion Tokens<br>
|
| 111 |
+
Data Collection Method by dataset: Human<br>
|
| 112 |
+
Labeling Method by dataset: N/A<br>
|
| 113 |
+
Properties: 115 molecular reaction templates in the SMARTS format
|
| 114 |
+
|
| 115 |
+
### Evaluation Dataset
|
| 116 |
+
|
| 117 |
+
Enamine REAL Test Set<br>
|
| 118 |
+
Link: https://github.com/wenhao-gao/synformer/blob/main/data/enamine_smiles_1k.txt<br>
|
| 119 |
+
https://enamine.net/compound-collections/real-compounds/real-database<br>
|
| 120 |
+
Data Collection Method by dataset: Human<br>
|
| 121 |
+
Labeling Method by dataset: N/A<br>
|
| 122 |
+
Properties: Randomly selected 1k test molecules from Enamine REAL to evaluate synthesizable molecule reconstruction.<br>
|
| 123 |
+
|
| 124 |
+
ChEMBL Test Set<br>
|
| 125 |
+
Link: https://github.com/wenhao-gao/synformer/blob/main/data/chembl_filtered_1k.txt<br>
|
| 126 |
+
https://www.ebi.ac.uk/chembl<br>
|
| 127 |
+
Data Collection Method by dataset: Human<br>
|
| 128 |
+
Labeling Method by dataset: N/A<br>
|
| 129 |
+
Properties: Randomly selected 1k test molecules from ChEMBL to evaluate synthesizable molecule reconstruction.<br>
|
| 130 |
+
|
| 131 |
+
ZINC250k Test Set<br>
|
| 132 |
+
Link: https://www.kaggle.com/datasets/basu369victor/zinc250k<br>
|
| 133 |
+
Data Collection Method by dataset: Synthetic<br>
|
| 134 |
+
Labeling Method by dataset: N/A <br>
|
| 135 |
+
Properties: Randomly selected 1k test molecules from ZINC250k to evaluate synthesizable molecule reconstruction.<br>
|
| 136 |
+
|
| 137 |
+
### Inference
|
| 138 |
+
|
| 139 |
+
Engine: Torch<br>
|
| 140 |
+
Test Hardware: Ampere / NVIDIA A100
|
| 141 |
+
|
| 142 |
+
### Ethical Considerations
|
| 143 |
+
|
| 144 |
+
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
|
| 145 |
+
|
| 146 |
+
Users are responsible for ensuring the physical properties of model-generated molecules are appropriately evaluated and comply with applicable safety regulations and ethical standards.
|
| 147 |
+
|
| 148 |
+
For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards.
|
| 149 |
+
|
| 150 |
+
Please report security vulnerabilities or NVIDIA AI Concerns [here](https://app.intigriti.com/programs/nvidia/nvidiavdp/detail).
|
nv-reasyn-ar-166m-v2.ckpt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:1a5055c0b8eee2ca40e70f0a12047d02803173c6602413204c5a357ca8c2b65e
|
| 3 |
+
size 2001291902
|