AIS-RoChem3D Paper Checkpoints
This repository provides the AIS-RoChem3D model checkpoints used in the paper experiments, together with the model definition, vocabulary files, and minimal examples for embedding extraction and downstream prediction.
The full pretraining datasets and generated H5 caches are not included because of their size.
Checkpoints
| file | meaning |
|---|---|
checkpoints/step76104.pt |
paper checkpoint at global step 76104 |
checkpoints/final.pt |
final paper checkpoint after resumed training |
Python Interface
The model can be imported from the aisrochem3d package:
src/aisrochem3d/AISRoChem3DConfigAISRoChem3DPretrainModelpaper_config()load_checkpoint()load_model_from_checkpoint()featurize_smiles()embed_batch()
Main Model Configuration
pair_dim = 512max_token_length = 384- ProbMix distance branch with MAT-style distance kernel
tau = 2.0- static mix raw lambda
(1.0, 0.5, 0.5), normalized to content/edge/dist =0.5/0.25/0.25 - edge embedding dimension
128 - edge probability branch scale initialized at
1.0
Minimal Checks
pip install -r requirements.txt
python examples/smoke_forward.py
python examples/load_checkpoint_keys.py --checkpoint checkpoints/step76104.pt
python examples/run_embedding_demo.py --smiles "CC(=O)Oc1ccccc1C(=O)O" --checkpoint checkpoints/step76104.pt
python examples/run_downstream_demo.py --smiles "CC(=O)Oc1ccccc1C(=O)O" --checkpoint checkpoints/step76104.pt
Downstream Evaluation Note
The downstream benchmark results reported in the paper were obtained by fine-tuning task-specific heads from the released checkpoints. This repository focuses on the pretrained AIS-RoChem3D checkpoints, model definition, vocabularies, and minimal embedding/property-head examples. It does not include the full downstream hyperparameter search records or task-specific fine-tuned heads.
Vocabulary Files
vocab/ais_vocab_qcmerged.txtvocab/bond_triplet_vocab_qcmerged.tsv
model_config.json records the vocabulary sizes and edge unknown-bucket ids.
SMILES-to-Embedding Example
The examples include a minimal SMILES-to-embedding path:
- canonicalize to no-H SMILES;
- AIS tokenize with token-to-atom alignment;
- generate a heavy-atom RDKit conformer;
- map bond triplets into the model edge-id space;
- run the encoder and pool atom embeddings.
examples/run_downstream_demo.py shows the property-head interface from embeddings.
Without --head, it uses a deterministic demo head only; replace it with a fine-tuned
head for real supervised prediction.