H-Net SMILES Checkpoints
Collection
Dynamic tokenization models for chemical SMILES (ICML 2026). 8 models on PI1M & MOSES datasets.
•
8 items
•
Updated
H-Net model for dynamic SMILES tokenization
MOSES molecular dataset, 340M bytes (~5 epochs), 10x concatenation, 1-stage architecture
| Property | Value |
|---|---|
| Architecture | H-Net (Hierarchical Network) |
| Parameters | ~350M |
| Dataset | MOSES |
| Training Bytes | 340M |
| Training Epochs | 5 |
| Concatenation | 10x SMILES per example |
| Architecture Variant | 1-stage |
1-stage: ['m4', ['T22'], 'm4']
checkpoints/checkpoint_bytes_best.pt - Best checkpoint (lowest validation loss)checkpoints/checkpoint_epoch_*.pt - Epoch checkpointsmetadata.json - Training configuration and historytest_smiles.txt - Test SMILES used during trainingvisualizations/ - Training evolution GIFs and prediction filesimport torch
from pathlib import Path
# Load checkpoint
checkpoint_path = "checkpoints/checkpoint_bytes_best.pt"
checkpoint = torch.load(checkpoint_path, map_location="cpu")
# The checkpoint contains:
# - 'model_state_dict': Model weights
# - 'optimizer_state_dict': Optimizer state
# - 'epoch': Training epoch
# - 'metrics': Training metrics
# - 'cumulative_training_bytes': Total bytes processed
# Load into your H-Net model
# model.load_state_dict(checkpoint['model_state_dict'])
| Metric | Value |
|---|---|
| Bits-per-byte (BPB) | 0.69 |
| Mean token length | 2.0 |
H-Net embeddings outperform RDKit descriptors on classification tasks:
@inproceedings{hnet_smiles_2026,
title={Learning Chemical Grammar: Dynamic Tokenization for SMILES with Hierarchical Networks},
author={Anonymous},
booktitle={International Conference on Machine Learning (ICML)},
year={2026}
}
All models from the paper are available:
Polymer (PI1M) Models:
Molecular (MOSES) Models:
MIT License