metadata
license: mit
tags:
- chemistry
- smiles
- tokenization
- dynamic-tokenization
- h-net
- hierarchical-networks
- molecular-representation
- polymer
- mamba
- transformer
datasets:
- MOSES
language:
- en
pipeline_tag: feature-extraction
MOSES-2stg
H-Net model for dynamic SMILES tokenization
MOSES molecular dataset, 340M bytes (~5 epochs), 10x concatenation, 2-stage hierarchical architecture
Model Details
| Property | Value |
|---|---|
| Architecture | H-Net (Hierarchical Network) |
| Parameters | ~350M |
| Dataset | MOSES |
| Training Bytes | 340M |
| Training Epochs | 5 |
| Concatenation | 10x SMILES per example |
| Architecture Variant | 2-stage |
Architecture Layout
2-stage: ['m4', ['T1m4', ['T22'], 'm4T1'], 'm4']
- Encoder: 4 Mamba blocks for byte-level encoding
- Core: 2-level hierarchical: Stage 0 (T1+4 Mamba) + Stage 1 (22 Transformer blocks)
- Decoder: 4 Mamba blocks for final decoding
Files
checkpoints/checkpoint_bytes_best.pt- Best checkpoint (lowest validation loss)checkpoints/checkpoint_epoch_*.pt- Epoch checkpointsmetadata.json- Training configuration and historytest_smiles.txt- Test SMILES used during trainingvisualizations/- Training evolution GIFs and prediction files
Usage
import torch
from pathlib import Path
# Load checkpoint
checkpoint_path = "checkpoints/checkpoint_bytes_best.pt"
checkpoint = torch.load(checkpoint_path, map_location="cpu")
# The checkpoint contains:
# - 'model_state_dict': Model weights
# - 'optimizer_state_dict': Optimizer state
# - 'epoch': Training epoch
# - 'metrics': Training metrics
# - 'cumulative_training_bytes': Total bytes processed
# Load into your H-Net model
# model.load_state_dict(checkpoint['model_state_dict'])
Performance
Tokenization Metrics (from paper)
| Metric | Value |
|---|---|
| Bits-per-byte (BPB) | 0.83 |
| Mean token length | 2.0 |
Property Prediction (embeddings)
H-Net embeddings outperform RDKit descriptors on classification tasks:
- BBBP: 0.950 AUC (vs 0.927 for RDKit)
- HIV: 0.788 AUC (vs 0.760 for RDKit)
Citation
@inproceedings{hnet_smiles_2026,
title={Learning Chemical Grammar: Dynamic Tokenization for SMILES with Hierarchical Networks},
author={Anonymous},
booktitle={International Conference on Machine Learning (ICML)},
year={2026}
}
Related Models
All models from the paper are available:
Polymer (PI1M) Models:
- PI1M-68M - 1 epoch, with concatenation
- PI1M-340M - 5 epochs, with concatenation
- PI1M-1B - 22 epochs, with concatenation (best compression)
- PI1M-nocat - 5 epochs, no concatenation
- PI1M-2stg - 5 epochs, 2-stage architecture
Molecular (MOSES) Models:
- MOSES-340M - 5 epochs, with concatenation
- MOSES-nocat - 5 epochs, no concatenation
- MOSES-2stg - 5 epochs, 2-stage architecture
License
MIT License