PI1M-68M

H-Net model for dynamic SMILES tokenization

PI1M polymer dataset, 68M bytes (~1 epoch), 10x concatenation, 1-stage architecture

Model Details

Property	Value
Architecture	H-Net (Hierarchical Network)
Parameters	~350M
Dataset	PI1M
Training Bytes	68M
Training Epochs	1
Concatenation	10x SMILES per example
Architecture Variant	1-stage

Architecture Layout

1-stage: ['m4', ['T22'], 'm4']

Encoder: 4 Mamba blocks for byte-level encoding
Core: 22 Transformer blocks with boundary prediction
Decoder: 4 Mamba blocks for final decoding

Files

checkpoints/checkpoint_bytes_best.pt - Best checkpoint (lowest validation loss)
checkpoints/checkpoint_epoch_*.pt - Epoch checkpoints
metadata.json - Training configuration and history
test_smiles.txt - Test SMILES used during training
visualizations/ - Training evolution GIFs and prediction files

Usage

import torch
from pathlib import Path

# Load checkpoint
checkpoint_path = "checkpoints/checkpoint_bytes_best.pt"
checkpoint = torch.load(checkpoint_path, map_location="cpu")

# The checkpoint contains:
# - 'model_state_dict': Model weights
# - 'optimizer_state_dict': Optimizer state
# - 'epoch': Training epoch
# - 'metrics': Training metrics
# - 'cumulative_training_bytes': Total bytes processed

# Load into your H-Net model
# model.load_state_dict(checkpoint['model_state_dict'])

Performance

Tokenization Metrics (from paper)

Metric	Value
Bits-per-byte (BPB)	0.83
Mean token length	2.6

Property Prediction (embeddings)

H-Net embeddings outperform RDKit descriptors on classification tasks:

BBBP: 0.950 AUC (vs 0.927 for RDKit)
HIV: 0.788 AUC (vs 0.760 for RDKit)

Citation

@inproceedings{hnet_smiles_2026,
  title={Learning Chemical Grammar: Dynamic Tokenization for SMILES with Hierarchical Networks},
  author={Anonymous},
  booktitle={International Conference on Machine Learning (ICML)},
  year={2026}
}

Related Models

All models from the paper are available:

Polymer (PI1M) Models:

PI1M-68M - 1 epoch, with concatenation
PI1M-340M - 5 epochs, with concatenation
PI1M-1B - 22 epochs, with concatenation (best compression)
PI1M-nocat - 5 epochs, no concatenation
PI1M-2stg - 5 epochs, 2-stage architecture

Molecular (MOSES) Models:

MOSES-340M - 5 epochs, with concatenation
MOSES-nocat - 5 epochs, no concatenation
MOSES-2stg - 5 epochs, 2-stage architecture

License

MIT License

Downloads last month: -; Downloads are not tracked for this model. How to track

Collection including jordiferrero/PI1M-68M

H-Net SMILES Checkpoints

Collection

Dynamic tokenization models for chemical SMILES (ICML 2026). 8 models on PI1M & MOSES datasets. • 8 items • Updated Jan 23