PI1M-68M

H-Net model for dynamic SMILES tokenization

PI1M polymer dataset, 68M bytes (~1 epoch), 10x concatenation, 1-stage architecture

Model Details

Property Value
Architecture H-Net (Hierarchical Network)
Parameters ~350M
Dataset PI1M
Training Bytes 68M
Training Epochs 1
Concatenation 10x SMILES per example
Architecture Variant 1-stage

Architecture Layout

1-stage: ['m4', ['T22'], 'm4']

  • Encoder: 4 Mamba blocks for byte-level encoding
  • Core: 22 Transformer blocks with boundary prediction
  • Decoder: 4 Mamba blocks for final decoding

Files

  • checkpoints/checkpoint_bytes_best.pt - Best checkpoint (lowest validation loss)
  • checkpoints/checkpoint_epoch_*.pt - Epoch checkpoints
  • metadata.json - Training configuration and history
  • test_smiles.txt - Test SMILES used during training
  • visualizations/ - Training evolution GIFs and prediction files

Usage

import torch
from pathlib import Path

# Load checkpoint
checkpoint_path = "checkpoints/checkpoint_bytes_best.pt"
checkpoint = torch.load(checkpoint_path, map_location="cpu")

# The checkpoint contains:
# - 'model_state_dict': Model weights
# - 'optimizer_state_dict': Optimizer state
# - 'epoch': Training epoch
# - 'metrics': Training metrics
# - 'cumulative_training_bytes': Total bytes processed

# Load into your H-Net model
# model.load_state_dict(checkpoint['model_state_dict'])

Performance

Tokenization Metrics (from paper)

Metric Value
Bits-per-byte (BPB) 0.83
Mean token length 2.6

Property Prediction (embeddings)

H-Net embeddings outperform RDKit descriptors on classification tasks:

  • BBBP: 0.950 AUC (vs 0.927 for RDKit)
  • HIV: 0.788 AUC (vs 0.760 for RDKit)

Citation

@inproceedings{hnet_smiles_2026,
  title={Learning Chemical Grammar: Dynamic Tokenization for SMILES with Hierarchical Networks},
  author={Anonymous},
  booktitle={International Conference on Machine Learning (ICML)},
  year={2026}
}

Related Models

All models from the paper are available:

Polymer (PI1M) Models:

  • PI1M-68M - 1 epoch, with concatenation
  • PI1M-340M - 5 epochs, with concatenation
  • PI1M-1B - 22 epochs, with concatenation (best compression)
  • PI1M-nocat - 5 epochs, no concatenation
  • PI1M-2stg - 5 epochs, 2-stage architecture

Molecular (MOSES) Models:

License

MIT License

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including jordiferrero/PI1M-68M