χhem-GPT-2
χhem-GPT-2 is an improved version of the model reported in "Generative Chemical Language Models for Energetic Materials Discovery" (DOI: TBD), with a "decoder-only" transformer architecture. It is an autoregressive model that takes in tokenized (see associated tokenizer) chemical structures in the Group SELFIES (DOI: 10.1039/D3DD00012E) syntax and target properties (N = 8), and outputs shifted tokens. Target properties have not been pretrained and are supported for finetuning and should be normalized. It was developed at Los Alamos National Laboratory (LANL).
General improvements from the prior architecture and training include use of a SwiGLU feedforward activation, Rotary Position Embeddings (RoPE), and a training warmup followed by linear annealing.
Copyright Triad National Security, LLC
Last Updated: 2026-04-16
Developed by
Andrew Salij
Contributed by
Wilton J.M. Kort-Kamp
Ivana Gonzales
Cristina Garcia Cardona
R. Seaton Ullberg
Megan Davis
Model Changelog
- 2026-03-31 initial public version
Model short description
- Generative Chemical Language Model for Small Molecules
Model description
χhem-GPT-2 is a Generative Chemical Language Model. It should be loaded and run in accordance with the AI4HE Python library (LANL, contact asalij@lanl.gov and kortkamp@lanl.gov).
It consists of approximately 40 million parameters with a transformer embedding dimension of 512 and 12 transformer layers (see hparams.json).
TODO χhem-GPT-2 may become integrated into the transformers library and be released on HuggingFace, making its use easier.
Model Type
XhemGPT
Inputs and outputs
Input: Token IDs tensor. Use associated tokenizer GroupSelfiesTokenizer.encode(). Context window 256. Input: Properties tensor of chosen properties. Property window 8 (pad extra values to 0). For fine-tuning/ Output: Token IDS tensor. Use associated tokenizer GroupSelfiesTokenizer.decode()
Compute Infrastructure
This model was trained on the Chicoma cluster (Institutional Computing, LANL) via SLURM.
Hardware
This model was trained on 8 NVIDIA A100 GPUs on the Chicoma cluster at LANL.
Software
This model was trained with Transform Your World (TYW, formerlyh AI4HE), a custom machine learning library for training chemical language models (LANL, contact asalij@lanl.gov and kortkamp@lanl.gov). It is recommended to run the model through TYW, which may be sourced from Github.
It is recommended to install Pixi, navigate to the root directory of TYW, and run
pixi install
pixi install --environment gpu
to install the necessary packages.
Papers and Scientific Outputs
Model License
Model is released under the MIT License. This is released under O# 5008.
© 2026. Triad National Security, LLC. All rights reserved.
This program was produced under U.S. Government contract 89233218CNA000001 for Los Alamos
National Laboratory (LANL), which is operated by Triad National Security, LLC for the U.S.
Department of Energy/National Nuclear Security Administration. All rights in the program are
reserved by Triad National Security, LLC, and the U.S. Department of Energy/National Nuclear
Security Administration. The Government is granted for itself and others acting on its behalf a
nonexclusive, paid-up, irrevocable worldwide license in this material to reproduce, prepare
derivative works, distribute copies to the public, perform publicly and display publicly, and to permit
others to do so.
Contact Info and Model Card Authors
Wilton J.M. Kortkamp (kortkamp@lanl.gov)
Intended Use
This model is intended to be used for molecular generation and discovery. It is a foundation model that should be adaptable to particular regions of chemical space.
Primary Intended Users
This model is likely to be of interest to AI researchers, chemistry, and materials scientists.
Mission Relevance
This model aims to help discover Materials for the Future.
Out-of-Scope Use Cases
This model is unlikely to generalize to very large (>100 heavy atom) structures.
This model should not be used to generate SMILES strings or other non-supported chemical languages.
This model cannot use natural language.
How to use
Install Instructions
See Software.
Training configuration
Upon installation of AI4HE, view ai4he.training.TrainingConfig for control details and ai4he.cli for CLI control of training specifications.
Inference configuration
Upon installation of AI4HE, view ai4he.generation, which provides functions for creating de novo structures from the model.
On a distributed system, the function ai4he.generation.get_generated_samples is a recommended interface.
Unstable The API XGPTLitAPI enables generation for a client-server architecture.
Code snippets of how to use the model
# imports
xhem_gpt = ai4he.io_model.load_from_checkpoint("<MODEL_PATH>")
tokenizer = ai4he.tokenizer.GroupSelfiesTokenizer.load("<TOKENIZER_PATH>")
sample_strings = ai4he.generation.UNCONDITIONAL_START
sample_generator = ai4he.generation.SampleGenerator(True,tokenizer,context_window = 256,sample_class = "group_selfies",
molecule_samples = ai4he.generation.get_generated_samples(xhem_gpt,sample_generator,n_samples = 1028,fabric = None,
sample_strings,run_distributed = False,generation_kwargs = {"batch_size":16,"top_k":None})
# the molecule samples will be a list[str] of Group SELFIES
Limitations
Risks
This model may be used to create molecules that carry potential risks.
Limitations
This model can only produce outputs in a single chemical language (i.e., Group SELFIES) and under the supported tokenization. Its embedding will need to be adapted to support additional encodings.
Training details
Training data
This model was pretrained on a subset of the SAFE (DOI: 10.1039/D4DD00019F) dataset of 8 million molecules (total, before splitting). These molecules were converted to SMILES, for which common molecular subgraphs were extracted and then converted to GroupSELFIES. Molecules that failed to convert successfully were discarded. The dataset for this has been updated to v2--this was trained on v1 (datamol-io/safe-gpt/tree/b83175cd7394).
Training Procedure
This model was trained with a batch size of 128 for one epoch on this ordered dataset under a 80/10/10% train/test/validation split and a linear warmup (20% training time) from 10% of maximum learning rate followed by linear cooldown to 0, with a peak learning rate of 0.0005 on the AdamW optimizer (DOI: https://doi.org/10.48550/arXiv.1711.05101) (kwargs = {"betas":(.9,.999),"weight_decay": .1, "amsgrad":True}).
Reproducibility Information
- Random seed used: 3047
- Machine/environment info: 8 A100 GPUs across 2 nodes, PyTorch Lightning 2.5.1.post0 used for Distributed Data Parallel training.
Pre-training information
CLI entry:
x_gpt_group_tv2.py --version 12 --nnodes 2 --modelindex 0
For a detailed suite of hyperparamers, see hparams.json.
Evaluation details
This model was tested on unconditioned generation of molecules and metrics pertaining to those molecules.
Evaluation data
Full data may be provided after reasonable request and according to the proper channels.
Evaluation Procedure
This model was evaluated by generating many molecules under various inference prompts and parameters (e.g., temperature) and benchmarking the generated molecules.
The recommended procedure is to benchmark as follows:
experiment = ai4he.experiment(preloaded_model_path = "<MODEL_PATH>",tokenizer_path = "<TOKENIZER_PATH>",context_window = 256,sample_class = "group_selfies")
benchmarkers = ai4he.benchmarking.HIGH_DETAILED_GROUP_BENCHMARK
benchmark_array = experiment.benchmark_model(benchmarkers,fname = "<PATH WITHOUT EXTENSION>",
n_samples = <NUMBER OF MOLECULES TO GENERATE>)
, which will save CSVs with the relevant molecules and benchmarks. Note that an Experiment with the model should be first defined, and this experiment should be initialized with Lightning Fabric arguments linked to any desired loggers.
Uncertainty Quantification.
Uncertainty may be grouped or pooled across multiple generations to obtain variances for validity, uniqueness, and other metrics.
Evaluation results
We are not providing evaluation data of this specific model at this time. (11.20.2025)
Please consult the related papers mentioned in the introduction for details on related models.