χhem-GPT-2

χhem-GPT-2 is an improved version of the model reported in "Generative Chemical Language Models for Energetic Materials Discovery" (DOI: TBD), with a "decoder-only" transformer architecture. It is an autoregressive model that takes in tokenized (see associated tokenizer) chemical structures in the Group SELFIES (DOI: 10.1039/D3DD00012E) syntax and target properties (N = 8), and outputs shifted tokens. Target properties have not been pretrained and are supported for finetuning and should be normalized. It was developed at Los Alamos National Laboratory (LANL).

General improvements from the prior architecture and training include use of a SwiGLU feedforward activation, Rotary Position Embeddings (RoPE), and a training warmup followed by linear annealing.

Copyright Triad National Security, LLC

Last Updated: 2026-04-16

Developed by

Andrew Salij

Contributed by

Wilton J.M. Kort-Kamp

Ivana Gonzales

Cristina Garcia Cardona

R. Seaton Ullberg

Megan Davis

Model Changelog

  • 2026-03-31 initial public version

Model short description

  • Generative Chemical Language Model for Small Molecules

Model description

χhem-GPT-2 is a Generative Chemical Language Model. It should be loaded and run in accordance with the AI4HE Python library (LANL, contact asalij@lanl.gov and kortkamp@lanl.gov).

It consists of approximately 40 million parameters with a transformer embedding dimension of 512 and 12 transformer layers (see hparams.json).

TODO χhem-GPT-2 may become integrated into the transformers library and be released on HuggingFace, making its use easier.

Model Type

XhemGPT

Inputs and outputs

Input: Token IDs tensor. Use associated tokenizer GroupSelfiesTokenizer.encode(). Context window 256. Input: Properties tensor of chosen properties. Property window 8 (pad extra values to 0). For fine-tuning/ Output: Token IDS tensor. Use associated tokenizer GroupSelfiesTokenizer.decode()

Compute Infrastructure

This model was trained on the Chicoma cluster (Institutional Computing, LANL) via SLURM.

Hardware

This model was trained on 8 NVIDIA A100 GPUs on the Chicoma cluster at LANL.

Software

This model was trained with Transform Your World (TYW, formerlyh AI4HE), a custom machine learning library for training chemical language models (LANL, contact asalij@lanl.gov and kortkamp@lanl.gov). It is recommended to run the model through TYW, which may be sourced from Github.

It is recommended to install Pixi, navigate to the root directory of TYW, and run

pixi install 
pixi install --environment gpu

to install the necessary packages.

Papers and Scientific Outputs

Model License

Model is released under the MIT License. This is released under O# 5008.

© 2026. Triad National Security, LLC. All rights reserved.

This program was produced under U.S. Government contract 89233218CNA000001 for Los Alamos

National Laboratory (LANL), which is operated by Triad National Security, LLC for the U.S.

Department of Energy/National Nuclear Security Administration. All rights in the program are

reserved by Triad National Security, LLC, and the U.S. Department of Energy/National Nuclear

Security Administration. The Government is granted for itself and others acting on its behalf a

nonexclusive, paid-up, irrevocable worldwide license in this material to reproduce, prepare

derivative works, distribute copies to the public, perform publicly and display publicly, and to permit

others to do so.

Contact Info and Model Card Authors

Wilton J.M. Kortkamp (kortkamp@lanl.gov)

Intended Use

This model is intended to be used for molecular generation and discovery. It is a foundation model that should be adaptable to particular regions of chemical space.

Primary Intended Users

This model is likely to be of interest to AI researchers, chemistry, and materials scientists.

Mission Relevance

This model aims to help discover Materials for the Future.

Out-of-Scope Use Cases

This model is unlikely to generalize to very large (>100 heavy atom) structures.

This model should not be used to generate SMILES strings or other non-supported chemical languages.

This model cannot use natural language.

How to use

Install Instructions

See Software.

Training configuration

Upon installation of AI4HE, view ai4he.training.TrainingConfig for control details and ai4he.cli for CLI control of training specifications.

Inference configuration

Upon installation of AI4HE, view ai4he.generation, which provides functions for creating de novo structures from the model.

On a distributed system, the function ai4he.generation.get_generated_samples is a recommended interface.

Unstable The API XGPTLitAPI enables generation for a client-server architecture.

Code snippets of how to use the model

# imports 
xhem_gpt = ai4he.io_model.load_from_checkpoint("<MODEL_PATH>")
tokenizer = ai4he.tokenizer.GroupSelfiesTokenizer.load("<TOKENIZER_PATH>")
sample_strings = ai4he.generation.UNCONDITIONAL_START
sample_generator = ai4he.generation.SampleGenerator(True,tokenizer,context_window = 256,sample_class = "group_selfies",
molecule_samples = ai4he.generation.get_generated_samples(xhem_gpt,sample_generator,n_samples = 1028,fabric = None,
    sample_strings,run_distributed = False,generation_kwargs = {"batch_size":16,"top_k":None})
    
# the molecule samples will be a list[str] of Group SELFIES

Limitations

Risks

This model may be used to create molecules that carry potential risks.

Limitations

This model can only produce outputs in a single chemical language (i.e., Group SELFIES) and under the supported tokenization. Its embedding will need to be adapted to support additional encodings.

Training details

Training data

This model was pretrained on a subset of the SAFE (DOI: 10.1039/D4DD00019F) dataset of 8 million molecules (total, before splitting). These molecules were converted to SMILES, for which common molecular subgraphs were extracted and then converted to GroupSELFIES. Molecules that failed to convert successfully were discarded. The dataset for this has been updated to v2--this was trained on v1 (datamol-io/safe-gpt/tree/b83175cd7394).

Training Procedure

This model was trained with a batch size of 128 for one epoch on this ordered dataset under a 80/10/10% train/test/validation split and a linear warmup (20% training time) from 10% of maximum learning rate followed by linear cooldown to 0, with a peak learning rate of 0.0005 on the AdamW optimizer (DOI: https://doi.org/10.48550/arXiv.1711.05101) (kwargs = {"betas":(.9,.999),"weight_decay": .1, "amsgrad":True}).

Reproducibility Information

  • Random seed used: 3047
  • Machine/environment info: 8 A100 GPUs across 2 nodes, PyTorch Lightning 2.5.1.post0 used for Distributed Data Parallel training.

Pre-training information

CLI entry:

x_gpt_group_tv2.py --version 12 --nnodes 2 --modelindex 0

For a detailed suite of hyperparamers, see hparams.json.

Evaluation details

This model was tested on unconditioned generation of molecules and metrics pertaining to those molecules.

Evaluation data

Full data may be provided after reasonable request and according to the proper channels.

Evaluation Procedure

This model was evaluated by generating many molecules under various inference prompts and parameters (e.g., temperature) and benchmarking the generated molecules.

The recommended procedure is to benchmark as follows:

experiment = ai4he.experiment(preloaded_model_path = "<MODEL_PATH>",tokenizer_path = "<TOKENIZER_PATH>",context_window = 256,sample_class = "group_selfies")
benchmarkers = ai4he.benchmarking.HIGH_DETAILED_GROUP_BENCHMARK
benchmark_array = experiment.benchmark_model(benchmarkers,fname = "<PATH WITHOUT EXTENSION>",
                            n_samples = <NUMBER OF MOLECULES TO GENERATE>)

, which will save CSVs with the relevant molecules and benchmarks. Note that an Experiment with the model should be first defined, and this experiment should be initialized with Lightning Fabric arguments linked to any desired loggers.

Uncertainty Quantification.

Uncertainty may be grouped or pooled across multiple generations to obtain variances for validity, uniqueness, and other metrics.

Evaluation results

We are not providing evaluation data of this specific model at this time. (11.20.2025)

Please consult the related papers mentioned in the introduction for details on related models.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train asalij/Xhem-GPT-2