Hyformer

Short description

Hyformer is a joint transformer-based model that unifies a generative decoder with a predictive encoder. Depending on the task, Hyformer uses either a causal or a bidirectional mask, outputting token probabilities or predicted property values. The model was developed by Adam Izdebski et al. and more information can be found on the GitHub repository and in the accompanying paper. This repository is a fork of their HuggingFace repository.

Model versions

Hyformer_molecules_50M: Trained on 19M molecules from ZINC, ChEMBL, and other purchasable molecular datasets (Zhou et al., 2023)
Hyformer_molecules_8M: Trained on GuacaMol dataset (Brown et al., 2019)
Hyformer_peptides_34M: Trained on 3.5M general-purpose and antimicrobial peptides
Hyformer_peptides_34M_MIC: Hyformer_peptides_34M jointly fine-tuned on minimal inhibitory concentration values (MIC) against E. coli bacteria

Long description

Hyformer, a transformer-based joint model that successfully blends the generative and predictive functionalities, using an alternating attention mechanism and a joint pre-training scheme. The project shows that Hyformer is simultaneously optimized for molecule generation and property prediction, while exhibiting synergistic benefits in conditional sampling, out-of-distribution property prediction and representation learning. It demonstrate the benefits of joint learning in a drug design use case of discovering novel antimicrobial peptides.

Metadata

Input

Description: SMILES representations of chemicals
Input format:
- Shape: [n, 1], where n is the number of chemical compounds, each on a new line
- Data format: [str]
Example input file: input/sequences.smiles

Model

Modality: Stings representation of chemical compounds in SMILES format
Scale: Per chemical compound
Description: The model generates chemical compounds, extracts features or makes predictions about property values.

Output

Feature extraction

Description: Each chemical compound is represented by a 512-dimensional vector.
Output format: tensor
- Shape: [n, 512], where n is the number of chemical compounds
- Data format: (float)

Installation

Install the conda environment with all dependencies:

# Create the conda environment called virtual-human-chc-hyformer
conda env create -f environment.yaml

# Activate the environment
conda activate virtual-human-chc-hyformer

Example

Prediction example

Feature extraction example

from pathlib import Path

import torch
from huggingface_hub import hf_hub_download

from hyformer.models.auto import AutoModel
from hyformer.models.base import Encoder
from hyformer.utils import set_seed
from hyformer.utils.tokenizers.auto import AutoTokenizer
from hyformer.configs.tokenizer import TokenizerConfig
from hyformer.configs.model import ModelConfig
from hyformer.utils.tokenizers.base import BaseTokenizer

SEED = 1337
set_seed(SEED)

device = "cuda" if torch.cuda.is_available() else "cpu"
repo = "virtual-human-chc/hyformer_molecules_50M"
local = Path("virtual-human-chc/hyformer_molecules_50M")

def download(repo_id, filename):
    return hf_hub_download(repo_id=repo_id, filename=filename, local_dir=local)

sequences = Path("input\sequences.smiles").read_text().splitlines()

download(repo, "vocab.txt") 

tokenizer = AutoTokenizer.from_config(
    TokenizerConfig.from_config_file(download(repo, "tokenizer_config.json"))
)

model = AutoModel.from_config(
    ModelConfig.from_config_file(download(repo, "model_config.json"))
)

model.load_pretrained(download(repo, "ckpt.pt"))
model.to(device)
model.eval()

featurizer = model.to_encoder(tokenizer, 128, device) # batch_size=128
embeddings = featurizer.encode(sequences)
print(embeddings) 

# Output:
# [[ 0.12989292 -0.04472789  1.27521825 ... -0.31017503 -2.61905527
#   -0.26748869]
#  [ 0.04795801 -0.71846646  3.47797537 ...  2.37488675 -0.28063831
#    1.84492266]
#  [-0.00499679  0.72711295  0.48343059 ... -1.17737067  0.93289232
#    0.32299849]

References

Adam Izdebski et al. "Synergistic Benefits of Joint Molecule Generation and Property Prediction" (arxiv)
Hugging Face repository: https://huggingface.co/SzczurekLab/hyformer_molecules_50M
Hugging Face repository (fork): https://huggingface.co/virtual-human-chc/hyformer_molecules_50M
GitHub repository: https://github.com/szczurek-lab/hyformer/tree/main?tab=readme-ov-file
Brown, Nathan, et al. "GuacaMol: benchmarking models for de novo molecular design." Journal of chemical information and modeling, 2019.
Zhou, Gengmo, et al. "Uni-mol: A universal 3d molecular representation learning framework." ICLR, 2023.

Copyright

Code derived from https://github.com/szczurek-lab/hyformer/tree/main and https://huggingface.co/SzczurekLab/hyformer_molecules_50M is licensed under the BSD 3-Clause, © 2023 szczurek-lab. Additional code © 2025 Maksim Pavlov, licensed under MIT License.

Downloads last month: -; Downloads are not tracked for this model. How to track

Collection including virtual-human-chc/hyformer_molecules_50M

hyformer

Collection

The models are forks from the original HF repositories at https://huggingface.co/SzczurekLab. • 4 items • Updated Dec 9, 2025

Paper for virtual-human-chc/hyformer_molecules_50M

Synergistic Benefits of Joint Molecule Generation and Property Prediction

Paper • 2504.16559 • Published Apr 23, 2025