Hyformer
Short description
Hyformer is a joint transformer-based model that unifies a generative decoder with a predictive encoder. Depending on the task, Hyformer uses either a causal or a bidirectional mask, outputting token probabilities or predicted property values. The model was developed by Adam Izdebski et al. and more information can be found on the GitHub repository and in the accompanying paper. This repository is a fork of their HuggingFace repository.
Model versions
- Hyformer_molecules_50M: Trained on 19M molecules from ZINC, ChEMBL, and other purchasable molecular datasets (Zhou et al., 2023)
- Hyformer_molecules_8M: Trained on GuacaMol dataset (Brown et al., 2019)
- Hyformer_peptides_34M: Trained on 3.5M general-purpose and antimicrobial peptides
- Hyformer_peptides_34M_MIC:
Hyformer_peptides_34Mjointly fine-tuned on minimal inhibitory concentration values (MIC) against E. coli bacteria
Long description
Hyformer, a transformer-based joint model that successfully blends the generative and predictive functionalities, using an alternating attention mechanism and a joint pre-training scheme. The project shows that Hyformer is simultaneously optimized for molecule generation and property prediction, while exhibiting synergistic benefits in conditional sampling, out-of-distribution property prediction and representation learning. It demonstrate the benefits of joint learning in a drug design use case of discovering novel antimicrobial peptides.
Metadata
Input
- Description: SMILES representations of chemicals
- Input format:
- Shape:
[n, 1], wherenis the number of chemical compounds, each on a new line - Data format:
[str]
- Shape:
- Example input file:
input/sequences.smiles
Model
- Modality: Stings representation of chemical compounds in SMILES format
- Scale: Per chemical compound
- Description: The model generates chemical compounds, extracts features or makes predictions about property values.
Output
Prediction
- Description: Predicts property values either through classification or regression. Outputs one value per chemical compound.
- Output format: tensor
- Shape:
[n, 1], wherenis the number of chemical compounds - Data format: (float)
- Shape:
Feature extraction
- Description: Each chemical compound is represented by a 512-dimensional vector.
- Output format: tensor
- Shape:
[n, 512], wherenis the number of chemical compounds - Data format: (float)
- Shape:
Generation
- Description: A chemical compound in SMILES format
- Output format: tensor
- Shape:
[n, 128], wherenis the number of chemical compounds and 128 is the maximal length they can have - Data format: (float)
- Shape:
Installation
Install the conda environment with all dependencies:
# Create the conda environment called virtual-human-chc-hyformer
conda env create -f environment.yaml
# Activate the environment
conda activate virtual-human-chc-hyformer
Example
Prediction example
from pathlib import Path
import torch
from huggingface_hub import hf_hub_download
from hyformer.configs.tokenizer import TokenizerConfig
from hyformer.configs.model import ModelConfig
from hyformer.utils.tokenizers.auto import AutoTokenizer
from hyformer.models.auto import AutoModel
from hyformer.utils import set_seed
SEED = 1337
set_seed(SEED)
device = "cuda" if torch.cuda.is_available() else "cpu"
repo = "virtual-human-chc/hyformer_molecules_50M"
local = Path("hyformer_molecules_50M")
def download(repo_id, filename):
return hf_hub_download(repo_id=repo_id, filename=filename, local_dir=local)
sequences = Path("input\sequences.smiles").read_text().splitlines()
tokenizer = AutoTokenizer.from_config(
TokenizerConfig.from_config_file(download(repo, "tokenizer_config.json"))
)
model = AutoModel.from_config(
ModelConfig.from_config_file(download(repo, "downstream_config.json")),
downstream_task="classification",
num_tasks=1,
)
model.load_pretrained(download(repo, "ckpt.pt"))
model = model.to_predictor(tokenizer, batch_size=128, device=device)
predictions = model.predict(sequences)
print(predictions) # Output: [[0.65190911], [0.58420199], [0.5933677]]
Feature extraction example
from pathlib import Path
import torch
from huggingface_hub import hf_hub_download
from hyformer.models.auto import AutoModel
from hyformer.models.base import Encoder
from hyformer.utils import set_seed
from hyformer.utils.tokenizers.auto import AutoTokenizer
from hyformer.configs.tokenizer import TokenizerConfig
from hyformer.configs.model import ModelConfig
from hyformer.utils.tokenizers.base import BaseTokenizer
SEED = 1337
set_seed(SEED)
device = "cuda" if torch.cuda.is_available() else "cpu"
repo = "virtual-human-chc/hyformer_molecules_50M"
local = Path("hyformer_molecules_50M")
def download(repo_id, filename):
return hf_hub_download(repo_id=repo_id, filename=filename, local_dir=local)
sequences = Path("input\sequences.smiles").read_text().splitlines()
tokenizer = AutoTokenizer.from_config(
TokenizerConfig.from_config_file(download(repo, "tokenizer_config.json"))
)
model = AutoModel.from_config(
ModelConfig.from_config_file(download(repo, "model_config.json"))
)
model.load_pretrained(download(repo, "ckpt.pt"))
model.to(device)
model.eval()
featurizer = model.to_encoder(tokenizer, 128, device) # batch_size=128
embeddings = featurizer.encode(sequences)
print(embeddings)
# Output:
# [[ 0.12989292 -0.04472789 1.27521825 ... -0.31017503 -2.61905527
# -0.26748869]
# [ 0.04795801 -0.71846646 3.47797537 ... 2.37488675 -0.28063831
# 1.84492266]
# [-0.00499679 0.72711295 0.48343059 ... -1.17737067 0.93289232
# 0.32299849]
Generation example
from pathlib import Path
import torch
from huggingface_hub import hf_hub_download
from hyformer.models.auto import AutoModel
from hyformer.utils import set_seed
from hyformer.utils.tokenizers.auto import AutoTokenizer
from hyformer.configs.tokenizer import TokenizerConfig
from hyformer.configs.model import ModelConfig
SEED = 1337
set_seed(SEED)
NUM_SAMPLES = 100 # Number of samples to generate
device = "cuda" if torch.cuda.is_available() else "cpu"
repo = "virtual-human-chc/hyformer_molecules_50M"
local = Path("hyformer_molecules_50M")
def download(repo_id, filename):
return hf_hub_download(repo_id=repo_id, filename=filename, local_dir=local)
tokenizer = AutoTokenizer.from_config(
TokenizerConfig.from_config_file(download(repo, "tokenizer_config.json"))
)
model = AutoModel.from_config(
ModelConfig.from_config_file(download(repo, "model_config.json")),
)
model.load_pretrained(download(repo, "ckpt.pt"))
generator = model.to_generator(tokenizer, 256, 0.9, 25, device) # batch_size=256, temperature=0.9, top_k=25
sequences = generator.generate(NUM_SAMPLES)
print(sequences)
# Output:
#
# CCCOc1cccc(-c2nn(-c3ccccc3)cc2/C=C(/C#N)C2=[N+]c3ccccc3[N-]2)c1 O=C(c1ccccc1)c1cc([N+](=O)O)c(Sc2c([N+](=O)O)cc([N+](=O)O)cc2[N+](=O)O)cc1[N+](=O)O
# Nc1ncc(CN2CCC3(CC2)C[C@H](c2ccccc2)CN(C2CC2)C3)cn1 O=C(c1ccco1)N(Cc1ccccc1Cl)C[C@@H]1CC(c2ccc(Cl)o2)=NO1
# O=C(c1cccc(/N=C(\O)CCc2ccccc2)c1)[N+]1CCCCC1
References
- Adam Izdebski et al. "Synergistic Benefits of Joint Molecule Generation and Property Prediction" (arxiv)
- Hugging Face repository: https://huggingface.co/SzczurekLab/hyformer_molecules_50M
- Hugging Face repository (fork): https://huggingface.co/virtual-human-chc/hyformer_molecules_50M
- GitHub repository: https://github.com/szczurek-lab/hyformer/tree/main?tab=readme-ov-file
- Brown, Nathan, et al. "GuacaMol: benchmarking models for de novo molecular design." Journal of chemical information and modeling, 2019.
- Zhou, Gengmo, et al. "Uni-mol: A universal 3d molecular representation learning framework." ICLR, 2023.
Copyright
Code derived from https://github.com/szczurek-lab/hyformer/tree/main and https://huggingface.co/SzczurekLab/hyformer_molecules_50M is licensed under the BSD 3-Clause, © 2023 szczurek-lab. Additional code © 2025 Maksim Pavlov, licensed under MIT License.