Hyformer

Short description

Hyformer is a joint transformer-based model that unifies a generative decoder with a predictive encoder. Depending on the task, Hyformer uses either a causal or a bidirectional mask, outputting token probabilities or predicted property values. The model was developed by Adam Izdebski et al. and more information can be found on the GitHub repository and in the accompanying paper. This repository is a fork of their HuggingFace repository.

Model versions

Long description

Hyformer, a transformer-based joint model that successfully blends the generative and predictive functionalities, using an alternating attention mechanism and a joint pre-training scheme. The project shows that Hyformer is simultaneously optimized for molecule generation and property prediction, while exhibiting synergistic benefits in conditional sampling, out-of-distribution property prediction and representation learning. It demonstrate the benefits of joint learning in a drug design use case of discovering novel antimicrobial peptides.

Metadata

Input

  • Description: SMILES representations of chemicals
  • Input format:
    • Shape: [n, 1], where n is the number of chemical compounds, each on a new line
    • Data format: [str]
  • Example input file: input/sequences.smiles

Model

  • Modality: Stings representation of chemical compounds in SMILES format
  • Scale: Per chemical compound
  • Description: The model generates chemical compounds, extracts features or makes predictions about property values.

Output

Prediction

  • Description: Predicts property values either through classification or regression. Outputs one value per chemical compound.
  • Output format: tensor
    • Shape: [n, 1], where n is the number of chemical compounds
    • Data format: (float)

Feature extraction

  • Description: Each chemical compound is represented by a 512-dimensional vector.
  • Output format: tensor
    • Shape: [n, 512], where n is the number of chemical compounds
    • Data format: (float)

Generation

  • Description: A chemical compound in SMILES format
  • Output format: tensor
    • Shape: [n, 128], where n is the number of chemical compounds and 128 is the maximal length they can have
    • Data format: (float)

Installation

Install the conda environment with all dependencies:

# Create the conda environment called virtual-human-chc-hyformer
conda env create -f environment.yaml

# Activate the environment
conda activate virtual-human-chc-hyformer

Example

Prediction example

from pathlib import Path

import torch
from huggingface_hub import hf_hub_download

from hyformer.configs.tokenizer import TokenizerConfig
from hyformer.configs.model import ModelConfig
from hyformer.utils.tokenizers.auto import AutoTokenizer
from hyformer.models.auto import AutoModel
from hyformer.utils import set_seed

SEED = 1337
set_seed(SEED)

device = "cuda" if torch.cuda.is_available() else "cpu"
repo = "virtual-human-chc/hyformer_molecules_50M"
local = Path("hyformer_molecules_50M")

def download(repo_id, filename):
    return hf_hub_download(repo_id=repo_id, filename=filename, local_dir=local)

sequences = Path("input\sequences.smiles").read_text().splitlines()

tokenizer = AutoTokenizer.from_config(
    TokenizerConfig.from_config_file(download(repo, "tokenizer_config.json"))
)

model = AutoModel.from_config(
    ModelConfig.from_config_file(download(repo, "downstream_config.json")),
    downstream_task="classification",
    num_tasks=1,
)

model.load_pretrained(download(repo, "ckpt.pt"))
model = model.to_predictor(tokenizer, batch_size=128, device=device)

predictions = model.predict(sequences)
print(predictions) # Output: [[0.65190911], [0.58420199], [0.5933677]]

Feature extraction example

from pathlib import Path

import torch
from huggingface_hub import hf_hub_download

from hyformer.models.auto import AutoModel
from hyformer.models.base import Encoder
from hyformer.utils import set_seed
from hyformer.utils.tokenizers.auto import AutoTokenizer
from hyformer.configs.tokenizer import TokenizerConfig
from hyformer.configs.model import ModelConfig
from hyformer.utils.tokenizers.base import BaseTokenizer

SEED = 1337
set_seed(SEED)

device = "cuda" if torch.cuda.is_available() else "cpu"
repo = "virtual-human-chc/hyformer_molecules_50M"
local = Path("hyformer_molecules_50M")

def download(repo_id, filename):
    return hf_hub_download(repo_id=repo_id, filename=filename, local_dir=local)

sequences = Path("input\sequences.smiles").read_text().splitlines() 

tokenizer = AutoTokenizer.from_config(
    TokenizerConfig.from_config_file(download(repo, "tokenizer_config.json"))
)

model = AutoModel.from_config(
    ModelConfig.from_config_file(download(repo, "model_config.json"))
)

model.load_pretrained(download(repo, "ckpt.pt"))
model.to(device)
model.eval()

featurizer = model.to_encoder(tokenizer, 128, device) # batch_size=128
embeddings = featurizer.encode(sequences)
print(embeddings) 

# Output:
# [[ 0.12989292 -0.04472789  1.27521825 ... -0.31017503 -2.61905527
#   -0.26748869]
#  [ 0.04795801 -0.71846646  3.47797537 ...  2.37488675 -0.28063831
#    1.84492266]
#  [-0.00499679  0.72711295  0.48343059 ... -1.17737067  0.93289232
#    0.32299849]

Generation example

from pathlib import Path

import torch
from huggingface_hub import hf_hub_download

from hyformer.models.auto import AutoModel
from hyformer.utils import set_seed
from hyformer.utils.tokenizers.auto import AutoTokenizer
from hyformer.configs.tokenizer import TokenizerConfig
from hyformer.configs.model import ModelConfig 

SEED = 1337
set_seed(SEED)
NUM_SAMPLES = 100 # Number of samples to generate
device = "cuda" if torch.cuda.is_available() else "cpu"
repo = "virtual-human-chc/hyformer_molecules_50M"
local = Path("hyformer_molecules_50M")

def download(repo_id, filename):
    return hf_hub_download(repo_id=repo_id, filename=filename, local_dir=local)

tokenizer = AutoTokenizer.from_config(
    TokenizerConfig.from_config_file(download(repo, "tokenizer_config.json"))
)

model = AutoModel.from_config(
    ModelConfig.from_config_file(download(repo, "model_config.json")),
)

model.load_pretrained(download(repo, "ckpt.pt"))
generator = model.to_generator(tokenizer, 256, 0.9, 25, device) # batch_size=256, temperature=0.9, top_k=25
sequences = generator.generate(NUM_SAMPLES)
print(sequences)

# Output:
#
# CCCOc1cccc(-c2nn(-c3ccccc3)cc2/C=C(/C#N)C2=[N+]c3ccccc3[N-]2)c1 O=C(c1ccccc1)c1cc([N+](=O)O)c(Sc2c([N+](=O)O)cc([N+](=O)O)cc2[N+](=O)O)cc1[N+](=O)O 
# Nc1ncc(CN2CCC3(CC2)C[C@H](c2ccccc2)CN(C2CC2)C3)cn1 O=C(c1ccco1)N(Cc1ccccc1Cl)C[C@@H]1CC(c2ccc(Cl)o2)=NO1
# O=C(c1cccc(/N=C(\O)CCc2ccccc2)c1)[N+]1CCCCC1

References

  1. Adam Izdebski et al. "Synergistic Benefits of Joint Molecule Generation and Property Prediction" (arxiv)
  2. Hugging Face repository: https://huggingface.co/SzczurekLab/hyformer_molecules_50M
  3. Hugging Face repository (fork): https://huggingface.co/virtual-human-chc/hyformer_molecules_50M
  4. GitHub repository: https://github.com/szczurek-lab/hyformer/tree/main?tab=readme-ov-file
  5. Brown, Nathan, et al. "GuacaMol: benchmarking models for de novo molecular design." Journal of chemical information and modeling, 2019.
  6. Zhou, Gengmo, et al. "Uni-mol: A universal 3d molecular representation learning framework." ICLR, 2023.

Copyright

Code derived from https://github.com/szczurek-lab/hyformer/tree/main and https://huggingface.co/SzczurekLab/hyformer_molecules_50M is licensed under the BSD 3-Clause, © 2023 szczurek-lab. Additional code © 2025 Maksim Pavlov, licensed under MIT License.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including virtual-human-chc/hyformer_molecules_50M