license: mit
pipeline_tag: any-to-any
library_name: transformers
InstructBioMol: A Multimodal LLM for Biomolecule Understanding and Design
Paper • Project • Quickstart • Citation
Model Description
InstructBioMol is a multimodal large language model that bridges natural language with biomolecules (proteins and small molecules). It achieves any-to-any alignment between natural language, molecules, and proteins through comprehensive instruction tuning. It can integrate multimodal biomolecules as input, enabling researchers to articulate design goals in natural language and receive biomolecular outputs that meet precise biological needs.
For detailed information, please refer to our paper and code repository.
Released Variants
| Model Name | Stage | Multimodal | Description |
|---|---|---|---|
| InstructBioMol-base | Pretraining | ❎ | Continual pretrained model on molecular sequences, protein sequences, and scientific literature. |
| InstructBioMol-instruct-stage1 | Instruction tuning (stage 1) | ✅ | Stage1 instruction-tuned model with biomolecular multimodal processing capabilities. (e.g., 3D molecules/proteins) |
| InstructBioMol-instruct (This Model) | Instruction tuning (stage 1 and 2) | ✅ | Fully instruction-tuned model (stage1 & stage2) with biomolecular multimodal processing capabilities (e.g., 3D molecules/proteins) |
Training Details
Base Architecture: InstructBioMol-instruct-stage1
Training Data:
1. Molecule - Natural Language Alignment:
- 52K data from chebi
2. Protein - Natural Langauge Alignment:
- 2 million data from UniProt (Swiss-Prot)
3. Molecule - Protein Alignment:
- 1 million data from BindingDB and Rhea
Training Objective: Instruction tuning
Quickstart
You can easily load and use the model with the transformers library.
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("hicai-zju/InstructBioMol-instruct", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("hicai-zju/InstructBioMol-instruct", trust_remote_code=True, torch_dtype=torch.bfloat16, device_map="auto")
# Example usage for text generation with a protein sequence
input_text = "What is the function of the protein with sequence: <PROT>MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR<PROT>"
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Citation
@article{DBLP:journals/corr/abs-2410-07919,
author = {Xiang Zhuang and
Keyan Ding and
Tianwen Lyu and
Yinuo Jiang and
Xiaotong Li and
Zhuoyi Xiang and
Zeyuan Wang and
Ming Qin and
Kehua Feng and
Jike Wang and
Qiang Zhang and
Huajun Chen},
title = {InstructBioMol: Advancing Biomolecule Understanding and Design Following
Human Instructions},
journal = {CoRR},
volume = {abs/2410-07919},
year = {2024}
}