Transformers
PyTorch
esm
biology
protein-language-model
protein-generation
protein-structure
diffusion
Instructions to use airkingbd/dplm2_3b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use airkingbd/dplm2_3b with Transformers:
# Load model directly from transformers import AutoTokenizer, EsmForDPLM2 tokenizer = AutoTokenizer.from_pretrained("airkingbd/dplm2_3b") model = EsmForDPLM2.from_pretrained("airkingbd/dplm2_3b") - Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| library_name: transformers | |
| tags: | |
| - biology | |
| - protein-language-model | |
| - protein-generation | |
| - protein-structure | |
| - diffusion | |
| - esm | |
| - pytorch | |
| - arxiv:2410.13782 | |
| - arxiv:2504.11454 | |
| datasets: | |
| - airkingbd/pdb_swissprot | |
| # DPLM-2 3B | |
| DPLM-2 is a multimodal diffusion protein language model for jointly modeling, | |
| understanding, and generating protein sequences and structures. It extends the | |
| discrete diffusion protein language model family from sequence-only protein | |
| language modeling to sequence-structure modeling, enabling protein | |
| sequence-structure co-generation and conditional generation tasks such as | |
| folding, inverse folding, and motif scaffolding. | |
| This repository contains the 3B-parameter DPLM-2 checkpoint. For the official | |
| implementation, installation instructions, generation scripts, training | |
| configuration, and evaluation utilities, see the | |
| [bytedance/dplm](https://github.com/bytedance/dplm) repository. | |
| ## Model Details | |
| - **Model type:** Multimodal discrete diffusion protein language model | |
| - **Checkpoint:** `airkingbd/dplm2_3b` | |
| - **Architecture:** ESM-style transformer for DPLM-2 (`EsmForDPLM2`) | |
| - **Scale:** 3B parameters, 36 transformer layers, hidden size 2560, 40 | |
| attention heads | |
| - **Vocabulary:** 8,229 tokens, covering amino-acid tokens, structure tokens, | |
| and special tokens | |
| - **Base initialization:** DPLM-2 training is initialized from the pretrained | |
| DPLM sequence model `airkingbd/dplm_3b` | |
| - **Structure tokenizer:** Uses the DPLM structure tokenizer | |
| (`airkingbd/struct_tokenizer`) for structure-token based modeling and PDB | |
| reconstruction | |
| - **License:** Apache-2.0 | |
| - **Paper:** [DPLM-2: A Multimodal Diffusion Protein Language Model](https://arxiv.org/abs/2410.13782) | |
| ## Quick Start | |
| Install the official DPLM codebase and dependencies: | |
| ```bash | |
| git clone --recursive https://github.com/bytedance/dplm.git | |
| cd dplm | |
| conda create -n dplm python=3.9 pip | |
| conda activate dplm | |
| bash scripts/install.sh | |
| ``` | |
| Load the pretrained DPLM-2 checkpoint: | |
| ```python | |
| from byprot.models.dplm2 import MultimodalDiffusionProteinLanguageModel as DPLM2 | |
| dplm2 = DPLM2.from_pretrained("airkingbd/dplm2_3b").cuda() | |
| dplm2 = dplm2.eval() | |
| ``` | |
| ### Sequence-Structure Co-Generation | |
| The official repository provides `generate_dplm2.py` for co-generation. The | |
| default DPLM-2 sampling strategy is `annealing@2.0:0.1`, which starts with high | |
| sampling temperature for diversity and anneals to a lower temperature for | |
| designability. | |
| ```bash | |
| model_name=dplm2_3b | |
| sampling_strategy=annealing@2.0:0.1 | |
| output_dir=generation-results/${model_name} | |
| python generate_dplm2.py \ | |
| --model_name airkingbd/${model_name} \ | |
| --task co_generation \ | |
| --sampling_strategy ${sampling_strategy} \ | |
| --num_seqs 50 \ | |
| --max_iter 500 \ | |
| --seq_lens 100 200 300 400 500 \ | |
| --saveto ${output_dir} | |
| ``` | |
| Generated sequences and structures are saved under | |
| `generation-results/dplm2_3b/co_generation`. The official repository also | |
| includes evaluation utilities for TM-score, RMSD, diversity, and related | |
| structure metrics. | |
| ### Forward Folding | |
| DPLM-2 can generate structures conditioned on input amino-acid sequences. The | |
| official scripts use deterministic argmax decoding for 100 diffusion iterations: | |
| ```bash | |
| model_name=dplm2_3b | |
| output_dir=generation-results/${model_name} | |
| python generate_dplm2.py \ | |
| --model_name airkingbd/${model_name} \ | |
| --task folding \ | |
| --input_fasta_path data-bin/cameo2022/aatype.fasta \ | |
| --max_iter 100 \ | |
| --unmasking_strategy deterministic \ | |
| --sampling_strategy argmax \ | |
| --saveto ${output_dir} | |
| ``` | |
| For custom sequences, provide a FASTA file via `--input_fasta_path`. | |
| ### Inverse Folding | |
| DPLM-2 can predict amino-acid sequences conditioned on tokenized protein | |
| structures: | |
| ```bash | |
| model_name=dplm2_3b | |
| output_dir=generation-results/${model_name} | |
| python generate_dplm2.py \ | |
| --model_name airkingbd/${model_name} \ | |
| --task inverse_folding \ | |
| --input_fasta_path data-bin/cameo2022/struct.fasta \ | |
| --max_iter 100 \ | |
| --unmasking_strategy deterministic \ | |
| --sampling_strategy argmax \ | |
| --saveto ${output_dir} | |
| ``` | |
| To use a custom structure, first tokenize PDB files with the structure tokenizer: | |
| ```bash | |
| python src/byprot/utils/protein/tokenize_pdb.py \ | |
| --input_pdb_folder /path/to/your/input/structure \ | |
| --output_dir /path/to/your/input/structure/tokenized_protein | |
| ``` | |
| Then pass the generated `struct.fasta` to `generate_dplm2.py`. | |
| ### Motif Scaffolding | |
| DPLM-2 supports multimodal motif scaffolding by conditioning on both the | |
| sequence and structure tokens of the motif and co-generating the scaffold | |
| sequence and structure: | |
| ```bash | |
| model_name=dplm2_3b | |
| output_dir=./generation-results/${model_name}/motif_scaffold | |
| python run/scaffold_generate_dplm2.py \ | |
| --model_name airkingbd/${model_name} \ | |
| --num_seqs 100 \ | |
| --saveto ${output_dir} | |
| ``` | |
| See the official repository for required motif data preparation and evaluation | |
| steps. | |
| ## Training Data and Training Procedure | |
| DPLM-2 is trained on experimental structures from PDB and AF2-predicted | |
| structures from SwissProt. The authors provide the preprocessed training dataset | |
| on Hugging Face as | |
| [airkingbd/pdb_swissprot](https://huggingface.co/datasets/airkingbd/pdb_swissprot). | |
| The official DPLM repository describes the following training setup for | |
| `dplm2_3b`: | |
| - Initialize from the pretrained DPLM checkpoint `airkingbd/dplm_3b` | |
| - Use a warm-up training strategy for structure data scarcity | |
| - Use LoRA to limit large parameter shifts during multimodal training | |
| - Use `airkingbd/struct_tokenizer` for structure tokenization | |
| The experiment configuration is available in the official repository at | |
| `configs/experiment/dplm2/dplm2_3b.yaml`. | |
| ## Evaluation Summary | |
| The DPLM repository reports DPLM-2 results on multiple protein generation and | |
| understanding tasks, including sequence-structure co-generation, forward | |
| folding, inverse folding, motif scaffolding, and representation learning. For | |
| full tables, baselines, metrics, and evaluation details, refer to the | |
| [DPLM-2 paper](https://arxiv.org/abs/2410.13782), the | |
| [DPLM-2.1 paper](https://arxiv.org/abs/2504.11454), and the official | |
| [bytedance/dplm](https://github.com/bytedance/dplm) repository. | |
| ## Citation | |
| If you use this checkpoint, please cite the DPLM and DPLM-2 papers: | |
| ```bibtex | |
| @inproceedings{wang2024dplm, | |
| title={Diffusion Language Models Are Versatile Protein Learners}, | |
| author={Wang, Xinyou and Zheng, Zaixiang and Ye, Fei and Xue, Dongyu and Huang, Shujian and Gu, Quanquan}, | |
| booktitle={International Conference on Machine Learning}, | |
| year={2024} | |
| } | |
| @inproceedings{wang2025dplm2, | |
| title={DPLM-2: A Multimodal Diffusion Protein Language Model}, | |
| author={Wang, Xinyou and Zheng, Zaixiang and Ye, Fei and Xue, Dongyu and Huang, Shujian and Gu, Quanquan}, | |
| booktitle={International Conference on Learning Representations}, | |
| year={2025} | |
| } | |
| @inproceedings{hsieh2025dplm2_1, | |
| title={Elucidating the Design Space of Multimodal Protein Language Models}, | |
| author={Hsieh, Cheng-Yen and Wang, Xinyou and Zhang, Daiheng and Xue, Dongyu and Ye, Fei and Huang, Shujian and Zheng, Zaixiang and Gu, Quanquan}, | |
| booktitle={International Conference on Machine Learning}, | |
| year={2025} | |
| } | |
| ``` | |
| ## Acknowledgements | |
| DPLM builds on and acknowledges prior work and resources including ByProt, | |
| EvoDiff, SaProt, ESM, LM-Design, EigenFold, MultiFlow, FrameFlow, and | |
| OpenFold-related structure modeling utilities. See the official repository for | |
| the complete acknowledgements and implementation details. | |