--- license: apache-2.0 library_name: transformers tags: - biology - protein-language-model - protein-generation - protein-structure - diffusion - esm - pytorch - arxiv:2410.13782 - arxiv:2504.11454 datasets: - airkingbd/pdb_swissprot --- # DPLM-2 3B DPLM-2 is a multimodal diffusion protein language model for jointly modeling, understanding, and generating protein sequences and structures. It extends the discrete diffusion protein language model family from sequence-only protein language modeling to sequence-structure modeling, enabling protein sequence-structure co-generation and conditional generation tasks such as folding, inverse folding, and motif scaffolding. This repository contains the 3B-parameter DPLM-2 checkpoint. For the official implementation, installation instructions, generation scripts, training configuration, and evaluation utilities, see the [bytedance/dplm](https://github.com/bytedance/dplm) repository. ## Model Details - **Model type:** Multimodal discrete diffusion protein language model - **Checkpoint:** `airkingbd/dplm2_3b` - **Architecture:** ESM-style transformer for DPLM-2 (`EsmForDPLM2`) - **Scale:** 3B parameters, 36 transformer layers, hidden size 2560, 40 attention heads - **Vocabulary:** 8,229 tokens, covering amino-acid tokens, structure tokens, and special tokens - **Base initialization:** DPLM-2 training is initialized from the pretrained DPLM sequence model `airkingbd/dplm_3b` - **Structure tokenizer:** Uses the DPLM structure tokenizer (`airkingbd/struct_tokenizer`) for structure-token based modeling and PDB reconstruction - **License:** Apache-2.0 - **Paper:** [DPLM-2: A Multimodal Diffusion Protein Language Model](https://arxiv.org/abs/2410.13782) ## Quick Start Install the official DPLM codebase and dependencies: ```bash git clone --recursive https://github.com/bytedance/dplm.git cd dplm conda create -n dplm python=3.9 pip conda activate dplm bash scripts/install.sh ``` Load the pretrained DPLM-2 checkpoint: ```python from byprot.models.dplm2 import MultimodalDiffusionProteinLanguageModel as DPLM2 dplm2 = DPLM2.from_pretrained("airkingbd/dplm2_3b").cuda() dplm2 = dplm2.eval() ``` ### Sequence-Structure Co-Generation The official repository provides `generate_dplm2.py` for co-generation. The default DPLM-2 sampling strategy is `annealing@2.0:0.1`, which starts with high sampling temperature for diversity and anneals to a lower temperature for designability. ```bash model_name=dplm2_3b sampling_strategy=annealing@2.0:0.1 output_dir=generation-results/${model_name} python generate_dplm2.py \ --model_name airkingbd/${model_name} \ --task co_generation \ --sampling_strategy ${sampling_strategy} \ --num_seqs 50 \ --max_iter 500 \ --seq_lens 100 200 300 400 500 \ --saveto ${output_dir} ``` Generated sequences and structures are saved under `generation-results/dplm2_3b/co_generation`. The official repository also includes evaluation utilities for TM-score, RMSD, diversity, and related structure metrics. ### Forward Folding DPLM-2 can generate structures conditioned on input amino-acid sequences. The official scripts use deterministic argmax decoding for 100 diffusion iterations: ```bash model_name=dplm2_3b output_dir=generation-results/${model_name} python generate_dplm2.py \ --model_name airkingbd/${model_name} \ --task folding \ --input_fasta_path data-bin/cameo2022/aatype.fasta \ --max_iter 100 \ --unmasking_strategy deterministic \ --sampling_strategy argmax \ --saveto ${output_dir} ``` For custom sequences, provide a FASTA file via `--input_fasta_path`. ### Inverse Folding DPLM-2 can predict amino-acid sequences conditioned on tokenized protein structures: ```bash model_name=dplm2_3b output_dir=generation-results/${model_name} python generate_dplm2.py \ --model_name airkingbd/${model_name} \ --task inverse_folding \ --input_fasta_path data-bin/cameo2022/struct.fasta \ --max_iter 100 \ --unmasking_strategy deterministic \ --sampling_strategy argmax \ --saveto ${output_dir} ``` To use a custom structure, first tokenize PDB files with the structure tokenizer: ```bash python src/byprot/utils/protein/tokenize_pdb.py \ --input_pdb_folder /path/to/your/input/structure \ --output_dir /path/to/your/input/structure/tokenized_protein ``` Then pass the generated `struct.fasta` to `generate_dplm2.py`. ### Motif Scaffolding DPLM-2 supports multimodal motif scaffolding by conditioning on both the sequence and structure tokens of the motif and co-generating the scaffold sequence and structure: ```bash model_name=dplm2_3b output_dir=./generation-results/${model_name}/motif_scaffold python run/scaffold_generate_dplm2.py \ --model_name airkingbd/${model_name} \ --num_seqs 100 \ --saveto ${output_dir} ``` See the official repository for required motif data preparation and evaluation steps. ## Training Data and Training Procedure DPLM-2 is trained on experimental structures from PDB and AF2-predicted structures from SwissProt. The authors provide the preprocessed training dataset on Hugging Face as [airkingbd/pdb_swissprot](https://huggingface.co/datasets/airkingbd/pdb_swissprot). The official DPLM repository describes the following training setup for `dplm2_3b`: - Initialize from the pretrained DPLM checkpoint `airkingbd/dplm_3b` - Use a warm-up training strategy for structure data scarcity - Use LoRA to limit large parameter shifts during multimodal training - Use `airkingbd/struct_tokenizer` for structure tokenization The experiment configuration is available in the official repository at `configs/experiment/dplm2/dplm2_3b.yaml`. ## Evaluation Summary The DPLM repository reports DPLM-2 results on multiple protein generation and understanding tasks, including sequence-structure co-generation, forward folding, inverse folding, motif scaffolding, and representation learning. For full tables, baselines, metrics, and evaluation details, refer to the [DPLM-2 paper](https://arxiv.org/abs/2410.13782), the [DPLM-2.1 paper](https://arxiv.org/abs/2504.11454), and the official [bytedance/dplm](https://github.com/bytedance/dplm) repository. ## Citation If you use this checkpoint, please cite the DPLM and DPLM-2 papers: ```bibtex @inproceedings{wang2024dplm, title={Diffusion Language Models Are Versatile Protein Learners}, author={Wang, Xinyou and Zheng, Zaixiang and Ye, Fei and Xue, Dongyu and Huang, Shujian and Gu, Quanquan}, booktitle={International Conference on Machine Learning}, year={2024} } @inproceedings{wang2025dplm2, title={DPLM-2: A Multimodal Diffusion Protein Language Model}, author={Wang, Xinyou and Zheng, Zaixiang and Ye, Fei and Xue, Dongyu and Huang, Shujian and Gu, Quanquan}, booktitle={International Conference on Learning Representations}, year={2025} } @inproceedings{hsieh2025dplm2_1, title={Elucidating the Design Space of Multimodal Protein Language Models}, author={Hsieh, Cheng-Yen and Wang, Xinyou and Zhang, Daiheng and Xue, Dongyu and Ye, Fei and Huang, Shujian and Zheng, Zaixiang and Gu, Quanquan}, booktitle={International Conference on Machine Learning}, year={2025} } ``` ## Acknowledgements DPLM builds on and acknowledges prior work and resources including ByProt, EvoDiff, SaProt, ESM, LM-Design, EigenFold, MultiFlow, FrameFlow, and OpenFold-related structure modeling utilities. See the official repository for the complete acknowledgements and implementation details.