dplm2_3b / README.md
airkingbd's picture
Add model card
9e77567
---
license: apache-2.0
library_name: transformers
tags:
- biology
- protein-language-model
- protein-generation
- protein-structure
- diffusion
- esm
- pytorch
- arxiv:2410.13782
- arxiv:2504.11454
datasets:
- airkingbd/pdb_swissprot
---
# DPLM-2 3B
DPLM-2 is a multimodal diffusion protein language model for jointly modeling,
understanding, and generating protein sequences and structures. It extends the
discrete diffusion protein language model family from sequence-only protein
language modeling to sequence-structure modeling, enabling protein
sequence-structure co-generation and conditional generation tasks such as
folding, inverse folding, and motif scaffolding.
This repository contains the 3B-parameter DPLM-2 checkpoint. For the official
implementation, installation instructions, generation scripts, training
configuration, and evaluation utilities, see the
[bytedance/dplm](https://github.com/bytedance/dplm) repository.
## Model Details
- **Model type:** Multimodal discrete diffusion protein language model
- **Checkpoint:** `airkingbd/dplm2_3b`
- **Architecture:** ESM-style transformer for DPLM-2 (`EsmForDPLM2`)
- **Scale:** 3B parameters, 36 transformer layers, hidden size 2560, 40
attention heads
- **Vocabulary:** 8,229 tokens, covering amino-acid tokens, structure tokens,
and special tokens
- **Base initialization:** DPLM-2 training is initialized from the pretrained
DPLM sequence model `airkingbd/dplm_3b`
- **Structure tokenizer:** Uses the DPLM structure tokenizer
(`airkingbd/struct_tokenizer`) for structure-token based modeling and PDB
reconstruction
- **License:** Apache-2.0
- **Paper:** [DPLM-2: A Multimodal Diffusion Protein Language Model](https://arxiv.org/abs/2410.13782)
## Quick Start
Install the official DPLM codebase and dependencies:
```bash
git clone --recursive https://github.com/bytedance/dplm.git
cd dplm
conda create -n dplm python=3.9 pip
conda activate dplm
bash scripts/install.sh
```
Load the pretrained DPLM-2 checkpoint:
```python
from byprot.models.dplm2 import MultimodalDiffusionProteinLanguageModel as DPLM2
dplm2 = DPLM2.from_pretrained("airkingbd/dplm2_3b").cuda()
dplm2 = dplm2.eval()
```
### Sequence-Structure Co-Generation
The official repository provides `generate_dplm2.py` for co-generation. The
default DPLM-2 sampling strategy is `annealing@2.0:0.1`, which starts with high
sampling temperature for diversity and anneals to a lower temperature for
designability.
```bash
model_name=dplm2_3b
sampling_strategy=annealing@2.0:0.1
output_dir=generation-results/${model_name}
python generate_dplm2.py \
--model_name airkingbd/${model_name} \
--task co_generation \
--sampling_strategy ${sampling_strategy} \
--num_seqs 50 \
--max_iter 500 \
--seq_lens 100 200 300 400 500 \
--saveto ${output_dir}
```
Generated sequences and structures are saved under
`generation-results/dplm2_3b/co_generation`. The official repository also
includes evaluation utilities for TM-score, RMSD, diversity, and related
structure metrics.
### Forward Folding
DPLM-2 can generate structures conditioned on input amino-acid sequences. The
official scripts use deterministic argmax decoding for 100 diffusion iterations:
```bash
model_name=dplm2_3b
output_dir=generation-results/${model_name}
python generate_dplm2.py \
--model_name airkingbd/${model_name} \
--task folding \
--input_fasta_path data-bin/cameo2022/aatype.fasta \
--max_iter 100 \
--unmasking_strategy deterministic \
--sampling_strategy argmax \
--saveto ${output_dir}
```
For custom sequences, provide a FASTA file via `--input_fasta_path`.
### Inverse Folding
DPLM-2 can predict amino-acid sequences conditioned on tokenized protein
structures:
```bash
model_name=dplm2_3b
output_dir=generation-results/${model_name}
python generate_dplm2.py \
--model_name airkingbd/${model_name} \
--task inverse_folding \
--input_fasta_path data-bin/cameo2022/struct.fasta \
--max_iter 100 \
--unmasking_strategy deterministic \
--sampling_strategy argmax \
--saveto ${output_dir}
```
To use a custom structure, first tokenize PDB files with the structure tokenizer:
```bash
python src/byprot/utils/protein/tokenize_pdb.py \
--input_pdb_folder /path/to/your/input/structure \
--output_dir /path/to/your/input/structure/tokenized_protein
```
Then pass the generated `struct.fasta` to `generate_dplm2.py`.
### Motif Scaffolding
DPLM-2 supports multimodal motif scaffolding by conditioning on both the
sequence and structure tokens of the motif and co-generating the scaffold
sequence and structure:
```bash
model_name=dplm2_3b
output_dir=./generation-results/${model_name}/motif_scaffold
python run/scaffold_generate_dplm2.py \
--model_name airkingbd/${model_name} \
--num_seqs 100 \
--saveto ${output_dir}
```
See the official repository for required motif data preparation and evaluation
steps.
## Training Data and Training Procedure
DPLM-2 is trained on experimental structures from PDB and AF2-predicted
structures from SwissProt. The authors provide the preprocessed training dataset
on Hugging Face as
[airkingbd/pdb_swissprot](https://huggingface.co/datasets/airkingbd/pdb_swissprot).
The official DPLM repository describes the following training setup for
`dplm2_3b`:
- Initialize from the pretrained DPLM checkpoint `airkingbd/dplm_3b`
- Use a warm-up training strategy for structure data scarcity
- Use LoRA to limit large parameter shifts during multimodal training
- Use `airkingbd/struct_tokenizer` for structure tokenization
The experiment configuration is available in the official repository at
`configs/experiment/dplm2/dplm2_3b.yaml`.
## Evaluation Summary
The DPLM repository reports DPLM-2 results on multiple protein generation and
understanding tasks, including sequence-structure co-generation, forward
folding, inverse folding, motif scaffolding, and representation learning. For
full tables, baselines, metrics, and evaluation details, refer to the
[DPLM-2 paper](https://arxiv.org/abs/2410.13782), the
[DPLM-2.1 paper](https://arxiv.org/abs/2504.11454), and the official
[bytedance/dplm](https://github.com/bytedance/dplm) repository.
## Citation
If you use this checkpoint, please cite the DPLM and DPLM-2 papers:
```bibtex
@inproceedings{wang2024dplm,
title={Diffusion Language Models Are Versatile Protein Learners},
author={Wang, Xinyou and Zheng, Zaixiang and Ye, Fei and Xue, Dongyu and Huang, Shujian and Gu, Quanquan},
booktitle={International Conference on Machine Learning},
year={2024}
}
@inproceedings{wang2025dplm2,
title={DPLM-2: A Multimodal Diffusion Protein Language Model},
author={Wang, Xinyou and Zheng, Zaixiang and Ye, Fei and Xue, Dongyu and Huang, Shujian and Gu, Quanquan},
booktitle={International Conference on Learning Representations},
year={2025}
}
@inproceedings{hsieh2025dplm2_1,
title={Elucidating the Design Space of Multimodal Protein Language Models},
author={Hsieh, Cheng-Yen and Wang, Xinyou and Zhang, Daiheng and Xue, Dongyu and Ye, Fei and Huang, Shujian and Zheng, Zaixiang and Gu, Quanquan},
booktitle={International Conference on Machine Learning},
year={2025}
}
```
## Acknowledgements
DPLM builds on and acknowledges prior work and resources including ByProt,
EvoDiff, SaProt, ESM, LM-Design, EigenFold, MultiFlow, FrameFlow, and
OpenFold-related structure modeling utilities. See the official repository for
the complete acknowledgements and implementation details.