dplm2_bit_650m / README.md
airkingbd's picture
Add model card
40cf303
---
license: apache-2.0
library_name: transformers
tags:
- biology
- protein-language-model
- protein-generation
- protein-structure
- diffusion
- esm
- pytorch
- bitwise-modeling
- arxiv:2410.13782
- arxiv:2504.11454
datasets:
- airkingbd/pdb_swissprot
---
# DPLM-2 Bit 650M
DPLM-2 Bit is a 650M-parameter multimodal diffusion protein language model for
joint protein sequence and structure modeling. It is a bitwise structure-token
modeling variant of DPLM-2, introduced in
[DPLM-2.1](https://arxiv.org/abs/2504.11454), for improving structure modeling
over index-based discrete structure token prediction.
For the official implementation, installation instructions, generation scripts,
training configuration, and evaluation utilities, see the
[bytedance/dplm](https://github.com/bytedance/dplm) repository.
## Model Details
- **Model type:** Multimodal discrete diffusion protein language model with
bitwise structure-token prediction
- **Checkpoint:** `airkingbd/dplm2_bit_650m`
- **Architecture:** ESM-style transformer for DPLM-2 Bit (`EsmForDPLM2Bit`)
- **Scale:** 650M parameters, 33 transformer layers, hidden size 1280, 20
attention heads
- **Amino-acid vocabulary size:** 33
- **Structure codebook:** 8,192 structure codes represented by 13-bit latent
structure features
- **Base initialization:** DPLM-2 Bit training is initialized from the pretrained
DPLM sequence model `airkingbd/dplm_650m`
- **Structure tokenizer:** Uses `airkingbd/struct_tokenizer`
- **License:** Apache-2.0
- **Papers:** [DPLM-2](https://arxiv.org/abs/2410.13782) and
[DPLM-2.1](https://arxiv.org/abs/2504.11454)
## Bitwise Modeling
The original DPLM-2 models protein structures with discrete structure token
indices produced by a structure tokenizer. In the DPLM-2.1 analysis, the authors
identify index-based structure token prediction as a bottleneck: small changes
in the underlying quantized bits can produce a very different token index, making
the index classification target hard for the language model to learn.
DPLM-2 Bit uses the LFQ structure tokenizer's bit-level representation directly.
Instead of predicting one 8,192-way structure-token index per residue, it predicts
each of the 13 bits of the quantized structure feature as a binary target. This
turns structure prediction into 13 binary classifications per residue, provides
finer-grained supervision, and reduces the difficulty of learning structural
patterns from tokenized 3D structures.
## Quick Start
Install the official DPLM codebase and dependencies:
```bash
git clone --recursive https://github.com/bytedance/dplm.git
cd dplm
conda create -n dplm python=3.9 pip
conda activate dplm
bash scripts/install.sh
```
Load the pretrained DPLM-2 Bit checkpoint:
```python
from byprot.models.dplm2 import DPLM2Bit
dplm2_bit = DPLM2Bit.from_pretrained("airkingbd/dplm2_bit_650m").cuda()
dplm2_bit = dplm2_bit.eval()
```
### Sequence-Structure Co-Generation
Use `generate_dplm2.py` with `--bit_model`. The official repository uses
`annealing@1.1:0.1` for the released DPLM-2 Bit co-generation example:
```bash
model_name=dplm2_bit_650m
sampling_strategy=annealing@1.1:0.1
output_dir=generation-results/${model_name}
python generate_dplm2.py \
--model_name airkingbd/${model_name} \
--task co_generation \
--bit_model \
--sampling_strategy ${sampling_strategy} \
--num_seqs 50 \
--max_iter 500 \
--seq_lens 100 200 300 400 500 \
--saveto ${output_dir}
```
### Forward Folding
DPLM-2 Bit can generate structures conditioned on amino-acid sequences:
```bash
model_name=dplm2_bit_650m
output_dir=generation-results/${model_name}
python generate_dplm2.py \
--model_name airkingbd/${model_name} \
--task folding \
--bit_model \
--input_fasta_path data-bin/cameo2022/aatype.fasta \
--max_iter 100 \
--unmasking_strategy deterministic \
--sampling_strategy argmax \
--saveto ${output_dir}
```
### Inverse Folding
DPLM-2 Bit can predict amino-acid sequences conditioned on tokenized protein
structures:
```bash
model_name=dplm2_bit_650m
output_dir=generation-results/${model_name}
python generate_dplm2.py \
--model_name airkingbd/${model_name} \
--task inverse_folding \
--bit_model \
--input_fasta_path data-bin/cameo2022/struct.fasta \
--max_iter 100 \
--unmasking_strategy deterministic \
--sampling_strategy argmax \
--saveto ${output_dir}
```
For custom structures, first tokenize PDB files with the released structure
tokenizer:
```bash
python src/byprot/utils/protein/tokenize_pdb.py \
--input_pdb_folder /path/to/input/pdbs \
--output_dir /path/to/output/tokenized_protein
```
Then pass the generated structure-token FASTA file to `generate_dplm2.py`.
## Training Data and Training Procedure
DPLM-2 Bit uses the same PDB and SwissProt-derived structure data as DPLM-2. The
authors provide the preprocessed training dataset on Hugging Face as
[airkingbd/pdb_swissprot](https://huggingface.co/datasets/airkingbd/pdb_swissprot).
The official DPLM repository provides the DPLM-2 Bit experiment configuration at
`configs/experiment/dplm2/dplm2_bit_650m.yaml`. The configuration initializes
from `airkingbd/dplm_650m`, uses `airkingbd/dplm2_650m` as the tokenizer
vocabulary source, and uses `airkingbd/struct_tokenizer` for structure
tokenization.
## Experimental Results
The tables below summarize selected results reported in the DPLM-2.1 paper.
Lower RMSD is better and higher TM-score, AAR, accuracy, and diversity are
better.
### Forward Folding
| Model | CAMEO 2022 RMSD | CAMEO 2022 TM-score | PDB Date RMSD | PDB Date TM-score |
|---|---:|---:|---:|---:|
| DPLM-2 650M | 7.7025 | 0.7936 | 5.3071 | 0.8306 |
| DPLM-2 Bit 650M | 6.4028 | 0.8380 | 3.2213 | 0.9043 |
### Structure-Token Prediction Accuracy
| Model | Test Set | Index Acc. | Bit Acc. | RMSD | TM-score |
|---|---|---:|---:|---:|---:|
| DPLM-2 650M | CAMEO 2022 | 0.0864 | 0.7720 | 7.7025 | 0.7936 |
| DPLM-2 650M | PDB Date | 0.1188 | 0.7932 | 5.3071 | 0.8306 |
| DPLM-2 Bit 650M | CAMEO 2022 | 0.1258 | 0.7958 | 6.4028 | 0.8380 |
| DPLM-2 Bit 650M | PDB Date | 0.2641 | 0.8648 | 3.2213 | 0.9043 |
### Inverse Folding
| Model | CAMEO 2022 AAR | CAMEO 2022 TM-score |
|---|---:|---:|
| DPLM-2 650M | 0.4962 | 0.8816 |
| DPLM-2 3B | 0.5236 | 0.8900 |
| DPLM-2 Bit 650M | 0.5586 | 0.8907 |
### Representation Learning
| Model | Human PPI Accuracy (%) | DeepLoc Subcellular Accuracy (%) |
|---|---:|---:|
| SaProt | 86.41 | 85.57 |
| DPLM-2 650M | 84.44 | 82.98 |
| DPLM-2 Bit 650M | 88.89 | 83.39 |
### Unconditional Generation Diversity
| Model | Diversity |
|---|---:|
| DPLM-2 650M | 0.700 |
| DPLM-2 Bit 650M | 0.825 |
For full experimental settings, additional variants such as FM, ResDiff, Geo,
REPA, and SFT, and complete ablations, see the
[DPLM-2.1 paper](https://arxiv.org/abs/2504.11454).
## Citation
If you use this checkpoint, please cite the DPLM and DPLM-2 papers:
```bibtex
@inproceedings{wang2024dplm,
title={Diffusion Language Models Are Versatile Protein Learners},
author={Wang, Xinyou and Zheng, Zaixiang and Ye, Fei and Xue, Dongyu and Huang, Shujian and Gu, Quanquan},
booktitle={International Conference on Machine Learning},
year={2024}
}
@inproceedings{wang2025dplm2,
title={DPLM-2: A Multimodal Diffusion Protein Language Model},
author={Wang, Xinyou and Zheng, Zaixiang and Ye, Fei and Xue, Dongyu and Huang, Shujian and Gu, Quanquan},
booktitle={International Conference on Learning Representations},
year={2025}
}
@inproceedings{hsieh2025dplm2_1,
title={Elucidating the Design Space of Multimodal Protein Language Models},
author={Hsieh, Cheng-Yen and Wang, Xinyou and Zhang, Daiheng and Xue, Dongyu and Ye, Fei and Huang, Shujian and Zheng, Zaixiang and Gu, Quanquan},
booktitle={International Conference on Machine Learning},
year={2025}
}
```
## Acknowledgements
DPLM builds on and acknowledges prior work and resources including ByProt,
EvoDiff, SaProt, ESM, LM-Design, EigenFold, MultiFlow, FrameFlow, and
OpenFold-related structure modeling utilities. See the official repository for
the complete acknowledgements and implementation details.