---
license: apache-2.0
library_name: transformers
tags:
- biology
- protein-language-model
- protein-generation
- protein-structure
- diffusion
- esm
- pytorch
- bitwise-modeling
- arxiv:2410.13782
- arxiv:2504.11454
datasets:
- airkingbd/pdb_swissprot
---

# DPLM-2 Bit 650M

DPLM-2 Bit is a 650M-parameter multimodal diffusion protein language model for
joint protein sequence and structure modeling. It is a bitwise structure-token
modeling variant of DPLM-2, introduced in
[DPLM-2.1](https://arxiv.org/abs/2504.11454), for improving structure modeling
over index-based discrete structure token prediction.

For the official implementation, installation instructions, generation scripts,
training configuration, and evaluation utilities, see the
[bytedance/dplm](https://github.com/bytedance/dplm) repository.

## Model Details

- **Model type:** Multimodal discrete diffusion protein language model with
  bitwise structure-token prediction
- **Checkpoint:** `airkingbd/dplm2_bit_650m`
- **Architecture:** ESM-style transformer for DPLM-2 Bit (`EsmForDPLM2Bit`)
- **Scale:** 650M parameters, 33 transformer layers, hidden size 1280, 20
  attention heads
- **Amino-acid vocabulary size:** 33
- **Structure codebook:** 8,192 structure codes represented by 13-bit latent
  structure features
- **Base initialization:** DPLM-2 Bit training is initialized from the pretrained
  DPLM sequence model `airkingbd/dplm_650m`
- **Structure tokenizer:** Uses `airkingbd/struct_tokenizer`
- **License:** Apache-2.0
- **Papers:** [DPLM-2](https://arxiv.org/abs/2410.13782) and
  [DPLM-2.1](https://arxiv.org/abs/2504.11454)

## Bitwise Modeling

The original DPLM-2 models protein structures with discrete structure token
indices produced by a structure tokenizer. In the DPLM-2.1 analysis, the authors
identify index-based structure token prediction as a bottleneck: small changes
in the underlying quantized bits can produce a very different token index, making
the index classification target hard for the language model to learn.

DPLM-2 Bit uses the LFQ structure tokenizer's bit-level representation directly.
Instead of predicting one 8,192-way structure-token index per residue, it predicts
each of the 13 bits of the quantized structure feature as a binary target. This
turns structure prediction into 13 binary classifications per residue, provides
finer-grained supervision, and reduces the difficulty of learning structural
patterns from tokenized 3D structures.

## Quick Start

Install the official DPLM codebase and dependencies:

```bash
git clone --recursive https://github.com/bytedance/dplm.git
cd dplm

conda create -n dplm python=3.9 pip
conda activate dplm
bash scripts/install.sh
```

Load the pretrained DPLM-2 Bit checkpoint:

```python
from byprot.models.dplm2 import DPLM2Bit

dplm2_bit = DPLM2Bit.from_pretrained("airkingbd/dplm2_bit_650m").cuda()
dplm2_bit = dplm2_bit.eval()
```

### Sequence-Structure Co-Generation

Use `generate_dplm2.py` with `--bit_model`. The official repository uses
`annealing@1.1:0.1` for the released DPLM-2 Bit co-generation example:

```bash
model_name=dplm2_bit_650m
sampling_strategy=annealing@1.1:0.1
output_dir=generation-results/${model_name}

python generate_dplm2.py \
    --model_name airkingbd/${model_name} \
    --task co_generation \
    --bit_model \
    --sampling_strategy ${sampling_strategy} \
    --num_seqs 50 \
    --max_iter 500 \
    --seq_lens 100 200 300 400 500 \
    --saveto ${output_dir}
```

### Forward Folding

DPLM-2 Bit can generate structures conditioned on amino-acid sequences:

```bash
model_name=dplm2_bit_650m
output_dir=generation-results/${model_name}

python generate_dplm2.py \
    --model_name airkingbd/${model_name} \
    --task folding \
    --bit_model \
    --input_fasta_path data-bin/cameo2022/aatype.fasta \
    --max_iter 100 \
    --unmasking_strategy deterministic \
    --sampling_strategy argmax \
    --saveto ${output_dir}
```

### Inverse Folding

DPLM-2 Bit can predict amino-acid sequences conditioned on tokenized protein
structures:

```bash
model_name=dplm2_bit_650m
output_dir=generation-results/${model_name}

python generate_dplm2.py \
    --model_name airkingbd/${model_name} \
    --task inverse_folding \
    --bit_model \
    --input_fasta_path data-bin/cameo2022/struct.fasta \
    --max_iter 100 \
    --unmasking_strategy deterministic \
    --sampling_strategy argmax \
    --saveto ${output_dir}
```

For custom structures, first tokenize PDB files with the released structure
tokenizer:

```bash
python src/byprot/utils/protein/tokenize_pdb.py \
    --input_pdb_folder /path/to/input/pdbs \
    --output_dir /path/to/output/tokenized_protein
```

Then pass the generated structure-token FASTA file to `generate_dplm2.py`.

## Training Data and Training Procedure

DPLM-2 Bit uses the same PDB and SwissProt-derived structure data as DPLM-2. The
authors provide the preprocessed training dataset on Hugging Face as
[airkingbd/pdb_swissprot](https://huggingface.co/datasets/airkingbd/pdb_swissprot).

The official DPLM repository provides the DPLM-2 Bit experiment configuration at
`configs/experiment/dplm2/dplm2_bit_650m.yaml`. The configuration initializes
from `airkingbd/dplm_650m`, uses `airkingbd/dplm2_650m` as the tokenizer
vocabulary source, and uses `airkingbd/struct_tokenizer` for structure
tokenization.

## Experimental Results

The tables below summarize selected results reported in the DPLM-2.1 paper.
Lower RMSD is better and higher TM-score, AAR, accuracy, and diversity are
better.

### Forward Folding

| Model | CAMEO 2022 RMSD | CAMEO 2022 TM-score | PDB Date RMSD | PDB Date TM-score |
|---|---:|---:|---:|---:|
| DPLM-2 650M | 7.7025 | 0.7936 | 5.3071 | 0.8306 |
| DPLM-2 Bit 650M | 6.4028 | 0.8380 | 3.2213 | 0.9043 |

### Structure-Token Prediction Accuracy

| Model | Test Set | Index Acc. | Bit Acc. | RMSD | TM-score |
|---|---|---:|---:|---:|---:|
| DPLM-2 650M | CAMEO 2022 | 0.0864 | 0.7720 | 7.7025 | 0.7936 |
| DPLM-2 650M | PDB Date | 0.1188 | 0.7932 | 5.3071 | 0.8306 |
| DPLM-2 Bit 650M | CAMEO 2022 | 0.1258 | 0.7958 | 6.4028 | 0.8380 |
| DPLM-2 Bit 650M | PDB Date | 0.2641 | 0.8648 | 3.2213 | 0.9043 |

### Inverse Folding

| Model | CAMEO 2022 AAR | CAMEO 2022 TM-score |
|---|---:|---:|
| DPLM-2 650M | 0.4962 | 0.8816 |
| DPLM-2 3B | 0.5236 | 0.8900 |
| DPLM-2 Bit 650M | 0.5586 | 0.8907 |

### Representation Learning

| Model | Human PPI Accuracy (%) | DeepLoc Subcellular Accuracy (%) |
|---|---:|---:|
| SaProt | 86.41 | 85.57 |
| DPLM-2 650M | 84.44 | 82.98 |
| DPLM-2 Bit 650M | 88.89 | 83.39 |

### Unconditional Generation Diversity

| Model | Diversity |
|---|---:|
| DPLM-2 650M | 0.700 |
| DPLM-2 Bit 650M | 0.825 |

For full experimental settings, additional variants such as FM, ResDiff, Geo,
REPA, and SFT, and complete ablations, see the
[DPLM-2.1 paper](https://arxiv.org/abs/2504.11454).

## Citation

If you use this checkpoint, please cite the DPLM and DPLM-2 papers:

```bibtex
@inproceedings{wang2024dplm,
  title={Diffusion Language Models Are Versatile Protein Learners},
  author={Wang, Xinyou and Zheng, Zaixiang and Ye, Fei and Xue, Dongyu and Huang, Shujian and Gu, Quanquan},
  booktitle={International Conference on Machine Learning},
  year={2024}
}

@inproceedings{wang2025dplm2,
  title={DPLM-2: A Multimodal Diffusion Protein Language Model},
  author={Wang, Xinyou and Zheng, Zaixiang and Ye, Fei and Xue, Dongyu and Huang, Shujian and Gu, Quanquan},
  booktitle={International Conference on Learning Representations},
  year={2025}
}

@inproceedings{hsieh2025dplm2_1,
  title={Elucidating the Design Space of Multimodal Protein Language Models},
  author={Hsieh, Cheng-Yen and Wang, Xinyou and Zhang, Daiheng and Xue, Dongyu and Ye, Fei and Huang, Shujian and Zheng, Zaixiang and Gu, Quanquan},
  booktitle={International Conference on Machine Learning},
  year={2025}
}
```

## Acknowledgements

DPLM builds on and acknowledges prior work and resources including ByProt,
EvoDiff, SaProt, ESM, LM-Design, EigenFold, MultiFlow, FrameFlow, and
OpenFold-related structure modeling utilities. See the official repository for
the complete acknowledgements and implementation details.