--- license: apache-2.0 library_name: transformers tags: - biology - protein-language-model - protein-generation - protein-structure - diffusion - esm - pytorch - bitwise-modeling - arxiv:2410.13782 - arxiv:2504.11454 datasets: - airkingbd/pdb_swissprot --- # DPLM-2 Bit 650M DPLM-2 Bit is a 650M-parameter multimodal diffusion protein language model for joint protein sequence and structure modeling. It is a bitwise structure-token modeling variant of DPLM-2, introduced in [DPLM-2.1](https://arxiv.org/abs/2504.11454), for improving structure modeling over index-based discrete structure token prediction. For the official implementation, installation instructions, generation scripts, training configuration, and evaluation utilities, see the [bytedance/dplm](https://github.com/bytedance/dplm) repository. ## Model Details - **Model type:** Multimodal discrete diffusion protein language model with bitwise structure-token prediction - **Checkpoint:** `airkingbd/dplm2_bit_650m` - **Architecture:** ESM-style transformer for DPLM-2 Bit (`EsmForDPLM2Bit`) - **Scale:** 650M parameters, 33 transformer layers, hidden size 1280, 20 attention heads - **Amino-acid vocabulary size:** 33 - **Structure codebook:** 8,192 structure codes represented by 13-bit latent structure features - **Base initialization:** DPLM-2 Bit training is initialized from the pretrained DPLM sequence model `airkingbd/dplm_650m` - **Structure tokenizer:** Uses `airkingbd/struct_tokenizer` - **License:** Apache-2.0 - **Papers:** [DPLM-2](https://arxiv.org/abs/2410.13782) and [DPLM-2.1](https://arxiv.org/abs/2504.11454) ## Bitwise Modeling The original DPLM-2 models protein structures with discrete structure token indices produced by a structure tokenizer. In the DPLM-2.1 analysis, the authors identify index-based structure token prediction as a bottleneck: small changes in the underlying quantized bits can produce a very different token index, making the index classification target hard for the language model to learn. DPLM-2 Bit uses the LFQ structure tokenizer's bit-level representation directly. Instead of predicting one 8,192-way structure-token index per residue, it predicts each of the 13 bits of the quantized structure feature as a binary target. This turns structure prediction into 13 binary classifications per residue, provides finer-grained supervision, and reduces the difficulty of learning structural patterns from tokenized 3D structures. ## Quick Start Install the official DPLM codebase and dependencies: ```bash git clone --recursive https://github.com/bytedance/dplm.git cd dplm conda create -n dplm python=3.9 pip conda activate dplm bash scripts/install.sh ``` Load the pretrained DPLM-2 Bit checkpoint: ```python from byprot.models.dplm2 import DPLM2Bit dplm2_bit = DPLM2Bit.from_pretrained("airkingbd/dplm2_bit_650m").cuda() dplm2_bit = dplm2_bit.eval() ``` ### Sequence-Structure Co-Generation Use `generate_dplm2.py` with `--bit_model`. The official repository uses `annealing@1.1:0.1` for the released DPLM-2 Bit co-generation example: ```bash model_name=dplm2_bit_650m sampling_strategy=annealing@1.1:0.1 output_dir=generation-results/${model_name} python generate_dplm2.py \ --model_name airkingbd/${model_name} \ --task co_generation \ --bit_model \ --sampling_strategy ${sampling_strategy} \ --num_seqs 50 \ --max_iter 500 \ --seq_lens 100 200 300 400 500 \ --saveto ${output_dir} ``` ### Forward Folding DPLM-2 Bit can generate structures conditioned on amino-acid sequences: ```bash model_name=dplm2_bit_650m output_dir=generation-results/${model_name} python generate_dplm2.py \ --model_name airkingbd/${model_name} \ --task folding \ --bit_model \ --input_fasta_path data-bin/cameo2022/aatype.fasta \ --max_iter 100 \ --unmasking_strategy deterministic \ --sampling_strategy argmax \ --saveto ${output_dir} ``` ### Inverse Folding DPLM-2 Bit can predict amino-acid sequences conditioned on tokenized protein structures: ```bash model_name=dplm2_bit_650m output_dir=generation-results/${model_name} python generate_dplm2.py \ --model_name airkingbd/${model_name} \ --task inverse_folding \ --bit_model \ --input_fasta_path data-bin/cameo2022/struct.fasta \ --max_iter 100 \ --unmasking_strategy deterministic \ --sampling_strategy argmax \ --saveto ${output_dir} ``` For custom structures, first tokenize PDB files with the released structure tokenizer: ```bash python src/byprot/utils/protein/tokenize_pdb.py \ --input_pdb_folder /path/to/input/pdbs \ --output_dir /path/to/output/tokenized_protein ``` Then pass the generated structure-token FASTA file to `generate_dplm2.py`. ## Training Data and Training Procedure DPLM-2 Bit uses the same PDB and SwissProt-derived structure data as DPLM-2. The authors provide the preprocessed training dataset on Hugging Face as [airkingbd/pdb_swissprot](https://huggingface.co/datasets/airkingbd/pdb_swissprot). The official DPLM repository provides the DPLM-2 Bit experiment configuration at `configs/experiment/dplm2/dplm2_bit_650m.yaml`. The configuration initializes from `airkingbd/dplm_650m`, uses `airkingbd/dplm2_650m` as the tokenizer vocabulary source, and uses `airkingbd/struct_tokenizer` for structure tokenization. ## Experimental Results The tables below summarize selected results reported in the DPLM-2.1 paper. Lower RMSD is better and higher TM-score, AAR, accuracy, and diversity are better. ### Forward Folding | Model | CAMEO 2022 RMSD | CAMEO 2022 TM-score | PDB Date RMSD | PDB Date TM-score | |---|---:|---:|---:|---:| | DPLM-2 650M | 7.7025 | 0.7936 | 5.3071 | 0.8306 | | DPLM-2 Bit 650M | 6.4028 | 0.8380 | 3.2213 | 0.9043 | ### Structure-Token Prediction Accuracy | Model | Test Set | Index Acc. | Bit Acc. | RMSD | TM-score | |---|---|---:|---:|---:|---:| | DPLM-2 650M | CAMEO 2022 | 0.0864 | 0.7720 | 7.7025 | 0.7936 | | DPLM-2 650M | PDB Date | 0.1188 | 0.7932 | 5.3071 | 0.8306 | | DPLM-2 Bit 650M | CAMEO 2022 | 0.1258 | 0.7958 | 6.4028 | 0.8380 | | DPLM-2 Bit 650M | PDB Date | 0.2641 | 0.8648 | 3.2213 | 0.9043 | ### Inverse Folding | Model | CAMEO 2022 AAR | CAMEO 2022 TM-score | |---|---:|---:| | DPLM-2 650M | 0.4962 | 0.8816 | | DPLM-2 3B | 0.5236 | 0.8900 | | DPLM-2 Bit 650M | 0.5586 | 0.8907 | ### Representation Learning | Model | Human PPI Accuracy (%) | DeepLoc Subcellular Accuracy (%) | |---|---:|---:| | SaProt | 86.41 | 85.57 | | DPLM-2 650M | 84.44 | 82.98 | | DPLM-2 Bit 650M | 88.89 | 83.39 | ### Unconditional Generation Diversity | Model | Diversity | |---|---:| | DPLM-2 650M | 0.700 | | DPLM-2 Bit 650M | 0.825 | For full experimental settings, additional variants such as FM, ResDiff, Geo, REPA, and SFT, and complete ablations, see the [DPLM-2.1 paper](https://arxiv.org/abs/2504.11454). ## Citation If you use this checkpoint, please cite the DPLM and DPLM-2 papers: ```bibtex @inproceedings{wang2024dplm, title={Diffusion Language Models Are Versatile Protein Learners}, author={Wang, Xinyou and Zheng, Zaixiang and Ye, Fei and Xue, Dongyu and Huang, Shujian and Gu, Quanquan}, booktitle={International Conference on Machine Learning}, year={2024} } @inproceedings{wang2025dplm2, title={DPLM-2: A Multimodal Diffusion Protein Language Model}, author={Wang, Xinyou and Zheng, Zaixiang and Ye, Fei and Xue, Dongyu and Huang, Shujian and Gu, Quanquan}, booktitle={International Conference on Learning Representations}, year={2025} } @inproceedings{hsieh2025dplm2_1, title={Elucidating the Design Space of Multimodal Protein Language Models}, author={Hsieh, Cheng-Yen and Wang, Xinyou and Zhang, Daiheng and Xue, Dongyu and Ye, Fei and Huang, Shujian and Zheng, Zaixiang and Gu, Quanquan}, booktitle={International Conference on Machine Learning}, year={2025} } ``` ## Acknowledgements DPLM builds on and acknowledges prior work and resources including ByProt, EvoDiff, SaProt, ESM, LM-Design, EigenFold, MultiFlow, FrameFlow, and OpenFold-related structure modeling utilities. See the official repository for the complete acknowledgements and implementation details.