Transformers
PyTorch
esm
biology
protein-language-model
protein-generation
protein-structure
diffusion
bitwise-modeling
Instructions to use airkingbd/dplm2_bit_650m with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use airkingbd/dplm2_bit_650m with Transformers:
# Load model directly from transformers import AutoTokenizer, EsmForDPLM2Bit tokenizer = AutoTokenizer.from_pretrained("airkingbd/dplm2_bit_650m") model = EsmForDPLM2Bit.from_pretrained("airkingbd/dplm2_bit_650m") - Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| library_name: transformers | |
| tags: | |
| - biology | |
| - protein-language-model | |
| - protein-generation | |
| - protein-structure | |
| - diffusion | |
| - esm | |
| - pytorch | |
| - bitwise-modeling | |
| - arxiv:2410.13782 | |
| - arxiv:2504.11454 | |
| datasets: | |
| - airkingbd/pdb_swissprot | |
| # DPLM-2 Bit 650M | |
| DPLM-2 Bit is a 650M-parameter multimodal diffusion protein language model for | |
| joint protein sequence and structure modeling. It is a bitwise structure-token | |
| modeling variant of DPLM-2, introduced in | |
| [DPLM-2.1](https://arxiv.org/abs/2504.11454), for improving structure modeling | |
| over index-based discrete structure token prediction. | |
| For the official implementation, installation instructions, generation scripts, | |
| training configuration, and evaluation utilities, see the | |
| [bytedance/dplm](https://github.com/bytedance/dplm) repository. | |
| ## Model Details | |
| - **Model type:** Multimodal discrete diffusion protein language model with | |
| bitwise structure-token prediction | |
| - **Checkpoint:** `airkingbd/dplm2_bit_650m` | |
| - **Architecture:** ESM-style transformer for DPLM-2 Bit (`EsmForDPLM2Bit`) | |
| - **Scale:** 650M parameters, 33 transformer layers, hidden size 1280, 20 | |
| attention heads | |
| - **Amino-acid vocabulary size:** 33 | |
| - **Structure codebook:** 8,192 structure codes represented by 13-bit latent | |
| structure features | |
| - **Base initialization:** DPLM-2 Bit training is initialized from the pretrained | |
| DPLM sequence model `airkingbd/dplm_650m` | |
| - **Structure tokenizer:** Uses `airkingbd/struct_tokenizer` | |
| - **License:** Apache-2.0 | |
| - **Papers:** [DPLM-2](https://arxiv.org/abs/2410.13782) and | |
| [DPLM-2.1](https://arxiv.org/abs/2504.11454) | |
| ## Bitwise Modeling | |
| The original DPLM-2 models protein structures with discrete structure token | |
| indices produced by a structure tokenizer. In the DPLM-2.1 analysis, the authors | |
| identify index-based structure token prediction as a bottleneck: small changes | |
| in the underlying quantized bits can produce a very different token index, making | |
| the index classification target hard for the language model to learn. | |
| DPLM-2 Bit uses the LFQ structure tokenizer's bit-level representation directly. | |
| Instead of predicting one 8,192-way structure-token index per residue, it predicts | |
| each of the 13 bits of the quantized structure feature as a binary target. This | |
| turns structure prediction into 13 binary classifications per residue, provides | |
| finer-grained supervision, and reduces the difficulty of learning structural | |
| patterns from tokenized 3D structures. | |
| ## Quick Start | |
| Install the official DPLM codebase and dependencies: | |
| ```bash | |
| git clone --recursive https://github.com/bytedance/dplm.git | |
| cd dplm | |
| conda create -n dplm python=3.9 pip | |
| conda activate dplm | |
| bash scripts/install.sh | |
| ``` | |
| Load the pretrained DPLM-2 Bit checkpoint: | |
| ```python | |
| from byprot.models.dplm2 import DPLM2Bit | |
| dplm2_bit = DPLM2Bit.from_pretrained("airkingbd/dplm2_bit_650m").cuda() | |
| dplm2_bit = dplm2_bit.eval() | |
| ``` | |
| ### Sequence-Structure Co-Generation | |
| Use `generate_dplm2.py` with `--bit_model`. The official repository uses | |
| `annealing@1.1:0.1` for the released DPLM-2 Bit co-generation example: | |
| ```bash | |
| model_name=dplm2_bit_650m | |
| sampling_strategy=annealing@1.1:0.1 | |
| output_dir=generation-results/${model_name} | |
| python generate_dplm2.py \ | |
| --model_name airkingbd/${model_name} \ | |
| --task co_generation \ | |
| --bit_model \ | |
| --sampling_strategy ${sampling_strategy} \ | |
| --num_seqs 50 \ | |
| --max_iter 500 \ | |
| --seq_lens 100 200 300 400 500 \ | |
| --saveto ${output_dir} | |
| ``` | |
| ### Forward Folding | |
| DPLM-2 Bit can generate structures conditioned on amino-acid sequences: | |
| ```bash | |
| model_name=dplm2_bit_650m | |
| output_dir=generation-results/${model_name} | |
| python generate_dplm2.py \ | |
| --model_name airkingbd/${model_name} \ | |
| --task folding \ | |
| --bit_model \ | |
| --input_fasta_path data-bin/cameo2022/aatype.fasta \ | |
| --max_iter 100 \ | |
| --unmasking_strategy deterministic \ | |
| --sampling_strategy argmax \ | |
| --saveto ${output_dir} | |
| ``` | |
| ### Inverse Folding | |
| DPLM-2 Bit can predict amino-acid sequences conditioned on tokenized protein | |
| structures: | |
| ```bash | |
| model_name=dplm2_bit_650m | |
| output_dir=generation-results/${model_name} | |
| python generate_dplm2.py \ | |
| --model_name airkingbd/${model_name} \ | |
| --task inverse_folding \ | |
| --bit_model \ | |
| --input_fasta_path data-bin/cameo2022/struct.fasta \ | |
| --max_iter 100 \ | |
| --unmasking_strategy deterministic \ | |
| --sampling_strategy argmax \ | |
| --saveto ${output_dir} | |
| ``` | |
| For custom structures, first tokenize PDB files with the released structure | |
| tokenizer: | |
| ```bash | |
| python src/byprot/utils/protein/tokenize_pdb.py \ | |
| --input_pdb_folder /path/to/input/pdbs \ | |
| --output_dir /path/to/output/tokenized_protein | |
| ``` | |
| Then pass the generated structure-token FASTA file to `generate_dplm2.py`. | |
| ## Training Data and Training Procedure | |
| DPLM-2 Bit uses the same PDB and SwissProt-derived structure data as DPLM-2. The | |
| authors provide the preprocessed training dataset on Hugging Face as | |
| [airkingbd/pdb_swissprot](https://huggingface.co/datasets/airkingbd/pdb_swissprot). | |
| The official DPLM repository provides the DPLM-2 Bit experiment configuration at | |
| `configs/experiment/dplm2/dplm2_bit_650m.yaml`. The configuration initializes | |
| from `airkingbd/dplm_650m`, uses `airkingbd/dplm2_650m` as the tokenizer | |
| vocabulary source, and uses `airkingbd/struct_tokenizer` for structure | |
| tokenization. | |
| ## Experimental Results | |
| The tables below summarize selected results reported in the DPLM-2.1 paper. | |
| Lower RMSD is better and higher TM-score, AAR, accuracy, and diversity are | |
| better. | |
| ### Forward Folding | |
| | Model | CAMEO 2022 RMSD | CAMEO 2022 TM-score | PDB Date RMSD | PDB Date TM-score | | |
| |---|---:|---:|---:|---:| | |
| | DPLM-2 650M | 7.7025 | 0.7936 | 5.3071 | 0.8306 | | |
| | DPLM-2 Bit 650M | 6.4028 | 0.8380 | 3.2213 | 0.9043 | | |
| ### Structure-Token Prediction Accuracy | |
| | Model | Test Set | Index Acc. | Bit Acc. | RMSD | TM-score | | |
| |---|---|---:|---:|---:|---:| | |
| | DPLM-2 650M | CAMEO 2022 | 0.0864 | 0.7720 | 7.7025 | 0.7936 | | |
| | DPLM-2 650M | PDB Date | 0.1188 | 0.7932 | 5.3071 | 0.8306 | | |
| | DPLM-2 Bit 650M | CAMEO 2022 | 0.1258 | 0.7958 | 6.4028 | 0.8380 | | |
| | DPLM-2 Bit 650M | PDB Date | 0.2641 | 0.8648 | 3.2213 | 0.9043 | | |
| ### Inverse Folding | |
| | Model | CAMEO 2022 AAR | CAMEO 2022 TM-score | | |
| |---|---:|---:| | |
| | DPLM-2 650M | 0.4962 | 0.8816 | | |
| | DPLM-2 3B | 0.5236 | 0.8900 | | |
| | DPLM-2 Bit 650M | 0.5586 | 0.8907 | | |
| ### Representation Learning | |
| | Model | Human PPI Accuracy (%) | DeepLoc Subcellular Accuracy (%) | | |
| |---|---:|---:| | |
| | SaProt | 86.41 | 85.57 | | |
| | DPLM-2 650M | 84.44 | 82.98 | | |
| | DPLM-2 Bit 650M | 88.89 | 83.39 | | |
| ### Unconditional Generation Diversity | |
| | Model | Diversity | | |
| |---|---:| | |
| | DPLM-2 650M | 0.700 | | |
| | DPLM-2 Bit 650M | 0.825 | | |
| For full experimental settings, additional variants such as FM, ResDiff, Geo, | |
| REPA, and SFT, and complete ablations, see the | |
| [DPLM-2.1 paper](https://arxiv.org/abs/2504.11454). | |
| ## Citation | |
| If you use this checkpoint, please cite the DPLM and DPLM-2 papers: | |
| ```bibtex | |
| @inproceedings{wang2024dplm, | |
| title={Diffusion Language Models Are Versatile Protein Learners}, | |
| author={Wang, Xinyou and Zheng, Zaixiang and Ye, Fei and Xue, Dongyu and Huang, Shujian and Gu, Quanquan}, | |
| booktitle={International Conference on Machine Learning}, | |
| year={2024} | |
| } | |
| @inproceedings{wang2025dplm2, | |
| title={DPLM-2: A Multimodal Diffusion Protein Language Model}, | |
| author={Wang, Xinyou and Zheng, Zaixiang and Ye, Fei and Xue, Dongyu and Huang, Shujian and Gu, Quanquan}, | |
| booktitle={International Conference on Learning Representations}, | |
| year={2025} | |
| } | |
| @inproceedings{hsieh2025dplm2_1, | |
| title={Elucidating the Design Space of Multimodal Protein Language Models}, | |
| author={Hsieh, Cheng-Yen and Wang, Xinyou and Zhang, Daiheng and Xue, Dongyu and Ye, Fei and Huang, Shujian and Zheng, Zaixiang and Gu, Quanquan}, | |
| booktitle={International Conference on Machine Learning}, | |
| year={2025} | |
| } | |
| ``` | |
| ## Acknowledgements | |
| DPLM builds on and acknowledges prior work and resources including ByProt, | |
| EvoDiff, SaProt, ESM, LM-Design, EigenFold, MultiFlow, FrameFlow, and | |
| OpenFold-related structure modeling utilities. See the official repository for | |
| the complete acknowledgements and implementation details. | |