Add model card

40cf303 about 1 month ago

8.19 kB

	---
	license: apache-2.0
	library_name: transformers
	tags:
	- biology
	- protein-language-model
	- protein-generation
	- protein-structure
	- diffusion
	- esm
	- pytorch
	- bitwise-modeling
	- arxiv:2410.13782
	- arxiv:2504.11454
	datasets:
	- airkingbd/pdb_swissprot
	---

	# DPLM-2 Bit 650M

	DPLM-2 Bit is a 650M-parameter multimodal diffusion protein language model for
	joint protein sequence and structure modeling. It is a bitwise structure-token
	modeling variant of DPLM-2, introduced in
	[DPLM-2.1](https://arxiv.org/abs/2504.11454), for improving structure modeling
	over index-based discrete structure token prediction.

	For the official implementation, installation instructions, generation scripts,
	training configuration, and evaluation utilities, see the
	[bytedance/dplm](https://github.com/bytedance/dplm) repository.

	## Model Details

	- Model type: Multimodal discrete diffusion protein language model with
	bitwise structure-token prediction
	- Checkpoint: `airkingbd/dplm2_bit_650m`
	- Architecture: ESM-style transformer for DPLM-2 Bit (`EsmForDPLM2Bit`)
	- Scale: 650M parameters, 33 transformer layers, hidden size 1280, 20
	attention heads
	- Amino-acid vocabulary size: 33
	- Structure codebook: 8,192 structure codes represented by 13-bit latent
	structure features
	- Base initialization: DPLM-2 Bit training is initialized from the pretrained
	DPLM sequence model `airkingbd/dplm_650m`
	- Structure tokenizer: Uses `airkingbd/struct_tokenizer`
	- License: Apache-2.0
	- Papers: [DPLM-2](https://arxiv.org/abs/2410.13782) and
	[DPLM-2.1](https://arxiv.org/abs/2504.11454)

	## Bitwise Modeling

	The original DPLM-2 models protein structures with discrete structure token
	indices produced by a structure tokenizer. In the DPLM-2.1 analysis, the authors
	identify index-based structure token prediction as a bottleneck: small changes
	in the underlying quantized bits can produce a very different token index, making
	the index classification target hard for the language model to learn.

	DPLM-2 Bit uses the LFQ structure tokenizer's bit-level representation directly.
	Instead of predicting one 8,192-way structure-token index per residue, it predicts
	each of the 13 bits of the quantized structure feature as a binary target. This
	turns structure prediction into 13 binary classifications per residue, provides
	finer-grained supervision, and reduces the difficulty of learning structural
	patterns from tokenized 3D structures.

	## Quick Start

	Install the official DPLM codebase and dependencies:

	```bash
	git clone --recursive https://github.com/bytedance/dplm.git
	cd dplm

	conda create -n dplm python=3.9 pip
	conda activate dplm
	bash scripts/install.sh
	```

	Load the pretrained DPLM-2 Bit checkpoint:

	```python
	from byprot.models.dplm2 import DPLM2Bit

	dplm2_bit = DPLM2Bit.from_pretrained("airkingbd/dplm2_bit_650m").cuda()
	dplm2_bit = dplm2_bit.eval()
	```

	### Sequence-Structure Co-Generation

	Use `generate_dplm2.py` with `--bit_model`. The official repository uses
	`annealing@1.1:0.1` for the released DPLM-2 Bit co-generation example:

	```bash
	model_name=dplm2_bit_650m
	sampling_strategy=annealing@1.1:0.1
	output_dir=generation-results/${model_name}

	python generate_dplm2.py \
	--model_name airkingbd/${model_name} \
	--task co_generation \
	--bit_model \
	--sampling_strategy ${sampling_strategy} \
	--num_seqs 50 \
	--max_iter 500 \
	--seq_lens 100 200 300 400 500 \
	--saveto ${output_dir}
	```

	### Forward Folding

	DPLM-2 Bit can generate structures conditioned on amino-acid sequences:

	```bash
	model_name=dplm2_bit_650m
	output_dir=generation-results/${model_name}

	python generate_dplm2.py \
	--model_name airkingbd/${model_name} \
	--task folding \
	--bit_model \
	--input_fasta_path data-bin/cameo2022/aatype.fasta \
	--max_iter 100 \
	--unmasking_strategy deterministic \
	--sampling_strategy argmax \
	--saveto ${output_dir}
	```

	### Inverse Folding

	DPLM-2 Bit can predict amino-acid sequences conditioned on tokenized protein
	structures:

	```bash
	model_name=dplm2_bit_650m
	output_dir=generation-results/${model_name}

	python generate_dplm2.py \
	--model_name airkingbd/${model_name} \
	--task inverse_folding \
	--bit_model \
	--input_fasta_path data-bin/cameo2022/struct.fasta \
	--max_iter 100 \
	--unmasking_strategy deterministic \
	--sampling_strategy argmax \
	--saveto ${output_dir}
	```

	For custom structures, first tokenize PDB files with the released structure
	tokenizer:

	```bash
	python src/byprot/utils/protein/tokenize_pdb.py \
	--input_pdb_folder /path/to/input/pdbs \
	--output_dir /path/to/output/tokenized_protein
	```

	Then pass the generated structure-token FASTA file to `generate_dplm2.py`.

	## Training Data and Training Procedure

	DPLM-2 Bit uses the same PDB and SwissProt-derived structure data as DPLM-2. The
	authors provide the preprocessed training dataset on Hugging Face as
	[airkingbd/pdb_swissprot](https://huggingface.co/datasets/airkingbd/pdb_swissprot).

	The official DPLM repository provides the DPLM-2 Bit experiment configuration at
	`configs/experiment/dplm2/dplm2_bit_650m.yaml`. The configuration initializes
	from `airkingbd/dplm_650m`, uses `airkingbd/dplm2_650m` as the tokenizer
	vocabulary source, and uses `airkingbd/struct_tokenizer` for structure
	tokenization.

	## Experimental Results

	The tables below summarize selected results reported in the DPLM-2.1 paper.
	Lower RMSD is better and higher TM-score, AAR, accuracy, and diversity are
	better.

	### Forward Folding

	\| Model \| CAMEO 2022 RMSD \| CAMEO 2022 TM-score \| PDB Date RMSD \| PDB Date TM-score \|
	\|---\|---:\|---:\|---:\|---:\|
	\| DPLM-2 650M \| 7.7025 \| 0.7936 \| 5.3071 \| 0.8306 \|
	\| DPLM-2 Bit 650M \| 6.4028 \| 0.8380 \| 3.2213 \| 0.9043 \|

	### Structure-Token Prediction Accuracy

	\| Model \| Test Set \| Index Acc. \| Bit Acc. \| RMSD \| TM-score \|
	\|---\|---\|---:\|---:\|---:\|---:\|
	\| DPLM-2 650M \| CAMEO 2022 \| 0.0864 \| 0.7720 \| 7.7025 \| 0.7936 \|
	\| DPLM-2 650M \| PDB Date \| 0.1188 \| 0.7932 \| 5.3071 \| 0.8306 \|
	\| DPLM-2 Bit 650M \| CAMEO 2022 \| 0.1258 \| 0.7958 \| 6.4028 \| 0.8380 \|
	\| DPLM-2 Bit 650M \| PDB Date \| 0.2641 \| 0.8648 \| 3.2213 \| 0.9043 \|

	### Inverse Folding

	\| Model \| CAMEO 2022 AAR \| CAMEO 2022 TM-score \|
	\|---\|---:\|---:\|
	\| DPLM-2 650M \| 0.4962 \| 0.8816 \|
	\| DPLM-2 3B \| 0.5236 \| 0.8900 \|
	\| DPLM-2 Bit 650M \| 0.5586 \| 0.8907 \|

	### Representation Learning

	\| Model \| Human PPI Accuracy (%) \| DeepLoc Subcellular Accuracy (%) \|
	\|---\|---:\|---:\|
	\| SaProt \| 86.41 \| 85.57 \|
	\| DPLM-2 650M \| 84.44 \| 82.98 \|
	\| DPLM-2 Bit 650M \| 88.89 \| 83.39 \|

	### Unconditional Generation Diversity

	\| Model \| Diversity \|
	\|---\|---:\|
	\| DPLM-2 650M \| 0.700 \|
	\| DPLM-2 Bit 650M \| 0.825 \|

	For full experimental settings, additional variants such as FM, ResDiff, Geo,
	REPA, and SFT, and complete ablations, see the
	[DPLM-2.1 paper](https://arxiv.org/abs/2504.11454).

	## Citation

	If you use this checkpoint, please cite the DPLM and DPLM-2 papers:

	```bibtex
	@inproceedings{wang2024dplm,
	title={Diffusion Language Models Are Versatile Protein Learners},
	author={Wang, Xinyou and Zheng, Zaixiang and Ye, Fei and Xue, Dongyu and Huang, Shujian and Gu, Quanquan},
	booktitle={International Conference on Machine Learning},
	year={2024}
	}

	@inproceedings{wang2025dplm2,
	title={DPLM-2: A Multimodal Diffusion Protein Language Model},
	author={Wang, Xinyou and Zheng, Zaixiang and Ye, Fei and Xue, Dongyu and Huang, Shujian and Gu, Quanquan},
	booktitle={International Conference on Learning Representations},
	year={2025}
	}

	@inproceedings{hsieh2025dplm2_1,
	title={Elucidating the Design Space of Multimodal Protein Language Models},
	author={Hsieh, Cheng-Yen and Wang, Xinyou and Zhang, Daiheng and Xue, Dongyu and Ye, Fei and Huang, Shujian and Zheng, Zaixiang and Gu, Quanquan},
	booktitle={International Conference on Machine Learning},
	year={2025}
	}
	```

	## Acknowledgements

	DPLM builds on and acknowledges prior work and resources including ByProt,
	EvoDiff, SaProt, ESM, LM-Design, EigenFold, MultiFlow, FrameFlow, and
	OpenFold-related structure modeling utilities. See the official repository for
	the complete acknowledgements and implementation details.