Add model card

dd9b86f 21 days ago

4.54 kB

	---
	license: apache-2.0
	library_name: pytorch
	tags:
	- biology
	- protein
	- protein-structure
	- protein-structure-tokenizer
	- structure-tokenizer
	- dplm-2
	- pytorch
	- arxiv:2410.13782
	- arxiv:2504.11454
	datasets:
	- airkingbd/pdb_swissprot
	---

	# DPLM-2 Structure Tokenizer

	This repository contains the structure tokenizer used by DPLM-2, a multimodal
	diffusion protein language model for joint protein sequence and structure
	modeling. The tokenizer converts protein backbone/atom coordinates into
	discrete structure tokens and can decode structure tokens back into protein
	structures. DPLM-2 uses these tokens to support sequence-structure
	co-generation, forward folding, inverse folding, and motif scaffolding.

	For the official implementation, installation instructions, DPLM-2 generation
	scripts, and evaluation utilities, see the
	[bytedance/dplm](https://github.com/bytedance/dplm) repository.

	## Model Details

	- Checkpoint: `airkingbd/struct_tokenizer`
	- Files: `config.yaml`, `dplm2_struct_tokenizer.ckpt`
	- Model class: `byprot.models.structok.structok_lfq.VQModel`
	- Tokenizer type: LFQ-based discrete protein structure tokenizer
	- Codebook size: 8,192 structure tokens (`2^13`)
	- Codebook embedding dimension: 13
	- Encoder: GVP-based structure encoder
	- Decoder: ESMFold-style structure decoder with decoder input dimension 128
	- License: Apache-2.0
	- Paper: [DPLM-2: A Multimodal Diffusion Protein Language Model](https://arxiv.org/abs/2410.13782)

	## Quick Start

	Install the official DPLM codebase and dependencies:

	```bash
	git clone --recursive https://github.com/bytedance/dplm.git
	cd dplm

	conda create -n dplm python=3.9 pip
	conda activate dplm
	bash scripts/install.sh
	```

	Load the released structure tokenizer:

	```python
	from byprot.models.utils import get_struct_tokenizer

	struct_tokenizer = get_struct_tokenizer("airkingbd/struct_tokenizer")
	struct_tokenizer = struct_tokenizer.cuda().eval()
	```

	The helper downloads this repository from Hugging Face, reads `config.yaml`,
	constructs `VQModel`, and loads `dplm2_struct_tokenizer.ckpt`.

	## Tokenize PDB Structures

	The official repository provides `src/byprot/utils/protein/tokenize_pdb.py` for
	converting PDB files into structure-token FASTA files:

	```bash
	python src/byprot/utils/protein/tokenize_pdb.py \
	--input_pdb_folder /path/to/input/pdbs \
	--output_dir /path/to/output/tokenized_protein
	```

	The script processes `*.pdb` files in the input folder and writes:

	- `struct_seq.fasta`: tokenized structure sequences
	- `aa_seq.fasta`: amino-acid sequences extracted from the same structures

	The structure sequences can be used as DPLM-2 structure-conditioning inputs.
	For example, pass the generated structure-token FASTA file to
	`generate_dplm2.py --task inverse_folding --input_fasta_path ...`.


	## Use with DPLM-2

	DPLM-2 checkpoints load this tokenizer through their `struct_tokenizer` property.
	For example:

	```python
	from byprot.models.dplm2 import MultimodalDiffusionProteinLanguageModel as DPLM2

	dplm2 = DPLM2.from_pretrained("airkingbd/dplm2_650m").cuda().eval()
	struct_tokenizer = dplm2.struct_tokenizer
	```

	The DPLM-2 configs point to this repository with:

	```yaml
	struct_tokenizer:
	exp_path: airkingbd/struct_tokenizer
	```


	## Citation

	If you use this tokenizer, please cite the DPLM and DPLM-2 papers:

	```bibtex
	@inproceedings{wang2024dplm,
	title={Diffusion Language Models Are Versatile Protein Learners},
	author={Wang, Xinyou and Zheng, Zaixiang and Ye, Fei and Xue, Dongyu and Huang, Shujian and Gu, Quanquan},
	booktitle={International Conference on Machine Learning},
	year={2024}
	}

	@inproceedings{wang2025dplm2,
	title={DPLM-2: A Multimodal Diffusion Protein Language Model},
	author={Wang, Xinyou and Zheng, Zaixiang and Ye, Fei and Xue, Dongyu and Huang, Shujian and Gu, Quanquan},
	booktitle={International Conference on Learning Representations},
	year={2025}
	}

	@inproceedings{hsieh2025dplm2_1,
	title={Elucidating the Design Space of Multimodal Protein Language Models},
	author={Hsieh, Cheng-Yen and Wang, Xinyou and Zhang, Daiheng and Xue, Dongyu and Ye, Fei and Huang, Shujian and Zheng, Zaixiang and Gu, Quanquan},
	booktitle={International Conference on Machine Learning},
	year={2025}
	}
	```

	## Acknowledgements

	DPLM builds on and acknowledges prior work and resources including ByProt,
	ESM, OpenFold-related structure modeling utilities, EigenFold, and MultiFlow.
	See the official repository for complete acknowledgements and implementation
	details.