File size: 4,537 Bytes
dd9b86f | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 | ---
license: apache-2.0
library_name: pytorch
tags:
- biology
- protein
- protein-structure
- protein-structure-tokenizer
- structure-tokenizer
- dplm-2
- pytorch
- arxiv:2410.13782
- arxiv:2504.11454
datasets:
- airkingbd/pdb_swissprot
---
# DPLM-2 Structure Tokenizer
This repository contains the structure tokenizer used by DPLM-2, a multimodal
diffusion protein language model for joint protein sequence and structure
modeling. The tokenizer converts protein backbone/atom coordinates into
discrete structure tokens and can decode structure tokens back into protein
structures. DPLM-2 uses these tokens to support sequence-structure
co-generation, forward folding, inverse folding, and motif scaffolding.
For the official implementation, installation instructions, DPLM-2 generation
scripts, and evaluation utilities, see the
[bytedance/dplm](https://github.com/bytedance/dplm) repository.
## Model Details
- **Checkpoint:** `airkingbd/struct_tokenizer`
- **Files:** `config.yaml`, `dplm2_struct_tokenizer.ckpt`
- **Model class:** `byprot.models.structok.structok_lfq.VQModel`
- **Tokenizer type:** LFQ-based discrete protein structure tokenizer
- **Codebook size:** 8,192 structure tokens (`2^13`)
- **Codebook embedding dimension:** 13
- **Encoder:** GVP-based structure encoder
- **Decoder:** ESMFold-style structure decoder with decoder input dimension 128
- **License:** Apache-2.0
- **Paper:** [DPLM-2: A Multimodal Diffusion Protein Language Model](https://arxiv.org/abs/2410.13782)
## Quick Start
Install the official DPLM codebase and dependencies:
```bash
git clone --recursive https://github.com/bytedance/dplm.git
cd dplm
conda create -n dplm python=3.9 pip
conda activate dplm
bash scripts/install.sh
```
Load the released structure tokenizer:
```python
from byprot.models.utils import get_struct_tokenizer
struct_tokenizer = get_struct_tokenizer("airkingbd/struct_tokenizer")
struct_tokenizer = struct_tokenizer.cuda().eval()
```
The helper downloads this repository from Hugging Face, reads `config.yaml`,
constructs `VQModel`, and loads `dplm2_struct_tokenizer.ckpt`.
## Tokenize PDB Structures
The official repository provides `src/byprot/utils/protein/tokenize_pdb.py` for
converting PDB files into structure-token FASTA files:
```bash
python src/byprot/utils/protein/tokenize_pdb.py \
--input_pdb_folder /path/to/input/pdbs \
--output_dir /path/to/output/tokenized_protein
```
The script processes `*.pdb` files in the input folder and writes:
- `struct_seq.fasta`: tokenized structure sequences
- `aa_seq.fasta`: amino-acid sequences extracted from the same structures
The structure sequences can be used as DPLM-2 structure-conditioning inputs.
For example, pass the generated structure-token FASTA file to
`generate_dplm2.py --task inverse_folding --input_fasta_path ...`.
## Use with DPLM-2
DPLM-2 checkpoints load this tokenizer through their `struct_tokenizer` property.
For example:
```python
from byprot.models.dplm2 import MultimodalDiffusionProteinLanguageModel as DPLM2
dplm2 = DPLM2.from_pretrained("airkingbd/dplm2_650m").cuda().eval()
struct_tokenizer = dplm2.struct_tokenizer
```
The DPLM-2 configs point to this repository with:
```yaml
struct_tokenizer:
exp_path: airkingbd/struct_tokenizer
```
## Citation
If you use this tokenizer, please cite the DPLM and DPLM-2 papers:
```bibtex
@inproceedings{wang2024dplm,
title={Diffusion Language Models Are Versatile Protein Learners},
author={Wang, Xinyou and Zheng, Zaixiang and Ye, Fei and Xue, Dongyu and Huang, Shujian and Gu, Quanquan},
booktitle={International Conference on Machine Learning},
year={2024}
}
@inproceedings{wang2025dplm2,
title={DPLM-2: A Multimodal Diffusion Protein Language Model},
author={Wang, Xinyou and Zheng, Zaixiang and Ye, Fei and Xue, Dongyu and Huang, Shujian and Gu, Quanquan},
booktitle={International Conference on Learning Representations},
year={2025}
}
@inproceedings{hsieh2025dplm2_1,
title={Elucidating the Design Space of Multimodal Protein Language Models},
author={Hsieh, Cheng-Yen and Wang, Xinyou and Zhang, Daiheng and Xue, Dongyu and Ye, Fei and Huang, Shujian and Zheng, Zaixiang and Gu, Quanquan},
booktitle={International Conference on Machine Learning},
year={2025}
}
```
## Acknowledgements
DPLM builds on and acknowledges prior work and resources including ByProt,
ESM, OpenFold-related structure modeling utilities, EigenFold, and MultiFlow.
See the official repository for complete acknowledgements and implementation
details.
|