File size: 4,537 Bytes
dd9b86f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
---
license: apache-2.0
library_name: pytorch
tags:
- biology
- protein
- protein-structure
- protein-structure-tokenizer
- structure-tokenizer
- dplm-2
- pytorch
- arxiv:2410.13782
- arxiv:2504.11454
datasets:
- airkingbd/pdb_swissprot
---

# DPLM-2 Structure Tokenizer

This repository contains the structure tokenizer used by DPLM-2, a multimodal
diffusion protein language model for joint protein sequence and structure
modeling. The tokenizer converts protein backbone/atom coordinates into
discrete structure tokens and can decode structure tokens back into protein
structures. DPLM-2 uses these tokens to support sequence-structure
co-generation, forward folding, inverse folding, and motif scaffolding.

For the official implementation, installation instructions, DPLM-2 generation
scripts, and evaluation utilities, see the
[bytedance/dplm](https://github.com/bytedance/dplm) repository.

## Model Details

- **Checkpoint:** `airkingbd/struct_tokenizer`
- **Files:** `config.yaml`, `dplm2_struct_tokenizer.ckpt`
- **Model class:** `byprot.models.structok.structok_lfq.VQModel`
- **Tokenizer type:** LFQ-based discrete protein structure tokenizer
- **Codebook size:** 8,192 structure tokens (`2^13`)
- **Codebook embedding dimension:** 13
- **Encoder:** GVP-based structure encoder
- **Decoder:** ESMFold-style structure decoder with decoder input dimension 128
- **License:** Apache-2.0
- **Paper:** [DPLM-2: A Multimodal Diffusion Protein Language Model](https://arxiv.org/abs/2410.13782)

## Quick Start

Install the official DPLM codebase and dependencies:

```bash
git clone --recursive https://github.com/bytedance/dplm.git
cd dplm

conda create -n dplm python=3.9 pip
conda activate dplm
bash scripts/install.sh
```

Load the released structure tokenizer:

```python
from byprot.models.utils import get_struct_tokenizer

struct_tokenizer = get_struct_tokenizer("airkingbd/struct_tokenizer")
struct_tokenizer = struct_tokenizer.cuda().eval()
```

The helper downloads this repository from Hugging Face, reads `config.yaml`,
constructs `VQModel`, and loads `dplm2_struct_tokenizer.ckpt`.

## Tokenize PDB Structures

The official repository provides `src/byprot/utils/protein/tokenize_pdb.py` for
converting PDB files into structure-token FASTA files:

```bash
python src/byprot/utils/protein/tokenize_pdb.py \
    --input_pdb_folder /path/to/input/pdbs \
    --output_dir /path/to/output/tokenized_protein
```

The script processes `*.pdb` files in the input folder and writes:

- `struct_seq.fasta`: tokenized structure sequences
- `aa_seq.fasta`: amino-acid sequences extracted from the same structures

The structure sequences can be used as DPLM-2 structure-conditioning inputs.
For example, pass the generated structure-token FASTA file to
`generate_dplm2.py --task inverse_folding --input_fasta_path ...`.


## Use with DPLM-2

DPLM-2 checkpoints load this tokenizer through their `struct_tokenizer` property.
For example:

```python
from byprot.models.dplm2 import MultimodalDiffusionProteinLanguageModel as DPLM2

dplm2 = DPLM2.from_pretrained("airkingbd/dplm2_650m").cuda().eval()
struct_tokenizer = dplm2.struct_tokenizer
```

The DPLM-2 configs point to this repository with:

```yaml
struct_tokenizer:
  exp_path: airkingbd/struct_tokenizer
```


## Citation

If you use this tokenizer, please cite the DPLM and DPLM-2 papers:

```bibtex
@inproceedings{wang2024dplm,
  title={Diffusion Language Models Are Versatile Protein Learners},
  author={Wang, Xinyou and Zheng, Zaixiang and Ye, Fei and Xue, Dongyu and Huang, Shujian and Gu, Quanquan},
  booktitle={International Conference on Machine Learning},
  year={2024}
}

@inproceedings{wang2025dplm2,
  title={DPLM-2: A Multimodal Diffusion Protein Language Model},
  author={Wang, Xinyou and Zheng, Zaixiang and Ye, Fei and Xue, Dongyu and Huang, Shujian and Gu, Quanquan},
  booktitle={International Conference on Learning Representations},
  year={2025}
}

@inproceedings{hsieh2025dplm2_1,
  title={Elucidating the Design Space of Multimodal Protein Language Models},
  author={Hsieh, Cheng-Yen and Wang, Xinyou and Zhang, Daiheng and Xue, Dongyu and Ye, Fei and Huang, Shujian and Zheng, Zaixiang and Gu, Quanquan},
  booktitle={International Conference on Machine Learning},
  year={2025}
}
```

## Acknowledgements

DPLM builds on and acknowledges prior work and resources including ByProt,
ESM, OpenFold-related structure modeling utilities, EigenFold, and MultiFlow.
See the official repository for complete acknowledgements and implementation
details.