airkingbd commited on
Commit
dd9b86f
·
1 Parent(s): 3b5312c

Add model card

Browse files
Files changed (1) hide show
  1. README.md +142 -0
README.md ADDED
@@ -0,0 +1,142 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ library_name: pytorch
4
+ tags:
5
+ - biology
6
+ - protein
7
+ - protein-structure
8
+ - protein-structure-tokenizer
9
+ - structure-tokenizer
10
+ - dplm-2
11
+ - pytorch
12
+ - arxiv:2410.13782
13
+ - arxiv:2504.11454
14
+ datasets:
15
+ - airkingbd/pdb_swissprot
16
+ ---
17
+
18
+ # DPLM-2 Structure Tokenizer
19
+
20
+ This repository contains the structure tokenizer used by DPLM-2, a multimodal
21
+ diffusion protein language model for joint protein sequence and structure
22
+ modeling. The tokenizer converts protein backbone/atom coordinates into
23
+ discrete structure tokens and can decode structure tokens back into protein
24
+ structures. DPLM-2 uses these tokens to support sequence-structure
25
+ co-generation, forward folding, inverse folding, and motif scaffolding.
26
+
27
+ For the official implementation, installation instructions, DPLM-2 generation
28
+ scripts, and evaluation utilities, see the
29
+ [bytedance/dplm](https://github.com/bytedance/dplm) repository.
30
+
31
+ ## Model Details
32
+
33
+ - **Checkpoint:** `airkingbd/struct_tokenizer`
34
+ - **Files:** `config.yaml`, `dplm2_struct_tokenizer.ckpt`
35
+ - **Model class:** `byprot.models.structok.structok_lfq.VQModel`
36
+ - **Tokenizer type:** LFQ-based discrete protein structure tokenizer
37
+ - **Codebook size:** 8,192 structure tokens (`2^13`)
38
+ - **Codebook embedding dimension:** 13
39
+ - **Encoder:** GVP-based structure encoder
40
+ - **Decoder:** ESMFold-style structure decoder with decoder input dimension 128
41
+ - **License:** Apache-2.0
42
+ - **Paper:** [DPLM-2: A Multimodal Diffusion Protein Language Model](https://arxiv.org/abs/2410.13782)
43
+
44
+ ## Quick Start
45
+
46
+ Install the official DPLM codebase and dependencies:
47
+
48
+ ```bash
49
+ git clone --recursive https://github.com/bytedance/dplm.git
50
+ cd dplm
51
+
52
+ conda create -n dplm python=3.9 pip
53
+ conda activate dplm
54
+ bash scripts/install.sh
55
+ ```
56
+
57
+ Load the released structure tokenizer:
58
+
59
+ ```python
60
+ from byprot.models.utils import get_struct_tokenizer
61
+
62
+ struct_tokenizer = get_struct_tokenizer("airkingbd/struct_tokenizer")
63
+ struct_tokenizer = struct_tokenizer.cuda().eval()
64
+ ```
65
+
66
+ The helper downloads this repository from Hugging Face, reads `config.yaml`,
67
+ constructs `VQModel`, and loads `dplm2_struct_tokenizer.ckpt`.
68
+
69
+ ## Tokenize PDB Structures
70
+
71
+ The official repository provides `src/byprot/utils/protein/tokenize_pdb.py` for
72
+ converting PDB files into structure-token FASTA files:
73
+
74
+ ```bash
75
+ python src/byprot/utils/protein/tokenize_pdb.py \
76
+ --input_pdb_folder /path/to/input/pdbs \
77
+ --output_dir /path/to/output/tokenized_protein
78
+ ```
79
+
80
+ The script processes `*.pdb` files in the input folder and writes:
81
+
82
+ - `struct_seq.fasta`: tokenized structure sequences
83
+ - `aa_seq.fasta`: amino-acid sequences extracted from the same structures
84
+
85
+ The structure sequences can be used as DPLM-2 structure-conditioning inputs.
86
+ For example, pass the generated structure-token FASTA file to
87
+ `generate_dplm2.py --task inverse_folding --input_fasta_path ...`.
88
+
89
+
90
+ ## Use with DPLM-2
91
+
92
+ DPLM-2 checkpoints load this tokenizer through their `struct_tokenizer` property.
93
+ For example:
94
+
95
+ ```python
96
+ from byprot.models.dplm2 import MultimodalDiffusionProteinLanguageModel as DPLM2
97
+
98
+ dplm2 = DPLM2.from_pretrained("airkingbd/dplm2_650m").cuda().eval()
99
+ struct_tokenizer = dplm2.struct_tokenizer
100
+ ```
101
+
102
+ The DPLM-2 configs point to this repository with:
103
+
104
+ ```yaml
105
+ struct_tokenizer:
106
+ exp_path: airkingbd/struct_tokenizer
107
+ ```
108
+
109
+
110
+ ## Citation
111
+
112
+ If you use this tokenizer, please cite the DPLM and DPLM-2 papers:
113
+
114
+ ```bibtex
115
+ @inproceedings{wang2024dplm,
116
+ title={Diffusion Language Models Are Versatile Protein Learners},
117
+ author={Wang, Xinyou and Zheng, Zaixiang and Ye, Fei and Xue, Dongyu and Huang, Shujian and Gu, Quanquan},
118
+ booktitle={International Conference on Machine Learning},
119
+ year={2024}
120
+ }
121
+
122
+ @inproceedings{wang2025dplm2,
123
+ title={DPLM-2: A Multimodal Diffusion Protein Language Model},
124
+ author={Wang, Xinyou and Zheng, Zaixiang and Ye, Fei and Xue, Dongyu and Huang, Shujian and Gu, Quanquan},
125
+ booktitle={International Conference on Learning Representations},
126
+ year={2025}
127
+ }
128
+
129
+ @inproceedings{hsieh2025dplm2_1,
130
+ title={Elucidating the Design Space of Multimodal Protein Language Models},
131
+ author={Hsieh, Cheng-Yen and Wang, Xinyou and Zhang, Daiheng and Xue, Dongyu and Ye, Fei and Huang, Shujian and Zheng, Zaixiang and Gu, Quanquan},
132
+ booktitle={International Conference on Machine Learning},
133
+ year={2025}
134
+ }
135
+ ```
136
+
137
+ ## Acknowledgements
138
+
139
+ DPLM builds on and acknowledges prior work and resources including ByProt,
140
+ ESM, OpenFold-related structure modeling utilities, EigenFold, and MultiFlow.
141
+ See the official repository for complete acknowledgements and implementation
142
+ details.