airkingbd
/

struct_tokenizer

+---
+license: apache-2.0
+library_name: pytorch
+tags:
+- biology
+- protein
+- protein-structure
+- protein-structure-tokenizer
+- structure-tokenizer
+- dplm-2
+- pytorch
+- arxiv:2410.13782
+- arxiv:2504.11454
+datasets:
+- airkingbd/pdb_swissprot
+---
+# DPLM-2 Structure Tokenizer
+This repository contains the structure tokenizer used by DPLM-2, a multimodal
+diffusion protein language model for joint protein sequence and structure
+modeling. The tokenizer converts protein backbone/atom coordinates into
+discrete structure tokens and can decode structure tokens back into protein
+structures. DPLM-2 uses these tokens to support sequence-structure
+co-generation, forward folding, inverse folding, and motif scaffolding.
+For the official implementation, installation instructions, DPLM-2 generation
+scripts, and evaluation utilities, see the
+[bytedance/dplm](https://github.com/bytedance/dplm) repository.
+## Model Details
+- **Checkpoint:** `airkingbd/struct_tokenizer`
+- **Files:** `config.yaml`, `dplm2_struct_tokenizer.ckpt`
+- **Model class:** `byprot.models.structok.structok_lfq.VQModel`
+- **Tokenizer type:** LFQ-based discrete protein structure tokenizer
+- **Codebook size:** 8,192 structure tokens (`2^13`)
+- **Codebook embedding dimension:** 13
+- **Encoder:** GVP-based structure encoder
+- **Decoder:** ESMFold-style structure decoder with decoder input dimension 128
+- **License:** Apache-2.0
+- **Paper:** [DPLM-2: A Multimodal Diffusion Protein Language Model](https://arxiv.org/abs/2410.13782)
+## Quick Start
+Install the official DPLM codebase and dependencies:
+```bash
+git clone --recursive https://github.com/bytedance/dplm.git
+cd dplm
+conda create -n dplm python=3.9 pip
+conda activate dplm
+bash scripts/install.sh
+```
+Load the released structure tokenizer:
+```python
+from byprot.models.utils import get_struct_tokenizer
+struct_tokenizer = get_struct_tokenizer("airkingbd/struct_tokenizer")
+struct_tokenizer = struct_tokenizer.cuda().eval()
+```
+The helper downloads this repository from Hugging Face, reads `config.yaml`,
+constructs `VQModel`, and loads `dplm2_struct_tokenizer.ckpt`.
+## Tokenize PDB Structures
+The official repository provides `src/byprot/utils/protein/tokenize_pdb.py` for
+converting PDB files into structure-token FASTA files:
+```bash
+python src/byprot/utils/protein/tokenize_pdb.py \
+    --input_pdb_folder /path/to/input/pdbs \
+    --output_dir /path/to/output/tokenized_protein
+```
+The script processes `*.pdb` files in the input folder and writes:
+- `struct_seq.fasta`: tokenized structure sequences
+- `aa_seq.fasta`: amino-acid sequences extracted from the same structures
+The structure sequences can be used as DPLM-2 structure-conditioning inputs.
+For example, pass the generated structure-token FASTA file to
+`generate_dplm2.py --task inverse_folding --input_fasta_path ...`.
+## Use with DPLM-2
+DPLM-2 checkpoints load this tokenizer through their `struct_tokenizer` property.
+For example:
+```python
+from byprot.models.dplm2 import MultimodalDiffusionProteinLanguageModel as DPLM2
+dplm2 = DPLM2.from_pretrained("airkingbd/dplm2_650m").cuda().eval()
+struct_tokenizer = dplm2.struct_tokenizer
+```
+The DPLM-2 configs point to this repository with:
+```yaml
+struct_tokenizer:
+  exp_path: airkingbd/struct_tokenizer
+```
+## Citation
+If you use this tokenizer, please cite the DPLM and DPLM-2 papers:
+```bibtex
+@inproceedings{wang2024dplm,
+  title={Diffusion Language Models Are Versatile Protein Learners},
+  author={Wang, Xinyou and Zheng, Zaixiang and Ye, Fei and Xue, Dongyu and Huang, Shujian and Gu, Quanquan},
+  booktitle={International Conference on Machine Learning},
+  year={2024}
+}
+@inproceedings{wang2025dplm2,
+  title={DPLM-2: A Multimodal Diffusion Protein Language Model},
+  author={Wang, Xinyou and Zheng, Zaixiang and Ye, Fei and Xue, Dongyu and Huang, Shujian and Gu, Quanquan},
+  booktitle={International Conference on Learning Representations},
+  year={2025}
+}
+@inproceedings{hsieh2025dplm2_1,
+  title={Elucidating the Design Space of Multimodal Protein Language Models},
+  author={Hsieh, Cheng-Yen and Wang, Xinyou and Zhang, Daiheng and Xue, Dongyu and Ye, Fei and Huang, Shujian and Zheng, Zaixiang and Gu, Quanquan},
+  booktitle={International Conference on Machine Learning},
+  year={2025}
+}
+```
+## Acknowledgements
+DPLM builds on and acknowledges prior work and resources including ByProt,
+ESM, OpenFold-related structure modeling utilities, EigenFold, and MultiFlow.
+See the official repository for complete acknowledgements and implementation
+details.