--- license: apache-2.0 library_name: pytorch tags: - biology - protein - protein-structure - protein-structure-tokenizer - structure-tokenizer - dplm-2 - pytorch - arxiv:2410.13782 - arxiv:2504.11454 datasets: - airkingbd/pdb_swissprot --- # DPLM-2 Structure Tokenizer This repository contains the structure tokenizer used by DPLM-2, a multimodal diffusion protein language model for joint protein sequence and structure modeling. The tokenizer converts protein backbone/atom coordinates into discrete structure tokens and can decode structure tokens back into protein structures. DPLM-2 uses these tokens to support sequence-structure co-generation, forward folding, inverse folding, and motif scaffolding. For the official implementation, installation instructions, DPLM-2 generation scripts, and evaluation utilities, see the [bytedance/dplm](https://github.com/bytedance/dplm) repository. ## Model Details - **Checkpoint:** `airkingbd/struct_tokenizer` - **Files:** `config.yaml`, `dplm2_struct_tokenizer.ckpt` - **Model class:** `byprot.models.structok.structok_lfq.VQModel` - **Tokenizer type:** LFQ-based discrete protein structure tokenizer - **Codebook size:** 8,192 structure tokens (`2^13`) - **Codebook embedding dimension:** 13 - **Encoder:** GVP-based structure encoder - **Decoder:** ESMFold-style structure decoder with decoder input dimension 128 - **License:** Apache-2.0 - **Paper:** [DPLM-2: A Multimodal Diffusion Protein Language Model](https://arxiv.org/abs/2410.13782) ## Quick Start Install the official DPLM codebase and dependencies: ```bash git clone --recursive https://github.com/bytedance/dplm.git cd dplm conda create -n dplm python=3.9 pip conda activate dplm bash scripts/install.sh ``` Load the released structure tokenizer: ```python from byprot.models.utils import get_struct_tokenizer struct_tokenizer = get_struct_tokenizer("airkingbd/struct_tokenizer") struct_tokenizer = struct_tokenizer.cuda().eval() ``` The helper downloads this repository from Hugging Face, reads `config.yaml`, constructs `VQModel`, and loads `dplm2_struct_tokenizer.ckpt`. ## Tokenize PDB Structures The official repository provides `src/byprot/utils/protein/tokenize_pdb.py` for converting PDB files into structure-token FASTA files: ```bash python src/byprot/utils/protein/tokenize_pdb.py \ --input_pdb_folder /path/to/input/pdbs \ --output_dir /path/to/output/tokenized_protein ``` The script processes `*.pdb` files in the input folder and writes: - `struct_seq.fasta`: tokenized structure sequences - `aa_seq.fasta`: amino-acid sequences extracted from the same structures The structure sequences can be used as DPLM-2 structure-conditioning inputs. For example, pass the generated structure-token FASTA file to `generate_dplm2.py --task inverse_folding --input_fasta_path ...`. ## Use with DPLM-2 DPLM-2 checkpoints load this tokenizer through their `struct_tokenizer` property. For example: ```python from byprot.models.dplm2 import MultimodalDiffusionProteinLanguageModel as DPLM2 dplm2 = DPLM2.from_pretrained("airkingbd/dplm2_650m").cuda().eval() struct_tokenizer = dplm2.struct_tokenizer ``` The DPLM-2 configs point to this repository with: ```yaml struct_tokenizer: exp_path: airkingbd/struct_tokenizer ``` ## Citation If you use this tokenizer, please cite the DPLM and DPLM-2 papers: ```bibtex @inproceedings{wang2024dplm, title={Diffusion Language Models Are Versatile Protein Learners}, author={Wang, Xinyou and Zheng, Zaixiang and Ye, Fei and Xue, Dongyu and Huang, Shujian and Gu, Quanquan}, booktitle={International Conference on Machine Learning}, year={2024} } @inproceedings{wang2025dplm2, title={DPLM-2: A Multimodal Diffusion Protein Language Model}, author={Wang, Xinyou and Zheng, Zaixiang and Ye, Fei and Xue, Dongyu and Huang, Shujian and Gu, Quanquan}, booktitle={International Conference on Learning Representations}, year={2025} } @inproceedings{hsieh2025dplm2_1, title={Elucidating the Design Space of Multimodal Protein Language Models}, author={Hsieh, Cheng-Yen and Wang, Xinyou and Zhang, Daiheng and Xue, Dongyu and Ye, Fei and Huang, Shujian and Zheng, Zaixiang and Gu, Quanquan}, booktitle={International Conference on Machine Learning}, year={2025} } ``` ## Acknowledgements DPLM builds on and acknowledges prior work and resources including ByProt, ESM, OpenFold-related structure modeling utilities, EigenFold, and MultiFlow. See the official repository for complete acknowledgements and implementation details.