| --- |
| license: apache-2.0 |
| library_name: pytorch |
| tags: |
| - biology |
| - protein |
| - protein-structure |
| - protein-structure-tokenizer |
| - structure-tokenizer |
| - dplm-2 |
| - pytorch |
| - arxiv:2410.13782 |
| - arxiv:2504.11454 |
| datasets: |
| - airkingbd/pdb_swissprot |
| --- |
| |
| # DPLM-2 Structure Tokenizer |
|
|
| This repository contains the structure tokenizer used by DPLM-2, a multimodal |
| diffusion protein language model for joint protein sequence and structure |
| modeling. The tokenizer converts protein backbone/atom coordinates into |
| discrete structure tokens and can decode structure tokens back into protein |
| structures. DPLM-2 uses these tokens to support sequence-structure |
| co-generation, forward folding, inverse folding, and motif scaffolding. |
|
|
| For the official implementation, installation instructions, DPLM-2 generation |
| scripts, and evaluation utilities, see the |
| [bytedance/dplm](https://github.com/bytedance/dplm) repository. |
|
|
| ## Model Details |
|
|
| - **Checkpoint:** `airkingbd/struct_tokenizer` |
| - **Files:** `config.yaml`, `dplm2_struct_tokenizer.ckpt` |
| - **Model class:** `byprot.models.structok.structok_lfq.VQModel` |
| - **Tokenizer type:** LFQ-based discrete protein structure tokenizer |
| - **Codebook size:** 8,192 structure tokens (`2^13`) |
| - **Codebook embedding dimension:** 13 |
| - **Encoder:** GVP-based structure encoder |
| - **Decoder:** ESMFold-style structure decoder with decoder input dimension 128 |
| - **License:** Apache-2.0 |
| - **Paper:** [DPLM-2: A Multimodal Diffusion Protein Language Model](https://arxiv.org/abs/2410.13782) |
|
|
| ## Quick Start |
|
|
| Install the official DPLM codebase and dependencies: |
|
|
| ```bash |
| git clone --recursive https://github.com/bytedance/dplm.git |
| cd dplm |
| |
| conda create -n dplm python=3.9 pip |
| conda activate dplm |
| bash scripts/install.sh |
| ``` |
|
|
| Load the released structure tokenizer: |
|
|
| ```python |
| from byprot.models.utils import get_struct_tokenizer |
| |
| struct_tokenizer = get_struct_tokenizer("airkingbd/struct_tokenizer") |
| struct_tokenizer = struct_tokenizer.cuda().eval() |
| ``` |
|
|
| The helper downloads this repository from Hugging Face, reads `config.yaml`, |
| constructs `VQModel`, and loads `dplm2_struct_tokenizer.ckpt`. |
|
|
| ## Tokenize PDB Structures |
|
|
| The official repository provides `src/byprot/utils/protein/tokenize_pdb.py` for |
| converting PDB files into structure-token FASTA files: |
|
|
| ```bash |
| python src/byprot/utils/protein/tokenize_pdb.py \ |
| --input_pdb_folder /path/to/input/pdbs \ |
| --output_dir /path/to/output/tokenized_protein |
| ``` |
|
|
| The script processes `*.pdb` files in the input folder and writes: |
|
|
| - `struct_seq.fasta`: tokenized structure sequences |
| - `aa_seq.fasta`: amino-acid sequences extracted from the same structures |
|
|
| The structure sequences can be used as DPLM-2 structure-conditioning inputs. |
| For example, pass the generated structure-token FASTA file to |
| `generate_dplm2.py --task inverse_folding --input_fasta_path ...`. |
|
|
|
|
| ## Use with DPLM-2 |
|
|
| DPLM-2 checkpoints load this tokenizer through their `struct_tokenizer` property. |
| For example: |
|
|
| ```python |
| from byprot.models.dplm2 import MultimodalDiffusionProteinLanguageModel as DPLM2 |
| |
| dplm2 = DPLM2.from_pretrained("airkingbd/dplm2_650m").cuda().eval() |
| struct_tokenizer = dplm2.struct_tokenizer |
| ``` |
|
|
| The DPLM-2 configs point to this repository with: |
|
|
| ```yaml |
| struct_tokenizer: |
| exp_path: airkingbd/struct_tokenizer |
| ``` |
|
|
|
|
| ## Citation |
|
|
| If you use this tokenizer, please cite the DPLM and DPLM-2 papers: |
|
|
| ```bibtex |
| @inproceedings{wang2024dplm, |
| title={Diffusion Language Models Are Versatile Protein Learners}, |
| author={Wang, Xinyou and Zheng, Zaixiang and Ye, Fei and Xue, Dongyu and Huang, Shujian and Gu, Quanquan}, |
| booktitle={International Conference on Machine Learning}, |
| year={2024} |
| } |
| |
| @inproceedings{wang2025dplm2, |
| title={DPLM-2: A Multimodal Diffusion Protein Language Model}, |
| author={Wang, Xinyou and Zheng, Zaixiang and Ye, Fei and Xue, Dongyu and Huang, Shujian and Gu, Quanquan}, |
| booktitle={International Conference on Learning Representations}, |
| year={2025} |
| } |
| |
| @inproceedings{hsieh2025dplm2_1, |
| title={Elucidating the Design Space of Multimodal Protein Language Models}, |
| author={Hsieh, Cheng-Yen and Wang, Xinyou and Zhang, Daiheng and Xue, Dongyu and Ye, Fei and Huang, Shujian and Zheng, Zaixiang and Gu, Quanquan}, |
| booktitle={International Conference on Machine Learning}, |
| year={2025} |
| } |
| ``` |
|
|
| ## Acknowledgements |
|
|
| DPLM builds on and acknowledges prior work and resources including ByProt, |
| ESM, OpenFold-related structure modeling utilities, EigenFold, and MultiFlow. |
| See the official repository for complete acknowledgements and implementation |
| details. |
|
|