struct_tokenizer / README.md
airkingbd's picture
Add model card
dd9b86f
metadata
license: apache-2.0
library_name: pytorch
tags:
  - biology
  - protein
  - protein-structure
  - protein-structure-tokenizer
  - structure-tokenizer
  - dplm-2
  - pytorch
  - arxiv:2410.13782
  - arxiv:2504.11454
datasets:
  - airkingbd/pdb_swissprot

DPLM-2 Structure Tokenizer

This repository contains the structure tokenizer used by DPLM-2, a multimodal diffusion protein language model for joint protein sequence and structure modeling. The tokenizer converts protein backbone/atom coordinates into discrete structure tokens and can decode structure tokens back into protein structures. DPLM-2 uses these tokens to support sequence-structure co-generation, forward folding, inverse folding, and motif scaffolding.

For the official implementation, installation instructions, DPLM-2 generation scripts, and evaluation utilities, see the bytedance/dplm repository.

Model Details

  • Checkpoint: airkingbd/struct_tokenizer
  • Files: config.yaml, dplm2_struct_tokenizer.ckpt
  • Model class: byprot.models.structok.structok_lfq.VQModel
  • Tokenizer type: LFQ-based discrete protein structure tokenizer
  • Codebook size: 8,192 structure tokens (2^13)
  • Codebook embedding dimension: 13
  • Encoder: GVP-based structure encoder
  • Decoder: ESMFold-style structure decoder with decoder input dimension 128
  • License: Apache-2.0
  • Paper: DPLM-2: A Multimodal Diffusion Protein Language Model

Quick Start

Install the official DPLM codebase and dependencies:

git clone --recursive https://github.com/bytedance/dplm.git
cd dplm

conda create -n dplm python=3.9 pip
conda activate dplm
bash scripts/install.sh

Load the released structure tokenizer:

from byprot.models.utils import get_struct_tokenizer

struct_tokenizer = get_struct_tokenizer("airkingbd/struct_tokenizer")
struct_tokenizer = struct_tokenizer.cuda().eval()

The helper downloads this repository from Hugging Face, reads config.yaml, constructs VQModel, and loads dplm2_struct_tokenizer.ckpt.

Tokenize PDB Structures

The official repository provides src/byprot/utils/protein/tokenize_pdb.py for converting PDB files into structure-token FASTA files:

python src/byprot/utils/protein/tokenize_pdb.py \
    --input_pdb_folder /path/to/input/pdbs \
    --output_dir /path/to/output/tokenized_protein

The script processes *.pdb files in the input folder and writes:

  • struct_seq.fasta: tokenized structure sequences
  • aa_seq.fasta: amino-acid sequences extracted from the same structures

The structure sequences can be used as DPLM-2 structure-conditioning inputs. For example, pass the generated structure-token FASTA file to generate_dplm2.py --task inverse_folding --input_fasta_path ....

Use with DPLM-2

DPLM-2 checkpoints load this tokenizer through their struct_tokenizer property. For example:

from byprot.models.dplm2 import MultimodalDiffusionProteinLanguageModel as DPLM2

dplm2 = DPLM2.from_pretrained("airkingbd/dplm2_650m").cuda().eval()
struct_tokenizer = dplm2.struct_tokenizer

The DPLM-2 configs point to this repository with:

struct_tokenizer:
  exp_path: airkingbd/struct_tokenizer

Citation

If you use this tokenizer, please cite the DPLM and DPLM-2 papers:

@inproceedings{wang2024dplm,
  title={Diffusion Language Models Are Versatile Protein Learners},
  author={Wang, Xinyou and Zheng, Zaixiang and Ye, Fei and Xue, Dongyu and Huang, Shujian and Gu, Quanquan},
  booktitle={International Conference on Machine Learning},
  year={2024}
}

@inproceedings{wang2025dplm2,
  title={DPLM-2: A Multimodal Diffusion Protein Language Model},
  author={Wang, Xinyou and Zheng, Zaixiang and Ye, Fei and Xue, Dongyu and Huang, Shujian and Gu, Quanquan},
  booktitle={International Conference on Learning Representations},
  year={2025}
}

@inproceedings{hsieh2025dplm2_1,
  title={Elucidating the Design Space of Multimodal Protein Language Models},
  author={Hsieh, Cheng-Yen and Wang, Xinyou and Zhang, Daiheng and Xue, Dongyu and Ye, Fei and Huang, Shujian and Zheng, Zaixiang and Gu, Quanquan},
  booktitle={International Conference on Machine Learning},
  year={2025}
}

Acknowledgements

DPLM builds on and acknowledges prior work and resources including ByProt, ESM, OpenFold-related structure modeling utilities, EigenFold, and MultiFlow. See the official repository for complete acknowledgements and implementation details.