--- license: mit library_name: pytorch tags: - protein - structure - encoder - pretrained --- # triprorep-3B Structure-aware protein encoder, 3B parameters. ELECTRA-style corrective MLM pre-training on 83.6M ATLAS + PDB structures. The encoder reads three per-residue token streams (seq / bb / fa) and outputs a per-residue embedding of dimension `2560` (fp16). Architecture: `embed_dim=2560`, `encoder_depth=33`, `encoder_heads=40`. Part of the [k-fold-structure release](https://github.com//k-fold-structure-release). ## Files - `3B.ckpt`: full Lightning checkpoint. - `3B_encoder.pt`: encoder-only state dict, loads `strict=True` for inference. - `config.yaml`: model + data config. The path fields are placeholders, point them at your local data. - `backbone_tokenizer.pt`, `fullatom_tokenizer.pt`: structure tokenizers (PDB → token IDs). See Acknowledgements. ## Usage ```bash pip install torch huggingface_hub omegaconf numpy lmdb biotite git clone https://github.com//k-fold-structure-release.git cd k-fold-structure-release ``` ```python import sys; sys.path.insert(0, "code/triprorep") from inference import load_encoder, embed_pdb encoder = load_encoder("3B", hf_repo="k-fold-structure/triprorep-3B") features = embed_pdb(encoder, "your_protein.pdb", hf_repo="k-fold-structure/triprorep-3B") print(features.shape) # (L, 2560) fp16 ``` `embed_pdb` downloads the bundled tokenizers from this repo on first call, then runs PDB → (seq, bb, fa) tokens → encoder. If you already have token IDs (e.g. from `k-fold-structure/repsp-triprorep-tokens`), call `encode(encoder, seq, bb, fa)` directly. For CPU, pass `device="cpu"` to `load_encoder`. ## Acknowledgements `backbone_tokenizer.pt` (aminoaseed VQ-VAE) is from [StructTokenBench](https://github.com/KatarinaYuan/StructTokenBench). ## Citation ```bibtex @misc{kfoldstructure, title = {K-Fold Structure: Structure-Aware Protein Encoders and a Per-Residue Representation Benchmark}, author = {}, year = {2026}, url = {https://huggingface.co/k-fold-structure} } ``` ## License MIT