lhallee commited on May 28

Commit

fb8a87c

verified ·

1 Parent(s): 5e243b2

Upload folder using huggingface_hub

Browse files

Files changed (39) hide show

LICENSE +9 -0
README.md +91 -168
__init__.py +12 -0
configuration_esmc.py +89 -0
configuration_esmc_sae.py +77 -0
configuration_esmfold2.py +298 -0
esmfold2_affine3d.py +561 -0
esmfold2_aligner.py +102 -0
esmfold2_atom_indexer.py +16 -0
esmfold2_conformers.py +292 -0
esmfold2_constants.py +563 -0
esmfold2_constants_esm3.py +138 -0
esmfold2_input_builder.py +255 -0
esmfold2_metrics.py +374 -0
esmfold2_misc.py +505 -0
esmfold2_mmcif_parsing.py +470 -0
esmfold2_molecular_complex.py +1226 -0
esmfold2_msa.py +507 -0
esmfold2_msa_filter_sequences.py +83 -0
esmfold2_normalize_coordinates.py +80 -0
esmfold2_output.py +225 -0
esmfold2_paired_msa.py +246 -0
esmfold2_parsing.py +113 -0
esmfold2_predicted_aligned_error.py +105 -0
esmfold2_prepare_input.py +1464 -0
esmfold2_processor.py +356 -0
esmfold2_protein_chain.py +1376 -0
esmfold2_protein_complex.py +1241 -0
esmfold2_protein_structure.py +307 -0
esmfold2_residue_constants.py +1224 -0
esmfold2_sequential_dataclass.py +158 -0
esmfold2_system.py +46 -0
esmfold2_types.py +34 -0
esmfold2_utils_types.py +34 -0
modeling_esmc.py +1667 -0
modeling_esmc_sae.py +363 -0
modeling_esmfold2.py +1288 -0
modeling_esmfold2_common.py +0 -0
protein_utils.py +488 -0

LICENSE ADDED Viewed

	@@ -0,0 +1,9 @@

+**License (MIT)**
+Copyright 2026 Chan Zuckerberg Biohub, Inc.
+Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

README.md CHANGED Viewed

@@ -1,199 +1,122 @@
 ---
 library_name: transformers
-tags: []
 ---
-# Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
-## Model Details
-### Model Description
-<!-- Provide a longer summary of what this model is. -->
-This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
-- **Developed by:** [More Information Needed]
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [More Information Needed]
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
-### Model Sources [optional]
-<!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
-## Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
-### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
-### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
-## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
-### Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
-## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
-## Training Details
-### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
-### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-[More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
-## Model Card Contact
-[More Information Needed]

 ---
 library_name: transformers
+tags:
+  - biology
+  - protein-structure
+  - esmfold2
+  - multimodal-protein-model
 ---
+# FastPLMs ESMFold2
+FastPLMs ESMFold2 is a self-contained Hugging Face `AutoModel` wrapper for Biohub's ESMFold2 and ESMFold2-Fast structure predictors. It vendors the released Biohub ESMFold2 model code, ESMC backbone code, input builder, MSA helpers, and structure export utilities needed for remote-code loading.
+## Load With AutoModel
+```python
+import torch
+from transformers import AutoModel
+model = AutoModel.from_pretrained(
+    "Synthyra/ESMFold2-Fast",
+    trust_remote_code=True,
+    dtype=torch.bfloat16,
+    device_map="cuda",
+).eval()
+```
+Use `Synthyra/ESMFold2` for the full model and `Synthyra/ESMFold2-Fast` for the faster release variant.
+## Fold One Protein
+```python
+sequence = "MKTLLILAVVAAALA"
+result = model.fold_protein(
+    sequence,
+    num_loops=3,
+    num_sampling_steps=50,
+    num_diffusion_samples=1,
+    seed=0,
+)
+print(float(result.plddt.mean()))
+print(float(result.ptm))
+```
+## Save mmCIF or PDB
+```python
+model.save_as_cif(result, "prediction.cif")
+model.save_as_pdb(result, "prediction.pdb")
+cif_text = model.result_to_cif(result)
+pdb_text = model.result_to_pdb(result)
+```
+`result_to_cif` preserves the full `MolecularComplex`. `result_to_pdb` converts through Biohub's protein-only `ProteinComplex` representation, so use mmCIF for complexes with ligands or nucleic acids.
+## Fold Complexes
+```python
+types = model.input_types
+complex_input = types.StructurePredictionInput(
+    sequences=[
+        types.ProteinInput(id="A", sequence="MKTLLILAVVAAALA"),
+        types.DNAInput(id="B", sequence="GATAGC"),
+        types.LigandInput(id="L", ccd=["SAH"]),
+    ]
+)
+result = model.fold(
+    complex_input,
+    num_loops=3,
+    num_sampling_steps=50,
+    num_diffusion_samples=1,
+    seed=0,
+)
+model.save_as_cif(result, "complex_prediction.cif")
+```
+## Use MSAs
+```python
+types = model.input_types
+msa = types.MSA.from_a3m("query.a3m", max_sequences=128)
+input_with_msa = types.StructurePredictionInput(
+    sequences=[
+        types.ProteinInput(id="A", sequence=msa.query, msa=msa),
+    ]
+)
+result = model.fold(input_with_msa, num_sampling_steps=50, seed=0)
+```
+## Raw Tensor Inference
+```python
+features, chain_infos = model.prepare_structure_input(complex_input, seed=0)
+with torch.inference_mode():
+    output = model(
+        **features,
+        num_loops=3,
+        num_sampling_steps=50,
+        num_diffusion_samples=1,
+    )
+decoded = model.input_builder.decode(output, features, chain_infos)
+```
+Set `load_esmc=False` when loading if you want to provide precomputed `lm_hidden_states` manually or run folding-trunk tests without loading the 6B ESMC backbone:
+```python
+model = AutoModel.from_pretrained(
+    "Synthyra/ESMFold2-Fast",
+    trust_remote_code=True,
+    load_esmc=False,
+).cuda().eval()
+```

__init__.py ADDED Viewed

	@@ -0,0 +1,12 @@

+import importlib
+import sys
+from .configuration_esmfold2 import ESMFold2Config
+from .modeling_esmfold2 import ESMFold2Model
+def ensure_vendored_esm() -> None:
+    sys.modules["esm"] = importlib.import_module(f"{__name__}.esm")
+__all__ = ["ESMFold2Config", "ESMFold2Model", "ensure_vendored_esm"]

configuration_esmc.py ADDED Viewed

	@@ -0,0 +1,89 @@

+# Copyright 2026 Biohub. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""ESMC model configuration."""
+from transformers.configuration_utils import PretrainedConfig
+class ESMCConfig(PretrainedConfig):
+    """
+    This is the configuration class to store the configuration of a [`ESMCModel`]. It is used to
+    instantiate an ESMC model according to the specified arguments, defining the model architecture.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model
+    outputs. Read the documentation from [`PretrainedConfig`] for more information.
+    Args:
+        vocab_size (`int`, *optional*, defaults to 64):
+            Vocabulary size of the ESMC model. Defines the number of different amino acid tokens that
+            can be represented by the ``input_ids`` passed to [`ESMCModel`].
+        d_model (`int`, *optional*, defaults to 2560):
+            Dimensionality of the encoder layers and the pooler layer.
+        n_heads (`int`, *optional*, defaults to 40):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        n_layers (`int`, *optional*, defaults to 80):
+            Number of hidden layers in the Transformer encoder.
+        pad_token_id (`int`, *optional*, defaults to 1):
+            Index of the padding token in the vocabulary (``"<pad>"``).
+        mask_token_id (`int`, *optional*, defaults to 32):
+            Index of the mask token in the vocabulary (``"<mask>"``), used for masked language modelling.
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated normal initialiser for weight matrix initialisation.
+        classifier_dropout (`float`, *optional*, defaults to 0.1):
+            Dropout ratio for the classification head.
+    Examples:
+    ```python
+    >>> from transformers import ESMCConfig, ESMCModel
+    >>> # Initializing an ESMC EvolutionaryScale/esmc-600m-2024-12 style configuration
+    >>> configuration = ESMCConfig()
+    >>> # Initializing a model (with random weights) from the EvolutionaryScale/esmc-600m-2024-12 style configuration
+    >>> model = ESMCModel(configuration)
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```
+    """
+    model_type = "esmc"
+    def __init__(
+        self,
+        vocab_size: int = 64,
+        d_model: int = 2560,
+        n_heads: int = 40,
+        n_layers: int = 80,
+        pad_token_id: int = 1,
+        mask_token_id: int = 32,
+        initializer_range: float = 0.02,
+        classifier_dropout: float = 0.1,
+        **kwargs,
+    ):
+        super().__init__(
+            pad_token_id=pad_token_id, mask_token_id=mask_token_id, **kwargs
+        )
+        self.vocab_size = vocab_size
+        self.d_model = d_model
+        self.n_heads = n_heads
+        self.n_layers = n_layers
+        self.initializer_range = initializer_range
+        self.classifier_dropout = classifier_dropout
+        self.tie_word_embeddings = False
+__all__ = ["ESMCConfig"]

configuration_esmc_sae.py ADDED Viewed

	@@ -0,0 +1,77 @@

+# Copyright 2026 Biohub. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""ESMC sparse autoencoder (SAE) configuration."""
+from dataclasses import dataclass
+from transformers.configuration_utils import PretrainedConfig
+@dataclass
+class ESMCSAEParams:
+    """Parameters for one backbone layer's SAE inside :class:`ESMCSAEModel`.
+    The SAE itself is an internal ``nn.Module``; this dataclass just bundles
+    the handful of fields needed to instantiate one.
+    """
+    d_model: int = 2560
+    codebook_dim: int = 65536
+    k: int = 64
+    layer: int = 0
+class ESMCSAEConfig(PretrainedConfig):
+    """
+    Configuration class for [`ESMCSAEModel`] — a container that holds one
+    SAE per backbone layer for a fixed ``(model, codebook_dim, k)`` group.
+    All SAEs in a container share ``d_model``, ``codebook_dim``, and ``k``;
+    they differ only in the backbone layer they were trained on.
+    ``available_layers`` lists the backbone-layer indices the repo ships;
+    each entry ``i`` is stored on disk as ``layer_{i}.safetensors`` (the
+    filename index *is* the backbone layer, so a single-layer repo for
+    layer 23 stores ``layer_23.safetensors``).
+    Args:
+        d_model (`int`, *optional*, defaults to 2560):
+            Dimensionality of the ESMC hidden states fed into the SAEs.
+        codebook_dim (`int`, *optional*, defaults to 65536):
+            Number of sparse features in each SAE's codebook.
+        k (`int`, *optional*, defaults to 64):
+            Top-k sparsity per SAE.
+        available_layers (`list[int]`, *optional*, defaults to ``[0]``):
+            Which backbone-layer indices the repo ships.
+    """
+    model_type = "esmc_sae"
+    def __init__(
+        self,
+        d_model: int = 2560,
+        codebook_dim: int = 65536,
+        k: int = 64,
+        available_layers: list[int] | None = None,
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+        self.d_model = d_model
+        self.codebook_dim = codebook_dim
+        self.k = k
+        self.available_layers = (
+            list(available_layers) if available_layers is not None else [0]
+        )
+__all__ = ["ESMCSAEConfig", "ESMCSAEParams"]

configuration_esmfold2.py ADDED Viewed

	@@ -0,0 +1,298 @@

+# Copyright 2026 Biohub. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""ESMFold2 model configuration."""
+from __future__ import annotations
+from dataclasses import asdict, dataclass, field
+from transformers.configuration_utils import PretrainedConfig
+# ---------------------------------------------------------------------------
+# Nested dataclass configs
+# ---------------------------------------------------------------------------
+_DEFAULT_ESMC_HF_REPO = "biohub/ESMC-6B"
+@dataclass
+class MSAEncoderConfig:
+    """Config for the optional MSA encoder module (Large MSA models only)."""
+    enabled: bool = False
+    d_msa: int = 128
+    d_hidden: int = 32
+    n_layers: int = 4
+    n_heads_msa: int = 8
+    msa_head_width: int = 32
+@dataclass
+class ParcaeConfig:
+    """Release-only config for the parcae diffusion-loop scheduler."""
+    enabled: bool = True
+    poisson_mean: float = 3.0
+    min_steps: int = 1
+    max_steps: int | None = 6
+    coda_n_layers: int = 2
+@dataclass
+class LMEncoderConfig:
+    """Release-only config for the LM-side pair encoder."""
+    enabled: bool = True
+    n_layers: int = 4
+    lm_dropout: float = 0.25
+    per_loop_lm_dropout: bool = True
+@dataclass
+class AtomAttentionConfig:
+    """Config for SWA atom encoder/decoder with 3D RoPE."""
+    d_atom: int = 128
+    d_token: int = 768
+    n_blocks: int = 3
+    n_heads: int = 4
+    swa_window_size: int = 128
+    expansion_ratio: int = 2
+    # 3D RoPE config
+    spatial_rope_base_frequency: float = 20.0
+    n_spatial_rope_pairs_per_axis: int = 2
+    n_uid_rope_pairs: int = 10
+    uid_rope_base_frequency: float = 10000.0
+@dataclass
+class FoldingTrunkConfig:
+    n_layers: int = 24
+    n_heads: int = 8
+    dropout: float = 0.0
+@dataclass
+class InputsEmbedderConfig:
+    d_inputs: int = 451
+    atom_encoder: AtomAttentionConfig = field(default_factory=AtomAttentionConfig)
+    def __post_init__(self):
+        if isinstance(self.atom_encoder, dict):
+            self.atom_encoder = AtomAttentionConfig(**self.atom_encoder)
+@dataclass
+class DiffusionModuleConfig:
+    """Config for the DiffusionModule."""
+    sigma_data: float = 16.0
+    c_atom: int = 128
+    c_token: int = 768
+    c_z: int = 256
+    c_s_inputs: int = 451
+    fourier_dim: int = 256
+    relpos_r_max: int = 32
+    relpos_s_max: int = 2
+    atom_num_blocks: int = 3
+    atom_num_heads: int = 4
+    token_num_blocks: int = 12
+    token_num_heads: int = 16
+    transition_multiplier: int = 2
+@dataclass
+class DiffusionStructureHeadConfig:
+    """Config for the diffusion-based structure prediction head."""
+    diffusion_module: DiffusionModuleConfig = field(
+        default_factory=DiffusionModuleConfig
+    )
+    distogram_bins: int = 128
+    # Training noise: sigma ~ sigma_data * exp(mu + sigma * N(0,1))
+    train_noise_log_mean: float = -1.2
+    train_noise_log_std: float = 1.5
+    # Sampling defaults (ODE)
+    gamma_0: float = 0.605
+    gamma_min: float = 1.107
+    noise_scale: float = 0.0
+    step_scale: float = 1.0
+    # Inference schedule defaults
+    inference_s_max: float = 160.0
+    inference_s_min: float = 4e-4
+    inference_p: float = 8.0
+    inference_num_steps: int = 68
+    def __post_init__(self):
+        if isinstance(self.diffusion_module, dict):
+            self.diffusion_module = DiffusionModuleConfig(**self.diffusion_module)
+@dataclass
+class ConfidenceHeadConfig:
+    enabled: bool = True
+    num_plddt_bins: int = 50
+    num_pde_bins: int = 64
+    num_pae_bins: int = 64
+    min_dist: float = 2.0
+    max_dist: float = 52.0
+    distogram_bins: int = 128
+    folding_trunk: FoldingTrunkConfig = field(
+        default_factory=lambda: FoldingTrunkConfig(n_layers=4)
+    )
+    def __post_init__(self):
+        if isinstance(self.folding_trunk, dict):
+            self.folding_trunk = FoldingTrunkConfig(**self.folding_trunk)
+# ---------------------------------------------------------------------------
+# Top-level config
+# ---------------------------------------------------------------------------
+class ESMFold2Config(PretrainedConfig):
+    """
+    Configuration for the ESMFold2 structure prediction model.
+    Uses SWA atom encoders with 3D RoPE, a diffusion transformer,
+    a folding trunk, and an ESMC 6B PLM backbone.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control
+    the model outputs. Read the documentation from [`PretrainedConfig`] for more
+    information.
+    Args:
+        d_single (`int`, defaults to 384):
+            Dimensionality of single (per-residue) representations.
+        d_pair (`int`, defaults to 256):
+            Dimensionality of pair (residue-residue) representations.
+        n_relative_residx_bins (`int`, defaults to 32):
+            Number of bins for relative residue index encoding.
+        n_relative_chain_bins (`int`, defaults to 2):
+            Number of bins for relative chain encoding.
+        num_loops (`int`, defaults to 10):
+            Number of trunk loops for iterative refinement.
+        num_diffusion_samples (`int`, defaults to 8):
+            Number of parallel structure predictions to generate.
+        lm_dropout (`float`, defaults to 0.0):
+            Dropout probability on LM pair embeddings. When > 0, dropout is
+            applied with ``training=True`` (including at inference) to match
+            the experimental training recipe used by binder design.
+        force_lm_dropout_during_inference (`bool`, defaults to False):
+            When True, apply ``lm_dropout`` even when ``model.eval()`` and
+            ``lm_dropout`` > 0. Binder-design loads set this to True.
+        disable_msa_features (`bool`, defaults to False):
+            When True, zero out MSA-derived ``profile`` and ``deletion_mean``
+            before the inputs embedder (experimental medium/large checkpoints).
+        inputs (`InputsEmbedderConfig`):
+            Configuration for the inputs embedder module.
+        folding_trunk (`FoldingTrunkConfig`):
+            Configuration for the folding trunk.
+        structure_head (`DiffusionStructureHeadConfig`):
+            Configuration for the diffusion-based structure prediction head.
+        confidence_head (`ConfidenceHeadConfig`):
+            Configuration for the confidence prediction head.
+    Examples:
+    ```python
+    >>> from transformers import ESMFold2Config, ESMFold2ExperimentalModel
+    >>> # Initializing an ESMFold2 configuration
+    >>> configuration = ESMFold2Config(type="experimental")
+    >>> # Initializing a model (with random weights) from the configuration
+    >>> model = ESMFold2ExperimentalModel(configuration)
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```
+    """
+    model_type = "esmfold2"
+    has_no_defaults_at_init = True
+    def __init__(self, **kwargs):
+        super().__init__(**kwargs)
+        self.type: str = kwargs.get("type", "release")
+        if self.type not in ("release", "experimental"):
+            raise ValueError(
+                f"ESMFold2Config.type must be 'release' or 'experimental', "
+                f"got {self.type!r}"
+            )
+        # Top-level scalar fields
+        self.d_single: int = kwargs.get("d_single", 384)
+        self.d_pair: int = kwargs.get("d_pair", 256)
+        self.n_relative_residx_bins: int = kwargs.get("n_relative_residx_bins", 32)
+        self.n_relative_chain_bins: int = kwargs.get("n_relative_chain_bins", 2)
+        self.num_loops: int = kwargs.get("num_loops", 10)
+        self.num_diffusion_samples: int = kwargs.get("num_diffusion_samples", 8)
+        # If True, ``profile`` / ``deletion_mean`` are zeroed before the inputs
+        # embedder.
+        self.disable_msa_features: bool = kwargs.get("disable_msa_features", False)
+        self.lm_dropout: float = kwargs.get("lm_dropout", 0.0)
+        self.force_lm_dropout_during_inference: bool = kwargs.get(
+            "force_lm_dropout_during_inference", False
+        )
+        self.lm_d_model: int = kwargs.get("lm_d_model", 2560)
+        self.lm_num_layers: int = kwargs.get("lm_num_layers", 80)
+        # Required, no default — every shipped HF export must name its ESMC backbone.
+        self.esmc_id: str = kwargs.get("esmc_id", _DEFAULT_ESMC_HF_REPO)
+        def _init_nested(cls, val):
+            if isinstance(val, cls):
+                return val
+            if isinstance(val, dict):
+                return cls(**val)
+            return cls()
+        self.inputs = _init_nested(InputsEmbedderConfig, kwargs.get("inputs"))
+        self.folding_trunk = _init_nested(
+            FoldingTrunkConfig, kwargs.get("folding_trunk")
+        )
+        self.structure_head = _init_nested(
+            DiffusionStructureHeadConfig, kwargs.get("structure_head")
+        )
+        self.confidence_head = _init_nested(
+            ConfidenceHeadConfig, kwargs.get("confidence_head")
+        )
+        self.msa_encoder = _init_nested(MSAEncoderConfig, kwargs.get("msa_encoder"))
+        # Release-only modules — ignored when ``type == "experimental"``.
+        self.parcae = _init_nested(ParcaeConfig, kwargs.get("parcae"))
+        self.lm_encoder = _init_nested(LMEncoderConfig, kwargs.get("lm_encoder"))
+        # If True, MSA encoder output replaces the pair stream; if False, it is added.
+        self.msa_encoder_overwrite: bool = bool(
+            kwargs.get("msa_encoder_overwrite", True)
+        )
+    def to_dict(self):
+        output = super().to_dict()
+        output["inputs"] = asdict(self.inputs)
+        output["folding_trunk"] = asdict(self.folding_trunk)
+        output["structure_head"] = asdict(self.structure_head)
+        output["confidence_head"] = asdict(self.confidence_head)
+        output["msa_encoder"] = asdict(self.msa_encoder)
+        output["parcae"] = asdict(self.parcae)
+        output["lm_encoder"] = asdict(self.lm_encoder)
+        return output
+__all__ = ["ESMFold2Config", "MSAEncoderConfig", "ParcaeConfig", "LMEncoderConfig"]

esmfold2_affine3d.py ADDED Viewed

	@@ -0,0 +1,561 @@

+from __future__ import annotations
+import typing as T
+from abc import ABC
+from dataclasses import dataclass
+import torch
+from torch.nn import functional as F
+from typing_extensions import Self
+from .esmfold2_misc import fp32_autocast_context
+class Rotation(ABC):
+    @classmethod
+    def identity(cls, shape: tuple[int, ...], **tensor_kwargs) -> Self: ...
+    @classmethod
+    def random(cls, shape: tuple[int, ...], **tensor_kwargs) -> Self: ...
+    def __getitem__(self, idx: T.Any) -> Self: ...
+    @property
+    def tensor(self) -> torch.Tensor:
+        # We claim that this should be zero-cost abstraction that returns the raw tensor backing this
+        # object. The raw tensor should always have exactly 1 more dim than self.shape, which should be
+        # implemented using reshaping
+        ...
+    @property
+    def shape(self) -> torch.Size:
+        # The "shape" of the rotation, as if it was a torch.tensor object
+        # This means that 1x4 quaternions are treated as size (1,) for example
+        ...
+    def as_matrix(self) -> RotationMatrix: ...
+    def as_quat(self, normalize: bool = False) -> RotationQuat: ...
+    def compose(self, other: Self) -> Self:
+        # To be safe, we force users to explicitly convert between rotation types.
+        ...
+    def convert_compose(self, other: Self) -> Self:
+        # This function will automatically convert between types of rotations
+        ...
+    def apply(self, p: torch.Tensor) -> torch.Tensor:
+        # rotates points by this rotation object
+        ...
+    def invert(self) -> Self: ...
+    @property
+    def dtype(self) -> torch.dtype:
+        return self.tensor.dtype
+    @property
+    def device(self) -> torch.device:
+        return self.tensor.device
+    @property
+    def requires_grad(self) -> bool:
+        return self.tensor.requires_grad
+    @classmethod
+    def _from_tensor(cls, t: torch.Tensor) -> Self:
+        # This function exists to simplify the below functions, esp type signatures
+        # Its implementation is different from Affine3D.from_tensor and does not
+        # autodetect rotation types.
+        return cls(t)  # type: ignore
+    def to(self, **kwargs) -> Self:
+        return self._from_tensor(self.tensor.to(**kwargs))
+    def detach(self, *args, **kwargs) -> Self:
+        return self._from_tensor(self.tensor.detach(**kwargs))
+    def tensor_apply(self, func) -> Self:
+        # Applys a function to the underlying tensor
+        return self._from_tensor(
+            torch.stack([func(x) for x in self.tensor.unbind(dim=-1)], dim=-1)
+        )
+class RotationMatrix(Rotation):
+    def __init__(self, rots: torch.Tensor):
+        if rots.shape[-1] == 9:
+            rots = rots.unflatten(-1, (3, 3))
+        assert rots.shape[-1] == 3
+        assert rots.shape[-2] == 3
+        # Force full precision
+        rots = rots.to(torch.float32)
+        self._rots = rots
+    @classmethod
+    def identity(cls, shape, **tensor_kwargs):
+        rots = torch.eye(3, **tensor_kwargs)
+        rots = rots.view(*[1 for _ in range(len(shape))], 3, 3)
+        rots = rots.expand(*shape, -1, -1)
+        return cls(rots)
+    @classmethod
+    def random(cls, shape, **tensor_kwargs):
+        return RotationQuat.random(shape, **tensor_kwargs).as_matrix()
+    def __getitem__(self, idx: T.Any) -> RotationMatrix:
+        indices = (idx,) if isinstance(idx, int) or idx is None else tuple(idx)
+        return RotationMatrix(self._rots[indices + (slice(None), slice(None))])
+    @property
+    def shape(self) -> torch.Size:
+        return self._rots.shape[:-2]
+    def as_matrix(self) -> RotationMatrix:
+        return self
+    def as_quat(self, normalize: bool = False) -> RotationQuat:
+        m00, m01, m02, m10, m11, m12, m20, m21, m22 = torch.unbind(
+            self._rots.flatten(-2), dim=-1
+        )
+        q_abs = _sqrt_subgradient(
+            torch.stack(
+                [
+                    1.0 + m00 + m11 + m22,
+                    1.0 + m00 - m11 - m22,
+                    1.0 - m00 + m11 - m22,
+                    1.0 - m00 - m11 + m22,
+                ],
+                dim=-1,
+            )
+        )
+        # we produce the desired quaternion multiplied by each of r, i, j, k
+        quat_by_rijk = torch.stack(
+            [
+                x
+                for lst in [
+                    [q_abs[..., 0] ** 2, m21 - m12, m02 - m20, m10 - m01],
+                    [m21 - m12, q_abs[..., 1] ** 2, m10 + m01, m02 + m20],
+                    [m02 - m20, m10 + m01, q_abs[..., 2] ** 2, m12 + m21],
+                    [m10 - m01, m20 + m02, m21 + m12, q_abs[..., 3] ** 2],
+                ]
+                for x in lst
+            ],
+            dim=-1,
+        ).unflatten(-1, (4, 4))
+        # We floor here at 0.1 but the exact level is not important; if q_abs is small,
+        # the candidate won't be picked.
+        flr = torch.tensor(0.1).to(dtype=q_abs.dtype, device=q_abs.device)
+        quat_candidates = quat_by_rijk / (2.0 * q_abs[..., None].max(flr))
+        # if not for numerical problems, quat_candidates[i] should be same (up to a sign),
+        # forall i; we pick the best-conditioned one (with the largest denominator)
+        # We manually implement one_hot so torch.compile works
+        one_hot = torch.zeros_like(q_abs, dtype=torch.bool)
+        one_hot.scatter_(-1, q_abs.argmax(dim=-1, keepdim=True), True)
+        quat = quat_candidates[one_hot, :].reshape(q_abs.shape)
+        return RotationQuat(quat)
+    def compose(self, other: RotationMatrix) -> RotationMatrix:
+        with fp32_autocast_context(self._rots.device.type):
+            return RotationMatrix(self._rots @ other._rots)
+    def convert_compose(self, other: Rotation):
+        return self.compose(other.as_matrix())
+    def apply(self, p: torch.Tensor) -> torch.Tensor:
+        with fp32_autocast_context(self.device.type):
+            if self._rots.shape[-3] == 1:
+                # This is a slight speedup over einsum for batched rotations
+                return p @ self._rots.transpose(-1, -2).squeeze(-3)
+            else:
+                # einsum way faster than bmm!
+                return torch.einsum("...ij,...j", self._rots, p)
+    def invert(self) -> RotationMatrix:
+        return RotationMatrix(self._rots.transpose(-1, -2))
+    @property
+    def tensor(self) -> torch.Tensor:
+        return self._rots.flatten(-2)
+    def to_3x3(self) -> torch.Tensor:
+        return self._rots
+    @staticmethod
+    def from_graham_schmidt(
+        x_axis: torch.Tensor, xy_plane: torch.Tensor, eps: float = 1e-12
+    ) -> RotationMatrix:
+        # A low eps here is necessary for good stability!
+        return RotationMatrix(_graham_schmidt(x_axis, xy_plane, eps))
+class RotationQuat(Rotation):
+    def __init__(self, quats: torch.Tensor, normalized=False):
+        assert quats.shape[-1] == 4
+        self._normalized = normalized
+        # Force float32 as well
+        if normalized:
+            self._quats = F.normalize(quats.to(torch.float32), dim=-1)
+            self._quats = self._quats.where(self._quats[..., :1] >= 0, -self._quats)
+        else:
+            self._quats = quats.to(torch.float32)
+    @classmethod
+    def identity(cls, shape, **tensor_kwargs):
+        q = torch.ones((*shape, 4), **tensor_kwargs)
+        mult = torch.tensor([1, 0, 0, 0], device=q.device)
+        return RotationQuat(q * mult)
+    @classmethod
+    def random(cls, shape, **tensor_kwargs):
+        quat = torch.randn((*shape, 4), **tensor_kwargs)
+        return RotationQuat(quat, normalized=True)
+    def __getitem__(self, idx: T.Any) -> RotationQuat:
+        indices = (idx,) if isinstance(idx, int) or idx is None else tuple(idx)
+        return RotationQuat(self._quats[indices + (slice(None),)])
+    @property
+    def shape(self) -> torch.Size:
+        return self._quats.shape[:-1]
+    def compose(self, other: RotationQuat) -> RotationQuat:
+        with fp32_autocast_context(self._quats.device.type):
+            return RotationQuat(_quat_mult(self._quats, other._quats))
+    def convert_compose(self, other: Rotation):
+        return self.compose(other.as_quat())
+    def as_matrix(self) -> RotationMatrix:
+        q = self.normalized().tensor
+        r, i, j, k = torch.unbind(q, -1)
+        two_s = 2.0 / torch.linalg.norm(q, dim=-1)
+        o = torch.stack(
+            (
+                1 - two_s * (j * j + k * k),
+                two_s * (i * j - k * r),
+                two_s * (i * k + j * r),
+                two_s * (i * j + k * r),
+                1 - two_s * (i * i + k * k),
+                two_s * (j * k - i * r),
+                two_s * (i * k - j * r),
+                two_s * (j * k + i * r),
+                1 - two_s * (i * i + j * j),
+            ),
+            -1,
+        )
+        return RotationMatrix(o.reshape(q.shape[:-1] + (3, 3)))
+    def as_quat(self, normalize: bool = False) -> RotationQuat:
+        return self
+    def apply(self, p: torch.Tensor) -> torch.Tensor:
+        return _quat_rotation(self.normalized()._quats, p)
+    def invert(self) -> RotationQuat:
+        return RotationQuat(_quat_invert(self._quats))
+    @property
+    def tensor(self) -> torch.Tensor:
+        return self._quats
+    def normalized(self) -> RotationQuat:
+        return self if self._normalized else RotationQuat(self._quats, normalized=True)
+@dataclass(frozen=True)
+class Affine3D:
+    trans: torch.Tensor
+    rot: Rotation
+    def __post_init__(self):
+        assert self.trans.shape[:-1] == self.rot.shape
+    @staticmethod
+    def identity(
+        shape_or_affine: T.Union[tuple[int, ...], "Affine3D"],
+        rotation_type: T.Type[Rotation] = RotationMatrix,
+        **tensor_kwargs,
+    ):
+        # Creates a new identity Affine3D object with a specified shape
+        # or the same shape as another Affine3D object.
+        if isinstance(shape_or_affine, Affine3D):
+            kwargs = {"dtype": shape_or_affine.dtype, "device": shape_or_affine.device}
+            kwargs.update(tensor_kwargs)
+            shape = shape_or_affine.shape
+            rotation_type = type(shape_or_affine.rot)
+        else:
+            kwargs = tensor_kwargs
+            shape = shape_or_affine
+        return Affine3D(
+            torch.zeros((*shape, 3), **kwargs), rotation_type.identity(shape, **kwargs)
+        )
+    @staticmethod
+    def random(
+        shape: tuple[int, ...],
+        std: float = 1,
+        rotation_type: T.Type[Rotation] = RotationMatrix,
+        **tensor_kwargs,
+    ) -> "Affine3D":
+        return Affine3D(
+            trans=torch.randn((*shape, 3), **tensor_kwargs).mul(std),
+            rot=rotation_type.random(shape, **tensor_kwargs),
+        )
+    def __getitem__(self, idx: T.Any) -> "Affine3D":
+        indices = (idx,) if isinstance(idx, int) or idx is None else tuple(idx)
+        return Affine3D(trans=self.trans[indices + (slice(None),)], rot=self.rot[idx])
+    @property
+    def shape(self) -> torch.Size:
+        return self.trans.shape[:-1]
+    @property
+    def dtype(self) -> torch.dtype:
+        return self.trans.dtype
+    @property
+    def device(self) -> torch.device:
+        return self.trans.device
+    @property
+    def requires_grad(self) -> bool:
+        return self.trans.requires_grad
+    def to(self, **kwargs) -> "Affine3D":
+        return Affine3D(self.trans.to(**kwargs), self.rot.to(**kwargs))
+    def detach(self, *args, **kwargs) -> "Affine3D":
+        return Affine3D(self.trans.detach(**kwargs), self.rot.detach(**kwargs))
+    def tensor_apply(self, func) -> "Affine3D":
+        # Applys a function to the underlying tensor
+        return self.from_tensor(
+            torch.stack([func(x) for x in self.tensor.unbind(dim=-1)], dim=-1)
+        )
+    def as_matrix(self):
+        return Affine3D(trans=self.trans, rot=self.rot.as_matrix())
+    def as_quat(self, normalize: bool = False):
+        return Affine3D(trans=self.trans, rot=self.rot.as_quat(normalize))
+    def compose(self, other: "Affine3D", autoconvert: bool = False):
+        rot = self.rot
+        new_rot = (rot.convert_compose if autoconvert else rot.compose)(other.rot)
+        new_trans = rot.apply(other.trans) + self.trans
+        return Affine3D(trans=new_trans, rot=new_rot)
+    def compose_rotation(self, other: Rotation, autoconvert: bool = False):
+        return Affine3D(
+            trans=self.trans,
+            rot=(self.rot.convert_compose if autoconvert else self.rot.compose)(other),
+        )
+    def scale(self, v: torch.Tensor | float):
+        return Affine3D(self.trans * v, self.rot)
+    def mask(self, mask: torch.Tensor, with_zero=False):
+        # Returns a transform where True positions in mask is identity
+        if with_zero:
+            tensor = self.tensor
+            return Affine3D.from_tensor(
+                torch.zeros_like(tensor).where(mask[..., None], tensor)
+            )
+        else:
+            identity = self.identity(
+                self.shape,
+                rotation_type=type(self.rot),
+                device=self.device,
+                dtype=self.dtype,
+            ).tensor
+            return Affine3D.from_tensor(identity.where(mask[..., None], self.tensor))
+    def apply(self, p: torch.Tensor) -> torch.Tensor:
+        return self.rot.apply(p) + self.trans
+    def invert(self):
+        inv_rot = self.rot.invert()
+        return Affine3D(trans=-inv_rot.apply(self.trans), rot=inv_rot)
+    @property
+    def tensor(self) -> torch.Tensor:
+        return torch.cat([self.rot.tensor, self.trans], dim=-1)
+    @staticmethod
+    def from_tensor(t: torch.Tensor) -> "Affine3D":
+        match t.shape[-1]:
+            case 4:
+                # Assume tensor 4x4 for backward compat with alphafold
+                trans = t[..., :3, 3]
+                rot = RotationMatrix(t[..., :3, :3])
+            case 6:
+                # Assume quaternion representation with real part = 1
+                trans = t[..., -3:]
+                rot = RotationQuat(F.pad(t[..., :3], (1, 0), value=1))
+            case 7:
+                trans = t[..., -3:]
+                rot = RotationQuat(t[..., :4])
+            case 12:
+                trans = t[..., -3:]
+                rot = RotationMatrix(t[..., :-3].unflatten(-1, (3, 3)))
+            case _:
+                raise RuntimeError(
+                    f"Cannot detect rotation fromat from {t.shape[-1] -3}-d flat vector"
+                )
+        return Affine3D(trans, rot)
+    @staticmethod
+    def from_tensor_pair(t: torch.Tensor, r: torch.Tensor) -> "Affine3D":
+        return Affine3D(t, RotationMatrix(r))
+    @staticmethod
+    def from_graham_schmidt(
+        neg_x_axis: torch.Tensor,
+        origin: torch.Tensor,
+        xy_plane: torch.Tensor,
+        eps: float = 1e-10,
+    ):
+        # The arguments of this function is for parity with AlphaFold
+        x_axis = origin - neg_x_axis
+        xy_plane = xy_plane - origin
+        return Affine3D(
+            trans=origin, rot=RotationMatrix.from_graham_schmidt(x_axis, xy_plane, eps)
+        )
+    @staticmethod
+    def cat(affines: list["Affine3D"], dim: int = 0):
+        if dim < 0:
+            dim = len(affines[0].shape) + dim
+        return Affine3D.from_tensor(torch.cat([x.tensor for x in affines], dim=dim))
+def _quat_mult(a: torch.Tensor, b: torch.Tensor) -> torch.Tensor:
+    """
+    Multiply two quaternions.
+    Usual torch rules for broadcasting apply.
+    Args:
+        a: Quaternions as tensor of shape (..., 4), real part first.
+        b: Quaternions as tensor of shape (..., 4), real part first.
+    Returns:
+        The product of a and b, a tensor of quaternions shape (..., 4).
+    """
+    aw, ax, ay, az = torch.unbind(a, -1)
+    bw, bx, by, bz = torch.unbind(b, -1)
+    ow = aw * bw - ax * bx - ay * by - az * bz
+    ox = aw * bx + ax * bw + ay * bz - az * by
+    oy = aw * by - ax * bz + ay * bw + az * bx
+    oz = aw * bz + ax * by - ay * bx + az * bw
+    return torch.stack((ow, ox, oy, oz), -1)
+def _quat_rotation(q: torch.Tensor, p: torch.Tensor) -> torch.Tensor:
+    """
+    Rotates p by quaternion q. Usual torch rules for broadcasting apply.
+    Args:
+        q: Quaternions as tensor of shape (..., 4), real part first.
+        p: Points as tensor of shape (..., 3)
+    Returns:
+        The rotated version of p, of shape (..., 3)
+    """
+    aw, ax, ay, az = torch.unbind(q, -1)
+    bx, by, bz = torch.unbind(p, -1)
+    # fmt: off
+    ow =         - ax * bx - ay * by - az * bz
+    ox = aw * bx           + ay * bz - az * by
+    oy = aw * by - ax * bz           + az * bx
+    oz = aw * bz + ax * by - ay * bx
+    # fmt: on
+    q_mul_pts = torch.stack((ow, ox, oy, oz), -1)
+    return _quat_mult(q_mul_pts, _quat_invert(q))[..., 1:]
+def _quat_invert(q: torch.Tensor):
+    return q * torch.tensor([1, -1, -1, -1], device=q.device)
+def _sqrt_subgradient(x: torch.Tensor) -> torch.Tensor:
+    # Returns torch.sqrt(torch.max(0, x)) but with a zero subgradient where x is 0.
+    ret = torch.zeros_like(x)
+    positive_mask = x > 0
+    ret[positive_mask] = torch.sqrt(x[positive_mask])
+    return ret
+def _graham_schmidt(x_axis: torch.Tensor, xy_plane: torch.Tensor, eps: float = 1e-12):
+    # A low eps here is necessary for good stability!
+    with fp32_autocast_context(x_axis.device.type):
+        e1 = xy_plane
+        denom = torch.sqrt((x_axis**2).sum(dim=-1, keepdim=True) + eps)
+        x_axis = x_axis / denom
+        dot = (x_axis * e1).sum(dim=-1, keepdim=True)
+        e1 = e1 - x_axis * dot
+        denom = torch.sqrt((e1**2).sum(dim=-1, keepdim=True) + eps)
+        e1 = e1 / denom
+        e2 = torch.cross(x_axis, e1, dim=-1)
+        rots = torch.stack([x_axis, e1, e2], dim=-1)
+        return rots
+def build_affine3d_from_coordinates(
+    coords: torch.Tensor,  # (N, CA, C).
+) -> tuple[Affine3D, torch.Tensor]:
+    _MAX_SUPPORTED_DISTANCE = 1e6
+    coord_mask = torch.all(
+        torch.all(torch.isfinite(coords) & (coords < _MAX_SUPPORTED_DISTANCE), dim=-1),
+        dim=-1,
+    )
+    def atom3_to_backbone_affine(bb_positions: torch.Tensor) -> Affine3D:
+        N, CA, C = bb_positions.unbind(dim=-2)
+        return Affine3D.from_graham_schmidt(C, CA, N)
+    coords = coords.clone().float()
+    coords[~coord_mask] = 0
+    # NOTE(thayes): If you have already normalized the coordinates, then
+    # the black hole affine translations will be zeros and the rotations will be
+    # the identity.
+    average_per_n_ca_c = coords.masked_fill(~coord_mask[..., None, None], 0).sum(1) / (
+        coord_mask.sum(-1)[..., None, None] + 1e-8
+    )
+    affine_from_average = atom3_to_backbone_affine(
+        average_per_n_ca_c.float()
+    ).as_matrix()
+    B, S, _, _ = coords.shape
+    assert isinstance(B, int)
+    assert isinstance(S, int)
+    affine_rot_mats = affine_from_average.rot.tensor[..., None, :].expand(B, S, 9)
+    affine_trans = affine_from_average.trans[..., None, :].expand(B, S, 3)
+    # We use the identity rotation whereever we have no coordinates. This is
+    # important because otherwise the rotation matrices will be all zeros, which
+    # will cause collapse in the distance/direction attention mechanism.
+    identity_rot = RotationMatrix.identity(
+        (B, S), dtype=torch.float32, device=coords.device, requires_grad=False
+    )
+    affine_rot_mats = affine_rot_mats.where(
+        coord_mask.any(-1)[..., None, None], identity_rot.tensor
+    )
+    black_hole_affine = Affine3D(affine_trans, RotationMatrix(affine_rot_mats))
+    affine = atom3_to_backbone_affine(coords.float())
+    affine = Affine3D.from_tensor(
+        affine.tensor.where(coord_mask[..., None], black_hole_affine.tensor)
+    )
+    return affine, coord_mask

esmfold2_aligner.py ADDED Viewed

	@@ -0,0 +1,102 @@

+from __future__ import annotations
+from dataclasses import Field, replace
+from typing import Any, ClassVar, Protocol, TypeVar
+import numpy as np
+import torch
+from .esmfold2_protein_structure import compute_affine_and_rmsd
+class Alignable(Protocol):
+    # Trick to detect whether an object is a dataclass
+    __dataclass_fields__: ClassVar[dict[str, Field[Any]]]
+    @property
+    def atom37_positions(self) -> np.ndarray:  # type: ignore
+        pass
+    @property
+    def atom37_mask(self) -> np.ndarray:  # type: ignore
+        pass
+    def __len__(self) -> int: ...
+T = TypeVar("T", bound=Alignable)
+class Aligner:
+    def __init__(
+        self,
+        mobile: Alignable,
+        target: Alignable,
+        only_use_backbone: bool = False,
+        use_reflection: bool = False,
+    ):
+        """
+        Aligns a mobile protein chain against a target protein chain.
+        Args:
+            mobile (ProteinChain): Protein chain to be aligned.
+            target (ProteinChain): Protein chain target.
+            only_use_backbone (bool): Whether to only use backbone atoms.
+            use_reflection (bool): Whether to align to target reflection.
+        """
+        # Check proteins must have same number of residues
+        assert len(mobile) == len(target)
+        # Determine overlapping atoms
+        joint_atom37_mask = mobile.atom37_mask.astype(bool) & target.atom37_mask.astype(
+            bool
+        )
+        # Backbone atoms are first sites in atom37 representation
+        if only_use_backbone:
+            joint_atom37_mask[:, 3:] = False
+        # Extract matching atom positions and convert to batched tensors
+        mobile_atom_tensor = (
+            torch.from_numpy(mobile.atom37_positions).type(torch.double).unsqueeze(0)
+        )
+        target_atom_tensor = (
+            torch.from_numpy(target.atom37_positions).type(torch.double).unsqueeze(0)
+        )
+        joint_atom37_mask = (
+            torch.from_numpy(joint_atom37_mask).type(torch.bool).unsqueeze(0)
+        )
+        # If using reflection flip target
+        if use_reflection:
+            target_atom_tensor = -target_atom_tensor
+        # Compute alignment and rmsd
+        affine3D, rmsd = compute_affine_and_rmsd(
+            mobile_atom_tensor, target_atom_tensor, atom_exists_mask=joint_atom37_mask
+        )
+        self._affine3D = affine3D
+        self._rmsd = rmsd.item()
+    @property
+    def rmsd(self):
+        return self._rmsd
+    def apply(self, mobile: T) -> T:
+        """Apply alignment to a protein chain"""
+        # Extract atom positions and convert to batched tensors
+        mobile_atom_tensor = (
+            torch.from_numpy(mobile.atom37_positions[mobile.atom37_mask])
+            .type(torch.float32)
+            .unsqueeze(0)
+        )
+        # Transform atom arrays
+        aligned_atom_tensor = self._affine3D.apply(mobile_atom_tensor).squeeze(0)
+        # Rebuild atom37 positions
+        aligned_atom37_positions = np.full_like(mobile.atom37_positions, np.nan)
+        aligned_atom37_positions[mobile.atom37_mask] = aligned_atom_tensor
+        return replace(mobile, atom37_positions=aligned_atom37_positions)

esmfold2_atom_indexer.py ADDED Viewed

	@@ -0,0 +1,16 @@

+import numpy as np
+from .esmfold2_protein_structure import index_by_atom_name
+class AtomIndexer:
+    def __init__(self, structure, property: str, dim: int):
+        self.structure = structure
+        self.property = property
+        self.dim = dim
+    def __getitem__(self, atom_names: str | list[str]) -> np.ndarray:
+        return index_by_atom_name(
+            getattr(self.structure, self.property), atom_names, self.dim
+        )

esmfold2_conformers.py ADDED Viewed

	@@ -0,0 +1,292 @@

+"""CCD conformer loading utilities.
+Loads idealized conformer coordinates from a CCD pickle file containing RDKit molecules.
+Conformer priority follows AF3 Section 2.8: Computed > Ideal > first available.
+"""
+from __future__ import annotations
+import os
+import pickle
+from pathlib import Path
+import numpy as np
+from huggingface_hub import hf_hub_download
+from .esmfold2_constants import RES_TYPE_TO_CCD
+if os.environ.get("ESMCFOLD_CCD_PATH"):
+    CCD_PICKLE_PATH = Path(os.environ["ESMCFOLD_CCD_PATH"])
+else:
+    CCD_PICKLE_PATH = None
+# Lazily loaded CCD dictionary
+_CCD_MOLECULES: dict | None = None
+# Caches
+_CCD_CONFORMERS: dict[str, dict[str, np.ndarray]] = {}
+_CCD_ATOM_CACHE: dict[str, list[tuple[str, str, int]]] = {}
+_CCD_BONDS_CACHE: dict[str, list[tuple[str, str]]] = {}
+_CCD_LEAVING_ATOMS_CACHE: dict[str, set[str]] = {}
+_IDEALIZED_POS_CACHE: dict[tuple[int, str], np.ndarray | None] = {}
+_LIGAND_IDEALIZED_POS_CACHE: dict[tuple[str, str], np.ndarray | None] = {}
+def load_ccd(cache_dir: Path | str | None = None) -> dict:
+    """Load CCD molecules from pickle file, downloading if needed.
+    Args:
+        cache_dir: Directory to cache the downloaded CCD pickle.
+            If None, uses CCD_PICKLE_PATH env var or downloads to ~/.cache/esmcfold/.
+    """
+    global _CCD_MOLECULES
+    if _CCD_MOLECULES is not None:
+        return _CCD_MOLECULES
+    # Determine pickle path
+    if CCD_PICKLE_PATH is not None and CCD_PICKLE_PATH.exists():
+        pkl_path = CCD_PICKLE_PATH
+    elif cache_dir is not None:
+        cache_dir = Path(cache_dir)
+        cache_dir.mkdir(parents=True, exist_ok=True)
+        pkl_path = cache_dir / "ccd.pkl"
+    else:
+        try:
+            pkl_path = Path(
+                hf_hub_download(repo_id="biohub/ESMFold2", filename="ccd.pkl")
+            )
+        except Exception as e:
+            raise FileNotFoundError(
+                f"Failed to download CCD pickle file from Hugging Face repository: {e}"
+            )
+    if not pkl_path.exists():
+        raise FileNotFoundError(
+            f"CCD pickle file not found: {pkl_path}. Please set the ESMCFOLD_CCD_PATH environment variable to the path of a valid CCD pickle file or download the file from the Hugging Face repository."
+        )
+    print(f"Loading CCD dictionary from {pkl_path}")
+    with open(pkl_path, "rb") as f:
+        _CCD_MOLECULES = pickle.load(f)
+    if _CCD_MOLECULES is None:
+        _CCD_MOLECULES = {}
+    return _CCD_MOLECULES
+def _get_ccd_molecules() -> dict:
+    """Get CCD molecules, loading lazily on first call."""
+    global _CCD_MOLECULES
+    if _CCD_MOLECULES is None:
+        return load_ccd()
+    return _CCD_MOLECULES
+def _get_ccd_mol_with_significant_h(comp_id: str):
+    """Get CCD molecule with only chemically significant hydrogens.
+    Returns (mol, conformer) tuple or (None, None) if not available.
+    """
+    ccd = _get_ccd_molecules()
+    if comp_id not in ccd:
+        return None, None
+    mol = ccd[comp_id]
+    if mol.GetNumConformers() == 0:
+        return None, None
+    # Find the "Computed" conformer (RDKit ETKDGv3), fall back to "Ideal"
+    conf_idx = 0
+    for i, c in enumerate(mol.GetConformers()):
+        props = c.GetPropsAsDict()
+        if props.get("name") == "Computed":
+            conf_idx = i
+            break
+    else:
+        for i, c in enumerate(mol.GetConformers()):
+            props = c.GetPropsAsDict()
+            if props.get("name") == "Ideal":
+                conf_idx = i
+                break
+    from rdkit import Chem
+    mol_no_h = Chem.RemoveHs(mol, sanitize=False)
+    if mol_no_h.GetNumConformers() == 0:
+        return None, None
+    return mol_no_h, mol_no_h.GetConformer(
+        min(conf_idx, mol_no_h.GetNumConformers() - 1)
+    )
+def get_ccd_conformer(comp_id: str) -> dict[str, np.ndarray] | None:
+    """Get idealized conformer as dict of atom_name -> position [3].
+    Conformer priority: Computed > Ideal > first available.
+    """
+    if comp_id in _CCD_CONFORMERS:
+        cached = _CCD_CONFORMERS[comp_id]
+        return cached if cached else None
+    mol, conf = _get_ccd_mol_with_significant_h(comp_id)
+    if mol is None or conf is None:
+        _CCD_CONFORMERS[comp_id] = {}
+        return None
+    conformer: dict[str, np.ndarray] = {}
+    for atom in mol.GetAtoms():
+        props = atom.GetPropsAsDict()
+        atom_name = props.get("name")
+        if not isinstance(atom_name, str) or not atom_name:
+            continue
+        idx = atom.GetIdx()
+        pos = conf.GetAtomPosition(idx)
+        conformer[atom_name] = np.array([pos.x, pos.y, pos.z], dtype=np.float32)
+    _CCD_CONFORMERS[comp_id] = conformer
+    return conformer if conformer else None
+def get_idealized_atom_pos(res_type: int, atom_name: str) -> np.ndarray | None:
+    """Get idealized position for a standard residue atom.
+    Uses res_type index to look up CCD component, then returns position.
+    Returns None if not found.
+    """
+    cache_key = (res_type, atom_name)
+    if cache_key in _IDEALIZED_POS_CACHE:
+        return _IDEALIZED_POS_CACHE[cache_key]
+    comp_id = RES_TYPE_TO_CCD.get(res_type)
+    if comp_id:
+        ccd_conformer = get_ccd_conformer(comp_id)
+        if ccd_conformer and atom_name in ccd_conformer:
+            pos = ccd_conformer[atom_name]
+            _IDEALIZED_POS_CACHE[cache_key] = pos
+            return pos
+    _IDEALIZED_POS_CACHE[cache_key] = None
+    return None
+def get_ligand_idealized_atom_pos(res_name: str, atom_name: str) -> np.ndarray | None:
+    """Get idealized position for a ligand/modified residue atom.
+    Returns None if not found.
+    """
+    cache_key = (res_name, atom_name)
+    if cache_key in _LIGAND_IDEALIZED_POS_CACHE:
+        return _LIGAND_IDEALIZED_POS_CACHE[cache_key]
+    ccd_conformer = get_ccd_conformer(res_name)
+    if ccd_conformer and atom_name in ccd_conformer:
+        pos = ccd_conformer[atom_name]
+        _LIGAND_IDEALIZED_POS_CACHE[cache_key] = pos
+        return pos
+    _LIGAND_IDEALIZED_POS_CACHE[cache_key] = None
+    return None
+def get_ligand_ccd_atoms_with_charges(
+    comp_id: str,
+) -> list[tuple[str, str, int]] | None:
+    """Get list of (atom_name, element, charge) for a CCD component.
+    Uses RDKit RemoveHs(sanitize=False) to keep chemically significant hydrogens.
+    Returns None if CCD data not available.
+    """
+    if comp_id in _CCD_ATOM_CACHE:
+        cached = _CCD_ATOM_CACHE[comp_id]
+        return cached if cached else None
+    mol, _ = _get_ccd_mol_with_significant_h(comp_id)
+    if mol is None:
+        _CCD_ATOM_CACHE[comp_id] = []
+        return None
+    atoms: list[tuple[str, str, int]] = []
+    for atom in mol.GetAtoms():
+        props = atom.GetPropsAsDict()
+        atom_name = props.get("name")
+        if not isinstance(atom_name, str) or not atom_name:
+            continue
+        element = atom.GetSymbol()
+        charge = atom.GetFormalCharge()
+        atoms.append((atom_name, element, charge))
+    _CCD_ATOM_CACHE[comp_id] = atoms
+    return atoms if atoms else None
+def get_ligand_ccd_bonds(comp_id: str) -> list[tuple[str, str]] | None:
+    """Get list of (atom1_name, atom2_name) bonds for a CCD component.
+    Returns None if CCD data not available.
+    """
+    if comp_id in _CCD_BONDS_CACHE:
+        cached = _CCD_BONDS_CACHE[comp_id]
+        return cached if cached else None
+    mol, _ = _get_ccd_mol_with_significant_h(comp_id)
+    if mol is None:
+        _CCD_BONDS_CACHE[comp_id] = []
+        return None
+    # Get included atom names
+    included_atoms = set()
+    for atom in mol.GetAtoms():
+        props = atom.GetPropsAsDict()
+        atom_name = props.get("name")
+        if isinstance(atom_name, str) and atom_name:
+            included_atoms.add(atom_name)
+    bonds: list[tuple[str, str]] = []
+    for bond in mol.GetBonds():
+        a1 = bond.GetBeginAtom()
+        a2 = bond.GetEndAtom()
+        n1 = a1.GetPropsAsDict().get("name")
+        n2 = a2.GetPropsAsDict().get("name")
+        if (
+            isinstance(n1, str)
+            and isinstance(n2, str)
+            and n1
+            and n2
+            and n1 in included_atoms
+            and n2 in included_atoms
+        ):
+            bonds.append((n1, n2))
+    _CCD_BONDS_CACHE[comp_id] = bonds
+    return bonds if bonds else None
+def get_ccd_leaving_atoms(comp_id: str) -> set[str]:
+    """Get set of atom names marked as leaving atoms in CCD.
+    Leaving atoms are removed during polymerization (e.g., OP3 in nucleotides).
+    """
+    if comp_id in _CCD_LEAVING_ATOMS_CACHE:
+        return _CCD_LEAVING_ATOMS_CACHE[comp_id]
+    ccd = _get_ccd_molecules()
+    if comp_id not in ccd:
+        _CCD_LEAVING_ATOMS_CACHE[comp_id] = set()
+        return set()
+    mol = ccd[comp_id]
+    leaving_atoms = set()
+    for atom in mol.GetAtoms():
+        if atom.HasProp("leaving_atom"):
+            if atom.GetProp("leaving_atom") == "1":
+                name = atom.GetProp("name") if atom.HasProp("name") else ""
+                if name:
+                    leaving_atoms.add(name)
+    _CCD_LEAVING_ATOMS_CACHE[comp_id] = leaving_atoms
+    return leaving_atoms

esmfold2_constants.py ADDED Viewed

	@@ -0,0 +1,563 @@

+"""Constants for the ESMFold2 input pipeline.
+Includes molecule types, residue types, vocabularies, atom lists, and element data.
+"""
+# =============================================================================
+# Molecule types
+# =============================================================================
+MOL_TYPE_PROTEIN = 0
+MOL_TYPE_DNA = 1
+MOL_TYPE_RNA = 2
+MOL_TYPE_NONPOLYMER = 3
+# =============================================================================
+# Residue type indices
+# =============================================================================
+# Standard amino acids (indices 2-21), MSE mapped to MET
+PROTEIN_RESIDUE_TO_RES_TYPE = {
+    "ALA": 2,
+    "ARG": 3,
+    "ASN": 4,
+    "ASP": 5,
+    "CYS": 6,
+    "GLN": 7,
+    "GLU": 8,
+    "GLY": 9,
+    "HIS": 10,
+    "ILE": 11,
+    "LEU": 12,
+    "LYS": 13,
+    "MET": 14,
+    "PHE": 15,
+    "PRO": 16,
+    "SER": 17,
+    "THR": 18,
+    "TRP": 19,
+    "TYR": 20,
+    "VAL": 21,
+    "MSE": 14,  # Selenomethionine -> MET
+}
+PROTEIN_UNK_RES_TYPE = 22
+# RNA nucleotides (indices 23-26, unknown=27)
+RNA_RESIDUE_TO_RES_TYPE = {"A": 23, "G": 24, "C": 25, "U": 26}
+RNA_UNK_RES_TYPE = 27
+# DNA nucleotides (indices 28-31, unknown=32)
+DNA_RESIDUE_TO_RES_TYPE = {"DA": 28, "DG": 29, "DC": 30, "DT": 31}
+DNA_UNK_RES_TYPE = 32
+GAP_RES_TYPE = 32
+# =============================================================================
+# Vocabularies
+# =============================================================================
+# 3-letter to 1-letter codes for proteins
+PROTEIN_3TO1 = {
+    "ALA": "A",
+    "ARG": "R",
+    "ASN": "N",
+    "ASP": "D",
+    "CYS": "C",
+    "GLN": "Q",
+    "GLU": "E",
+    "GLY": "G",
+    "HIS": "H",
+    "ILE": "I",
+    "LEU": "L",
+    "LYS": "K",
+    "MET": "M",
+    "PHE": "F",
+    "PRO": "P",
+    "SER": "S",
+    "THR": "T",
+    "TRP": "W",
+    "TYR": "Y",
+    "VAL": "V",
+    "MSE": "M",
+}
+# 1-letter to 3-letter codes
+PROTEIN_1TO3 = {v: k for k, v in PROTEIN_3TO1.items() if k != "MSE"}
+PROTEIN_1TO3["X"] = "UNK"
+# DNA 1-letter to CCD code
+DNA_1TO3 = {"A": "DA", "T": "DT", "C": "DC", "G": "DG"}
+# RNA 1-letter to CCD code
+RNA_1TO3 = {"A": "A", "U": "U", "C": "C", "G": "G"}
+# ESM-2 input_ids vocabulary for proteins
+ESM_PROTEIN_VOCAB = {
+    "L": 4,
+    "A": 5,
+    "G": 6,
+    "V": 7,
+    "S": 8,
+    "E": 9,
+    "R": 10,
+    "T": 11,
+    "I": 12,
+    "D": 13,
+    "P": 14,
+    "K": 15,
+    "Q": 16,
+    "N": 17,
+    "F": 18,
+    "Y": 19,
+    "M": 20,
+    "H": 21,
+    "W": 22,
+    "C": 23,
+    "X": 3,  # Unknown
+}
+# For DNA/RNA/ligands
+DNA_RNA_LIGAND_INPUT_ID = 24
+# MSA tokens
+MSA_PAD_TOKEN_ID = 0
+MSA_GAP_TOKEN_ID = 1  # Gap/insertion token for MSA
+# res_type int -> CCD component ID (for conformer lookup)
+RES_TYPE_TO_CCD = {
+    # Proteins (2-22)
+    2: "ALA",
+    3: "ARG",
+    4: "ASN",
+    5: "ASP",
+    6: "CYS",
+    7: "GLN",
+    8: "GLU",
+    9: "GLY",
+    10: "HIS",
+    11: "ILE",
+    12: "LEU",
+    13: "LYS",
+    14: "MET",
+    15: "PHE",
+    16: "PRO",
+    17: "SER",
+    18: "THR",
+    19: "TRP",
+    20: "TYR",
+    21: "VAL",
+    22: "UNK",
+    # RNA (23-27)
+    23: "A",
+    24: "G",
+    25: "C",
+    26: "U",
+    27: "N",
+    # DNA (28-32)
+    28: "DA",
+    29: "DG",
+    30: "DC",
+    31: "DT",
+    32: "DN",
+}
+# =============================================================================
+# Charged atoms at physiological pH
+# =============================================================================
+CHARGED_ATOMS: dict[tuple[str, str], int] = {
+    ("LYS", "NZ"): 1,
+    ("ARG", "NH2"): 1,
+    ("HIS", "ND1"): 1,
+    ("PO4", "O2"): -1,
+    ("PO4", "O3"): -1,
+    ("PO4", "O4"): -1,
+    ("SO4", "O3"): -1,
+    ("SO4", "O4"): -1,
+    ("MG", "MG"): 2,
+    ("ZN", "ZN"): 2,
+    ("CA", "CA"): 2,
+    ("FE2", "FE"): 2,
+    ("MN", "MN"): 2,
+    ("CO", "CO"): 2,
+    ("NCO", "CO"): 3,
+    ("CU", "CU"): 2,
+    ("NI", "NI"): 2,
+    ("K", "K"): 1,
+    ("NA", "NA"): 1,
+    ("CD", "CD"): 2,
+    ("CL", "CL"): -1,
+    ("ACT", "OXT"): -1,
+    ("NAD", "O2N"): -1,
+    ("NAD", "N1N"): 1,
+    ("NAP", "O2N"): -1,
+    ("NAP", "N1N"): 1,
+    ("IMD", "N3"): 1,
+    ("SAM", "SD"): 1,
+    ("FE", "FE"): 3,
+    ("A1BH3", "N3"): 1,
+}
+# =============================================================================
+# Element atomic numbers (Z=1 to 92)
+# =============================================================================
+ELEMENT_TO_ATOMIC_NUM = {
+    "H": 1,
+    "LI": 3,
+    "BE": 4,
+    "B": 5,
+    "C": 6,
+    "N": 7,
+    "O": 8,
+    "F": 9,
+    "NE": 10,
+    "NA": 11,
+    "MG": 12,
+    "AL": 13,
+    "SI": 14,
+    "P": 15,
+    "S": 16,
+    "CL": 17,
+    "AR": 18,
+    "K": 19,
+    "CA": 20,
+    "SC": 21,
+    "TI": 22,
+    "V": 23,
+    "CR": 24,
+    "MN": 25,
+    "FE": 26,
+    "CO": 27,
+    "NI": 28,
+    "CU": 29,
+    "ZN": 30,
+    "GA": 31,
+    "GE": 32,
+    "AS": 33,
+    "SE": 34,
+    "BR": 35,
+    "KR": 36,
+    "RB": 37,
+    "SR": 38,
+    "Y": 39,
+    "ZR": 40,
+    "NB": 41,
+    "MO": 42,
+    "TC": 43,
+    "RU": 44,
+    "RH": 45,
+    "PD": 46,
+    "AG": 47,
+    "CD": 48,
+    "IN": 49,
+    "SN": 50,
+    "SB": 51,
+    "TE": 52,
+    "I": 53,
+    "XE": 54,
+    "CS": 55,
+    "BA": 56,
+    "LA": 57,
+    "CE": 58,
+    "PR": 59,
+    "ND": 60,
+    "PM": 61,
+    "SM": 62,
+    "EU": 63,
+    "GD": 64,
+    "TB": 65,
+    "DY": 66,
+    "HO": 67,
+    "ER": 68,
+    "TM": 69,
+    "YB": 70,
+    "LU": 71,
+    "HF": 72,
+    "TA": 73,
+    "W": 74,
+    "RE": 75,
+    "OS": 76,
+    "IR": 77,
+    "PT": 78,
+    "AU": 79,
+    "HG": 80,
+    "TL": 81,
+    "PB": 82,
+    "BI": 83,
+    "PO": 84,
+    "AT": 85,
+    "RN": 86,
+    "FR": 87,
+    "RA": 88,
+    "AC": 89,
+    "TH": 90,
+    "PA": 91,
+    "U": 92,
+}
+# Inverse mapping: atomic number → element symbol
+ELEMENT_NUMBER_TO_SYMBOL = {v: k for k, v in ELEMENT_TO_ATOMIC_NUM.items()}
+# =============================================================================
+# Standard heavy atoms per residue type
+# =============================================================================
+PROTEIN_HEAVY_ATOMS = {
+    "ALA": ["N", "CA", "C", "O", "CB"],
+    "ARG": ["N", "CA", "C", "O", "CB", "CG", "CD", "NE", "CZ", "NH1", "NH2"],
+    "ASN": ["N", "CA", "C", "O", "CB", "CG", "OD1", "ND2"],
+    "ASP": ["N", "CA", "C", "O", "CB", "CG", "OD1", "OD2"],
+    "CYS": ["N", "CA", "C", "O", "CB", "SG"],
+    "GLN": ["N", "CA", "C", "O", "CB", "CG", "CD", "OE1", "NE2"],
+    "GLU": ["N", "CA", "C", "O", "CB", "CG", "CD", "OE1", "OE2"],
+    "GLY": ["N", "CA", "C", "O"],
+    "HIS": ["N", "CA", "C", "O", "CB", "CG", "ND1", "CD2", "CE1", "NE2"],
+    "ILE": ["N", "CA", "C", "O", "CB", "CG1", "CG2", "CD1"],
+    "LEU": ["N", "CA", "C", "O", "CB", "CG", "CD1", "CD2"],
+    "LYS": ["N", "CA", "C", "O", "CB", "CG", "CD", "CE", "NZ"],
+    "MET": ["N", "CA", "C", "O", "CB", "CG", "SD", "CE"],
+    "PHE": ["N", "CA", "C", "O", "CB", "CG", "CD1", "CD2", "CE1", "CE2", "CZ"],
+    "PRO": ["N", "CA", "C", "O", "CB", "CG", "CD"],
+    "SER": ["N", "CA", "C", "O", "CB", "OG"],
+    "THR": ["N", "CA", "C", "O", "CB", "OG1", "CG2"],
+    "TRP": [
+        "N",
+        "CA",
+        "C",
+        "O",
+        "CB",
+        "CG",
+        "CD1",
+        "CD2",
+        "NE1",
+        "CE2",
+        "CE3",
+        "CZ2",
+        "CZ3",
+        "CH2",
+    ],
+    "TYR": ["N", "CA", "C", "O", "CB", "CG", "CD1", "CD2", "CE1", "CE2", "CZ", "OH"],
+    "VAL": ["N", "CA", "C", "O", "CB", "CG1", "CG2"],
+    "MSE": ["N", "CA", "C", "O", "CB", "CG", "SD", "CE"],
+    "UNK": ["N", "CA", "C", "O"],
+}
+DNA_HEAVY_ATOMS = {
+    "DA": [
+        "P",
+        "OP1",
+        "OP2",
+        "O5'",
+        "C5'",
+        "C4'",
+        "O4'",
+        "C3'",
+        "O3'",
+        "C2'",
+        "C1'",
+        "N9",
+        "C8",
+        "N7",
+        "C5",
+        "C6",
+        "N6",
+        "N1",
+        "C2",
+        "N3",
+        "C4",
+    ],
+    "DG": [
+        "P",
+        "OP1",
+        "OP2",
+        "O5'",
+        "C5'",
+        "C4'",
+        "O4'",
+        "C3'",
+        "O3'",
+        "C2'",
+        "C1'",
+        "N9",
+        "C8",
+        "N7",
+        "C5",
+        "C6",
+        "O6",
+        "N1",
+        "C2",
+        "N2",
+        "N3",
+        "C4",
+    ],
+    "DC": [
+        "P",
+        "OP1",
+        "OP2",
+        "O5'",
+        "C5'",
+        "C4'",
+        "O4'",
+        "C3'",
+        "O3'",
+        "C2'",
+        "C1'",
+        "N1",
+        "C2",
+        "O2",
+        "N3",
+        "C4",
+        "N4",
+        "C5",
+        "C6",
+    ],
+    "DT": [
+        "P",
+        "OP1",
+        "OP2",
+        "O5'",
+        "C5'",
+        "C4'",
+        "O4'",
+        "C3'",
+        "O3'",
+        "C2'",
+        "C1'",
+        "N1",
+        "C2",
+        "O2",
+        "N3",
+        "C4",
+        "O4",
+        "C5",
+        "C7",
+        "C6",
+    ],
+}
+RNA_HEAVY_ATOMS = {
+    "A": [
+        "P",
+        "OP1",
+        "OP2",
+        "O5'",
+        "C5'",
+        "C4'",
+        "O4'",
+        "C3'",
+        "O3'",
+        "C2'",
+        "O2'",
+        "C1'",
+        "N9",
+        "C8",
+        "N7",
+        "C5",
+        "C6",
+        "N6",
+        "N1",
+        "C2",
+        "N3",
+        "C4",
+    ],
+    "G": [
+        "P",
+        "OP1",
+        "OP2",
+        "O5'",
+        "C5'",
+        "C4'",
+        "O4'",
+        "C3'",
+        "O3'",
+        "C2'",
+        "O2'",
+        "C1'",
+        "N9",
+        "C8",
+        "N7",
+        "C5",
+        "C6",
+        "O6",
+        "N1",
+        "C2",
+        "N2",
+        "N3",
+        "C4",
+    ],
+    "C": [
+        "P",
+        "OP1",
+        "OP2",
+        "O5'",
+        "C5'",
+        "C4'",
+        "O4'",
+        "C3'",
+        "O3'",
+        "C2'",
+        "O2'",
+        "C1'",
+        "N1",
+        "C2",
+        "O2",
+        "N3",
+        "C4",
+        "N4",
+        "C5",
+        "C6",
+    ],
+    "U": [
+        "P",
+        "OP1",
+        "OP2",
+        "O5'",
+        "C5'",
+        "C4'",
+        "O4'",
+        "C3'",
+        "O3'",
+        "C2'",
+        "O2'",
+        "C1'",
+        "N1",
+        "C2",
+        "O2",
+        "N3",
+        "C4",
+        "O4",
+        "C5",
+        "C6",
+    ],
+}
+# Unknown nucleotide backbone atoms
+DNA_BACKBONE_ATOMS = [
+    "P",
+    "OP1",
+    "OP2",
+    "O5'",
+    "C5'",
+    "C4'",
+    "O4'",
+    "C3'",
+    "O3'",
+    "C2'",
+    "C1'",
+]
+RNA_BACKBONE_ATOMS = [
+    "P",
+    "OP1",
+    "OP2",
+    "O5'",
+    "C5'",
+    "C4'",
+    "O4'",
+    "C3'",
+    "O3'",
+    "C2'",
+    "O2'",
+    "C1'",
+]

esmfold2_constants_esm3.py ADDED Viewed

	@@ -0,0 +1,138 @@

+import os
+from functools import cache
+from pathlib import Path
+from huggingface_hub import snapshot_download
+SEQUENCE_BOS_TOKEN = 0
+SEQUENCE_PAD_TOKEN = 1
+SEQUENCE_EOS_TOKEN = 2
+SEQUENCE_CHAINBREAK_TOKEN = 31
+SEQUENCE_MASK_TOKEN = 32
+VQVAE_CODEBOOK_SIZE = 4096
+VQVAE_SPECIAL_TOKENS = {
+    "MASK": VQVAE_CODEBOOK_SIZE,
+    "EOS": VQVAE_CODEBOOK_SIZE + 1,
+    "BOS": VQVAE_CODEBOOK_SIZE + 2,
+    "PAD": VQVAE_CODEBOOK_SIZE + 3,
+    "CHAINBREAK": VQVAE_CODEBOOK_SIZE + 4,
+}
+VQVAE_DIRECTION_LOSS_BINS = 16
+VQVAE_PAE_BINS = 64
+VQVAE_MAX_PAE_BIN = 31.0
+VQVAE_PLDDT_BINS = 50
+STRUCTURE_MASK_TOKEN = VQVAE_SPECIAL_TOKENS["MASK"]
+STRUCTURE_BOS_TOKEN = VQVAE_SPECIAL_TOKENS["BOS"]
+STRUCTURE_EOS_TOKEN = VQVAE_SPECIAL_TOKENS["EOS"]
+STRUCTURE_PAD_TOKEN = VQVAE_SPECIAL_TOKENS["PAD"]
+STRUCTURE_CHAINBREAK_TOKEN = VQVAE_SPECIAL_TOKENS["CHAINBREAK"]
+STRUCTURE_UNDEFINED_TOKEN = 955
+SASA_PAD_TOKEN = 0
+SS8_PAD_TOKEN = 0
+INTERPRO_PAD_TOKEN = 0
+RESIDUE_PAD_TOKEN = 0
+CHAIN_BREAK_STR = "|"
+SEQUENCE_BOS_STR = "<cls>"
+SEQUENCE_EOS_STR = "<eos>"
+MASK_STR_SHORT = "_"
+SEQUENCE_MASK_STR = "<mask>"
+SASA_MASK_STR = "<unk>"
+SS8_MASK_STR = "<unk>"
+# fmt: off
+SEQUENCE_VOCAB = [
+    "<cls>", "<pad>", "<eos>", "<unk>",
+    "L", "A", "G", "V", "S", "E", "R", "T", "I", "D", "P", "K",
+    "Q", "N", "F", "Y", "M", "H", "W", "C", "X", "B", "U", "Z",
+    "O", ".", "-", "|",
+    "<mask>",
+]
+# fmt: on
+SEQUENCE_STANDARD_AA_MIN_TOKEN = 4  # L
+SEQUENCE_STANDARD_AA_MAX_TOKEN = 24  # X (exclusive)
+SSE_8CLASS_VOCAB = "GHITEBSC"
+SSE_3CLASS_VOCAB = "HEC"
+SSE_8CLASS_TO_3CLASS_MAP = {
+    "G": "H",
+    "H": "H",
+    "I": "H",
+    "T": "C",
+    "E": "E",
+    "B": "E",
+    "S": "C",
+    "C": "C",
+}
+SASA_DISCRETIZATION_BOUNDARIES = [
+    0.8,
+    4.0,
+    9.6,
+    16.4,
+    24.5,
+    32.9,
+    42.0,
+    51.5,
+    61.2,
+    70.9,
+    81.6,
+    93.3,
+    107.2,
+    125.4,
+    151.4,
+]
+MAX_RESIDUE_ANNOTATIONS = 16
+TFIDF_VECTOR_SIZE = 58641
+FUNCTION_TOKENS_DEPTH = 8
+@staticmethod
+@cache
+def data_root(model: str):
+    if "INFRA_PROVIDER" in os.environ:
+        return Path("")
+    # Try to download from huggingface if it doesn't exist
+    if model.startswith("esm3"):
+        path = Path(snapshot_download(repo_id="biohub/esm3-sm-open-v1"))
+    elif model.startswith("esmc-300"):
+        path = Path(snapshot_download(repo_id="biohub/esmc-300m-2024-12"))
+    elif model.startswith("esmc-600"):
+        path = Path(snapshot_download(repo_id="biohub/esmc-600m-2024-12"))
+    elif model.startswith("esmc-6b"):
+        path = Path(snapshot_download(repo_id="biohub/esmc-6b-2024-12"))
+    else:
+        raise ValueError(f"{model=} is an invalid model name.")
+    return path
+IN_REPO_DATA_FOLDER = Path(__file__).parents[2] / "data"
+INTERPRO_ENTRY = IN_REPO_DATA_FOLDER / "entry_list_safety_29026.list"
+INTERPRO_HIERARCHY = IN_REPO_DATA_FOLDER / "ParentChildTreeFile.txt"
+INTERPRO2GO = IN_REPO_DATA_FOLDER / "ParentChildTreeFile.txt"
+INTERPRO_2ID = "data/tag_dict_4_safety_filtered.json"
+LSH_TABLE_PATHS = {"8bit": "data/hyperplanes_8bit_58641.npz"}
+KEYWORDS_VOCABULARY = (
+    IN_REPO_DATA_FOLDER / "keyword_vocabulary_safety_filtered_58641.txt"
+)
+KEYWORDS_IDF = IN_REPO_DATA_FOLDER / "keyword_idf_safety_filtered_58641.npy"
+RESID_CSV = "data/uniref90_and_mgnify90_residue_annotations_gt_1k_proteins.csv"
+INTERPRO2KEYWORDS = IN_REPO_DATA_FOLDER / "interpro_29026_to_keywords_58641.csv"

esmfold2_input_builder.py ADDED Viewed

	@@ -0,0 +1,255 @@

+from dataclasses import dataclass
+from typing import Any, Sequence, TypeAlias, Union
+import numpy as np
+from .esmfold2_msa import MSA
+# fmt: off
+MSAInput: TypeAlias = Union[
+    MSA,
+    None,
+]
+# fmt: on
+@dataclass
+class Modification:
+    position: int  # zero-indexed
+    ccd: str
+    smiles: str | None = None  # TODO(mlee): add smiles support
+@dataclass
+class ProteinInput:
+    id: str | list[str]
+    sequence: str
+    modifications: list[Modification] | None = None
+    msa: MSAInput = None
+@dataclass
+class RNAInput:
+    id: str | list[str]
+    sequence: str
+    modifications: list[Modification] | None = None
+@dataclass
+class DNAInput:
+    id: str | list[str]
+    sequence: str
+    modifications: list[Modification] | None = None
+@dataclass
+class LigandInput:
+    id: str | list[str]
+    smiles: str | None = None
+    ccd: list[str] | None = None
+@dataclass
+class DistogramConditioning:
+    chain_id: str
+    distogram: np.ndarray
+@dataclass
+class PocketConditioning:
+    binder_chain_id: str
+    contacts: list[tuple[str, int]]
+@dataclass
+class CovalentBond:
+    chain_id1: str
+    res_idx1: int
+    atom_idx1: int
+    chain_id2: str
+    res_idx2: int
+    atom_idx2: int
+@dataclass
+class StructurePredictionInput:
+    sequences: Sequence[ProteinInput | RNAInput | DNAInput | LigandInput]
+    pocket: PocketConditioning | None = None
+    distogram_conditioning: list[DistogramConditioning] | None = None
+    covalent_bonds: list[CovalentBond] | None = None
+def serialize_structure_prediction_input(all_atom_input: StructurePredictionInput):
+    def create_chain_data(seq_input, chain_type: str) -> dict[str, Any]:
+        chain_data: dict[str, Any] = {
+            "sequence": seq_input.sequence,
+            "id": seq_input.id,
+            "type": chain_type,
+        }
+        if hasattr(seq_input, "modifications") and seq_input.modifications:
+            mods = [
+                {"position": mod.position, "ccd": mod.ccd}
+                for mod in seq_input.modifications
+            ]
+            chain_data["modifications"] = mods
+        if not hasattr(seq_input, "msa"):
+            pass
+        elif seq_input.msa is None:
+            chain_data["msa"] = None
+        elif isinstance(seq_input.msa, MSA):
+            chain_data["msa"] = {"sequences": seq_input.msa.sequences}
+        else:
+            error_msg = f"MSA must be None or MSA. Got {seq_input.msa} instead."
+            raise AttributeError(error_msg)
+        return chain_data
+    sequences = []
+    for seq_input in all_atom_input.sequences:
+        if isinstance(seq_input, ProteinInput):
+            sequences.append(create_chain_data(seq_input, "protein"))
+        elif isinstance(seq_input, RNAInput):
+            sequences.append(create_chain_data(seq_input, "rna"))
+        elif isinstance(seq_input, DNAInput):
+            sequences.append(create_chain_data(seq_input, "dna"))
+        elif isinstance(seq_input, LigandInput):
+            sequences.append(
+                {
+                    "smiles": seq_input.smiles,
+                    "id": seq_input.id,
+                    "ccd": seq_input.ccd,
+                    "type": "ligand",
+                }
+            )
+        else:
+            raise ValueError(f"Unsupported sequence input type: {type(seq_input)}")
+    result: dict[str, Any] = {"sequences": sequences}
+    if all_atom_input.covalent_bonds is not None:
+        result["covalent_bonds"] = [
+            {
+                "chain_id1": bond.chain_id1,
+                "res_idx1": bond.res_idx1,
+                "atom_idx1": bond.atom_idx1,
+                "chain_id2": bond.chain_id2,
+                "res_idx2": bond.res_idx2,
+                "atom_idx2": bond.atom_idx2,
+            }
+            for bond in all_atom_input.covalent_bonds
+        ]
+    if all_atom_input.pocket is not None:
+        result["pocket"] = {
+            "binder_chain_id": all_atom_input.pocket.binder_chain_id,
+            "contacts": all_atom_input.pocket.contacts,
+        }
+    if all_atom_input.distogram_conditioning is not None:
+        result["distogram_conditioning"] = [
+            {"chain_id": disto.chain_id, "distogram": disto.distogram.tolist()}
+            for disto in all_atom_input.distogram_conditioning
+        ]
+    return result
+def deserialize_structure_prediction_input(
+    data: dict[str, Any],
+) -> StructurePredictionInput:
+    """Inverse of :func:`serialize_structure_prediction_input`.
+    Reconstructs a :class:`StructurePredictionInput` from the JSON-safe dict
+    produced by ``serialize_structure_prediction_input``. Values round-trip;
+    ``DistogramConditioning.distogram`` dtype follows from JSON (``int64``
+    for integer entries, ``float64`` for floats) — cast back to the original
+    dtype if downstream code requires a specific one.
+    """
+    def _mods(chain: dict[str, Any]) -> list[Modification] | None:
+        raw = chain.get("modifications")
+        if not raw:
+            return None
+        return [Modification(position=m["position"], ccd=m["ccd"]) for m in raw]
+    def _msa(chain: dict[str, Any]) -> MSAInput:
+        if "msa" not in chain or chain["msa"] is None:
+            return None
+        msa_blk = chain["msa"]
+        if isinstance(msa_blk, str):
+            raise ValueError(f"Unexpected MSA string value: {msa_blk!r}")
+        return MSA.from_sequences(msa_blk["sequences"])
+    sequences: list[ProteinInput | RNAInput | DNAInput | LigandInput] = []
+    for chain in data["sequences"]:
+        t = chain["type"]
+        if t == "protein":
+            sequences.append(
+                ProteinInput(
+                    id=chain["id"],
+                    sequence=chain["sequence"],
+                    modifications=_mods(chain),
+                    msa=_msa(chain),
+                )
+            )
+        elif t == "rna":
+            sequences.append(
+                RNAInput(
+                    id=chain["id"],
+                    sequence=chain["sequence"],
+                    modifications=_mods(chain),
+                )
+            )
+        elif t == "dna":
+            sequences.append(
+                DNAInput(
+                    id=chain["id"],
+                    sequence=chain["sequence"],
+                    modifications=_mods(chain),
+                )
+            )
+        elif t == "ligand":
+            sequences.append(
+                LigandInput(
+                    id=chain["id"], smiles=chain.get("smiles"), ccd=chain.get("ccd")
+                )
+            )
+        else:
+            raise ValueError(f"Unsupported sequence type: {t!r}")
+    pocket: PocketConditioning | None = None
+    if (pocket_blk := data.get("pocket")) is not None:
+        pocket = PocketConditioning(
+            binder_chain_id=pocket_blk["binder_chain_id"],
+            contacts=[tuple(c) for c in pocket_blk["contacts"]],
+        )
+    distogram_conditioning: list[DistogramConditioning] | None = None
+    if (disto_blk := data.get("distogram_conditioning")) is not None:
+        distogram_conditioning = [
+            DistogramConditioning(
+                chain_id=d["chain_id"], distogram=np.asarray(d["distogram"])
+            )
+            for d in disto_blk
+        ]
+    covalent_bonds: list[CovalentBond] | None = None
+    if (bonds_blk := data.get("covalent_bonds")) is not None:
+        covalent_bonds = [
+            CovalentBond(
+                chain_id1=b["chain_id1"],
+                res_idx1=b["res_idx1"],
+                atom_idx1=b["atom_idx1"],
+                chain_id2=b["chain_id2"],
+                res_idx2=b["res_idx2"],
+                atom_idx2=b["atom_idx2"],
+            )
+            for b in bonds_blk
+        ]
+    return StructurePredictionInput(
+        sequences=sequences,
+        pocket=pocket,
+        distogram_conditioning=distogram_conditioning,
+        covalent_bonds=covalent_bonds,
+    )

esmfold2_metrics.py ADDED Viewed

	@@ -0,0 +1,374 @@

+import numpy as np
+import torch
+import torch.nn.functional as F
+from einops import rearrange
+from torch import Tensor
+from torch.amp import autocast  # type: ignore
+from . import esmfold2_residue_constants
+from .esmfold2_misc import binpack, unbinpack
+from .esmfold2_protein_structure import (
+    compute_alignment_tensors,
+    compute_gdt_ts_no_alignment,
+    compute_rmsd_no_alignment,
+)
+def contact_precision(
+    predictions: Tensor,
+    targets: Tensor,
+    src_lengths: Tensor | None = None,
+    minsep: int = 6,
+    maxsep: int | None = None,
+    override_length: int | None = None,  # for casp
+):
+    """Computes contact precisions.
+    For protein contact prediction, precision is measured for the top (L/K) highest confidence predictions,
+    with L being the length of the protein sequence and K generally being equal to 1 or 5.
+    K = 5 measures the predictions of the very highest confidence contacts, while K = 1 is a more general measure
+    over all relatively high confidence predictions.
+    Since there are roughly ~L true contacts in a protein, this is a reasonable cutoff.
+    Args:
+        predictions (Tensor): Tensor of probabilities of size (B, L, L)
+        targets (Tensor): Tensor of true contacts of size (B, L, L)
+        src_lengths (Tensor, optional): Lengths of each sample in the batch, if using variable lengths.
+            If not provided, inferred from the size of the predictions.
+        minsep (int): Minimum separation distance to consider. We often want to measure contacts at a
+            certain range. Typical ranges are short [6, 12), medium [12, 24), and long [24, inf).
+        maxsep (int, optional): Used in conjunction with minsep to specify a contact range. If not provided uses
+            assumes no maximum range
+        override_length (int, optional): Used for casp evaluation where sometimes the "true" length is not
+            the same as the length of the input. Kept for posterity, we probably don't need this argument.
+    """
+    if predictions.dim() == 2:
+        predictions = predictions.unsqueeze(0)
+    if targets.dim() == 2:
+        targets = targets.unsqueeze(0)
+    # Check sizes
+    if predictions.size() != targets.size():
+        raise ValueError(
+            f"Size mismatch. Received predictions of size {predictions.size()}, "
+            f"targets of size {targets.size()}"
+        )
+    device = predictions.device
+    batch_size, seqlen, _ = predictions.size()
+    # Step 1) Construct a mask of size [B, L, L] to mask invalid contacts
+    seqlen_range = torch.arange(seqlen, device=device)
+    sep = seqlen_range.unsqueeze(0) - seqlen_range.unsqueeze(1)
+    sep = sep.unsqueeze(0)
+    # Mask contacts that are closer than minsep
+    valid_mask = sep >= minsep
+    # Mask contacts where target is negative (padding or unknown)
+    valid_mask = valid_mask & (targets >= 0)  # negative targets are invalid
+    # Mask contacts that are farther than maxsep, if provided
+    if maxsep is not None:
+        valid_mask &= sep < maxsep
+    if src_lengths is not None:
+        # If the lengths of the individual sequences are provided, mask positions
+        # that are farther than the end of the sequence.
+        valid = seqlen_range.unsqueeze(0) < src_lengths.unsqueeze(1)
+        valid_mask &= valid.unsqueeze(1) & valid.unsqueeze(2)
+    else:
+        src_lengths = torch.full([batch_size], seqlen, device=device, dtype=torch.long)
+    # Fill in the logit tensor with -inf for all invalid positions
+    predictions = predictions.masked_fill(~valid_mask, float("-inf"))
+    # Step 2) Select the top half of the prediction (should be symmetric)
+    x_ind, y_ind = np.triu_indices(seqlen, minsep)
+    predictions_upper = predictions[:, x_ind, y_ind]
+    targets_upper = targets[:, x_ind, y_ind]
+    # Step 3) Select the topk values in each batch where k = L (length of sequence)
+    topk = seqlen if override_length is None else max(seqlen, override_length)
+    # Indices are the indices into the predictions corresponding to the most confident predictions
+    indices = predictions_upper.argsort(dim=-1, descending=True)[:, :topk]
+    # topk_targets are the target values corresponding to the above indices
+    topk_targets = targets_upper[torch.arange(batch_size).unsqueeze(1), indices]
+    if topk_targets.size(1) < topk:
+        # If there aren't enough targets, pad to the output.
+        topk_targets = F.pad(topk_targets, [0, topk - topk_targets.size(1)])
+    # Step 4) Sum the accuracy at of the top-i predictions for i in 1, L
+    # topk_targets => 1/0 true vs. false contact, sorted by confidence of prediction
+    # cmumulative sum => Number of correct answers for the top-i predictions.
+    cumulative_dist = topk_targets.type_as(predictions).cumsum(-1)
+    # Step 5) Find the gather indices. This should be P@(L / K) for varous values of K
+    # The values will differ for each batch.
+    gather_lengths = src_lengths.unsqueeze(1)
+    if override_length is not None:
+        gather_lengths = override_length * torch.ones_like(
+            gather_lengths, device=device
+        )
+    # This gets you (0.1 * L, 0.2 * L, 0.3 * L, etc.)
+    gather_indices = (
+        (torch.arange(0.1, 1.1, 0.1, device=device).unsqueeze(0) * gather_lengths).type(
+            torch.long
+        )
+        - 1
+    ).clamp_min(0)
+    # Step 6) Gather the results and divide by the number of guesses to get the precision.
+    binned_cumulative_dist = cumulative_dist.gather(1, gather_indices)
+    binned_precisions = binned_cumulative_dist / (gather_indices + 1).type_as(
+        binned_cumulative_dist
+    )
+    # Select specific P@L/k. pl5 is index 1 b/c that corresponds to L * 0.2 in
+    # gather_indices above
+    pl5 = binned_precisions[:, 1]
+    # pl2 = binned_precisions[:, 4]
+    pl = binned_precisions[:, 9]
+    # AUC is the integral wrt K of P@L/K for K in range(1, L)
+    auc = binned_precisions.mean(-1)
+    return {"AUC": auc, "P@L": pl, "P@L5": pl5}
+def compute_lddt(
+    all_atom_pred_pos: torch.Tensor,
+    all_atom_positions: torch.Tensor,
+    all_atom_mask: torch.Tensor,
+    pairwise_all_atom_mask: torch.Tensor | None = None,
+    cutoff: float | torch.Tensor = 15.0,
+    eps: float = 1e-10,
+    per_residue: bool = True,
+    sequence_id: torch.Tensor | None = None,
+) -> torch.Tensor:
+    """
+    Computes LDDT for a protein. Tensor sizes below include some optional dimensions. Specifically:
+        Nstates:
+            all_atom_pred_pos can contain multiple states in the first dimension which corresponds to outputs from different layers of a model (e.g. each IPA block). The return size will be [Nstates x Batch size] if this is included.
+        Natoms:
+            LDDT can be computed for all atoms or some atoms. The second to last dimension should contain the *FLATTENED* representation of L x Natoms. If you want to calculate for atom37, e.g., this will be of size (L * 37). If you are only calculating CA LDDT, it will be of size L.
+    Args:
+        all_atom_pred_pos (Tensor[float], [(Nstates x) B x (L * Natoms x) 3]): Tensor of predicted positions
+        all_atom_positions (Tensor[float], [B x (L * Natoms x) 3]): Tensor of true positions
+        all_atom_mask (Tensor[float], [B x (L * Natoms)]): Tensor of masks, indicating whether an atom exists.
+        pairwise_all_atom_mask (Tensor[float], [B x (L * Natoms x L * Natoms)], optional): Tensor of masks, indicating whether a pair of atoms should be considered in the LDDT calculation.
+        cutoff (float): Max distance to score lddt over. This can either be a float, or a tensor of shape [B, L, L] to allow for per-residue cutoffs, e.g. if you want to use a different cutoff for nucleic acids.
+        per_residue (bool): Whether to return per-residue or full-protein lddt.
+        sequence_id (Tensor, optional): Sequence id tensor for binpacking. NOTE: only supported for lddt_ca calculations, not when Natoms is passed!
+    Returns:
+        LDDT Tensor:
+            if per_residue:
+                Tensor[float], [(Nstates x) B x (L * Natoms)]
+            else:
+                Tensor[float], [(Nstates x) B]
+    """
+    all_atom_mask = all_atom_mask[..., None]  # add a dimension for broadcasting
+    dmat_true = torch.sqrt(
+        eps
+        + torch.sum(
+            (all_atom_positions[..., None, :] - all_atom_positions[..., None, :, :])
+            ** 2,
+            dim=-1,
+        )
+    )
+    dmat_pred = torch.sqrt(
+        eps
+        + torch.sum(
+            (all_atom_pred_pos[..., None, :] - all_atom_pred_pos[..., None, :, :]) ** 2,
+            dim=-1,
+        )
+    )
+    mask = all_atom_mask * rearrange(all_atom_mask, "... a b -> ... b a")
+    if pairwise_all_atom_mask is not None:
+        mask = mask * pairwise_all_atom_mask
+    if sequence_id is not None:
+        # TODO: This will work for lddt_ca, but not for regular lddt
+        # Problem is that regular lddt has natoms * nres scores, so would need to repeat this mask by natoms
+        # Leaving for now because it won't fail silently so should be ook.
+        seqid_mask = sequence_id[..., None] == sequence_id[..., None, :]
+        mask = mask * seqid_mask.type_as(mask)
+    return compute_lddt_from_dmat(
+        dmat_pred, dmat_true, mask, cutoff=cutoff, eps=eps, per_residue=per_residue
+    )
+def compute_lddt_from_dmat(
+    dmat_pred: torch.Tensor,
+    dmat_true: torch.Tensor,
+    pairwise_mask: torch.Tensor,
+    cutoff: float | torch.Tensor = 15.0,
+    eps: float = 1e-10,
+    per_residue: bool = True,
+):
+    """
+    Compute LDDT from pre-computed distance matrices.
+    This is useful when you want to compute LDDT with multiple different masks or cutoffs, e.g. for different molecule types (protein, nucleic acid, etc.).
+    Args:
+        dmat_pred (Tensor[float], [B x L x L]): Predicted distance matrix
+        dmat_true (Tensor[float], [B x L x L]): True distance matrix
+        pairwise_mask (Tensor[float], [B x L x L]): Pairwise mask indicating which pairs of atoms to consider
+        cutoff (float): Max distance to score lddt over. This can either be a float, or a tensor of shape [B, L, L] to allow for per-residue cutoffs, e.g. if you want to use a different cutoff for nucleic acids.
+        per_residue (bool): Whether to return per-residue or full-protein lddt.
+    Returns:
+        LDDT Tensor:
+            if per_residue:
+                Tensor[float], [B x L]
+            else:
+                Tensor[float], [B]
+    """
+    n = dmat_true.size(-1)
+    dists_to_score = (
+        (dmat_true < cutoff)
+        * pairwise_mask
+        * (1.0 - torch.eye(n, device=dmat_true.device))
+    )
+    dist_l1 = torch.abs(dmat_true - dmat_pred)
+    score = (
+        (dist_l1 < 0.5).type(dist_l1.dtype)
+        + (dist_l1 < 1.0).type(dist_l1.dtype)
+        + (dist_l1 < 2.0).type(dist_l1.dtype)
+        + (dist_l1 < 4.0).type(dist_l1.dtype)
+    )
+    score = score * 0.25
+    dims = (-1,) if per_residue else (-2, -1)
+    norm = 1.0 / (eps + torch.sum(dists_to_score, dim=dims))
+    score = norm * (eps + torch.sum(dists_to_score * score, dim=dims))
+    return score
+def compute_lddt_ca(
+    all_atom_pred_pos: torch.Tensor,
+    all_atom_positions: torch.Tensor,
+    all_atom_mask: torch.Tensor,
+    cutoff: float = 15.0,
+    eps: float = 1e-10,
+    per_residue: bool = True,
+    sequence_id: torch.Tensor | None = None,
+) -> torch.Tensor:
+    ca_pos = residue_constants.atom_order["CA"]
+    if all_atom_pred_pos.dim() != 3:
+        all_atom_pred_pos = all_atom_pred_pos[..., ca_pos, :]
+    all_atom_positions = all_atom_positions[..., ca_pos, :]
+    all_atom_mask = all_atom_mask[..., ca_pos]
+    return compute_lddt(
+        all_atom_pred_pos,
+        all_atom_positions,
+        all_atom_mask,
+        cutoff=cutoff,
+        eps=eps,
+        per_residue=per_residue,
+        sequence_id=sequence_id,
+    )
+# NOTE(roshan): no_grad required for stack_variable_length_tensors apparently... let's revisit if we want to backprop
+@torch.no_grad()
+@autocast("cuda", enabled=False)
+def compute_rmsd(
+    mobile: torch.Tensor,
+    target: torch.Tensor,
+    atom_exists_mask: torch.Tensor | None = None,
+    sequence_id: torch.Tensor | None = None,
+    reduction: str = "batch",
+):
+    """
+    Compute RMSD between two batches of structures with support for masking invalid atoms using PyTorch.
+    Args:
+    - mobile (torch.Tensor): Batch of coordinates of structure to be superimposed in shape (B, N, 3)
+    - target (torch.Tensor): Batch of coordinates of structure that is fixed in shape (B, N, 3)
+    - atom_exists_mask (torch.Tensor, optional): Mask for Whether an atom exists of shape (B, N)
+    - sequence_id (torch.Tensor, optional): Sequence id tensor for binpacking.
+    - reduction (str): One of "batch", "per_sample", "per_residue".
+    Returns:
+    If reduction == "batch":
+        (torch.Tensor): 0-dim, Average Root Mean Square Deviation between the structures for each batch
+    If reduction == "per_sample":
+        (torch.Tensor): (B,)-dim, Root Mean Square Deviation between the structures for each batch
+    If reduction == "per_residue":
+        (torch.Tensor): (B, N)-dim, Root Mean Square Deviation between the structures for residue in the batch
+    """
+    (centered_mobile, _, centered_target, _, rotation_matrix, num_valid_atoms) = (
+        compute_alignment_tensors(
+            mobile=mobile,
+            target=target,
+            atom_exists_mask=atom_exists_mask,
+            sequence_id=sequence_id,
+        )
+    )
+    # Apply transformation to centered structure
+    rotated_mobile = torch.matmul(centered_mobile, rotation_matrix)
+    # Compute rmsd for centered structures
+    rmsd = compute_rmsd_no_alignment(
+        rotated_mobile, centered_target, num_valid_atoms, reduction=reduction
+    )
+    if reduction == "per_residue" and sequence_id is not None:
+        rmsd = binpack(rmsd, sequence_id, pad_value=0)
+    return rmsd
+def compute_gdt_ts(
+    mobile: torch.Tensor,
+    target: torch.Tensor,
+    atom_exists_mask: torch.Tensor | None = None,
+    sequence_id: torch.Tensor | None = None,
+    reduction: str = "per_sample",
+):
+    """
+    Compute GDT_TS between two batches of structures with support for masking invalid atoms using PyTorch.
+    Args:
+    - mobile (torch.Tensor): Batch of coordinates of structure to be superimposed in shape (B, N, 3)
+    - target (torch.Tensor): Batch of coordinates of structure that is fixed in shape (B, N, 3)
+    - atom_exists_mask (torch.Tensor, optional): Mask for Whether an atom exists of shape (B, N)
+    - sequence_id (torch.Tensor, optional): Sequence id tensor for binpacking.
+    - reduction (str): One of "batch", "per_sample", "per_residue".
+    Returns:
+    If reduction == "batch":
+        (torch.Tensor): 0-dim, GDT_TS between the structures for each batch
+    If reduction == "per_sample":
+        (torch.Tensor): (B,)-dim, GDT_TS between the structures for each sample in the batch
+    """
+    if atom_exists_mask is None:
+        atom_exists_mask = torch.isfinite(target).all(dim=-1)
+    (centered_mobile, _, centered_target, _, rotation_matrix, _) = (
+        compute_alignment_tensors(
+            mobile=mobile,
+            target=target,
+            atom_exists_mask=atom_exists_mask,
+            sequence_id=sequence_id,
+        )
+    )
+    # Apply transformation to centered structure
+    rotated_mobile = torch.matmul(centered_mobile, rotation_matrix)
+    # the coordinate tensors returned by `compute_alignment_tensors` are unbinpacked and contain zeros for invalid positions
+    # so `compute_gdt_ts_no_alignment` requires `atom_exists_mask` to be passed and be unbinpacked
+    if sequence_id is not None:
+        atom_exists_mask = unbinpack(atom_exists_mask, sequence_id, pad_value=False)
+    return compute_gdt_ts_no_alignment(
+        rotated_mobile, centered_target, atom_exists_mask, reduction
+    )

esmfold2_misc.py ADDED Viewed

	@@ -0,0 +1,505 @@

+from __future__ import annotations
+import os
+from collections import defaultdict
+from contextlib import nullcontext
+from dataclasses import is_dataclass
+from io import BytesIO
+from typing import (
+    Any,
+    ContextManager,
+    Generator,
+    Iterable,
+    Protocol,
+    Sequence,
+    TypeVar,
+    runtime_checkable,
+)
+from warnings import warn
+import huggingface_hub
+import numpy as np
+import torch
+import zstd
+from .esmfold2_constants_esm3 import CHAIN_BREAK_STR
+from .esmfold2_utils_types import FunctionAnnotation
+MAX_SUPPORTED_DISTANCE = 1e6
+TSequence = TypeVar("TSequence", bound=Sequence)
+@runtime_checkable
+class Concatable(Protocol):
+    @classmethod
+    def concat(cls, objs: list[Concatable]) -> Concatable: ...
+def slice_python_object_as_numpy(
+    obj: TSequence, idx: int | list[int] | slice | np.ndarray
+) -> TSequence:
+    """
+    Slice a python object (like a list, string, or tuple) as if it was a numpy object.
+    Example:
+        >>> obj = "ABCDE"
+        >>> slice_python_object_as_numpy(obj, [1, 3, 4])
+        "BDE"
+        >>> obj = [1, 2, 3, 4, 5]
+        >>> slice_python_object_as_numpy(obj, np.arange(5) < 3)
+        [1, 2, 3]
+    """
+    if np.isscalar(idx):
+        idx = [int(idx)]  # type: ignore
+    if isinstance(idx, np.ndarray) and idx.dtype == bool:
+        sliced_obj = [obj[i] for i in np.where(idx)[0]]
+    elif isinstance(idx, slice):
+        sliced_obj = obj[idx]
+    else:
+        sliced_obj = [obj[i] for i in idx]  # type: ignore
+    match obj, sliced_obj:
+        case str(), list():
+            sliced_obj = "".join(sliced_obj)
+        case _:
+            sliced_obj = obj.__class__(sliced_obj)  # type: ignore
+    return sliced_obj  # type: ignore
+def slice_any_object(
+    obj: TSequence, idx: int | list[int] | slice | np.ndarray
+) -> TSequence:
+    """
+    Slice a arbitrary object (like a list, string, or tuple) as if it was a numpy object. Similar to `slice_python_object_as_numpy`, but detects if it's a numpy array or Tensor and uses the existing slice method if so.
+    If the object is a dataclass, it will simply apply the index to the object, under the assumption that the object has correcty implemented numpy indexing.
+    Example:
+        >>> obj = "ABCDE"
+        >>> slice_any_object(obj, [1, 3, 4])
+        "BDE"
+        >>> obj = np.array([1, 2, 3, 4, 5])
+        >>> slice_any_object(obj, np.arange(5) < 3)
+        np.array([1, 2, 3])
+        >>> obj = ProteinChain.from_rcsb("1a3a", "A")
+        >>> slice_any_object(obj, np.arange(len(obj)) < 10)
+        # ProteinChain w/ length 10
+    """
+    if isinstance(obj, (np.ndarray, torch.Tensor)):
+        return obj[idx]  # type: ignore
+    elif is_dataclass(obj):
+        # if passing a dataclass, assume it implements a custom slice
+        return obj[idx]  # type: ignore
+    else:
+        return slice_python_object_as_numpy(obj, idx)
+def rbf(values, v_min, v_max, n_bins=16):
+    """
+    Returns RBF encodings in a new dimension at the end.
+    """
+    rbf_centers = torch.linspace(
+        v_min, v_max, n_bins, device=values.device, dtype=values.dtype
+    )
+    rbf_centers = rbf_centers.view([1] * len(values.shape) + [-1])
+    rbf_std = (v_max - v_min) / n_bins
+    z = (values.unsqueeze(-1) - rbf_centers) / rbf_std
+    return torch.exp(-(z**2))
+def batched_gather(data, inds, dim=0, no_batch_dims=0):
+    ranges = []
+    for i, s in enumerate(data.shape[:no_batch_dims]):
+        r = torch.arange(s)
+        r = r.view(*(*((1,) * i), -1, *((1,) * (len(inds.shape) - i - 1))))
+        ranges.append(r)
+    remaining_dims = [slice(None) for _ in range(len(data.shape) - no_batch_dims)]
+    remaining_dims[dim - no_batch_dims if dim >= 0 else dim] = inds
+    ranges.extend(remaining_dims)
+    return data[ranges]
+def node_gather(s: torch.Tensor, edges: torch.Tensor) -> torch.Tensor:
+    return batched_gather(s.unsqueeze(-3), edges, -2, no_batch_dims=len(s.shape) - 1)
+def knn_graph(
+    coords: torch.Tensor,
+    coord_mask: torch.Tensor,
+    padding_mask: torch.Tensor,
+    sequence_id: torch.Tensor,
+    *,
+    no_knn: int,
+):
+    L = coords.shape[-2]
+    num_by_dist = min(no_knn, L)
+    device = coords.device
+    coords = coords.nan_to_num()
+    coord_mask = ~(coord_mask[..., None, :] & coord_mask[..., :, None])
+    padding_pairwise_mask = padding_mask[..., None, :] | padding_mask[..., :, None]
+    if sequence_id is not None:
+        padding_pairwise_mask |= torch.unsqueeze(sequence_id, 1) != torch.unsqueeze(
+            sequence_id, 2
+        )
+    dists = (coords.unsqueeze(-2) - coords.unsqueeze(-3)).norm(dim=-1)
+    arange = torch.arange(L, device=device)
+    seq_dists = (arange.unsqueeze(-1) - arange.unsqueeze(-2)).abs()
+    # We only support up to a certain distance, above that, we use sequence distance
+    # instead. This is so that when a large portion of the structure is masked out,
+    # the edges are built according to sequence distance.
+    max_dist = MAX_SUPPORTED_DISTANCE
+    if not (dists[~coord_mask] < max_dist).all():
+        raise ValueError(
+            f"Coordinate pairwise distances exceed max supported distance ({max_dist}). "
+        )
+    struct_then_seq_dist = (
+        seq_dists.to(dists.dtype)
+        .mul(1e2)
+        .add(max_dist)
+        .where(coord_mask, dists)
+        .masked_fill(padding_pairwise_mask, torch.inf)
+    )
+    dists, edges = struct_then_seq_dist.sort(dim=-1, descending=False)
+    # This is a L x L tensor, where we index by rows first,
+    # and columns are the edges we should pick.
+    chosen_edges = edges[..., :num_by_dist]
+    chosen_mask = dists[..., :num_by_dist].isfinite()
+    return chosen_edges, chosen_mask
+def stack_variable_length_tensors(
+    sequences: Sequence[torch.Tensor],
+    constant_value: int | float = 0,
+    dtype: torch.dtype | None = None,
+) -> torch.Tensor:
+    """Automatically stack tensors together, padding variable lengths with the
+    value in constant_value. Handles an arbitrary number of dimensions.
+    Examples:
+        >>> tensor1, tensor2 = torch.ones([2]), torch.ones([5])
+        >>> stack_variable_length_tensors(tensor1, tensor2)
+        tensor of shape [2, 5]. First row is [1, 1, 0, 0, 0]. Second row is all ones.
+        >>> tensor1, tensor2 = torch.ones([2, 4]), torch.ones([5, 3])
+        >>> stack_variable_length_tensors(tensor1, tensor2)
+        tensor of shape [2, 5, 4]
+    """
+    batch_size = len(sequences)
+    shape = [batch_size] + np.max([seq.shape for seq in sequences], 0).tolist()
+    if dtype is None:
+        dtype = sequences[0].dtype
+    device = sequences[0].device
+    array = torch.full(shape, constant_value, dtype=dtype, device=device)
+    for arr, seq in zip(array, sequences):
+        arrslice = tuple(slice(dim) for dim in seq.shape)
+        arr[arrslice] = seq
+    return array
+def binpack(
+    tensor: torch.Tensor, sequence_id: torch.Tensor | None, pad_value: int | float
+):
+    """
+    Args:
+        tensor (Tensor): [B, L, ...]
+    Returns:
+        Tensor: [B_binpacked, L_binpacked, ...]
+    """
+    if sequence_id is None:
+        return tensor
+    num_sequences = sequence_id.max(dim=-1).values + 1
+    dims = sequence_id.shape + tensor.shape[2:]
+    output_tensor = torch.full(
+        dims, fill_value=pad_value, dtype=tensor.dtype, device=tensor.device
+    )
+    idx = 0
+    for batch_idx, (batch_seqid, batch_num_sequences) in enumerate(
+        zip(sequence_id, num_sequences)
+    ):
+        for seqid in range(batch_num_sequences):
+            mask = batch_seqid == seqid
+            output_tensor[batch_idx, mask] = tensor[idx, : mask.sum()]
+            idx += 1
+    return output_tensor
+def unbinpack(
+    tensor: torch.Tensor, sequence_id: torch.Tensor | None, pad_value: int | float
+):
+    """
+    Args:
+        tensor (Tensor): [B, L, ...]
+    Returns:
+        Tensor: [B_unbinpacked, L_unbinpack, ...]
+    """
+    if sequence_id is None:
+        return tensor
+    unpacked_tensors = []
+    num_sequences = sequence_id.max(dim=-1).values + 1
+    for batch_idx, (batch_seqid, batch_num_sequences) in enumerate(
+        zip(sequence_id, num_sequences)
+    ):
+        for seqid in range(batch_num_sequences):
+            mask = batch_seqid == seqid
+            unpacked = tensor[batch_idx, mask]
+            unpacked_tensors.append(unpacked)
+    return stack_variable_length_tensors(unpacked_tensors, pad_value)
+def fp32_autocast_context(device_type: str) -> ContextManager[Any]:  # type: ignore
+    """
+    Returns an autocast context manager that disables downcasting by AMP.
+    Args:
+        device_type: The device type ('cpu' or 'cuda')
+    Returns:
+        An autocast context manager with the specified behavior.
+    """
+    if device_type == "cpu":
+        return torch.amp.autocast(device_type, enabled=False)  # type: ignore
+    elif device_type == "mps":
+        # For MPS, just return a no-op context manager (nullcontext) since MPS does not support autocast.
+        return nullcontext()
+    elif device_type == "cuda":
+        return torch.amp.autocast(device_type, dtype=torch.float32)  # type: ignore
+    else:
+        raise ValueError(f"Unsupported device type: {device_type}")
+def merge_ranges(ranges: list[range], merge_gap_max: int | None = None) -> list[range]:
+    """Merge overlapping ranges into sorted, non-overlapping segments.
+    Args:
+        ranges: collection of ranges to merge.
+        merge_gap_max: optionally merge neighboring ranges that are separated by a gap
+          no larger than this size.
+    Returns:
+        non-overlapping ranges merged from the inputs, sorted by position.
+    """
+    ranges = sorted(ranges, key=lambda r: r.start)
+    merge_gap_max = merge_gap_max if merge_gap_max is not None else 0
+    assert merge_gap_max >= 0, f"Invalid merge_gap_max: {merge_gap_max}"
+    merged = []
+    for r in ranges:
+        if not merged:
+            merged.append(r)
+        else:
+            last = merged[-1]
+            if last.stop + merge_gap_max >= r.start:
+                merged[-1] = range(last.start, max(last.stop, r.stop))
+            else:
+                merged.append(r)
+    return merged
+def merge_annotations(
+    annotations: list[FunctionAnnotation], merge_gap_max: int | None = None
+) -> list[FunctionAnnotation]:
+    """Merges annotations into non-overlapping segments.
+    Args:
+        annotations: annotations to merge.
+        merge_gap_max: optionally merge neighboring ranges that are separated by a gap
+          no larger than this size.
+    Returns:
+        non-overlapping annotations with gaps merged.
+    """
+    grouped: dict[str, list[range]] = defaultdict(list)
+    for a in annotations:
+        # +1 since FunctionAnnotation.end is inlcusive.
+        grouped[a.label].append(range(a.start, a.end + 1))
+    merged = []
+    for label, ranges in grouped.items():
+        merged_ranges = merge_ranges(ranges, merge_gap_max=merge_gap_max)
+        for range_ in merged_ranges:
+            annotation = FunctionAnnotation(
+                label=label,
+                start=range_.start,
+                end=range_.stop - 1,  # convert range.stop exclusive -> inclusive.
+            )
+            merged.append(annotation)
+    return merged
+def replace_inf(data):
+    if data is None:
+        return None
+    array = np.asarray(data, dtype=np.float32)
+    array = np.where(np.isinf(array), 1000, array)
+    return array.tolist()
+def maybe_tensor(x, convert_none_to_nan: bool = False) -> torch.Tensor | None:
+    if x is None:
+        return None
+    if isinstance(x, torch.Tensor):
+        return x
+    if isinstance(x, list) and all(isinstance(t, torch.Tensor) for t in x):
+        return torch.stack(x)
+    if convert_none_to_nan:
+        x = np.asarray(x, dtype=np.float32)
+        x = np.where(x is None, np.nan, x)
+    return torch.tensor(x)
+def maybe_list(x, convert_nan_to_none: bool = False) -> list | None:
+    if x is None:
+        return None
+    if not convert_nan_to_none:
+        return x.tolist()
+    # Handle both torch.tensor and np.ndarray input.
+    if isinstance(x, torch.Tensor):
+        nan_mask = torch.isnan(x).cpu().numpy()
+        np_arr = x.cpu().numpy().astype(object)
+    elif isinstance(x, np.ndarray):
+        nan_mask = np.isnan(x)
+        np_arr = x.astype(object)
+    else:
+        raise TypeError("maybe_list can only work with torch.tensor or np.ndarray.")
+    np_arr[nan_mask] = None
+    return np_arr.tolist()
+def huggingfacehub_login():
+    """Authenticates with the Hugging Face Hub using the HF_TOKEN environment
+    variable, else by prompting the user"""
+    token = os.environ.get("HF_TOKEN")
+    huggingface_hub.login(token=token)
+def get_chainbreak_boundaries_from_sequence(sequence: Sequence[str]) -> np.ndarray:
+    chain_boundaries = [0]
+    for i, aa in enumerate(sequence):
+        if aa == CHAIN_BREAK_STR:
+            if i == (len(sequence) - 1):
+                raise ValueError(
+                    "Encountered chain break token at end of sequence, this is unexpected."
+                )
+            if i == (len(sequence) - 2):
+                warn(
+                    "Encountered chain break token at penultimate position, this is unexpected."
+                )
+            chain_boundaries.append(i)
+            chain_boundaries.append(i + 1)
+    chain_boundaries.append(len(sequence))
+    assert len(chain_boundaries) % 2 == 0
+    chain_boundaries = np.array(chain_boundaries).reshape(-1, 2)
+    return chain_boundaries
+def deserialize_tensors(b: bytes) -> Any:
+    buf = BytesIO(zstd.ZSTD_uncompress(b))
+    d = torch.load(buf, map_location="cpu", weights_only=False)
+    return d
+def join_lists(
+    lists: Sequence[Sequence[Any]], separator: Sequence[Any] | None = None
+) -> list[Any]:
+    """Joins multiple lists with separator element. Like str.join but for lists.
+    Example: [[1, 2], [3], [4]], separator=[0] -> [1, 2, 0, 3, 0, 4]
+    Args:
+        lists: Lists of elements to chain
+        separator: separators to intsert between chained output.
+    Returns:
+        Joined lists.
+    """
+    if not lists:
+        return []
+    joined = []
+    joined.extend(lists[0])
+    for l in lists[1:]:
+        if separator:
+            joined.extend(separator)
+        joined.extend(l)
+    return joined
+def iterate_with_intermediate(
+    lists: Iterable, intermediate
+) -> Generator[Any, None, None]:
+    """
+    Iterate over the iterable, yielding the intermediate value between
+    every element of the intermediate. Useful for joining objects with
+    separator tokens.
+    """
+    it = iter(lists)
+    yield next(it)
+    for l in it:
+        yield intermediate
+        yield l
+def concat_objects(objs: Sequence[Any], separator: Any | None = None):
+    """
+    Concat objects with each other using a separator token.
+    Supports:
+        - Concatable (objects that implement `concat` classmethod)
+        - strings
+        - lists
+        - numpy arrays
+        - torch Tensors
+    Example:
+        >>> foo = "abc"
+        >>> bar = "def"
+        >>> concat_objects([foo, bar], "|")
+        "abc|def"
+    """
+    match objs[0]:
+        case Concatable():
+            return objs[0].__class__.concat(objs)  # type: ignore
+        case str():
+            assert isinstance(
+                separator, str
+            ), "Trying to join strings but separator is not a string"
+            return separator.join(objs)
+        case list():
+            if separator is not None:
+                return join_lists(objs, [separator])
+            else:
+                return join_lists(objs)
+        case np.ndarray():
+            if separator is not None:
+                return np.concatenate(
+                    list(iterate_with_intermediate(objs, np.array([separator])))
+                )
+            else:
+                return np.concatenate(objs)
+        case torch.Tensor():
+            if separator is not None:
+                return torch.cat(
+                    list(iterate_with_intermediate(objs, torch.tensor([separator])))
+                )
+            else:
+                return torch.cat(objs)  # type: ignore
+        case _:
+            raise TypeError(type(objs[0]))

esmfold2_mmcif_parsing.py ADDED Viewed

	@@ -0,0 +1,470 @@

+from __future__ import annotations
+import functools
+import io
+import os
+from dataclasses import dataclass
+from datetime import datetime
+from typing import Union
+import biotite.structure as bs
+import biotite.structure.io.pdbx as pdbx
+from . import esmfold2_residue_constants
+# Define PathOrBuffer for the opensource version
+PathOrBuffer = Union[str, os.PathLike, io.StringIO]
+class NoProteinError(Exception):
+    pass
+@dataclass
+class Residue:
+    residue_number: int | None = None
+    insertion_code: str = ""
+    hetflag: bool = False
+@dataclass
+class MmcifHeader:
+    release_date: datetime | None = None
+    resolution: float | None = None
+    structure_method: str = "UNKNOWN"
+class MmcifWrapper:
+    def __init__(self, id: str | None = None):
+        self.id: str = id or ""
+        self.raw: pdbx.CIFFile | None = None
+        self.structure: bs.AtomArray
+        self.header: MmcifHeader = MmcifHeader()
+        self.entities: dict[int, list[str]] = {}
+        self.chain_to_seqres: dict[str, str] = {}
+        self.seqres_to_structure: dict[str, dict[int, Residue]] = {}
+    @classmethod
+    def read(cls, path: PathOrBuffer, id: str | None = None) -> MmcifWrapper:
+        obj = cls(id=id)
+        obj._load(path)
+        return obj
+    def _load(self, path: PathOrBuffer, fileid: str | None = None):
+        """Load mmCIF data from file."""
+        self.raw = pdbx.CIFFile.read(path)
+        self._parse_structure()
+        self._parse_header()
+        self._parse_entities()
+        self._parse_sequences()
+    def _parse_structure(self):
+        """Parse the atomic structure from mmCIF."""
+        try:
+            structure = pdbx.get_structure(self.raw, model=1)
+            if structure is None or not isinstance(structure, bs.AtomArray):
+                raise NoProteinError("No structure found in mmCIF file")
+            if len(structure) == 0:
+                raise NoProteinError("Empty structure in mmCIF file")
+            self.structure = structure
+        except Exception as e:
+            raise ValueError(f"Failed to parse structure: {e}")
+    def _parse_header(self):
+        """Parse header information from mmCIF."""
+        if not self.raw:
+            return
+        try:
+            # Get the first (and usually only) block
+            block = self.raw.block
+            # Parse release date
+            if "pdbx_database_status" in block:
+                status_cat = block["pdbx_database_status"]
+                if "recvd_initial_deposition_date" in status_cat:
+                    date_str = status_cat["recvd_initial_deposition_date"].as_item()
+                    if date_str and date_str != "?":
+                        try:
+                            self.header.release_date = datetime.strptime(
+                                date_str, "%Y-%m-%d"
+                            )
+                        except ValueError:
+                            pass
+            # Parse resolution
+            if "refine" in block:
+                refine_cat = block["refine"]
+                if "ls_d_res_high" in refine_cat:
+                    res_str = refine_cat["ls_d_res_high"].as_item()
+                    if res_str and res_str != "?":
+                        try:
+                            self.header.resolution = float(res_str)
+                        except ValueError:
+                            pass
+            # Parse structure method
+            if "exptl" in block:
+                exptl_cat = block["exptl"]
+                if "method" in exptl_cat:
+                    method = exptl_cat["method"].as_item()
+                    if method and method != "?":
+                        self.header.structure_method = method.upper()
+        except Exception:
+            # If parsing fails, keep default values
+            pass
+    def _parse_entities(self):
+        """Parse entity information and map to chains."""
+        if not self.raw:
+            return
+        try:
+            block = self.raw.block
+            # Parse entity information
+            if "entity" in block:
+                entity_cat = block["entity"]
+                entity_ids = entity_cat["id"].as_array(str)
+                entity_types = entity_cat["type"].as_array(str)
+                # Initialize entities dict with all entities (not just polymers)
+                for i, (entity_id, entity_type) in enumerate(
+                    zip(entity_ids, entity_types)
+                ):
+                    self.entities[int(entity_id)] = []
+            # Map polymer chains to entities using entity_poly
+            if "entity_poly" in block:
+                poly_cat = block["entity_poly"]
+                entity_ids = poly_cat["entity_id"].as_array(str)
+                chain_lists = poly_cat["pdbx_strand_id"].as_array(str)
+                for entity_id, chain_list in zip(entity_ids, chain_lists):
+                    entity_id = int(entity_id)
+                    # Chain list is comma-separated
+                    chains = [c.strip() for c in chain_list.split(",") if c.strip()]
+                    if entity_id in self.entities:
+                        self.entities[entity_id] = chains
+            # Map non-polymer chains using struct_asym for entities not covered by entity_poly
+            if "struct_asym" in block:
+                asym_cat = block["struct_asym"]
+                asym_ids = asym_cat["id"].as_array(str)
+                entity_ids = asym_cat["entity_id"].as_array(str)
+                for asym_id, entity_id in zip(asym_ids, entity_ids):
+                    entity_id = int(entity_id)
+                    # Only add if entity exists but has no chains yet (non-polymer entities)
+                    if entity_id in self.entities and not self.entities[entity_id]:
+                        self.entities[entity_id].append(asym_id)
+        except Exception:
+            # If parsing fails, try to infer from structure
+            if (
+                self.structure
+                and hasattr(self.structure, "chain_id")
+                and self.structure.chain_id is not None
+                and hasattr(self.structure.chain_id, "__iter__")
+            ):
+                chain_ids = list(set(self.structure.chain_id))
+                self.entities = {1: chain_ids}
+    def _parse_sequences(self):
+        """Parse sequence information from mmCIF."""
+        if not self.raw:
+            return
+        block = self.raw.block
+        # Parse polymer sequences
+        if "entity_poly" in block:
+            poly_cat = block["entity_poly"]
+            entity_ids = poly_cat["entity_id"].as_array(str)
+            sequences = poly_cat["pdbx_seq_one_letter_code_can"].as_array(str)
+            chain_lists = poly_cat["pdbx_strand_id"].as_array(str)
+            for entity_id, sequence, chain_list in zip(
+                entity_ids, sequences, chain_lists
+            ):
+                # Clean up sequence (remove whitespace and newlines)
+                clean_seq = "".join(sequence.split())
+                chains = [c.strip() for c in chain_list.split(",") if c.strip()]
+                for chain_id in chains:
+                    self.chain_to_seqres[chain_id] = clean_seq
+        # Parse sequence to structure mapping
+        if "pdbx_poly_seq_scheme" in block:
+            seq_cat = block["pdbx_poly_seq_scheme"]
+            asym_ids = seq_cat["asym_id"].as_array(str)  # Internal chain IDs
+            seq_positions = seq_cat["seq_id"].as_array(str)
+            auth_seq_nums = seq_cat["auth_seq_num"].as_array(str)
+            ins_codes = (
+                seq_cat["pdb_ins_code"].as_array(str)
+                if "pdb_ins_code" in seq_cat
+                else [""] * len(asym_ids)
+            )
+            hetflags = (
+                seq_cat["hetflag"].as_array(str)
+                if "hetflag" in seq_cat
+                else ["N"] * len(asym_ids)
+            )
+            # Get author chain IDs if available
+            auth_chain_ids = (
+                seq_cat["pdb_strand_id"].as_array(str)
+                if "pdb_strand_id" in seq_cat
+                else asym_ids  # Fallback to internal IDs
+            )
+            # Build mapping from internal chain ID to author chain ID
+            asym_to_auth_mapping = {}
+            for asym_id, auth_id in zip(asym_ids, auth_chain_ids):
+                asym_to_auth_mapping[asym_id] = auth_id
+            # Group by internal chain ID first, then map to author chain ID
+            chain_data = {}
+            for asym_id, seq_pos, auth_seq, ins_code, hetflag in zip(
+                asym_ids, seq_positions, auth_seq_nums, ins_codes, hetflags
+            ):
+                if asym_id not in chain_data:
+                    chain_data[asym_id] = {}
+                try:
+                    seq_index = int(seq_pos) - 1  # Convert to 0-based indexing
+                    res_num = int(auth_seq) if auth_seq != "?" else None
+                except ValueError:
+                    continue
+                if res_num is not None:
+                    # Convert mmCIF "." and "?" to empty string
+                    clean_ins_code = "" if ins_code in [".", "?"] else ins_code
+                else:
+                    clean_ins_code = ""
+                    res_num = None
+                is_het = hetflag.upper() == "Y"  # type: ignore
+                chain_data[asym_id][seq_index] = Residue(
+                    residue_number=res_num,
+                    insertion_code=clean_ins_code,  # type: ignore
+                    hetflag=is_het,
+                )
+            # Handle cases where multiple residues have the same auth_seq_num
+            # by adjusting residue numbers to be unique within each chain
+            for asym_id, residue_data in chain_data.items():
+                # Check if there are duplicate residue numbers in this chain
+                positions_with_same_num = {}
+                for seq_idx, res_at_pos in residue_data.items():
+                    if res_at_pos.residue_number is not None:
+                        res_num = res_at_pos.residue_number
+                        if res_num not in positions_with_same_num:
+                            positions_with_same_num[res_num] = []
+                        positions_with_same_num[res_num].append(seq_idx)
+                # Fix duplicate residue numbers by making them sequential
+                for res_num, seq_indices in positions_with_same_num.items():
+                    if len(seq_indices) > 1:
+                        # Multiple residues have the same residue number
+                        # Make them sequential starting from the original number
+                        seq_indices.sort()  # Ensure consistent ordering
+                        for i, seq_idx in enumerate(seq_indices):
+                            original_pos = residue_data[seq_idx]
+                            new_pos = Residue(
+                                residue_number=res_num + i,
+                                insertion_code=original_pos.insertion_code,
+                                hetflag=original_pos.hetflag,
+                            )
+                            residue_data[seq_idx] = new_pos
+            # Create ordered mappings using author chain IDs
+            for asym_id in chain_data:
+                auth_chain_id = asym_to_auth_mapping.get(asym_id, asym_id)
+                if auth_chain_id in self.chain_to_seqres:
+                    seq_len = len(self.chain_to_seqres[auth_chain_id])
+                    ordered_mapping = {}
+                    for i in range(seq_len):
+                        if i in chain_data[asym_id]:
+                            ordered_mapping[i] = chain_data[asym_id][i]
+                        else:
+                            # Missing residue - no structure coordinates
+                            ordered_mapping[i] = Residue(
+                                residue_number=None, insertion_code="", hetflag=False
+                            )
+                    self.seqres_to_structure[auth_chain_id] = ordered_mapping
+                else:
+                    # Handle case where auth_chain_id is not in chain_to_seqres
+                    # This can happen if the chain is not a polymer or if there's a parsing issue
+                    # Create a basic mapping based on the chain_data
+                    if chain_data[asym_id]:
+                        # Sort by sequence index to create ordered mapping
+                        sorted_indices = sorted(chain_data[asym_id].keys())
+                        ordered_mapping = {}
+                        for i, seq_idx in enumerate(sorted_indices):
+                            ordered_mapping[i] = chain_data[asym_id][seq_idx]
+                        self.seqres_to_structure[auth_chain_id] = ordered_mapping
+        # Ensure all chains have complete mappings
+        for chain_id in self.chain_to_seqres:
+            if chain_id not in self.seqres_to_structure:
+                seq_len = len(self.chain_to_seqres[chain_id])
+                self.seqres_to_structure[chain_id] = {
+                    i: Residue(residue_number=None, insertion_code="", hetflag=False)
+                    for i in range(seq_len)
+                }
+            else:
+                # Fill in any missing indices
+                seq_len = len(self.chain_to_seqres[chain_id])
+                mapping = self.seqres_to_structure[chain_id]
+                for i in range(seq_len):
+                    if i not in mapping:
+                        mapping[i] = Residue(
+                            residue_number=None, insertion_code="", hetflag=False
+                        )
+        # Fallback: create basic mappings from structure for missing chains
+        if (
+            self.structure
+            and hasattr(self.structure, "chain_id")
+            and self.structure.chain_id is not None
+            and hasattr(self.structure.chain_id, "__iter__")
+        ):
+            for chain_id in set(self.structure.chain_id):
+                if chain_id not in self.seqres_to_structure:
+                    chain_structure = self.structure[
+                        self.structure.chain_id == chain_id
+                    ]
+                    if (
+                        hasattr(chain_structure, "res_id")
+                        and chain_structure.res_id is not None
+                        and hasattr(chain_structure.res_id, "__iter__")
+                    ):
+                        residue_ids = list(set(chain_structure.res_id))
+                        residue_ids.sort()
+                        self.seqres_to_structure[chain_id] = {
+                            i: Residue(
+                                residue_number=res_id, insertion_code="", hetflag=False
+                            )
+                            for i, res_id in enumerate(residue_ids)
+                        }
+    def _parse_nonpoly_from_mmcif(self) -> dict[tuple, bs.AtomArray]:
+        """Parse non-polymer coordinates from mmCIF block data."""
+        nonpoly_coords = {}
+        # Get non-polymer entities from the mmCIF block
+        assert self.raw is not None
+        block = self.raw.block
+        nonpoly_entities = set()
+        # Find non-polymer entities
+        if "entity" in block:
+            entity_cat = block["entity"]
+            entity_ids = entity_cat["id"].as_array(str)
+            entity_types = entity_cat["type"].as_array(str)
+            for entity_id, entity_type in zip(entity_ids, entity_types):
+                if entity_type.upper() in ["NON-POLYMER", "WATER", "BRANCHED"]:
+                    nonpoly_entities.add(entity_id)
+        # Map entities to chains for non-polymers
+        entity_to_chains = {}
+        if "pdbx_entity_nonpoly" in block:
+            nonpoly_cat = block["pdbx_entity_nonpoly"]
+            entity_ids = nonpoly_cat["entity_id"].as_array(str)
+            comp_ids = nonpoly_cat["comp_id"].as_array(str)
+            for entity_id, comp_id in zip(entity_ids, comp_ids):
+                if entity_id in nonpoly_entities:
+                    entity_to_chains[entity_id] = comp_id
+        # Get atom site information for non-polymers
+        if "atom_site" in block:
+            atom_cat = block["atom_site"]
+            atom_chain_ids = atom_cat["label_asym_id"].as_array(str)
+            atom_entity_ids = atom_cat["label_entity_id"].as_array(str)
+            atom_comp_ids = atom_cat["label_comp_id"].as_array(str)
+            # Group non-polymer atoms by entity and chain
+            nonpoly_atom_groups = {}
+            for i, (chain_id, entity_id, comp_id) in enumerate(
+                zip(atom_chain_ids, atom_entity_ids, atom_comp_ids)
+            ):
+                if entity_id in nonpoly_entities:
+                    key = (comp_id, chain_id)
+                    if key not in nonpoly_atom_groups:
+                        nonpoly_atom_groups[key] = []
+                    nonpoly_atom_groups[key].append(i)
+            # Extract coordinates for each non-polymer group
+            for (comp_id, chain_id), atom_indices in nonpoly_atom_groups.items():
+                # Match atoms by comparing chain_id and residue name
+                structure_mask = (self.structure.chain_id == chain_id) & (
+                    self.structure.res_name == comp_id
+                )
+                if structure_mask.any():
+                    nonpoly_array = self.structure[structure_mask]
+                    if (
+                        isinstance(nonpoly_array, (bs.AtomArray, bs.AtomArrayStack))
+                        and len(nonpoly_array) > 0
+                    ):
+                        nonpoly_coords[(comp_id, chain_id)] = nonpoly_array
+        return nonpoly_coords
+    def _parse_nonpoly_fallback(self) -> dict[tuple, bs.AtomArray]:
+        """Fallback method to extract heteroatoms directly from structure."""
+        nonpoly_coords = {}
+        if not (self.structure and hasattr(self.structure, "chain_id")):
+            return nonpoly_coords
+        # Create set of standard residues from residue_constants
+        standard_residues = set(residue_constants.resnames[:-1])  # Exclude 'UNK'
+        standard_residues.update({"A", "C", "G", "T", "U"})  # Add nucleic acids
+        if hasattr(self.structure, "chain_id") and self.structure.chain_id is not None:
+            for chain_id in set(self.structure.chain_id):
+                chain_structure = self.structure[self.structure.chain_id == chain_id]
+                # Find non-standard residues
+                if (
+                    hasattr(chain_structure, "res_name")
+                    and chain_structure.res_name is not None
+                    and hasattr(chain_structure.res_name, "__iter__")
+                ):
+                    for res_name in set(chain_structure.res_name):
+                        if res_name not in standard_residues:
+                            res_mask = (chain_structure.chain_id == chain_id) & (
+                                chain_structure.res_name == res_name
+                            )
+                            if res_mask.any() and isinstance(
+                                chain_structure, (bs.AtomArray, bs.AtomArrayStack)
+                            ):
+                                nonpoly_array = chain_structure[res_mask]
+                                nonpoly_coords[(res_name, chain_id)] = nonpoly_array
+        return nonpoly_coords
+    @functools.cached_property
+    def non_polymer_coords(self) -> dict[tuple, bs.AtomArray]:
+        """
+        Extract non-polymer coordinates (ligands, cofactors, etc.) from mmCIF structure.
+        Returns a dictionary mapping (nonpolymer_info, chain_id) tuples to AtomArrays.
+        """
+        if not self.structure or not self.raw:
+            return {}
+        try:
+            return self._parse_nonpoly_from_mmcif()
+        except Exception:
+            return self._parse_nonpoly_fallback()

esmfold2_molecular_complex.py ADDED Viewed

	@@ -0,0 +1,1226 @@

+from __future__ import annotations
+import io
+import os
+import re
+from dataclasses import asdict, dataclass
+from pathlib import Path
+from subprocess import check_output
+from tempfile import TemporaryDirectory
+from typing import TYPE_CHECKING, Any
+import biotite.structure as bs
+import biotite.structure.io.pdbx as pdbx
+import brotli
+import msgpack
+import numpy as np
+import torch
+from biotite.structure.io.pdbx import (
+    CIFCategory,
+    CIFColumn,
+    CIFData,
+    CIFFile,
+    set_structure,
+)
+from . import esmfold2_residue_constants
+from .esmfold2_metrics import compute_lddt, compute_rmsd
+from .esmfold2_protein_complex import ProteinComplex, ProteinComplexMetadata
+@dataclass
+class MolecularComplexResult:
+    """Result of molecular complex folding"""
+    complex: MolecularComplex
+    plddt: torch.Tensor | None = None
+    ptm: float | None = None
+    iptm: float | None = None
+    pae: torch.Tensor | None = None
+    distogram: torch.Tensor | None = None
+    pair_chains_iptm: torch.Tensor | None = None
+    output_embedding_sequence: torch.Tensor | None = None
+    output_embedding_pair_pooled: torch.Tensor | None = None
+    residue_index: torch.Tensor | None = None
+    entity_id: torch.Tensor | None = None
+    sae_features: np.ndarray | None = None  # [L, n_features]
+@dataclass
+class MolecularComplexMetadata:
+    """Metadata for MolecularComplex objects."""
+    entity_lookup: dict[int, str]
+    chain_lookup: dict[int, str]
+    assembly_composition: dict[str, list[str]] | None = None
+@dataclass
+class Molecule:
+    """Represents a single molecule/token within a MolecularComplex."""
+    token: str
+    token_idx: int
+    atom_positions: np.ndarray  # [N_atoms, 3]
+    atom_elements: np.ndarray  # [N_atoms] element strings
+    atom_names: np.ndarray | None = None  # [N_atoms] atom names (optional)
+    atom_hetero: np.ndarray | None = None  # [N_atoms] hetero flags (optional)
+    residue_type: int = 0
+    molecule_type: int = 0  # PROTEIN=0, RNA=1, DNA=2, LIGAND=3
+    confidence: float = 0.0
+@dataclass(frozen=True)
+class MolecularComplex:
+    """
+    Dataclass representing a molecular complex with support for proteins, nucleic acids, and ligands.
+    Uses a flat atom representation with token-based sequence indexing, supporting all atom types
+    beyond the traditional atom37 protein representation.
+    """
+    id: str
+    sequence: list[str]  # Token sequence like ['MET', 'LYS', 'A', 'G', 'ATP']
+    # Flat atom arrays - simplified representation
+    atom_positions: np.ndarray  # [N_atoms, 3] 3D coordinates
+    atom_elements: np.ndarray  # [N_atoms] element strings
+    # Token-to-atom mapping for efficient access
+    token_to_atoms: np.ndarray  # [N_tokens, 2] start/end indices into atoms array
+    # Chain information
+    chain_id: np.ndarray  # [N_tokens] chain identifier for each token
+    # Confidence data
+    plddt: np.ndarray  # Per-token confidence scores [N_tokens]
+    # Metadata
+    metadata: MolecularComplexMetadata
+    # Optional atom names and hetero flags (preserved from original structures)
+    atom_names: np.ndarray | None = None  # [N_atoms] atom names (optional)
+    atom_hetero: np.ndarray | None = None  # [N_atoms] hetero flags (optional)
+    def __post_init__(self):
+        """Validate array dimensions."""
+        n_tokens = len(self.sequence)
+        n_atoms = len(self.atom_positions)
+        assert (
+            self.token_to_atoms.shape[0] == n_tokens
+        ), f"token_to_atoms shape {self.token_to_atoms.shape} != {n_tokens} tokens"
+        assert (
+            self.chain_id.shape[0] == n_tokens
+        ), f"chain_id shape {self.chain_id.shape} != {n_tokens} tokens"
+        assert (
+            self.plddt.shape[0] == n_tokens
+        ), f"plddt shape {self.plddt.shape} != {n_tokens} tokens"
+        if self.atom_names is not None:
+            assert (
+                self.atom_names.shape[0] == n_atoms
+            ), f"atom_names shape {self.atom_names.shape} != {n_atoms} atoms"
+        if self.atom_hetero is not None:
+            assert (
+                self.atom_hetero.shape[0] == n_atoms
+            ), f"atom_hetero shape {self.atom_hetero.shape} != {n_atoms} atoms"
+    def __len__(self) -> int:
+        """Return number of tokens."""
+        return len(self.sequence)
+    def __getitem__(self, idx: int) -> Molecule:
+        """Access individual molecules/tokens by index."""
+        if idx >= len(self.sequence) or idx < 0:
+            raise IndexError(
+                f"Token index {idx} out of range for {len(self.sequence)} tokens"
+            )
+        token = self.sequence[idx]
+        start_atom, end_atom = self.token_to_atoms[idx]
+        # Extract atom data for this token
+        token_atom_positions = self.atom_positions[start_atom:end_atom]
+        token_atom_elements = self.atom_elements[start_atom:end_atom]
+        token_atom_names = None
+        if self.atom_names is not None:
+            token_atom_names = self.atom_names[start_atom:end_atom]
+        token_atom_hetero = None
+        if self.atom_hetero is not None:
+            token_atom_hetero = self.atom_hetero[start_atom:end_atom]
+        # Default values for residue/molecule type (would be extended based on actual implementation)
+        residue_type = 0  # Default to standard residue
+        molecule_type = 0  # Default to protein
+        return Molecule(
+            token=token,
+            token_idx=idx,
+            atom_positions=token_atom_positions,
+            atom_elements=token_atom_elements,
+            atom_names=token_atom_names,
+            atom_hetero=token_atom_hetero,
+            residue_type=residue_type,
+            molecule_type=molecule_type,
+            confidence=self.plddt[idx],
+        )
+    @property
+    def atom_coordinates(self) -> np.ndarray:
+        """Get flat array of all atom coordinates [N_atoms, 3]."""
+        return self.atom_positions
+    # Conversion methods
+    @classmethod
+    def from_protein_complex(cls, pc: ProteinComplex) -> "MolecularComplex":
+        """Convert a ProteinComplex to MolecularComplex.
+        Args:
+            pc: ProteinComplex object with atom37 representation
+        Returns:
+            MolecularComplex with flat atom arrays and token-based indexing
+        """
+        from . import esmfold2_residue_constants
+        # Extract sequence without chain breaks
+        sequence_no_breaks = pc.sequence.replace("|", "")
+        sequence_tokens = [
+            residue_constants.restype_1to3.get(aa, "UNK") for aa in sequence_no_breaks
+        ]
+        # Convert atom37 to flat arrays
+        flat_positions = []
+        flat_elements = []
+        flat_names = []
+        flat_hetero = []
+        token_to_atoms = []
+        atom_idx = 0
+        for i, aa in enumerate(pc.sequence):
+            if aa == "|":
+                # Skip chain break tokens
+                continue
+            # Get atom37 positions and mask for this residue.
+            # ProteinComplex arrays are indexed by sequence position (including |),
+            # so use `i` not a separate residue counter.
+            res_positions = pc.atom37_positions[i]  # [37, 3]
+            res_mask = pc.atom37_mask[i]  # [37]
+            # Track start position for this token
+            token_start = atom_idx
+            # Process each atom type in atom37 representation
+            for atom_type_idx, atom_name in enumerate(residue_constants.atom_types):
+                if res_mask[atom_type_idx]:  # Atom is present
+                    # Add position
+                    flat_positions.append(res_positions[atom_type_idx])
+                    # Determine element from atom name
+                    element = (
+                        atom_name[0] if atom_name else "C"
+                    )  # First character is element
+                    flat_elements.append(element)
+                    # Add atom name
+                    flat_names.append(atom_name)
+                    # Add hetero flag (all proteins are non-hetero)
+                    flat_hetero.append(False)
+                    atom_idx += 1
+            # Record token-to-atom mapping [start_idx, end_idx)
+            token_to_atoms.append([token_start, atom_idx])
+        # Convert to numpy arrays
+        atom_positions = np.array(flat_positions, dtype=np.float32)
+        atom_elements = np.array(flat_elements, dtype=object)
+        atom_names = np.array(flat_names, dtype=object)
+        atom_hetero = np.array(flat_hetero, dtype=bool)
+        token_to_atoms_array = np.array(token_to_atoms, dtype=np.int32)
+        # Extract confidence scores and chain_ids (skip chain breaks)
+        confidence_scores = []
+        chain_ids = []
+        for seq_idx, aa in enumerate(pc.sequence):
+            if aa != "|":
+                confidence_scores.append(pc.confidence[seq_idx])
+                chain_ids.append(pc.chain_id[seq_idx])
+        confidence_array = np.array(confidence_scores, dtype=np.float32)
+        chain_id_array = np.array(chain_ids, dtype=np.int64)
+        # Create metadata - convert entity IDs to strings for MolecularComplexMetadata
+        entity_lookup_str = {k: str(v) for k, v in pc.metadata.entity_lookup.items()}
+        metadata = MolecularComplexMetadata(
+            entity_lookup=entity_lookup_str,
+            chain_lookup=pc.metadata.chain_lookup,
+            assembly_composition=pc.metadata.assembly_composition,
+        )
+        return cls(
+            id=pc.id,
+            sequence=sequence_tokens,
+            atom_positions=atom_positions,
+            atom_elements=atom_elements,
+            token_to_atoms=token_to_atoms_array,
+            chain_id=chain_id_array,
+            plddt=confidence_array,
+            metadata=metadata,
+            atom_names=atom_names,
+            atom_hetero=atom_hetero,
+        )
+    def to_protein_complex(self) -> ProteinComplex:
+        """Convert MolecularComplex back to ProteinComplex format.
+        Extracts only protein tokens and converts from flat atom representation
+        back to atom37 format used by ProteinComplex.
+        Returns:
+            ProteinComplex with protein residues only, excluding ligands/nucleic acids
+        """
+        from . import esmfold2_residue_constants
+        # No need for element mapping - already using element characters
+        # Filter for protein tokens only (skip ligands, nucleic acids)
+        protein_tokens = []
+        protein_indices = []
+        for i, token in enumerate(self.sequence):
+            # Check if token is a standard 3-letter amino acid code
+            if token in residue_constants.restype_3to1:
+                protein_tokens.append(token)
+                protein_indices.append(i)
+        if not protein_tokens:
+            raise ValueError("No protein tokens found in MolecularComplex")
+        n_residues = len(protein_tokens)
+        # Initialize atom37 arrays
+        atom37_positions = np.full((n_residues, 37, 3), np.nan, dtype=np.float32)
+        atom37_mask = np.zeros((n_residues, 37), dtype=bool)
+        # Extract confidence scores and chain_ids for protein residues only
+        protein_confidence = self.plddt[protein_indices]
+        protein_chain_ids = self.chain_id[protein_indices]
+        # Convert tokens back to single-letter sequence with chain breaks
+        single_letter_residues = []
+        prev_chain_id = None
+        for i, (token, chain_id_val) in enumerate(
+            zip(protein_tokens, protein_chain_ids)
+        ):
+            # Add chain break if we're switching to a new chain
+            if prev_chain_id is not None and chain_id_val != prev_chain_id:
+                single_letter_residues.append("|")
+            single_letter_residues.append(residue_constants.restype_3to1[token])
+            prev_chain_id = chain_id_val
+        single_letter_sequence = "".join(single_letter_residues)
+        # Calculate final sequence length (includes chain breaks)
+        sequence_length = len(single_letter_sequence)
+        # Convert flat atoms back to atom37 representation using atom names
+        for res_idx, token_idx in enumerate(protein_indices):
+            token = self.sequence[token_idx]
+            start_atom, end_atom = self.token_to_atoms[token_idx]
+            res_atom_positions = self.atom_positions[start_atom:end_atom]
+            res_atom_names = (
+                np.array(self.atom_names[start_atom:end_atom], dtype=str)
+                if self.atom_names is not None
+                else np.array([], dtype=str)
+            )
+            # Build a mapping from normalized atom name -> position for this residue
+            # Normalize to uppercase and strip whitespace for robust matching
+            name_to_pos: dict[str, np.ndarray] = {}
+            for i, nm in enumerate(res_atom_names):
+                key = nm.upper().strip()
+                # Prefer first occurrence; ignore duplicates/altlocs
+                if key not in name_to_pos:
+                    name_to_pos[key] = res_atom_positions[i]
+            # Place atoms into atom37 by matching stored atom names to atom37 indices.
+            # This handles all atoms present in the flat representation, not just
+            # the canonical residue_atoms for this residue type. This preserves
+            # atoms that were in the original atom37_mask even if they're atypical
+            # for the residue (e.g., from alternate conformations or data quirks).
+            for atom_name_str, pos in name_to_pos.items():
+                idx37 = residue_constants.atom_order.get(atom_name_str)
+                if idx37 is not None:
+                    atom37_positions[res_idx, idx37] = pos
+                    atom37_mask[res_idx, idx37] = True
+        # Create arrays that match sequence length (including chain breaks)
+        # Initialize arrays with proper size
+        chain_id_expanded = np.full(sequence_length, -1, dtype=np.int64)
+        entity_id_expanded = np.full(sequence_length, -1, dtype=np.int64)
+        sym_id_expanded = np.zeros(sequence_length, dtype=np.int64)
+        residue_index_expanded = np.zeros(sequence_length, dtype=np.int64)
+        insertion_code_expanded = np.array([""] * sequence_length, dtype=object)
+        confidence_expanded = np.zeros(sequence_length, dtype=np.float32)
+        atom37_positions_expanded = np.full(
+            (sequence_length, 37, 3), np.nan, dtype=np.float32
+        )
+        atom37_mask_expanded = np.zeros((sequence_length, 37), dtype=bool)
+        # Map residue data to sequence positions (skipping chain breaks)
+        residue_idx = 0
+        residue_counter_per_chain = {}
+        for seq_pos, char in enumerate(single_letter_sequence):
+            if char != "|":
+                # This is a residue position
+                chain_id_val = protein_chain_ids[residue_idx]
+                chain_id_expanded[seq_pos] = chain_id_val
+                entity_id_expanded[seq_pos] = chain_id_val  # Simplified mapping
+                # Track residue numbering per chain
+                if chain_id_val not in residue_counter_per_chain:
+                    residue_counter_per_chain[chain_id_val] = 1
+                else:
+                    residue_counter_per_chain[chain_id_val] += 1
+                residue_index_expanded[seq_pos] = residue_counter_per_chain[
+                    chain_id_val
+                ]
+                confidence_expanded[seq_pos] = protein_confidence[residue_idx]
+                atom37_positions_expanded[seq_pos] = atom37_positions[residue_idx]
+                atom37_mask_expanded[seq_pos] = atom37_mask[residue_idx]
+                residue_idx += 1
+            # Chain break positions keep default values (-1, False, etc.)
+        # Use the expanded arrays
+        chain_id = chain_id_expanded
+        entity_id = entity_id_expanded
+        sym_id = sym_id_expanded
+        residue_index = residue_index_expanded
+        insertion_code = insertion_code_expanded
+        protein_confidence = confidence_expanded
+        atom37_positions = atom37_positions_expanded
+        atom37_mask = atom37_mask_expanded
+        # Create protein complex metadata preserving chain information
+        # Convert MolecularComplex metadata to ProteinComplex format
+        unique_chain_ids = np.unique(protein_chain_ids)
+        entity_lookup = {int(cid): int(cid) for cid in unique_chain_ids}
+        chain_lookup = {
+            int(cid): self.metadata.chain_lookup.get(int(cid), chr(65 + int(cid)))
+            for cid in unique_chain_ids
+        }
+        protein_metadata = ProteinComplexMetadata(
+            entity_lookup=entity_lookup,
+            chain_lookup=chain_lookup,
+            assembly_composition=self.metadata.assembly_composition,
+        )
+        return ProteinComplex(
+            id=self.id,
+            sequence=single_letter_sequence,
+            entity_id=entity_id,
+            chain_id=chain_id,
+            sym_id=sym_id,
+            residue_index=residue_index,
+            insertion_code=insertion_code,
+            atom37_positions=atom37_positions,
+            atom37_mask=atom37_mask,
+            confidence=protein_confidence,
+            metadata=protein_metadata,
+        )
+    @classmethod
+    def from_mmcif(cls, inp: str, id: str | None = None) -> "MolecularComplex":
+        """Read MolecularComplex from mmcif file or string.
+        Args:
+            inp: Path to mmCIF file or mmCIF content as string
+            id: Optional identifier to assign to the complex
+        Returns:
+            MolecularComplex with all molecules (proteins, ligands, nucleic acids)
+        """
+        from io import StringIO
+        # Check if input is a file path or mmCIF string content
+        if os.path.exists(inp):
+            # Input is a file path
+            mmcif_file = pdbx.CIFFile.read(inp)
+        else:
+            # Input is mmCIF string content
+            mmcif_file = pdbx.CIFFile.read(StringIO(inp))
+        # Get structure - handle missing model information gracefully
+        try:
+            structure = pdbx.get_structure(
+                mmcif_file, model=1, extra_fields=["b_factor"]
+            )
+        except (KeyError, ValueError):
+            # Fallback for mmCIF files without model information
+            try:
+                structure = pdbx.get_structure(mmcif_file)
+            except Exception:
+                # Last resort: use the first available model or all atoms
+                structure = pdbx.get_structure(mmcif_file, model=None)
+        # Type hint for pyright - structure is an AtomArray which is iterable
+        if TYPE_CHECKING:
+            structure: Any = structure
+        # Read label_asym_id from the raw CIF atom_site category.
+        # Biotite's atom.chain_id uses auth_asym_id, which collapses ligands
+        # onto their parent protein chain. label_asym_id gives each entity a
+        # distinct chain identifier.
+        block = mmcif_file.block
+        label_asym_ids: list[str] | None = None
+        if "atom_site" in block:
+            atom_site = block["atom_site"]
+            if "label_asym_id" in atom_site:
+                _col = atom_site["label_asym_id"]
+                _raw = (
+                    _col.as_array(str)
+                    if hasattr(_col, "as_array")
+                    else np.array(list(_col), dtype=str)  # type: ignore[arg-type]
+                )
+                # biotite's get_structure(model=1) filters to model 1 AND
+                # removes alternate conformations. We must apply the same
+                # filters to label_asym_id to keep arrays aligned.
+                keep = np.ones(len(_raw), dtype=bool)
+                if "pdbx_PDB_model_num" in atom_site:
+                    _mc = atom_site["pdbx_PDB_model_num"]
+                    _models = (
+                        _mc.as_array(str)
+                        if hasattr(_mc, "as_array")
+                        else np.array(list(_mc), dtype=str)  # type: ignore[arg-type]
+                    )
+                    keep &= _models == "1"
+                if "label_alt_id" in atom_site:
+                    _ac = atom_site["label_alt_id"]
+                    _alts = (
+                        _ac.as_array(str)
+                        if hasattr(_ac, "as_array")
+                        else np.array(list(_ac), dtype=str)  # type: ignore[arg-type]
+                    )
+                    keep &= np.isin(_alts, [".", "?", "", "A"])
+                filtered = _raw[keep]
+                if len(filtered) == len(structure):
+                    label_asym_ids = filtered.tolist()
+                # If lengths still don't match, fall back to atom.chain_id
+        # Get entity information from mmCIF
+        entity_info = {}
+        try:
+            if "entity" in block:
+                entity_category = block["entity"]
+                if "id" in entity_category and "type" in entity_category:
+                    entity_ids = entity_category["id"]
+                    entity_types = entity_category["type"]
+                    # Convert CIFColumn to list for iteration
+                    if hasattr(entity_ids, "__iter__") and hasattr(
+                        entity_types, "__iter__"
+                    ):
+                        # Type annotation to help pyright understand these are iterable
+                        entity_ids_list = list(entity_ids)  # type: ignore
+                        entity_types_list = list(entity_types)  # type: ignore
+                        for eid, etype in zip(entity_ids_list, entity_types_list):
+                            entity_info[eid] = etype
+        except Exception:
+            pass
+        # Initialize arrays for flat atom representation
+        sequence_tokens = []
+        flat_positions = []
+        flat_elements = []
+        flat_names = []
+        flat_hetero = []
+        token_to_atoms = []
+        confidence_scores = []
+        chain_ids = []  # Track chain IDs for each token
+        atom_idx = 0
+        # Group atoms by chain and residue.
+        # Use label_asym_id (distinct per entity) when available, otherwise
+        # fall back to biotite's chain_id (auth_asym_id).
+        chain_residue_groups: dict[str, dict[tuple[int, str], dict]] = {}
+        for atom_i, atom in enumerate(structure):
+            chain_id = (
+                label_asym_ids[atom_i] if label_asym_ids is not None else atom.chain_id
+            )
+            res_id = atom.res_id
+            res_name = atom.res_name
+            if chain_id not in chain_residue_groups:
+                chain_residue_groups[chain_id] = {}
+            # Key by (res_id, res_name) to distinguish residues that share
+            # the same res_id but have different res_name (e.g. a protein
+            # residue and a ligand that were on the same auth chain).
+            res_key = (res_id, res_name)
+            if res_key not in chain_residue_groups[chain_id]:
+                chain_residue_groups[chain_id][res_key] = {
+                    "atoms": [],
+                    "res_name": res_name,
+                    "is_hetero": atom.hetero,
+                }
+            chain_residue_groups[chain_id][res_key]["atoms"].append(atom)
+        # Create a mapping from chain_id to numeric indices
+        chain_id_to_numeric = {
+            chain_id: idx
+            for idx, chain_id in enumerate(sorted(chain_residue_groups.keys()))
+        }
+        # Process each chain and residue
+        for chain_id in sorted(chain_residue_groups.keys()):
+            residues = chain_residue_groups[chain_id]
+            numeric_chain_id = chain_id_to_numeric[chain_id]
+            for res_key in sorted(residues.keys()):
+                residue_data = residues[res_key]
+                res_name = residue_data["res_name"]
+                atoms = residue_data["atoms"]
+                is_hetero = residue_data["is_hetero"]
+                # Skip water molecules
+                if res_name == "HOH":
+                    continue
+                # Determine token name
+                if not is_hetero and res_name in residue_constants.restype_3to1:
+                    # Standard amino acid
+                    token_name = res_name
+                elif res_name in ["A", "T", "G", "C", "U", "DA", "DT", "DG", "DC"]:
+                    # Nucleotide
+                    token_name = res_name
+                else:
+                    # Ligand or other molecule
+                    token_name = res_name
+                sequence_tokens.append(token_name)
+                chain_ids.append(
+                    numeric_chain_id
+                )  # Store the numeric chain ID for this token
+                token_start = atom_idx
+                # Add all atoms from this residue
+                for atom in atoms:
+                    flat_positions.append(atom.coord)
+                    # Get element character
+                    element = atom.element
+                    flat_elements.append(element)
+                    # Get atom name
+                    atom_name = atom.atom_name
+                    flat_names.append(atom_name)
+                    # Get hetero flag
+                    hetero_flag = atom.hetero
+                    flat_hetero.append(hetero_flag)
+                    atom_idx += 1
+                # Record token-to-atom mapping
+                token_to_atoms.append([token_start, atom_idx])
+                # Add confidence score (B-factor if available, otherwise 1.0)
+                bfactor = getattr(atoms[0], "b_factor", 50.0) if atoms else 50.0
+                confidence_scores.append(min(bfactor / 100.0, 1.0))
+        # Convert to numpy arrays
+        if not flat_positions:
+            # Create minimal arrays if no atoms found
+            atom_positions = np.zeros((0, 3), dtype=np.float32)
+            atom_elements = np.zeros(0, dtype=object)
+            atom_names = np.zeros(0, dtype=object)
+            atom_hetero = np.zeros(0, dtype=bool)
+            token_to_atoms_array = np.zeros((len(sequence_tokens), 2), dtype=np.int32)
+            chain_id_array = (
+                np.array(chain_ids, dtype=np.int64)
+                if chain_ids
+                else np.zeros(len(sequence_tokens), dtype=np.int64)
+            )
+        else:
+            atom_positions = np.array(flat_positions, dtype=np.float32)
+            atom_elements = np.array(flat_elements, dtype=object)
+            atom_names = np.array(flat_names, dtype=object)
+            atom_hetero = np.array(flat_hetero, dtype=bool)
+            token_to_atoms_array = np.array(token_to_atoms, dtype=np.int32)
+            chain_id_array = np.array(chain_ids, dtype=np.int64)
+        confidence_array = np.array(confidence_scores, dtype=np.float32)
+        # Create metadata using the chain_id_to_numeric mapping
+        if chain_residue_groups:
+            chain_lookup = {
+                numeric_id: chain_id
+                for chain_id, numeric_id in chain_id_to_numeric.items()
+            }
+        else:
+            chain_lookup = {}
+        metadata = MolecularComplexMetadata(
+            entity_lookup=entity_info,
+            chain_lookup=chain_lookup,
+            assembly_composition=None,
+        )
+        # Set complex ID - if input was a path, use the stem; otherwise use default
+        if os.path.exists(inp):
+            complex_id = id or Path(inp).stem
+        else:
+            complex_id = id or "complex_from_string"
+        return cls(
+            id=complex_id,
+            sequence=sequence_tokens,
+            atom_positions=atom_positions,
+            atom_elements=atom_elements,
+            token_to_atoms=token_to_atoms_array,
+            chain_id=chain_id_array,
+            plddt=confidence_array,
+            metadata=metadata,
+            atom_names=atom_names,
+            atom_hetero=atom_hetero,
+        )
+    def _get_entity_mapping(
+        self,
+    ) -> tuple[dict[str, list[str]], dict[str, int], dict[int, tuple[str, ...]]]:
+        """Compute chain→sequence, chain→entity_id, and entity_id→sequence mappings.
+        Returns:
+            (chain_sequences, chain_to_entity, entity_sequences)
+        """
+        chain_sequences: dict[str, list[str]] = {}
+        for token_idx in range(len(self.token_to_atoms)):
+            chain_id_numeric = self.chain_id[token_idx]
+            chain_id_str = self.metadata.chain_lookup.get(
+                int(chain_id_numeric), chr(65 + int(chain_id_numeric))
+            )
+            if chain_id_str not in chain_sequences:
+                chain_sequences[chain_id_str] = []
+            chain_sequences[chain_id_str].append(self.sequence[token_idx])
+        sequence_to_entity: dict[tuple[str, ...], int] = {}
+        chain_to_entity: dict[str, int] = {}
+        entity_sequences: dict[int, tuple[str, ...]] = {}
+        entity_id_counter = 1
+        for chain_id_str, sequence in chain_sequences.items():
+            seq_tuple = tuple(sequence)
+            if seq_tuple not in sequence_to_entity:
+                sequence_to_entity[seq_tuple] = entity_id_counter
+                entity_sequences[entity_id_counter] = seq_tuple
+                entity_id_counter += 1
+            chain_to_entity[chain_id_str] = sequence_to_entity[seq_tuple]
+        return chain_sequences, chain_to_entity, entity_sequences
+    def _add_entity_information(
+        self, cif_file: CIFFile, entity_sequences: dict[int, tuple[str, ...]]
+    ) -> None:
+        """Add _entity category to CIF file so OST can identify ligands vs polymers."""
+        entity_ids: list[str] = []
+        entity_types: list[str] = []
+        entity_descriptions: list[str] = []
+        for eid in sorted(entity_sequences.keys()):
+            seq = entity_sequences[eid]
+            entity_ids.append(str(eid))
+            has_protein = any(t in residue_constants.restype_3to1 for t in seq)
+            has_na = any(
+                t in ("A", "T", "G", "C", "U", "DA", "DT", "DG", "DC") for t in seq
+            )
+            if has_protein or has_na:
+                entity_types.append("polymer")
+                if has_protein:
+                    entity_descriptions.append(f"Polymer entity {eid} (protein)")
+                else:
+                    entity_descriptions.append(f"Polymer entity {eid} (nucleic acid)")
+            else:
+                entity_types.append("non-polymer")
+                entity_descriptions.append(f"Non-polymer entity {eid}")
+        if entity_ids:
+            cif_file.block["entity"] = CIFCategory(
+                name="entity",
+                columns={
+                    "id": CIFColumn(
+                        data=CIFData(array=np.array(entity_ids), dtype=np.str_)
+                    ),
+                    "type": CIFColumn(
+                        data=CIFData(array=np.array(entity_types), dtype=np.str_)
+                    ),
+                    "pdbx_description": CIFColumn(
+                        data=CIFData(array=np.array(entity_descriptions), dtype=np.str_)
+                    ),
+                },
+            )
+        # Add _struct_asym to map chain IDs to entity IDs
+        _, chain_to_entity, _ = self._get_entity_mapping()
+        if chain_to_entity:
+            asym_ids = sorted(chain_to_entity.keys())
+            asym_entity_ids = [str(chain_to_entity[c]) for c in asym_ids]
+            cif_file.block["struct_asym"] = CIFCategory(
+                name="struct_asym",
+                columns={
+                    "id": CIFColumn(
+                        data=CIFData(array=np.array(asym_ids), dtype=np.str_)
+                    ),
+                    "entity_id": CIFColumn(
+                        data=CIFData(array=np.array(asym_entity_ids), dtype=np.str_)
+                    ),
+                },
+            )
+    def to_mmcif(self) -> str:
+        """Write MolecularComplex to mmcif string using biotite.
+        Returns:
+            String representation of the complex in mmCIF format
+        """
+        # Pre-allocate AtomArray
+        n_atoms = len(self.atom_positions)
+        atom_array = bs.AtomArray(length=n_atoms)
+        # Set coordinates directly (already vectorized)
+        atom_array.coord = self.atom_positions
+        # Pre-allocate per-atom arrays
+        atom_res_ids = np.zeros(n_atoms, dtype=np.int32)
+        atom_chain_ids = np.empty(n_atoms, dtype=object)
+        atom_res_names = np.empty(n_atoms, dtype=object)
+        atom_hetero = np.zeros(n_atoms, dtype=bool)
+        atom_bfactors = np.zeros(n_atoms, dtype=np.float32)
+        atom_names = np.empty(n_atoms, dtype=object)
+        # Build entity mappings: chains with identical sequences share entity ID
+        _, chain_to_entity, entity_sequences = self._get_entity_mapping()
+        atom_entity_ids = np.zeros(n_atoms, dtype=np.int32)
+        # Track residue IDs per chain
+        chain_res_counters: dict[int, int] = {}
+        # Vectorized expansion of token-level to atom-level annotations
+        for token_idx, (start, end) in enumerate(self.token_to_atoms):
+            token = self.sequence[token_idx]
+            chain_id_numeric = self.chain_id[token_idx]
+            chain_id_str = self.metadata.chain_lookup.get(
+                int(chain_id_numeric), chr(65 + int(chain_id_numeric))
+            )
+            # Track residue numbering per chain
+            if chain_id_numeric not in chain_res_counters:
+                chain_res_counters[chain_id_numeric] = 1
+            res_id = chain_res_counters[chain_id_numeric]
+            chain_res_counters[chain_id_numeric] += 1
+            # Determine if protein
+            is_protein = token in residue_constants.restype_3to1
+            # Get atom names for this residue
+            if self.atom_names is not None:
+                # Use stored atom names (preserves original names from mmCIF)
+                names = list(self.atom_names[start:end])
+            elif is_protein:
+                # Fallback: use standard protein atom names
+                standard_names = residue_constants.residue_atoms.get(
+                    token, ["N", "CA", "C", "O"]
+                )
+                names = standard_names[: end - start]
+                # Pad if needed
+                while len(names) < (end - start):
+                    names.append(f"X{len(names)+1}")
+            else:
+                # Fallback: generate names for ligands/nucleic acids
+                names = [f"C{i+1}" for i in range(end - start)]
+            # Vectorized assignment for this token's atoms
+            atom_res_ids[start:end] = res_id
+            atom_chain_ids[start:end] = chain_id_str
+            atom_res_names[start:end] = token
+            # Use stored hetero flags if available, otherwise guess based on protein status
+            if self.atom_hetero is not None:
+                atom_hetero[start:end] = self.atom_hetero[start:end]
+            else:
+                atom_hetero[start:end] = not is_protein
+            atom_bfactors[start:end] = self.plddt[token_idx] * 100.0
+            atom_names[start:end] = names
+            atom_entity_ids[start:end] = chain_to_entity.get(chain_id_str, 1)
+        # Set all AtomArray attributes at once (convert object arrays to proper string arrays)
+        # res_name uses U8 to accommodate CCD codes up to 5 characters (e.g., A1AZ2);
+        # chain_id uses U16 because chain names like ``ligand_1`` / ``ligand_2`` /
+        # auth-asym ids of arbitrary length are possible.
+        atom_array.res_id = atom_res_ids
+        atom_array.chain_id = np.array(atom_chain_ids, dtype="U16")
+        atom_array.res_name = np.array(atom_res_names, dtype="U8")
+        atom_array.hetero = atom_hetero
+        atom_array.atom_name = np.array(atom_names, dtype="U4")
+        atom_array.add_annotation("b_factor", dtype=float)
+        atom_array.b_factor = atom_bfactors
+        atom_array.add_annotation("entity_id", dtype=int)
+        atom_array.entity_id = atom_entity_ids
+        # Use existing elements or infer them from atom names
+        if self.atom_elements is not None and len(self.atom_elements) == n_atoms:
+            # Convert object array to proper string array for biotite
+            atom_array.element = np.array(self.atom_elements, dtype="U4")
+        else:
+            # Use biotite's built-in element inference
+            atom_array.element = bs.infer_elements(atom_array)
+        # Create CIF file and set structure
+        cif_file = CIFFile()
+        set_structure(cif_file, atom_array, data_block=self.id)
+        # Manually fix label_entity_id (biotite doesn't use entity_id annotation correctly)
+        if "atom_site" in cif_file.block:
+            atom_site = cif_file.block["atom_site"]
+            if "label_asym_id" in atom_site and "label_entity_id" in atom_site:
+                label_asym_ids = atom_site["label_asym_id"]
+                if hasattr(label_asym_ids, "as_array"):
+                    chain_ids_list = label_asym_ids.as_array(str).tolist()
+                elif hasattr(label_asym_ids, "__iter__"):
+                    chain_ids_list = list(label_asym_ids)  # type: ignore[arg-type]
+                else:
+                    chain_ids_list = []
+                updated_entity_ids = [
+                    str(chain_to_entity.get(cid, 1)) for cid in chain_ids_list
+                ]
+                if updated_entity_ids:
+                    atom_site["label_entity_id"] = CIFColumn(
+                        data=CIFData(array=np.array(updated_entity_ids), dtype=np.str_)
+                    )
+        # Add _entity category for OST compatibility
+        self._add_entity_information(cif_file, entity_sequences)
+        # Convert to string
+        output = io.StringIO()
+        cif_file.write(output)
+        return output.getvalue()
+    def dockq(self, native: "MolecularComplex") -> Any:
+        """Compute DockQ score against native structure.
+        Args:
+            native: Native MolecularComplex to compute DockQ against
+        Returns:
+            DockQ result containing score and alignment information
+        """
+        # Imports moved to top of file
+        # Convert both complexes to ProteinComplex format for DockQ computation
+        # This extracts only the protein portion and converts to PDB format
+        try:
+            self_pc = self.to_protein_complex()
+            native_pc = native.to_protein_complex()
+        except ValueError as e:
+            raise ValueError(
+                f"Cannot convert MolecularComplex to ProteinComplex for DockQ: {e}"
+            )
+        # Normalize chain IDs for PDB compatibility
+        self_pc = self_pc.normalize_chain_ids_for_pdb()
+        native_pc = native_pc.normalize_chain_ids_for_pdb()
+        # Use the existing ProteinComplex.dockq() method
+        try:
+            dockq_result = self_pc.dockq(native_pc)
+            return dockq_result
+        except Exception:
+            # Fallback to manual DockQ computation if ProteinComplex.dockq() fails
+            return self._compute_dockq_manual(native)
+    def _compute_dockq_manual(self, native: "MolecularComplex") -> Any:
+        """Manual DockQ computation fallback."""
+        # Imports moved to top of file
+        # Convert both complexes to ProteinComplex format
+        try:
+            self_pc = self.to_protein_complex()
+            native_pc = native.to_protein_complex()
+        except ValueError as e:
+            raise ValueError(
+                f"Cannot convert MolecularComplex to ProteinComplex for DockQ: {e}"
+            )
+        # Normalize chain IDs for PDB compatibility
+        self_pc = self_pc.normalize_chain_ids_for_pdb()
+        native_pc = native_pc.normalize_chain_ids_for_pdb()
+        # Write temporary PDB files and run DockQ
+        with TemporaryDirectory() as tdir:
+            dir_path = Path(tdir)
+            self_pdb = dir_path / "self.pdb"
+            native_pdb = dir_path / "native.pdb"
+            # Write PDB files
+            self_pc.to_pdb(self_pdb)
+            native_pc.to_pdb(native_pdb)
+            # Run DockQ
+            try:
+                output = check_output(["DockQ", str(self_pdb), str(native_pdb)])
+                output_text = output.decode()
+                # Parse DockQ output
+                lines = output_text.split("\n")
+                # Find the total DockQ score
+                dockq_score = None
+                for line in lines:
+                    if "Total DockQ" in line:
+                        match = re.search(r"Total DockQ.*: ([\d.]+)", line)
+                        if match:
+                            dockq_score = float(match.group(1))
+                            break
+                if dockq_score is None:
+                    # Try to find individual DockQ scores
+                    for line in lines:
+                        if line.startswith("DockQ") and ":" in line:
+                            try:
+                                dockq_score = float(line.split(":")[1].strip())
+                                break
+                            except (ValueError, IndexError):
+                                continue
+                if dockq_score is None:
+                    raise ValueError("Could not parse DockQ score from output")
+                # Return a simple result structure
+                return {
+                    "total_dockq": dockq_score,
+                    "raw_output": output_text,
+                    "aligned": self,  # Return self as aligned structure
+                }
+            except FileNotFoundError:
+                raise RuntimeError(
+                    "DockQ is not installed. Please install DockQ to use this method."
+                )
+            except Exception as e:
+                raise RuntimeError(f"DockQ computation failed: {e}")
+    def rmsd(self, target: "MolecularComplex", **kwargs) -> float:
+        """Compute RMSD against target structure.
+        Args:
+            target: Target MolecularComplex to compute RMSD against
+            **kwargs: Additional arguments passed to compute_rmsd
+        Returns:
+            float: RMSD value between the two structures
+        """
+        # Imports moved to top of file
+        # Ensure both complexes have the same number of tokens
+        if len(self) != len(target):
+            raise ValueError(
+                f"Complexes must have the same number of tokens: {len(self)} vs {len(target)}"
+            )
+        # Extract center positions for each token (using centroid of atoms)
+        mobile_coords = []
+        target_coords = []
+        atom_mask = []
+        for i in range(len(self)):
+            # Get atom positions for this token
+            mobile_start, mobile_end = self.token_to_atoms[i]
+            target_start, target_end = target.token_to_atoms[i]
+            # Extract atom positions
+            mobile_atoms = self.atom_positions[mobile_start:mobile_end]
+            target_atoms = target.atom_positions[target_start:target_end]
+            # Check if both tokens have atoms
+            if len(mobile_atoms) == 0 or len(target_atoms) == 0:
+                # Skip tokens with no atoms
+                continue
+            # For simplicity, use the centroid of atoms as the representative position
+            mobile_center = mobile_atoms.mean(axis=0)
+            target_center = target_atoms.mean(axis=0)
+            mobile_coords.append(mobile_center)
+            target_coords.append(target_center)
+            atom_mask.append(True)
+        if len(mobile_coords) == 0:
+            raise ValueError("No valid atoms found for RMSD computation")
+        # Convert to tensors
+        mobile_tensor = torch.from_numpy(np.stack(mobile_coords, axis=0)).unsqueeze(
+            0
+        )  # [1, N, 3]
+        target_tensor = torch.from_numpy(np.stack(target_coords, axis=0)).unsqueeze(
+            0
+        )  # [1, N, 3]
+        mask_tensor = torch.tensor(atom_mask, dtype=torch.bool).unsqueeze(0)  # [1, N]
+        # Compute RMSD using existing infrastructure
+        rmsd_value = compute_rmsd(
+            mobile=mobile_tensor,
+            target=target_tensor,
+            atom_exists_mask=mask_tensor,
+            reduction="batch",
+            **kwargs,
+        )
+        return float(rmsd_value)
+    def lddt_ca(self, target: "MolecularComplex", **kwargs) -> float:
+        """Compute LDDT score against target structure.
+        Args:
+            target: Target MolecularComplex to compute LDDT against
+            **kwargs: Additional arguments passed to compute_lddt
+        Returns:
+            float: LDDT value between the two structures
+        """
+        # Imports moved to top of file
+        # Ensure both complexes have the same number of tokens
+        if len(self) != len(target):
+            raise ValueError(
+                f"Complexes must have the same number of tokens: {len(self)} vs {len(target)}"
+            )
+        # Extract center positions for each token (using centroid of atoms)
+        mobile_coords = []
+        target_coords = []
+        atom_mask = []
+        for i in range(len(self)):
+            # Get atom positions for this token
+            mobile_start, mobile_end = self.token_to_atoms[i]
+            target_start, target_end = target.token_to_atoms[i]
+            # Extract atom positions
+            mobile_atoms = self.atom_positions[mobile_start:mobile_end]
+            target_atoms = target.atom_positions[target_start:target_end]
+            # Check if both tokens have atoms
+            if len(mobile_atoms) == 0 or len(target_atoms) == 0:
+                # Skip tokens with no atoms
+                mobile_coords.append(np.full(3, np.nan))
+                target_coords.append(np.full(3, np.nan))
+                atom_mask.append(False)
+                continue
+            # For simplicity, use the centroid of atoms as the representative position
+            mobile_center = mobile_atoms.mean(axis=0)
+            target_center = target_atoms.mean(axis=0)
+            mobile_coords.append(mobile_center)
+            target_coords.append(target_center)
+            atom_mask.append(True)
+        if not any(atom_mask):
+            raise ValueError("No valid atoms found for LDDT computation")
+        # Convert to tensors
+        mobile_tensor = torch.from_numpy(np.stack(mobile_coords, axis=0)).unsqueeze(
+            0
+        )  # [1, N, 3]
+        target_tensor = torch.from_numpy(np.stack(target_coords, axis=0)).unsqueeze(
+            0
+        )  # [1, N, 3]
+        mask_tensor = torch.tensor(atom_mask, dtype=torch.bool).unsqueeze(0)  # [1, N]
+        # Compute LDDT using existing infrastructure
+        lddt_value = compute_lddt(
+            all_atom_pred_pos=mobile_tensor,
+            all_atom_positions=target_tensor,
+            all_atom_mask=mask_tensor,
+            per_residue=False,  # Return overall LDDT score
+            **kwargs,
+        )
+        return float(lddt_value)
+    def state_dict(self):
+        """This state dict is optimized for storage, so it turns things to fp16 whenever
+        possible and converts numpy arrays to lists for JSON serialization.
+        """
+        dct = {k: v for k, v in vars(self).items()}
+        for k, v in dct.items():
+            if isinstance(v, np.ndarray):
+                match v.dtype:
+                    case np.int64:
+                        dct[k] = v.astype(np.int32).tolist()
+                    case np.float64 | np.float32:
+                        dct[k] = v.astype(np.float16).tolist()
+                    case _:
+                        dct[k] = v.tolist()
+            elif isinstance(v, MolecularComplexMetadata):
+                dct[k] = asdict(v)
+        return dct
+    def to_blob(self) -> bytes:
+        return brotli.compress(msgpack.dumps(self.state_dict()), quality=5)
+    @classmethod
+    def from_state_dict(cls, dct):
+        for k, v in dct.items():
+            if isinstance(v, list) and k in [
+                "atom_positions",
+                "atom_elements",
+                "atom_names",
+                "atom_hetero",
+                "token_to_atoms",
+                "chain_id",
+                "plddt",
+            ]:
+                dct[k] = np.array(v)
+        for k, v in dct.items():
+            if isinstance(v, np.ndarray):
+                if k in ["atom_positions", "plddt"]:
+                    dct[k] = v.astype(np.float32)
+                elif k in ["token_to_atoms", "chain_id"]:
+                    dct[k] = (
+                        v.astype(np.int32)
+                        if k == "token_to_atoms"
+                        else v.astype(np.int64)
+                    )
+        dct["metadata"] = MolecularComplexMetadata(**dct["metadata"])
+        # Backward compatibility: if chain_id is missing, create default array
+        if "chain_id" not in dct:
+            # Default all tokens to chain 0
+            dct["chain_id"] = np.zeros(len(dct["sequence"]), dtype=np.int64)
+        return cls(**dct)
+    @classmethod
+    def from_blob(cls, input: Path | str | io.BytesIO | bytes):
+        match input:
+            case Path() | str():
+                bytes = Path(input).read_bytes()
+            case io.BytesIO():
+                bytes = input.getvalue()
+            case _:
+                bytes = input
+        return cls.from_state_dict(
+            msgpack.loads(brotli.decompress(bytes), strict_map_key=False)
+        )

esmfold2_msa.py ADDED Viewed

	@@ -0,0 +1,507 @@

+from __future__ import annotations
+import dataclasses
+import string
+from dataclasses import dataclass
+from functools import cached_property
+from itertools import islice
+from typing import Sequence
+import numpy as np
+from Bio import SeqIO
+from scipy.spatial.distance import cdist
+from .esmfold2_misc import slice_any_object
+from .esmfold2_msa_filter_sequences import greedy_select_indices, hhfilter
+from .esmfold2_parsing import FastaEntry, read_sequences, write_sequences
+from .esmfold2_sequential_dataclass import SequentialDataclass
+from .esmfold2_system import PathOrBuffer
+REMOVE_LOWERCASE_TRANSLATION = str.maketrans(dict.fromkeys(string.ascii_lowercase))
+def remove_insertions_from_sequence(seq: str) -> str:
+    return seq.translate(REMOVE_LOWERCASE_TRANSLATION)
+@dataclass(frozen=True)
+class MSA(SequentialDataclass):
+    """Object-oriented interface to an MSA.
+    Args:
+        sequences (list[str]): List of protein sequences
+        headers (list[str]): List of headers describing the sequences
+    """
+    entries: list[FastaEntry]
+    @cached_property
+    def sequences(self) -> list[str]:
+        return [entry.sequence for entry in self.entries]
+    @cached_property
+    def headers(self) -> list[str]:
+        return [entry.header for entry in self.entries]
+    def __repr__(self):
+        return (
+            f"MSA({self.entries[0].header}: Depth={self.depth}, Length={self.seqlen})"
+        )
+    def to_fast_msa(self) -> FastMSA:
+        return FastMSA(self.array, self.headers)
+    @classmethod
+    def from_a3m(
+        cls,
+        path: PathOrBuffer,
+        remove_insertions: bool = True,
+        max_sequences: int | None = None,
+    ) -> MSA:
+        entries = []
+        for header, seq in islice(read_sequences(path), max_sequences):
+            if remove_insertions:
+                seq = remove_insertions_from_sequence(seq)
+            if entries:
+                assert (
+                    len(seq) == len(entries[0].sequence)
+                ), f"Sequence length mismatch. Expected: {len(entries[0].sequence)}, Received: {len(seq)}"
+            entries.append(FastaEntry(header, seq))
+        return cls(entries)
+    def to_a3m(self, path: PathOrBuffer) -> None:
+        write_sequences(self.entries, path)
+    @classmethod
+    def from_stockholm(
+        cls,
+        path: PathOrBuffer,
+        remove_insertions: bool = True,
+        max_sequences: int | None = None,
+    ) -> MSA:
+        entries = []
+        for record in islice(SeqIO.parse(path, "stockholm"), max_sequences):
+            header = f"{record.id} {record.description}"
+            seq = str(record.seq)
+            if entries:
+                assert (
+                    len(seq) == len(entries[0].sequence)
+                ), f"Sequence length mismatch. Expected: {len(entries[0].sequence)}, Received: {len(seq)}"
+            entries.append(FastaEntry(header, seq))
+        msa = cls(entries)
+        if remove_insertions:
+            keep_inds = [i for i, aa in enumerate(msa.query) if aa != "-"]
+            msa = msa.select_positions(keep_inds)
+        return msa
+    def to_bytes(self) -> bytes:
+        version = 1
+        version_bytes = version.to_bytes(1, "little")
+        seqlen_bytes = self.seqlen.to_bytes(4, "little")
+        depth_bytes = self.depth.to_bytes(4, "little")
+        array_bytes = self.array.tobytes()
+        header_bytes = "\n".join(entry.header for entry in self.entries).encode()
+        all_bytes = (
+            version_bytes + seqlen_bytes + depth_bytes + array_bytes + header_bytes
+        )
+        return all_bytes
+    @classmethod
+    def from_bytes(cls, data: bytes) -> MSA:
+        version_bytes, seqlen_bytes, depth_bytes, data = (
+            data[:1],
+            data[1:5],
+            data[5:9],
+            data[9:],
+        )
+        version = int.from_bytes(version_bytes, "little")
+        if version != 1:
+            raise ValueError(f"Unsupported version: {version}")
+        seqlen = int.from_bytes(seqlen_bytes, "little")
+        depth = int.from_bytes(depth_bytes, "little")
+        array_bytes, header_bytes = data[: seqlen * depth], data[seqlen * depth :]
+        array = np.frombuffer(array_bytes, dtype="|S1")
+        array = array.reshape(depth, seqlen)
+        headers = header_bytes.decode().split("\n")
+        # Sometimes the separation is two newlines, which results in an empty header.
+        headers = [header for header in headers if header]
+        # If all headers were empty (e.g., saved from from_sequences), use empty headers
+        if len(headers) == 0 and depth > 0:
+            headers = [""] * depth
+        entries = [
+            FastaEntry(header, b"".join(row).decode())
+            for header, row in zip(headers, array)
+        ]
+        return cls(entries)
+    # TODO(jmaccarl): set remove_insertions to True by default here to match other utils
+    @classmethod
+    def from_sequences(
+        cls, sequences: list[str], remove_insertions: bool = False
+    ) -> MSA:
+        if remove_insertions:
+            entries = [
+                FastaEntry("", remove_insertions_from_sequence(seq))
+                for seq in sequences
+            ]
+        else:
+            entries = [FastaEntry("", seq) for seq in sequences]
+        return cls(entries)
+    def to_sequence_bytes(self) -> bytes:
+        """Stores ONLY SEQUENCES in array format as bytes. Header information will be lost."""
+        seqlen_bytes = self.seqlen.to_bytes(4, "little")
+        array_bytes = self.array.tobytes()
+        all_bytes = seqlen_bytes + array_bytes
+        return all_bytes
+    @classmethod
+    def from_sequence_bytes(cls, data: bytes) -> MSA:
+        seqlen_bytes, array_bytes = data[:4], data[4:]
+        seqlen = int.from_bytes(seqlen_bytes, "little")
+        array = np.frombuffer(array_bytes, dtype="|S1")
+        array = array.reshape(-1, seqlen)
+        entries = [FastaEntry("", b"".join(row).decode()) for row in array]
+        return cls(entries)
+    @property
+    def depth(self) -> int:
+        return len(self.entries)
+    @property
+    def seqlen(self) -> int:
+        return len(self.entries[0].sequence)
+    @cached_property
+    def array(self) -> np.ndarray:
+        return np.array([list(seq) for seq in self.sequences], dtype="|S1")
+    @property
+    def query(self) -> str:
+        return self.entries[0].sequence
+    def select_sequences(self, indices: Sequence[int] | np.ndarray) -> MSA:
+        """Subselect rows of the MSA."""
+        entries = [self.entries[idx] for idx in indices]
+        return dataclasses.replace(self, entries=entries)
+    def select_positions(self, indices: Sequence[int] | np.ndarray) -> MSA:
+        """Subselect columns of the MSA."""
+        entries = [
+            FastaEntry(header, "".join(seq[idx] for idx in indices))
+            for header, seq in self.entries
+        ]
+        return dataclasses.replace(self, entries=entries)
+    def __getitem__(self, indices: int | list[int] | slice | np.ndarray):
+        if isinstance(indices, int):
+            indices = [indices]
+        entries = [
+            FastaEntry(header, slice_any_object(seq, indices))
+            for header, seq in self.entries
+        ]
+        return dataclasses.replace(self, entries=entries)
+    def __len__(self):
+        return self.seqlen
+    def greedy_select(self, num_seqs: int, mode: str = "max") -> MSA:
+        """Greedily select sequences that either maximize or minimize hamming distance.
+        Algorithm proposed in the MSA Transformer paper. Starting from the query sequence,
+        iteratively add sequences to the list with the maximum (minimum) average Hamming
+        distance to the existing set of sequences.
+        Args:
+            num_seqs (int): Number of sequences to select.
+            mode (str): Whether to maximize or minimize diversity. DO NOT pick 'min' unless
+                you're doing it to prove a point for a paper.
+        Returns:
+            MSA object w/ subselected sequences.
+        """
+        assert mode in ("max", "min")
+        if self.depth <= num_seqs:
+            return self
+        indices = greedy_select_indices(self.array, num_seqs, mode)
+        return self.select_sequences(indices)
+    def hhfilter(
+        self,
+        seqid: int = 90,
+        diff: int = 0,
+        cov: int = 0,
+        qid: int = 0,
+        qsc: float = -20.0,
+        binary: str = "hhfilter",
+    ) -> MSA:
+        """Apply hhfilter to the sequences in the MSA and return a filtered MSA."""
+        indices = hhfilter(
+            self.sequences,
+            seqid=seqid,
+            diff=diff,
+            cov=cov,
+            qid=qid,
+            qsc=qsc,
+            binary=binary,
+        )
+        return self.select_sequences(indices)
+    def select_random_sequences(self, num_seqs: int) -> MSA:
+        """Uses random sampling to subselect sequences from the MSA. Always
+        keeps the query sequence.
+        """
+        if num_seqs >= self.depth:
+            return self
+        # Subselect random, always keeping the query sequence.
+        indices = np.sort(
+            np.append(
+                0, np.random.choice(self.depth - 1, num_seqs - 1, replace=False) + 1
+            )
+        )
+        msa = self.select_sequences(indices)  # type: ignore
+        return msa
+    def select_diverse_sequences(self, num_seqs: int) -> MSA:
+        """Applies hhfilter to select ~num_seqs sequences, then uses random sampling
+        to subselect if necessary.
+        """
+        if num_seqs >= self.depth:
+            return self
+        msa = self.hhfilter(diff=num_seqs)
+        if num_seqs < msa.depth:
+            msa = msa.select_random_sequences(num_seqs)
+        return msa
+    def pad_to_depth(self, depth: int) -> MSA:
+        if depth < self.depth:
+            raise ValueError(f"Cannot pad to depth {depth} when depth is {self.depth}")
+        elif depth == self.depth:
+            return self
+        num_to_add = depth - self.depth
+        extra_entries = [FastaEntry("", "-" * self.seqlen) for _ in range(num_to_add)]
+        return dataclasses.replace(self, entries=self.entries + extra_entries)
+    @classmethod
+    def stack(
+        cls, msas: Sequence[MSA], remove_query_from_later_msas: bool = True
+    ) -> MSA:
+        """Stack a series of MSAs. Optionally remove the query from msas after the first."""
+        all_entries = []
+        for i, msa in enumerate(msas):
+            entries = msa.entries
+            if i > 0 and remove_query_from_later_msas:
+                entries = entries[1:]
+            all_entries.extend(entries)
+        return cls(entries=all_entries)
+    @cached_property
+    def seqid(self) -> np.ndarray:
+        array = self.array.view(np.uint8)
+        seqid = 1 - cdist(array[0][None], array, "hamming")
+        return seqid[0]
+    @classmethod
+    def concat(
+        cls,
+        msas: Sequence[MSA],
+        join_token: str | None = "|",
+        allow_depth_mismatch: bool = False,
+    ) -> MSA:
+        """Concatenate a series of MSAs horizontally, along the sequence dimension."""
+        if not msas:
+            raise ValueError("Cannot concatenate an empty list of MSAs")
+        msa_depths = [msa.depth for msa in msas]
+        if len(set(msa_depths)) != 1:
+            if not allow_depth_mismatch:
+                raise ValueError("Depth mismatch in concatenating MSAs")
+            else:
+                max_depth = max(msa_depths)
+                msas = [msa.pad_to_depth(max_depth) for msa in msas]
+        headers = [
+            "|".join([str(h) for h in headers])
+            for headers in zip(*(msa.headers for msa in msas))
+        ]
+        if join_token is None:
+            join_token = ""
+        seqs = [join_token.join(vals) for vals in zip(*(msa.sequences for msa in msas))]
+        entries = [FastaEntry(header, seq) for header, seq in zip(headers, seqs)]
+        return cls(entries)
+@dataclass(frozen=True)
+class FastMSA(SequentialDataclass):
+    """Object-oriented interface to an MSA stored as a numpy uint8 array."""
+    array: np.ndarray
+    headers: list[str] | None = None
+    def __post_init__(self):
+        if self.headers is not None:
+            assert (
+                len(self.headers) == self.depth
+            ), "Number of headers must match depth."
+    @classmethod
+    def from_bytes(cls, data: bytes) -> FastMSA:
+        version_bytes, seqlen_bytes, depth_bytes, data = (
+            data[:1],
+            data[1:5],
+            data[5:9],
+            data[9:],
+        )
+        version = int.from_bytes(version_bytes, "little")
+        if version != 1:
+            raise ValueError(f"Unsupported version: {version}")
+        seqlen = int.from_bytes(seqlen_bytes, "little")
+        depth = int.from_bytes(depth_bytes, "little")
+        array_bytes, header_bytes = data[: seqlen * depth], data[seqlen * depth :]
+        array = np.frombuffer(array_bytes, dtype="|S1")
+        array = array.reshape(depth, seqlen)
+        headers = header_bytes.decode().split("\n")
+        # Sometimes the separation is two newlines, which results in an empty header.
+        headers = [header for header in headers if header]
+        # If all headers were empty (e.g., saved from from_sequences), use empty headers
+        if len(headers) == 0 and depth > 0:
+            headers = [""] * depth
+        return cls(array, headers)
+    @classmethod
+    def from_sequence_bytes(cls, data: bytes) -> FastMSA:
+        seqlen_bytes, array_bytes = data[:4], data[4:]
+        seqlen = int.from_bytes(seqlen_bytes, "little")
+        array = np.frombuffer(array_bytes, dtype="|S1")
+        array = array.reshape(-1, seqlen)
+        return cls(array)
+    @property
+    def depth(self) -> int:
+        return self.array.shape[0]
+    @property
+    def seqlen(self) -> int:
+        return self.array.shape[1]
+    def __len__(self):
+        return self.seqlen
+    def __getitem__(self, indices: int | list[int] | slice | np.ndarray):
+        if isinstance(indices, int):
+            indices = [indices]
+        return dataclasses.replace(self, array=self.array[:, indices])
+    def select_sequences(self, indices: Sequence[int] | np.ndarray) -> FastMSA:
+        """Subselect rows of the MSA."""
+        array = self.array[indices]
+        headers = (
+            [self.headers[idx] for idx in indices] if self.headers is not None else None
+        )
+        return dataclasses.replace(self, array=array, headers=headers)
+    def select_random_sequences(self, num_seqs: int) -> FastMSA:
+        """Uses random sampling to subselect sequences from the MSA. Always
+        keeps the query sequence.
+        """
+        if num_seqs >= self.depth:
+            return self
+        # Subselect random, always keeping the query sequence.
+        indices = np.sort(
+            np.append(
+                0, np.random.choice(self.depth - 1, num_seqs - 1, replace=False) + 1
+            )
+        )
+        msa = self.select_sequences(indices)  # type: ignore
+        return msa
+    def pad_to_depth(self, depth: int) -> FastMSA:
+        if depth < self.depth:
+            raise ValueError(f"Cannot pad to depth {depth} when depth is {self.depth}")
+        elif depth == self.depth:
+            return self
+        num_to_add = depth - self.depth
+        array = np.pad(
+            self.array,
+            [(0, num_to_add), (0, 0)],
+            constant_values=ord("-") if self.array.dtype == np.uint8 else b"-",
+        )
+        headers = self.headers
+        if headers is not None:
+            headers = headers + [""] * num_to_add
+        return dataclasses.replace(self, array=array, headers=headers)
+    @classmethod
+    def concat(
+        cls,
+        msas: Sequence[FastMSA],
+        join_token: str | None = None,
+        allow_depth_mismatch: bool = False,
+    ) -> FastMSA:
+        """Concatenate a series of MSAs horizontally, along the sequence dimension."""
+        if not msas:
+            raise ValueError("Cannot concatenate an empty list of MSAs")
+        if join_token is not None and join_token != "":
+            raise NotImplementedError("join_token is not supported for FastMSA")
+        msa_depths = [msa.depth for msa in msas]
+        if len(set(msa_depths)) != 1:
+            if not allow_depth_mismatch:
+                raise ValueError("Depth mismatch in concatenating MSAs")
+            else:
+                max_depth = max(msa_depths)
+                msas = [msa.pad_to_depth(max_depth) for msa in msas]
+        headers = [
+            "|".join([str(h) for h in headers])
+            for headers in zip(
+                *(
+                    msa.headers if msa.headers is not None else [""] * msa.depth
+                    for msa in msas
+                )
+            )
+        ]
+        array = np.concatenate([msa.array for msa in msas], axis=1)
+        return cls(array, headers)
+    def to_msa(self) -> MSA:
+        headers = (
+            self.headers
+            if self.headers is not None
+            else [f"seq{i}" for i in range(self.depth)]
+        )
+        entries = [
+            FastaEntry(header, b"".join(row).decode())
+            for header, row in zip(headers, self.array)
+        ]
+        return MSA(entries)
+    @classmethod
+    def stack(
+        cls, msas: Sequence[FastMSA], remove_query_from_later_msas: bool = True
+    ) -> FastMSA:
+        """Stack a series of MSAs. Optionally remove the query from msas after the first."""
+        arrays = []
+        all_headers = []
+        for i, msa in enumerate(msas):
+            array = msa.array
+            headers = msa.headers
+            if i > 0 and remove_query_from_later_msas:
+                array = array[1:]
+                if headers is not None:
+                    headers = headers[1:]
+            arrays.append(array)
+            if headers is not None:
+                all_headers.extend(headers)
+        return cls(np.concatenate(arrays, axis=0), all_headers)

esmfold2_msa_filter_sequences.py ADDED Viewed

	@@ -0,0 +1,83 @@

+import os
+import tempfile
+from pathlib import Path
+import numpy as np
+from scipy.spatial.distance import cdist
+from .esmfold2_system import run_subprocess_with_errorcheck
+def greedy_select_indices(array, num_seqs: int, mode: str = "max") -> list[int]:
+    """Greedily select sequences that either maximize or minimize hamming distance.
+    Algorithm proposed in the MSA Transformer paper. Starting from the query sequence,
+    iteratively add sequences to the list with the maximum (minimum) average Hamming
+    distance to the existing set of sequences.
+    Args:
+        array (np.ndarray): Character array representing the sequences in the MSA
+        num_seqs (int): Number of sequences to select.
+        mode (str): Whether to maximize or minimize diversity. DO NOT pick 'min' unless
+            you're doing it to prove a point for a paper.
+    Returns:
+        list[int]: List of indices to select from the array
+    """
+    assert mode in ("max", "min")
+    depth = array.shape[0]
+    if depth <= num_seqs:
+        return list(range(depth))
+    array = array.view(np.uint8)
+    optfunc = np.argmax if mode == "max" else np.argmin
+    all_indices = np.arange(depth)
+    indices = [0]
+    pairwise_distances = np.zeros((0, depth))
+    for _ in range(num_seqs - 1):
+        dist = cdist(array[indices[-1:]], array, "hamming")
+        pairwise_distances = np.concatenate([pairwise_distances, dist])
+        shifted_distance = np.delete(pairwise_distances, indices, axis=1).mean(0)
+        shifted_index = optfunc(shifted_distance)
+        index = np.delete(all_indices, indices)[shifted_index]
+        indices.append(index)
+    indices = sorted(indices)
+    return indices
+def hhfilter(
+    sequences: list[str],
+    seqid: int = 90,
+    diff: int = 0,
+    cov: int = 0,
+    qid: int = 0,
+    qsc: float = -20.0,
+    binary: str = "hhfilter",
+) -> list[int]:
+    with tempfile.TemporaryDirectory(
+        dir="/dev/shm" if os.path.exists("/dev/shm") else None
+    ) as tempdirname:
+        tempdir = Path(tempdirname)
+        fasta_file = tempdir / "input.fasta"
+        fasta_file.write_text(
+            "\n".join(f">{i}\n{seq}" for i, seq in enumerate(sequences))
+        )
+        output_file = tempdir / "output.fasta"
+        command = " ".join(
+            [
+                f"{binary}",
+                f"-i {fasta_file}",
+                "-M a3m",
+                f"-o {output_file}",
+                f"-id {seqid}",
+                f"-diff {diff}",
+                f"-cov {cov}",
+                f"-qid {qid}",
+                f"-qsc {qsc}",
+            ]
+        ).split(" ")
+        run_subprocess_with_errorcheck(command, capture_output=True)
+        with output_file.open() as f:
+            indices = [int(line[1:].strip()) for line in f if line.startswith(">")]
+        return indices

esmfold2_normalize_coordinates.py ADDED Viewed

	@@ -0,0 +1,80 @@

+from typing import TypeVar
+import numpy as np
+import torch
+from torch import Tensor
+from . import esmfold2_residue_constants as RC
+from .esmfold2_affine3d import Affine3D
+ArrayOrTensor = TypeVar("ArrayOrTensor", np.ndarray, Tensor)
+def atom3_to_backbone_frames(bb_positions: torch.Tensor) -> Affine3D:
+    N, CA, C = bb_positions.unbind(dim=-2)
+    return Affine3D.from_graham_schmidt(C, CA, N)
+def index_by_atom_name(
+    atom37: ArrayOrTensor, atom_names: str | list[str], dim: int = -2
+) -> ArrayOrTensor:
+    squeeze = False
+    if isinstance(atom_names, str):
+        atom_names = [atom_names]
+        squeeze = True
+    indices = [RC.atom_order[atom_name] for atom_name in atom_names]
+    dim = dim % atom37.ndim
+    index = tuple(slice(None) if dim != i else indices for i in range(atom37.ndim))
+    result = atom37[index]  # type: ignore
+    if squeeze:
+        result = result.squeeze(dim)
+    return result
+def get_protein_normalization_frame(coords: Tensor) -> Affine3D:
+    """Given a set of coordinates for a protein, compute a single frame that can be used to normalize the coordinates.
+    Specifically, we compute the average position of the N, CA, and C atoms use those 3 points to construct a frame
+    using the Gram-Schmidt algorithm. The average CA position is used as the origin of the frame.
+    Args:
+        coords (torch.FloatTensor): [L, 37, 3] tensor of coordinates
+    Returns:
+        Affine3D: tensor of Affine3D frame
+    """
+    bb_coords = index_by_atom_name(coords, ["N", "CA", "C"], dim=-2)
+    coord_mask = torch.all(torch.all(torch.isfinite(bb_coords), dim=-1), dim=-1)
+    average_position_per_n_ca_c = bb_coords.masked_fill(
+        ~coord_mask[..., None, None], 0
+    ).sum(-3) / (coord_mask.sum(-1)[..., None, None] + 1e-8)
+    frame = atom3_to_backbone_frames(average_position_per_n_ca_c.float())
+    return frame
+def apply_frame_to_coords(coords: Tensor, frame: Affine3D) -> Tensor:
+    """Given a set of coordinates and a single frame, apply the frame to the coordinates.
+    Args:
+        coords (torch.FloatTensor): [L, 37, 3] tensor of coordinates
+        frame (Affine3D): Affine3D frame
+    Returns:
+        torch.FloatTensor: [L, 37, 3] tensor of transformed coordinates
+    """
+    coords_trans_rot = frame[..., None, None].invert().apply(coords)
+    # only transform coordinates with frame that have a valid rotation
+    valid_frame = frame.trans.norm(dim=-1) > 0
+    is_inf = torch.isinf(coords)
+    coords = coords_trans_rot.where(valid_frame[..., None, None, None], coords)
+    coords.masked_fill_(is_inf, torch.inf)
+    return coords
+def normalize_coordinates(coords: Tensor) -> Tensor:
+    return apply_frame_to_coords(coords, get_protein_normalization_frame(coords))

esmfold2_output.py ADDED Viewed

	@@ -0,0 +1,225 @@

+from itertools import groupby
+from typing import Any
+import numpy as np
+import torch
+from .esmfold2_constants import ELEMENT_NUMBER_TO_SYMBOL, MOL_TYPE_NONPOLYMER
+from .esmfold2_molecular_complex import (
+    MolecularComplex,
+    MolecularComplexMetadata,
+)
+def get_element_symbol(atomic_num: int) -> str:
+    return ELEMENT_NUMBER_TO_SYMBOL.get(atomic_num, "X")
+def build_molecular_complex_from_features(
+    coords: torch.Tensor,
+    plddt: torch.Tensor,
+    atom_mask: torch.Tensor,
+    ref_element: torch.Tensor,
+    ref_atom_name_chars: torch.Tensor,
+    chain_infos: list,
+    complex_id: str,
+) -> MolecularComplex:
+    """Construct a MolecularComplex from feature-dict tensors and chain metadata.
+    Non-polymer chains (ligands) collapse all per-atom tokens into a single
+    residue token whose pLDDT is the per-token average and whose hetero flag
+    is True.
+    """
+    mask_np = atom_mask.bool().cpu().numpy()
+    coords_np = coords.float().cpu().numpy()
+    name_chars_np = ref_atom_name_chars.cpu().numpy()
+    elements_np = ref_element.cpu().numpy()
+    plddt_np = plddt.float().cpu().numpy()
+    sequence_tokens: list[str] = []
+    chain_ids_per_token: list[int] = []
+    token_to_atoms: list[list[int]] = []
+    confidence: list[float] = []
+    flat_positions: list[list[float]] = []
+    flat_elements: list[str] = []
+    flat_names: list[str] = []
+    flat_hetero: list[bool] = []
+    chain_lookup: dict[int, str] = {}
+    entity_info: dict[int, str] = {}
+    out_atom_cursor = 0
+    for ci in chain_infos:
+        chain_lookup[ci.asym_id] = ci.chain_id
+        is_nonpolymer = ci.mol_type == MOL_TYPE_NONPOLYMER
+        entity_info[ci.entity_id] = "non-polymer" if is_nonpolymer else "polymer"
+        if is_nonpolymer:
+            residue_name = ci.tokens[0].residue_name if ci.tokens else "LIG"
+            sequence_tokens.append(residue_name)
+            chain_ids_per_token.append(ci.asym_id)
+            avg_plddt = (
+                float(np.mean([plddt_np[ti.token_index] for ti in ci.tokens]))
+                if ci.tokens
+                else 0.0
+            )
+            confidence.append(avg_plddt)
+            token_atom_start = out_atom_cursor
+            for ti in ci.tokens:
+                for atom_idx in range(ti.atom_start, ti.atom_start + ti.atom_count):
+                    if not mask_np[atom_idx]:
+                        continue
+                    flat_positions.append(coords_np[atom_idx].tolist())
+                    flat_elements.append(get_element_symbol(int(elements_np[atom_idx])))
+                    chars = name_chars_np[atom_idx]
+                    name = "".join(
+                        chr(int(c) + 32) for c in chars if int(c) != 0
+                    ).strip()
+                    flat_names.append(name)
+                    flat_hetero.append(True)
+                    out_atom_cursor += 1
+            token_to_atoms.append([token_atom_start, out_atom_cursor])
+            continue
+        # Atom-tokenized modified residues (HYP, MSE, ...) span multiple
+        # tokens per residue; collapse them back to one mmCIF residue.
+        for _residue_index, ti_iter in groupby(
+            ci.tokens, key=lambda t: t.residue_index
+        ):
+            ti_group = list(ti_iter)
+            sequence_tokens.append(ti_group[0].residue_name)
+            chain_ids_per_token.append(ci.asym_id)
+            confidence.append(
+                float(np.mean([plddt_np[ti.token_index] for ti in ti_group]))
+            )
+            token_atom_start = out_atom_cursor
+            for ti in ti_group:
+                for atom_idx in range(ti.atom_start, ti.atom_start + ti.atom_count):
+                    if not mask_np[atom_idx]:
+                        continue
+                    flat_positions.append(coords_np[atom_idx].tolist())
+                    flat_elements.append(get_element_symbol(int(elements_np[atom_idx])))
+                    chars = name_chars_np[atom_idx]
+                    name = "".join(
+                        chr(int(c) + 32) for c in chars if int(c) != 0
+                    ).strip()
+                    flat_names.append(name)
+                    flat_hetero.append(False)
+                    out_atom_cursor += 1
+            token_to_atoms.append([token_atom_start, out_atom_cursor])
+    return MolecularComplex(
+        id=complex_id,
+        sequence=sequence_tokens,
+        atom_positions=np.array(flat_positions, dtype=np.float32).reshape(-1, 3),
+        atom_elements=np.array(flat_elements, dtype=object),
+        token_to_atoms=np.array(token_to_atoms, dtype=np.int32).reshape(-1, 2),
+        chain_id=np.array(chain_ids_per_token, dtype=np.int64),
+        plddt=np.array(confidence, dtype=np.float32),
+        atom_names=np.array(flat_names, dtype=object),
+        atom_hetero=np.array(flat_hetero, dtype=bool),
+        metadata=MolecularComplexMetadata(
+            entity_lookup=entity_info,
+            chain_lookup=chain_lookup,
+            assembly_composition=None,
+        ),
+    )
+def build_molecular_complex(
+    structure: Any, coords: torch.Tensor, plddt: torch.Tensor, complex_id: str
+) -> MolecularComplex:
+    """Directly constructs a MolecularComplex from model outputs without intermediate files.
+    Args:
+        structure: Object with .chains, .residues, .atoms numpy structured arrays.
+        coords: [N_atoms, 3] predicted atom coordinates.
+        plddt: [N_residues] per-residue confidence scores.
+        complex_id: Identifier string for the resulting complex.
+    """
+    flat_positions = []
+    flat_elements = []
+    flat_names = []
+    flat_hetero = []
+    sequence_tokens = []
+    token_to_atoms = []
+    chain_ids_per_token = []
+    confidence_scores = []
+    chain_lookup = {}
+    entity_info = {}
+    global_atom_cursor = 0
+    global_res_cursor = 0
+    atom_array_idx = 0
+    for chain in structure.chains:
+        chain_idx_numeric = chain["asym_id"]
+        chain_name_str = str(chain["name"])
+        mol_type = chain["mol_type"]
+        chain_lookup[chain_idx_numeric] = chain_name_str
+        entity_info[chain["entity_id"]] = (
+            "polymer" if mol_type != MOL_TYPE_NONPOLYMER else "non-polymer"
+        )
+        res_start = chain["res_idx"]
+        res_end = chain["res_idx"] + chain["res_num"]
+        residues = structure.residues[res_start:res_end]
+        for residue in residues:
+            res_name = str(residue["name"])
+            sequence_tokens.append(res_name)
+            chain_ids_per_token.append(chain_idx_numeric)
+            score = plddt[global_res_cursor].item()
+            confidence_scores.append(score)
+            token_start_idx = atom_array_idx
+            atom_start = residue["atom_idx"]
+            atom_end = residue["atom_idx"] + residue["atom_num"]
+            atoms = structure.atoms[atom_start:atom_end]
+            for atom in atoms:
+                if not atom["is_present"]:
+                    continue
+                pos = coords[global_atom_cursor].tolist()
+                flat_positions.append(pos)
+                elem = get_element_symbol(atom["element"].item())
+                flat_elements.append(elem)
+                raw_name = atom["name"]
+                if hasattr(raw_name, "tolist"):
+                    raw_name = raw_name.tolist()
+                name_str = "".join([chr(c + 32) for c in raw_name if c != 0])
+                flat_names.append(name_str)
+                flat_hetero.append(mol_type == MOL_TYPE_NONPOLYMER)
+                global_atom_cursor += 1
+                atom_array_idx += 1
+            token_to_atoms.append([token_start_idx, atom_array_idx])
+            global_res_cursor += 1
+    return MolecularComplex(
+        id=complex_id,
+        sequence=sequence_tokens,
+        atom_positions=np.array(flat_positions, dtype=np.float32),
+        atom_elements=np.array(flat_elements, dtype=object),
+        token_to_atoms=np.array(token_to_atoms, dtype=np.int32),
+        chain_id=np.array(chain_ids_per_token, dtype=np.int64),
+        plddt=np.array(confidence_scores, dtype=np.float32),
+        atom_names=np.array(flat_names, dtype=object),
+        atom_hetero=np.array(flat_hetero, dtype=bool),
+        metadata=MolecularComplexMetadata(
+            entity_lookup=entity_info,
+            chain_lookup=chain_lookup,
+            assembly_composition=None,
+        ),
+    )

esmfold2_paired_msa.py ADDED Viewed

	@@ -0,0 +1,246 @@

+"""Taxonomy-paired MSA construction for ESMFold2 inference.
+Taxonomy IDs are read from FASTA headers as ``key=N`` tokens. Rows
+where any chain has ``key=-1`` (or no ``key=`` at all) are treated as
+unpaired and assigned to that chain's block-diagonal section after
+the paired rows.
+"""
+import re
+import numpy as np
+from .esmfold2_constants import (
+    MSA_GAP_TOKEN_ID,
+    PROTEIN_3TO1,
+    PROTEIN_RESIDUE_TO_RES_TYPE,
+    PROTEIN_UNK_RES_TYPE,
+)
+from .esmfold2_msa import MSA
+_KEY_RE = re.compile(r"key=(-?\d+)")
+def protein_letter_to_res_type() -> dict[str, int]:
+    """Return the protein 1-letter → res_type mapping used by the MSA encoder."""
+    mapping: dict[str, int] = {}
+    for three, one in PROTEIN_3TO1.items():
+        if three in PROTEIN_RESIDUE_TO_RES_TYPE:
+            mapping[one] = PROTEIN_RESIDUE_TO_RES_TYPE[three]
+    mapping["-"] = MSA_GAP_TOKEN_ID
+    mapping["X"] = PROTEIN_UNK_RES_TYPE
+    return mapping
+def _taxonomy_from_header(header: str) -> int:
+    if not header:
+        return -1
+    m = _KEY_RE.search(header)
+    return int(m.group(1)) if m else -1
+def msa_to_res_type_and_deletions(
+    msa: MSA, letter_to_res_type: dict[str, int]
+) -> tuple[np.ndarray, np.ndarray]:
+    """Convert an :class:`MSA` to ``(res_type[M, L], deletion_count[M, L])``.
+    Handles a3m insertion convention: lowercase letters and ``.`` are
+    insertions and are not emitted; their count is accumulated into the
+    next non-insertion position's deletion value. ``L`` is the query
+    length after stripping insertions from row 0.
+    """
+    query = msa.entries[0].sequence
+    L = sum(1 for ch in query if not (ch.islower() or ch == "."))
+    M = msa.depth
+    res_type = np.full((M, L), MSA_GAP_TOKEN_ID, dtype=np.int64)
+    deletions = np.zeros((M, L), dtype=np.float32)
+    for r, entry in enumerate(msa.entries):
+        col = 0
+        ins = 0
+        for ch in entry.sequence:
+            if ch == "." or (ch.islower() and ch != "-"):
+                ins += 1
+                continue
+            if col >= L:
+                break
+            if ch == "-":
+                res_type[r, col] = MSA_GAP_TOKEN_ID
+            else:
+                res_type[r, col] = letter_to_res_type.get(
+                    ch.upper(), PROTEIN_UNK_RES_TYPE
+                )
+            if ins > 0:
+                deletions[r, col] = float(ins)
+                ins = 0
+            col += 1
+    return res_type, deletions
+def _dummy_msa_residues(query_res_types: np.ndarray) -> np.ndarray:
+    """Single-row 'MSA' for chains without one — just the query."""
+    return query_res_types[None, :]  # [1, L]
+def construct_paired_msa(
+    chain_msas: dict[int, MSA | None],
+    chain_query_res_types: dict[int, np.ndarray],
+    token_asym_ids: np.ndarray,
+    token_res_ids: np.ndarray,
+    letter_to_res_type: dict[str, int] | None = None,
+    *,
+    max_pairs: int = 8192,
+    max_total: int = 16384,
+    max_seqs: int = 16384,
+) -> tuple[np.ndarray, np.ndarray, np.ndarray]:
+    """Build paired MSA features.
+    Parameters
+    ----------
+    chain_msas
+        ``asym_id -> MSA`` (or ``None`` for chains without an MSA).
+    chain_query_res_types
+        ``asym_id -> np.ndarray[L_c]`` of res-type ids for the chain's
+        query. Used to build dummy MSAs when a chain has no MSA.
+    token_asym_ids
+        Per-token asym_id, length ``T``. Must be non-decreasing.
+    token_res_ids
+        Per-token residue index within chain, length ``T``.
+    letter_to_res_type
+        1-letter → res-type mapping. Defaults to
+        :func:`protein_letter_to_res_type`.
+    Returns
+    -------
+    msa_residues : ``np.ndarray[M, T]`` int64
+    deletion_value : ``np.ndarray[M, T]`` float32 (raw deletion counts; the
+        ``arctan(/3) * pi/2`` transform is applied by the caller)
+    is_paired : ``np.ndarray[M, T]`` float32 broadcast of per-row,
+        per-chain paired flags.
+    """
+    if letter_to_res_type is None:
+        letter_to_res_type = protein_letter_to_res_type()
+    chain_ids: list[int] = sorted(chain_msas.keys())
+    # Build per-chain (res_type, deletions, taxonomy) tables.
+    chain_res_type: dict[int, np.ndarray] = {}
+    chain_deletions: dict[int, np.ndarray] = {}
+    chain_taxonomies: dict[int, list[int]] = {}
+    for c in chain_ids:
+        m = chain_msas.get(c)
+        if m is None or m.depth == 0:
+            qres = chain_query_res_types[c]
+            chain_res_type[c] = _dummy_msa_residues(qres)
+            chain_deletions[c] = np.zeros((1, qres.shape[0]), dtype=np.float32)
+            chain_taxonomies[c] = [-1]
+            continue
+        rt, dl = msa_to_res_type_and_deletions(m, letter_to_res_type)
+        chain_res_type[c] = rt
+        chain_deletions[c] = dl
+        chain_taxonomies[c] = [_taxonomy_from_header(e.header) for e in m.entries]
+    # Group by taxonomy, skip query row and unpaired (-1) entries.
+    taxonomy_map: dict[int, list[tuple[int, int]]] = {}
+    for c in chain_ids:
+        for seq_idx, taxon in enumerate(chain_taxonomies[c]):
+            if seq_idx == 0 or taxon == -1:
+                continue
+            taxonomy_map.setdefault(taxon, []).append((c, seq_idx))
+    taxonomy_map = {k: v for k, v in taxonomy_map.items() if len(v) > 1}
+    # Order taxonomies by number of distinct chains, descending.
+    sorted_taxa = sorted(
+        taxonomy_map.items(), key=lambda kv: len({c for c, _ in kv[1]}), reverse=True
+    )
+    visited = {s for _, items in taxonomy_map.items() for s in items}
+    available: dict[int, list[int]] = {
+        c: [i for i in range(1, len(chain_taxonomies[c])) if (c, i) not in visited]
+        for c in chain_ids
+    }
+    pairing: list[dict[int, int]] = [{c: 0 for c in chain_ids}]
+    is_paired: list[dict[int, int]] = [{c: 1 for c in chain_ids}]
+    for _, pairs in sorted_taxa:
+        per_chain: dict[int, list[int]] = {}
+        for c, seq_idx in pairs:
+            per_chain.setdefault(c, []).append(seq_idx)
+        max_occ = max(len(v) for v in per_chain.values())
+        for i in range(max_occ):
+            row_pairing: dict[int, int] = {}
+            row_is_paired: dict[int, int] = {}
+            for c, seq_idxs in per_chain.items():
+                row_pairing[c] = seq_idxs[i % len(seq_idxs)]
+                row_is_paired[c] = 1
+            for c in chain_ids:
+                if c in row_pairing:
+                    continue
+                row_is_paired[c] = 0
+                if available[c]:
+                    row_pairing[c] = available[c].pop(0)
+                else:
+                    row_pairing[c] = -1
+            pairing.append(row_pairing)
+            is_paired.append(row_is_paired)
+            if len(pairing) >= max_pairs:
+                break
+        if len(pairing) >= max_pairs:
+            break
+    max_left = max((len(v) for v in available.values()), default=0)
+    for _ in range(min(max_total - len(pairing), max_left)):
+        row_pairing = {}
+        row_is_paired = {}
+        for c in chain_ids:
+            row_is_paired[c] = 0
+            if available[c]:
+                row_pairing[c] = available[c].pop(0)
+            else:
+                row_pairing[c] = -1
+        pairing.append(row_pairing)
+        is_paired.append(row_is_paired)
+        if len(pairing) >= max_total:
+            break
+    pairing = pairing[:max_seqs]
+    is_paired = is_paired[:max_seqs]
+    M = len(pairing)
+    T = len(token_asym_ids)
+    msa_residues = np.full((M, T), MSA_GAP_TOKEN_ID, dtype=np.int64)
+    deletion_value = np.zeros((M, T), dtype=np.float32)
+    paired_mask = np.zeros((M, T), dtype=np.float32)
+    # Vectorize per chain: gather chain rows according to pairing[c], then
+    # index into them by the chain's token residue ids.
+    for c in chain_ids:
+        rt = chain_res_type[c]
+        dl = chain_deletions[c]
+        Lc = rt.shape[1]
+        chain_pairing = np.array([row[c] for row in pairing], dtype=np.int64)
+        chain_paired = np.array([row[c] for row in is_paired], dtype=np.float32)
+        token_mask = token_asym_ids == c
+        if not token_mask.any():
+            continue
+        token_res_in_chain = token_res_ids[token_mask]
+        # Clamp residue indices to the MSA's column range. Modified-residue
+        # tokens that exceed the query length fall back to the last column.
+        cols = np.minimum(token_res_in_chain, Lc - 1)
+        # Rows where pairing == -1 fall back to gap (already initialized).
+        valid_rows = chain_pairing >= 0
+        if valid_rows.any():
+            gathered_rt = rt[chain_pairing[valid_rows]][:, cols]
+            gathered_dl = dl[chain_pairing[valid_rows]][:, cols]
+            valid_idx = np.where(valid_rows)[0]
+            token_idx = np.where(token_mask)[0]
+            msa_residues[np.ix_(valid_idx, token_idx)] = gathered_rt
+            deletion_value[np.ix_(valid_idx, token_idx)] = gathered_dl
+        paired_mask[:, token_mask] = chain_paired[:, None]
+    return msa_residues, deletion_value, paired_mask

esmfold2_parsing.py ADDED Viewed

	@@ -0,0 +1,113 @@

+import io
+from pathlib import Path
+from typing import Generator, Iterable, NamedTuple
+PathOrBuffer = str | Path | io.TextIOBase
+FastaEntry = NamedTuple("FastaEntry", [("header", str), ("sequence", str)])
+def parse_fasta(fasta_string: str) -> Generator[FastaEntry, None, None]:
+    """
+    Parses a fasta file and yields FastaEntry objects
+    Args:
+        fasta_string: The fasta file as a string
+    Returns:
+        A generator of FastaEntry objects
+    """
+    header = None
+    seq = []
+    num_sequences = 0
+    for line in fasta_string.splitlines():
+        if not line or line[0] == "#":
+            continue
+        if line.startswith(">"):
+            if header is not None:
+                yield FastaEntry(header, "".join(seq))
+                seq = []
+            header = line[1:].strip()
+        else:
+            seq.append(line)
+    if header is not None:
+        num_sequences += 1
+        yield FastaEntry(header, "".join(seq))
+    if num_sequences == 0:
+        raise ValueError("Found no sequences in input")
+def read_sequences(path: PathOrBuffer) -> Generator[FastaEntry, None, None]:
+    # Uses duck typing to try and call the right method
+    # Doesn't use explicit isinstance check to support
+    # inputs that are not explicitly str/Path/TextIOBase but
+    # may support similar functionality
+    data = None  # type: ignore
+    try:
+        if str(path).endswith(".gz"):
+            import gzip
+            data = gzip.open(path, "rt")  # type: ignore
+        else:
+            try:
+                data = open(path)  # type: ignore
+            except TypeError:
+                data: io.TextIOBase = path  # type: ignore
+        yield from parse_fasta(data.read())
+    finally:
+        if data is not None:
+            data.close()
+def read_first_sequence(path: PathOrBuffer) -> FastaEntry:
+    return next(iter(read_sequences(path)))
+def count_fasta_sequences(path: str | Path) -> int:
+    """Count sequences in a FASTA file by counting header lines.
+    Faster than parsing the full file — only scans for '>' prefixes.
+    Returns 0 if the file does not exist.
+    """
+    path = Path(path)
+    if not path.exists():
+        return 0
+    with open(path) as f:
+        return sum(1 for line in f if line.startswith(">"))
+def append_fasta_sequence(header: str, sequence: str, path: str | Path) -> None:
+    """Append a single sequence to a FASTA file (creating it if needed)."""
+    path = Path(path)
+    path.parent.mkdir(parents=True, exist_ok=True)
+    # The existing file may not end with a newline (e.g., write_sequences()
+    # explicitly avoids writing a newline at the end), so we insert one before
+    # appending to avoid merging with the last line.
+    needs_newline = (
+        path.exists() and path.stat().st_size > 0 and path.read_bytes()[-1:] != b"\n"
+    )
+    with open(path, "a") as f:
+        if needs_newline:
+            f.write("\n")
+        f.write(f">{header}\n{sequence}\n")
+def write_sequences(sequences: Iterable[tuple[str, str]], path: PathOrBuffer) -> None:
+    needs_closing = False
+    handle = None
+    try:
+        try:
+            handle = open(path, "w")  # type: ignore
+            needs_closing = True
+        except TypeError:
+            handle = path
+        has_prev = False
+        for header, seq in sequences:
+            if has_prev:
+                handle.write("\n")  # type: ignore
+            handle.write(f">{header}\n{seq}")  # type: ignore
+            has_prev = True
+    finally:
+        if needs_closing:
+            handle.close()  # type: ignore

esmfold2_predicted_aligned_error.py ADDED Viewed

	@@ -0,0 +1,105 @@

+import torch
+import torch.nn.functional as F
+from .esmfold2_affine3d import Affine3D
+def masked_mean(
+    mask: torch.Tensor,
+    value: torch.Tensor,
+    dim: int | None | tuple[int, ...] = None,
+    eps=1e-10,
+) -> torch.Tensor:
+    """Compute the mean of `value` where only positions where `mask == true` are
+    counted.
+    """
+    mask = mask.expand(*value.shape)
+    return torch.sum(mask * value, dim=dim) / (eps + torch.sum(mask, dim=dim))
+def _pae_bins(
+    max_bin: float = 31, num_bins: int = 64, device: torch.device = torch.device("cpu")
+):
+    bins = torch.linspace(0, max_bin, steps=(num_bins - 1), device=device)
+    step = max_bin / (num_bins - 2)
+    bin_centers = bins + step / 2
+    bin_centers = torch.cat(
+        [bin_centers, (bin_centers[-1] + step).unsqueeze(-1)], dim=0
+    )
+    return bin_centers
+def _compute_pae_masks(mask: torch.Tensor):
+    square_mask = (mask.unsqueeze(-1) * mask.unsqueeze(-2)).bool()
+    return square_mask
+def compute_predicted_aligned_error(
+    logits: torch.Tensor,
+    aa_mask: torch.Tensor,
+    sequence_id: torch.Tensor | None = None,
+    max_bin: float = 31,
+) -> torch.Tensor:
+    bins = _pae_bins(max_bin, logits.shape[-1], logits.device)
+    square_mask = _compute_pae_masks(aa_mask)
+    min_v = torch.finfo(logits.dtype).min
+    probs = logits.masked_fill(~square_mask.unsqueeze(-1), min_v).softmax(dim=-1)
+    return (probs * bins).sum(dim=-1)
+@torch.no_grad
+def compute_tm(logits: torch.Tensor, aa_mask: torch.Tensor, max_bin: float = 31.0):
+    square_mask = _compute_pae_masks(aa_mask)
+    seqlens = aa_mask.sum(-1, keepdim=True)
+    bins = _pae_bins(max_bin, logits.shape[-1], logits.device)
+    d0 = 1.24 * (seqlens.clamp_min(19) - 15) ** (1 / 3) - 1.8
+    f_d = 1.0 / (1 + (bins / d0.unsqueeze(-1)) ** 2)
+    min_v = torch.finfo(logits.dtype).min
+    probs = logits.masked_fill(~square_mask.unsqueeze(-1), min_v).softmax(dim=-1)
+    # This is the sum over bins
+    ptm = (probs * f_d.unsqueeze(-2)).sum(dim=-1)
+    # This is the mean over residues j
+    ptm = masked_mean(square_mask, ptm, dim=-1)
+    # The we do a max over residues i
+    return ptm.max(dim=-1).values
+def tm_loss(
+    logits: torch.Tensor,
+    pred_affine: torch.Tensor,
+    targ_affine: torch.Tensor,
+    targ_mask: torch.Tensor,
+    tm_mask: torch.Tensor | None = None,
+    sequence_id: torch.Tensor | None = None,
+    max_bin: float = 31,
+):
+    pred = Affine3D.from_tensor(pred_affine)
+    targ = Affine3D.from_tensor(targ_affine)
+    def transform(affine: Affine3D):
+        pts = affine.trans[..., None, :, :]
+        return affine.invert()[..., None].apply(pts)
+    with torch.no_grad():
+        sq_diff = (transform(pred) - transform(targ)).square().sum(dim=-1)
+        num_bins = logits.shape[-1]
+        sq_bins = torch.linspace(
+            0, max_bin, num_bins - 1, device=logits.device
+        ).square()
+        # Gets the bin id by using a sum.
+        true_bins = (sq_diff[..., None] > sq_bins).sum(dim=-1).long()
+    errors = F.cross_entropy(logits.movedim(3, 1), true_bins, reduction="none")
+    square_mask = _compute_pae_masks(targ_mask)
+    loss = masked_mean(square_mask, errors, dim=(-1, -2))
+    if tm_mask is not None:
+        loss = masked_mean(tm_mask, loss, dim=None)
+    else:
+        loss = loss.mean()
+    return loss

esmfold2_prepare_input.py ADDED Viewed

	@@ -0,0 +1,1464 @@

+"""Prepare ESMFold2 model inputs from sequence-level StructurePredictionInput.
+This module converts StructurePredictionInput (protein/DNA/RNA/ligand sequences)
+into the tensor dict expected by the ESMFold2 model forward pass.
+"""
+from __future__ import annotations
+import math
+import warnings
+from collections import defaultdict
+from dataclasses import dataclass, field
+import numpy as np
+import torch
+from .esmfold2_conformers import (
+    get_ccd_leaving_atoms,
+    get_idealized_atom_pos,
+    get_ligand_ccd_atoms_with_charges,
+    get_ligand_ccd_bonds,
+    get_ligand_idealized_atom_pos,
+)
+from .esmfold2_constants import (
+    CHARGED_ATOMS,
+    DNA_1TO3,
+    DNA_BACKBONE_ATOMS,
+    DNA_HEAVY_ATOMS,
+    DNA_RESIDUE_TO_RES_TYPE,
+    DNA_RNA_LIGAND_INPUT_ID,
+    DNA_UNK_RES_TYPE,
+    ELEMENT_TO_ATOMIC_NUM,
+    ESM_PROTEIN_VOCAB,
+    MOL_TYPE_DNA,
+    MOL_TYPE_NONPOLYMER,
+    MOL_TYPE_PROTEIN,
+    MOL_TYPE_RNA,
+    MSA_GAP_TOKEN_ID,
+    PROTEIN_1TO3,
+    PROTEIN_3TO1,
+    PROTEIN_HEAVY_ATOMS,
+    PROTEIN_RESIDUE_TO_RES_TYPE,
+    PROTEIN_UNK_RES_TYPE,
+    RNA_1TO3,
+    RNA_BACKBONE_ATOMS,
+    RNA_HEAVY_ATOMS,
+    RNA_RESIDUE_TO_RES_TYPE,
+    RNA_UNK_RES_TYPE,
+)
+from .esmfold2_types import (
+    MSA,
+    DNAInput,
+    LigandInput,
+    Modification,
+    ProteinInput,
+    RNAInput,
+    StructurePredictionInput,
+)
+# =============================================================================
+# Lightweight data model
+# =============================================================================
+_ZERO_POS = np.array([0.0, 0.0, 0.0], dtype=np.float32)
+@dataclass
+class AtomInfo:
+    name: str
+    element: str
+    charge: int
+    ref_pos: np.ndarray  # Idealized position from CCD [3]
+    pos: np.ndarray  # Experimental position [3] (zeros for inference)
+    token_index: int = -1
+    atom_index: int = -1
+    space_uid: int = -1
+    is_valid: bool = True
+@dataclass
+class TokenInfo:
+    token_index: int
+    residue_index: int  # Within chain (0-based)
+    residue_name: str  # 3-letter code
+    mol_type: int  # 0=protein, 1=DNA, 2=RNA, 3=nonpolymer
+    res_type: int  # Residue type index (2-32)
+    input_id: int  # ESM vocab ID
+    asym_id: int
+    sym_id: int
+    entity_id: int
+    atom_start: int  # Index into atoms list
+    atom_count: int
+@dataclass
+class ChainInfo:
+    chain_id: str
+    asym_id: int
+    entity_id: int
+    sym_id: int
+    mol_type: int
+    tokens: list[TokenInfo] = field(default_factory=list)
+# =============================================================================
+# Helper functions
+# =============================================================================
+# Caches for hot-path functions
+_ENCODE_ATOM_NAME_CACHE: dict[str, list[int]] = {}
+_ELEMENT_ATOMIC_NUM_CACHE: dict[str, int] = {}
+def encode_atom_name(name: str) -> list[int]:
+    """Encode atom name as 4 character indices (offset by 32 from ASCII)."""
+    if name in _ENCODE_ATOM_NAME_CACHE:
+        return _ENCODE_ATOM_NAME_CACHE[name]
+    padded = name.ljust(4)[:4]
+    result = [ord(c) - 32 if c != " " else 0 for c in padded]
+    _ENCODE_ATOM_NAME_CACHE[name] = result
+    return result
+def get_element_atomic_num(element: str) -> int:
+    """Get atomic number for an element symbol."""
+    if element in _ELEMENT_ATOMIC_NUM_CACHE:
+        return _ELEMENT_ATOMIC_NUM_CACHE[element]
+    result = ELEMENT_TO_ATOMIC_NUM.get(element.upper(), 0)
+    _ELEMENT_ATOMIC_NUM_CACHE[element] = result
+    return result
+def _infer_element(atom_name: str) -> str:
+    """Infer element from atom name."""
+    name = atom_name.strip()
+    if not name:
+        return "C"
+    if name[0].isdigit():
+        return name[1] if len(name) > 1 else "H"
+    if len(name) == 2 and name in (
+        "FE",
+        "ZN",
+        "MG",
+        "MN",
+        "CO",
+        "NI",
+        "CU",
+        "SE",
+        "BR",
+    ):
+        return name
+    return name[0]
+def _compute_res_type(name: str, mol_type: int) -> int:
+    """Compute residue type index from residue name and mol_type."""
+    if mol_type == MOL_TYPE_PROTEIN:
+        return PROTEIN_RESIDUE_TO_RES_TYPE.get(name, PROTEIN_UNK_RES_TYPE)
+    elif mol_type == MOL_TYPE_DNA:
+        if name in DNA_RESIDUE_TO_RES_TYPE:
+            return DNA_RESIDUE_TO_RES_TYPE[name]
+        if name in RNA_RESIDUE_TO_RES_TYPE:
+            return RNA_RESIDUE_TO_RES_TYPE[name]
+        return DNA_UNK_RES_TYPE
+    elif mol_type == MOL_TYPE_RNA:
+        if name in RNA_RESIDUE_TO_RES_TYPE:
+            return RNA_RESIDUE_TO_RES_TYPE[name]
+        if name in DNA_RESIDUE_TO_RES_TYPE:
+            return DNA_RESIDUE_TO_RES_TYPE[name]
+        return RNA_UNK_RES_TYPE
+    return PROTEIN_UNK_RES_TYPE
+def _compute_esm_input_id(name: str, mol_type: int) -> int:
+    """Compute ESM vocabulary input ID."""
+    if mol_type == MOL_TYPE_PROTEIN:
+        letter = PROTEIN_3TO1.get(name)
+        if letter is None:
+            return DNA_RNA_LIGAND_INPUT_ID
+        return ESM_PROTEIN_VOCAB.get(letter, ESM_PROTEIN_VOCAB["X"])
+    return DNA_RNA_LIGAND_INPUT_ID
+# =============================================================================
+# Tokenization functions — build tokens and atoms from sequences
+# =============================================================================
+def tokenize_protein(
+    sequence: str,
+    modifications: list[Modification] | None,
+    entity_id: int,
+    asym_id: int,
+    sym_id: int,
+    token_offset: int,
+    atom_offset: int,
+    space_uid_offset: int,
+) -> tuple[list[TokenInfo], list[AtomInfo]]:
+    """Tokenize a protein sequence into tokens and atoms.
+    Standard residues produce 1 token with all heavy atoms.
+    Modified residues (from modifications) are atom-tokenized (1 token per atom).
+    """
+    tokens: list[TokenInfo] = []
+    atoms: list[AtomInfo] = []
+    # Build 3-letter sequence, applying modifications
+    seq_3letter = [PROTEIN_1TO3.get(c, "UNK") for c in sequence]
+    modified_positions: set[int] = set()
+    if modifications:
+        for mod in modifications:
+            seq_3letter[mod.position] = mod.ccd
+            modified_positions.add(mod.position)
+    token_idx = token_offset
+    atom_idx = atom_offset
+    space_uid = space_uid_offset
+    for res_idx, res_name in enumerate(seq_3letter):
+        # MSE → MET for atom lookup
+        res_corrected = "MET" if res_name == "MSE" else res_name
+        is_modified = res_idx in modified_positions
+        # Check if standard residue (has predefined atom list)
+        if not is_modified and res_corrected in PROTEIN_HEAVY_ATOMS:
+            # Standard residue: 1 token, multiple atoms
+            atom_names = PROTEIN_HEAVY_ATOMS[res_corrected]
+            res_type = _compute_res_type(res_corrected, MOL_TYPE_PROTEIN)
+            input_id = _compute_esm_input_id(res_corrected, MOL_TYPE_PROTEIN)
+            atom_start = atom_idx
+            for a_name in atom_names:
+                ref_pos = get_idealized_atom_pos(res_type, a_name)
+                atoms.append(
+                    AtomInfo(
+                        name=a_name,
+                        element=_infer_element(a_name),
+                        charge=CHARGED_ATOMS.get((res_corrected, a_name), 0),
+                        ref_pos=ref_pos.copy()
+                        if ref_pos is not None
+                        else _ZERO_POS.copy(),
+                        pos=_ZERO_POS.copy(),
+                        token_index=token_idx,
+                        atom_index=atom_idx,
+                        space_uid=space_uid,
+                    )
+                )
+                atom_idx += 1
+            tokens.append(
+                TokenInfo(
+                    token_index=token_idx,
+                    residue_index=res_idx,
+                    residue_name=res_corrected,
+                    mol_type=MOL_TYPE_PROTEIN,
+                    res_type=res_type,
+                    input_id=input_id,
+                    asym_id=asym_id,
+                    sym_id=sym_id,
+                    entity_id=entity_id,
+                    atom_start=atom_start,
+                    atom_count=len(atom_names),
+                )
+            )
+            token_idx += 1
+            space_uid += 1
+        else:
+            # Modified or unknown residue: atom-tokenized
+            ccd_atoms = get_ligand_ccd_atoms_with_charges(res_name)
+            if ccd_atoms is None:
+                # Fallback: backbone only
+                ccd_atoms = [
+                    (_infer_element(n), _infer_element(n), 0)
+                    for n in ["N", "CA", "C", "O"]
+                ]
+            # Filter leaving atoms if not terminal
+            is_terminal = res_idx == len(seq_3letter) - 1
+            leaving_atoms = set() if is_terminal else get_ccd_leaving_atoms(res_name)
+            kept_atoms = [a for a in ccd_atoms if a[0] not in leaving_atoms]
+            # Single-atom residues (e.g. NH2 cap): the local frame is
+            # ill-defined with one atom; place at origin.
+            single_atom_residue = len(kept_atoms) == 1
+            for a_name, a_element, a_charge in kept_atoms:
+                ref_pos = get_ligand_idealized_atom_pos(res_name, a_name)
+                atoms.append(
+                    AtomInfo(
+                        name=a_name,
+                        element=a_element,
+                        charge=a_charge,
+                        ref_pos=_ZERO_POS.copy()
+                        if single_atom_residue
+                        else (
+                            ref_pos.copy() if ref_pos is not None else _ZERO_POS.copy()
+                        ),
+                        pos=_ZERO_POS.copy(),
+                        token_index=token_idx,
+                        atom_index=atom_idx,
+                        space_uid=space_uid,
+                    )
+                )
+                tokens.append(
+                    TokenInfo(
+                        token_index=token_idx,
+                        residue_index=res_idx,
+                        residue_name=res_name,
+                        mol_type=MOL_TYPE_PROTEIN,
+                        res_type=PROTEIN_UNK_RES_TYPE,
+                        input_id=DNA_RNA_LIGAND_INPUT_ID,
+                        asym_id=asym_id,
+                        sym_id=sym_id,
+                        entity_id=entity_id,
+                        atom_start=atom_idx,
+                        atom_count=1,
+                    )
+                )
+                token_idx += 1
+                atom_idx += 1
+            space_uid += 1
+    return tokens, atoms
+def tokenize_nucleotide(
+    sequence: str,
+    modifications: list[Modification] | None,
+    mol_type: int,
+    entity_id: int,
+    asym_id: int,
+    sym_id: int,
+    token_offset: int,
+    atom_offset: int,
+    space_uid_offset: int,
+) -> tuple[list[TokenInfo], list[AtomInfo]]:
+    """Tokenize a DNA or RNA sequence into tokens and atoms."""
+    tokens: list[TokenInfo] = []
+    atoms: list[AtomInfo] = []
+    letter_to_3 = DNA_1TO3 if mol_type == MOL_TYPE_DNA else RNA_1TO3
+    heavy_atoms = DNA_HEAVY_ATOMS if mol_type == MOL_TYPE_DNA else RNA_HEAVY_ATOMS
+    backbone_atoms = (
+        DNA_BACKBONE_ATOMS if mol_type == MOL_TYPE_DNA else RNA_BACKBONE_ATOMS
+    )
+    unk_res_type = DNA_UNK_RES_TYPE if mol_type == MOL_TYPE_DNA else RNA_UNK_RES_TYPE
+    seq_3letter = [letter_to_3.get(c, "UNK") for c in sequence]
+    modified_positions: set[int] = set()
+    if modifications:
+        for mod in modifications:
+            seq_3letter[mod.position] = mod.ccd
+            modified_positions.add(mod.position)
+    token_idx = token_offset
+    atom_idx = atom_offset
+    space_uid = space_uid_offset
+    for res_idx, res_name in enumerate(seq_3letter):
+        is_modified = res_idx in modified_positions
+        if not is_modified and res_name in heavy_atoms:
+            # Standard nucleotide
+            atom_names = heavy_atoms[res_name]
+            res_type = _compute_res_type(res_name, mol_type)
+            input_id = DNA_RNA_LIGAND_INPUT_ID
+            atom_start = atom_idx
+            for a_name in atom_names:
+                ref_pos = get_idealized_atom_pos(res_type, a_name)
+                atoms.append(
+                    AtomInfo(
+                        name=a_name,
+                        element=_infer_element(a_name),
+                        charge=CHARGED_ATOMS.get((res_name, a_name), 0),
+                        ref_pos=ref_pos.copy()
+                        if ref_pos is not None
+                        else _ZERO_POS.copy(),
+                        pos=_ZERO_POS.copy(),
+                        token_index=token_idx,
+                        atom_index=atom_idx,
+                        space_uid=space_uid,
+                    )
+                )
+                atom_idx += 1
+            tokens.append(
+                TokenInfo(
+                    token_index=token_idx,
+                    residue_index=res_idx,
+                    residue_name=res_name,
+                    mol_type=mol_type,
+                    res_type=res_type,
+                    input_id=input_id,
+                    asym_id=asym_id,
+                    sym_id=sym_id,
+                    entity_id=entity_id,
+                    atom_start=atom_start,
+                    atom_count=len(atom_names),
+                )
+            )
+            token_idx += 1
+            space_uid += 1
+        elif not is_modified and res_name == "UNK":
+            # Unknown nucleotide: backbone only
+            atom_names = backbone_atoms
+            atom_start = atom_idx
+            for a_name in atom_names:
+                ref_pos = None  # No idealized positions for UNK
+                atoms.append(
+                    AtomInfo(
+                        name=a_name,
+                        element=_infer_element(a_name),
+                        charge=0,
+                        ref_pos=_ZERO_POS.copy(),
+                        pos=_ZERO_POS.copy(),
+                        token_index=token_idx,
+                        atom_index=atom_idx,
+                        space_uid=space_uid,
+                    )
+                )
+                atom_idx += 1
+            tokens.append(
+                TokenInfo(
+                    token_index=token_idx,
+                    residue_index=res_idx,
+                    residue_name=res_name,
+                    mol_type=mol_type,
+                    res_type=unk_res_type,
+                    input_id=DNA_RNA_LIGAND_INPUT_ID,
+                    asym_id=asym_id,
+                    sym_id=sym_id,
+                    entity_id=entity_id,
+                    atom_start=atom_start,
+                    atom_count=len(atom_names),
+                )
+            )
+            token_idx += 1
+            space_uid += 1
+        else:
+            # Modified nucleotide: atom-tokenized
+            ccd_atoms = get_ligand_ccd_atoms_with_charges(res_name)
+            if ccd_atoms is None:
+                ccd_atoms = [
+                    (_infer_element(n), _infer_element(n), 0) for n in backbone_atoms
+                ]
+            is_terminal = res_idx == len(seq_3letter) - 1
+            leaving_atoms = set() if is_terminal else get_ccd_leaving_atoms(res_name)
+            for a_name, a_element, a_charge in ccd_atoms:
+                if a_name in leaving_atoms:
+                    continue
+                ref_pos = get_ligand_idealized_atom_pos(res_name, a_name)
+                atoms.append(
+                    AtomInfo(
+                        name=a_name,
+                        element=a_element,
+                        charge=a_charge,
+                        ref_pos=ref_pos.copy()
+                        if ref_pos is not None
+                        else _ZERO_POS.copy(),
+                        pos=_ZERO_POS.copy(),
+                        token_index=token_idx,
+                        atom_index=atom_idx,
+                        space_uid=space_uid,
+                    )
+                )
+                tokens.append(
+                    TokenInfo(
+                        token_index=token_idx,
+                        residue_index=res_idx,
+                        residue_name=res_name,
+                        mol_type=mol_type,
+                        res_type=PROTEIN_UNK_RES_TYPE,
+                        input_id=DNA_RNA_LIGAND_INPUT_ID,
+                        asym_id=asym_id,
+                        sym_id=sym_id,
+                        entity_id=entity_id,
+                        atom_start=atom_idx,
+                        atom_count=1,
+                    )
+                )
+                token_idx += 1
+                atom_idx += 1
+            space_uid += 1
+    return tokens, atoms
+def tokenize_ligand_ccd(
+    ccd_codes: list[str],
+    entity_id: int,
+    asym_id: int,
+    sym_id: int,
+    token_offset: int,
+    atom_offset: int,
+    space_uid_offset: int,
+    has_covalent_bond: bool,
+) -> tuple[list[TokenInfo], list[AtomInfo]]:
+    """Tokenize a ligand from CCD codes (1 token per atom)."""
+    tokens: list[TokenInfo] = []
+    atoms: list[AtomInfo] = []
+    token_idx = token_offset
+    atom_idx = atom_offset
+    space_uid = space_uid_offset
+    for res_idx, code in enumerate(ccd_codes):
+        ccd_atoms = get_ligand_ccd_atoms_with_charges(code)
+        if ccd_atoms is None:
+            raise ValueError(f"CCD component {code} not found")
+        leaving_atoms = get_ccd_leaving_atoms(code) if has_covalent_bond else set()
+        for a_name, a_element, a_charge in ccd_atoms:
+            if a_name in leaving_atoms:
+                continue
+            ref_pos = get_ligand_idealized_atom_pos(code, a_name)
+            atoms.append(
+                AtomInfo(
+                    name=a_name,
+                    element=a_element,
+                    charge=a_charge,
+                    ref_pos=ref_pos.copy() if ref_pos is not None else _ZERO_POS.copy(),
+                    pos=_ZERO_POS.copy(),
+                    token_index=token_idx,
+                    atom_index=atom_idx,
+                    space_uid=space_uid,
+                )
+            )
+            tokens.append(
+                TokenInfo(
+                    token_index=token_idx,
+                    residue_index=res_idx,
+                    residue_name=code,
+                    mol_type=MOL_TYPE_NONPOLYMER,
+                    res_type=PROTEIN_UNK_RES_TYPE,
+                    input_id=DNA_RNA_LIGAND_INPUT_ID,
+                    asym_id=asym_id,
+                    sym_id=sym_id,
+                    entity_id=entity_id,
+                    atom_start=atom_idx,
+                    atom_count=1,
+                )
+            )
+            token_idx += 1
+            atom_idx += 1
+        space_uid += 1
+    return tokens, atoms
+def tokenize_ligand_smiles(
+    smiles: str,
+    entity_id: int,
+    asym_id: int,
+    sym_id: int,
+    token_offset: int,
+    atom_offset: int,
+    space_uid_offset: int,
+    seed: int | None = None,
+) -> tuple[list[TokenInfo], list[AtomInfo]]:
+    """Tokenize a ligand from SMILES (1 token per heavy atom)."""
+    from rdkit import Chem
+    from rdkit.Chem import AllChem
+    mol = Chem.MolFromSmiles(smiles)
+    if mol is None:
+        raise ValueError(f"Failed to parse SMILES: {smiles}")
+    mol = Chem.AddHs(mol)
+    # Assign atom names using canonical ranking
+    canonical_order = AllChem.CanonicalRankAtoms(mol)  # type: ignore[attr-defined]
+    for atom, can_idx in zip(mol.GetAtoms(), canonical_order):
+        atom_name = atom.GetSymbol().upper() + str(can_idx + 1)
+        if len(atom_name) > 4:
+            raise ValueError(
+                f"SMILES {smiles} has atom name longer than 4 chars: {atom_name}"
+            )
+        atom.SetProp("name", atom_name)
+    # Generate 3D conformer
+    options = AllChem.ETKDGv3()  # type: ignore[attr-defined]
+    options.clearConfs = False
+    if seed is not None:
+        options.randomSeed = seed
+    conf_id = AllChem.EmbedMolecule(mol, options)  # type: ignore[attr-defined]
+    if conf_id == -1:
+        options.useRandomCoords = True
+        conf_id = AllChem.EmbedMolecule(mol, options)  # type: ignore[attr-defined]
+    if conf_id != -1:
+        try:
+            AllChem.UFFOptimizeMolecule(mol, confId=conf_id, maxIters=1000)  # type: ignore[attr-defined]
+        except (RuntimeError, ValueError):
+            pass
+    # Remove hydrogens
+    mol_no_h = Chem.RemoveHs(mol)
+    if mol_no_h.GetNumConformers() == 0:
+        raise ValueError(f"Failed to generate conformer for SMILES: {smiles}")
+    conformer = mol_no_h.GetConformer(0)
+    tokens: list[TokenInfo] = []
+    atoms_list: list[AtomInfo] = []
+    token_idx = token_offset
+    atom_idx = atom_offset
+    space_uid = space_uid_offset
+    for atom in mol_no_h.GetAtoms():
+        a_name = atom.GetProp("name")
+        a_element = atom.GetSymbol()
+        a_charge = atom.GetFormalCharge()
+        pos_3d = conformer.GetAtomPosition(atom.GetIdx())
+        ref_pos = np.array([pos_3d.x, pos_3d.y, pos_3d.z], dtype=np.float32)
+        atoms_list.append(
+            AtomInfo(
+                name=a_name,
+                element=a_element,
+                charge=a_charge,
+                ref_pos=ref_pos,
+                pos=_ZERO_POS.copy(),
+                token_index=token_idx,
+                atom_index=atom_idx,
+                space_uid=space_uid,
+            )
+        )
+        tokens.append(
+            TokenInfo(
+                token_index=token_idx,
+                residue_index=0,
+                residue_name="LIG",
+                mol_type=MOL_TYPE_NONPOLYMER,
+                res_type=PROTEIN_UNK_RES_TYPE,
+                input_id=DNA_RNA_LIGAND_INPUT_ID,
+                asym_id=asym_id,
+                sym_id=sym_id,
+                entity_id=entity_id,
+                atom_start=atom_idx,
+                atom_count=1,
+            )
+        )
+        token_idx += 1
+        atom_idx += 1
+    return tokens, atoms_list
+# =============================================================================
+# Build chains from StructurePredictionInput
+# =============================================================================
+def _get_sequence_key(item) -> str:
+    """Get a hashable key for entity deduplication."""
+    if isinstance(item, ProteinInput):
+        return f"PROTEIN:{item.sequence}"
+    elif isinstance(item, DNAInput):
+        return f"DNA:{item.sequence}"
+    elif isinstance(item, RNAInput):
+        return f"RNA:{item.sequence}"
+    elif isinstance(item, LigandInput):
+        if item.ccd:
+            return f"LIGAND_CCD:{','.join(item.ccd)}"
+        return f"LIGAND_SMILES:{item.smiles}"
+    raise ValueError(f"Unknown input type: {type(item)}")
+def build_chains_from_input(
+    input: StructurePredictionInput, seed: int | None = None
+) -> tuple[list[ChainInfo], list[TokenInfo], list[AtomInfo]]:
+    """Build chains, tokens, and atoms from StructurePredictionInput.
+    Handles entity deduplication (identical sequences get same entity_id),
+    sym_id assignment, and delegates to type-specific tokenization functions.
+    """
+    chains: list[ChainInfo] = []
+    all_tokens: list[TokenInfo] = []
+    all_atoms: list[AtomInfo] = []
+    # Entity deduplication
+    sequence_to_entity: dict[str, int] = {}
+    entity_sym_count: dict[int, int] = {}
+    next_entity_id = 0
+    # Gather chain IDs involved in covalent bonds
+    covalent_chain_ids: set[str] = set()
+    if input.covalent_bonds:
+        for cb in input.covalent_bonds:
+            covalent_chain_ids.update([cb.chain_id1, cb.chain_id2])
+    token_offset = 0
+    atom_offset = 0
+    space_uid_offset = 0
+    asym_id = 0
+    for item in input.sequences:
+        # Entity deduplication
+        seq_key = _get_sequence_key(item)
+        if seq_key in sequence_to_entity:
+            entity_id = sequence_to_entity[seq_key]
+        else:
+            entity_id = next_entity_id
+            sequence_to_entity[seq_key] = entity_id
+            next_entity_id += 1
+        # Get all chain IDs for this item
+        ids = [item.id] if isinstance(item.id, str) else item.id
+        for chain_id_str in ids:
+            # sym_id is the per-entity copy index; increment per chain so
+            # ProteinInput(id=['A','B']) gives chain A sym_id=0, chain B sym_id=1.
+            sym_id = entity_sym_count.get(entity_id, 0)
+            entity_sym_count[entity_id] = sym_id + 1
+            if isinstance(item, ProteinInput):
+                if item.msa is None:
+                    warnings.warn(
+                        f"No MSA provided for {item.id}, using single sequence mode"
+                    )
+                new_tokens, new_atoms = tokenize_protein(
+                    sequence=item.sequence,
+                    modifications=item.modifications,
+                    entity_id=entity_id,
+                    asym_id=asym_id,
+                    sym_id=sym_id,
+                    token_offset=token_offset,
+                    atom_offset=atom_offset,
+                    space_uid_offset=space_uid_offset,
+                )
+            elif isinstance(item, (DNAInput, RNAInput)):
+                mol_type = MOL_TYPE_DNA if isinstance(item, DNAInput) else MOL_TYPE_RNA
+                new_tokens, new_atoms = tokenize_nucleotide(
+                    sequence=item.sequence,
+                    modifications=item.modifications,
+                    mol_type=mol_type,
+                    entity_id=entity_id,
+                    asym_id=asym_id,
+                    sym_id=sym_id,
+                    token_offset=token_offset,
+                    atom_offset=atom_offset,
+                    space_uid_offset=space_uid_offset,
+                )
+            elif isinstance(item, LigandInput):
+                has_cov = chain_id_str in covalent_chain_ids
+                if item.ccd is not None:
+                    if item.smiles is not None:
+                        warnings.warn("Both ccd and smiles provided, using ccd")
+                    new_tokens, new_atoms = tokenize_ligand_ccd(
+                        ccd_codes=item.ccd,
+                        entity_id=entity_id,
+                        asym_id=asym_id,
+                        sym_id=sym_id,
+                        token_offset=token_offset,
+                        atom_offset=atom_offset,
+                        space_uid_offset=space_uid_offset,
+                        has_covalent_bond=has_cov,
+                    )
+                elif item.smiles is not None:
+                    new_tokens, new_atoms = tokenize_ligand_smiles(
+                        smiles=item.smiles,
+                        entity_id=entity_id,
+                        asym_id=asym_id,
+                        sym_id=sym_id,
+                        token_offset=token_offset,
+                        atom_offset=atom_offset,
+                        space_uid_offset=space_uid_offset,
+                        seed=seed,
+                    )
+                else:
+                    raise ValueError("LigandInput must have either ccd or smiles")
+            else:
+                raise ValueError(f"Unknown input type: {type(item)}")
+            chain = ChainInfo(
+                chain_id=chain_id_str,
+                asym_id=asym_id,
+                entity_id=entity_id,
+                sym_id=sym_id,
+                mol_type=new_tokens[0].mol_type if new_tokens else MOL_TYPE_PROTEIN,
+                tokens=new_tokens,
+            )
+            chains.append(chain)
+            all_tokens.extend(new_tokens)
+            all_atoms.extend(new_atoms)
+            token_offset += len(new_tokens)
+            atom_offset += len(new_atoms)
+            space_uid_offset += len(set(a.space_uid for a in new_atoms))
+            asym_id += 1
+    return chains, all_tokens, all_atoms
+# =============================================================================
+# Feature tensor building
+# =============================================================================
+def compute_frame_indices(
+    tokens: list[TokenInfo], atoms: list[AtomInfo]
+) -> tuple[np.ndarray, np.ndarray]:
+    """Compute backbone frame indices for each token.
+    Protein: [N, CA, C]; DNA/RNA: [C1', C3', C4']; Ligand: distance-based.
+    """
+    # Build atom name -> atom_index lookup per token
+    token_atoms: dict[int, dict[str, int]] = defaultdict(dict)
+    for atom in atoms:
+        if atom.is_valid:
+            token_atoms[atom.token_index][atom.name] = atom.atom_index
+    # Ligand-token frames come from CCD reference-conformer geometry,
+    # grouped per residue. For each token, the frame is the 3 atoms nearest
+    # to its own atom in the residue's ref-pos space, ordered
+    # (1st-nearest, self, 2nd-nearest).
+    ligand_token_to_atom: dict[int, int] = {}
+    ligand_tokens_by_res: dict[tuple[int, int], list[int]] = defaultdict(list)
+    for t in tokens:
+        if t.mol_type == MOL_TYPE_NONPOLYMER:
+            ad = token_atoms.get(t.token_index)
+            if ad:
+                ligand_token_to_atom[t.token_index] = next(iter(ad.values()))
+            ligand_tokens_by_res[(t.asym_id, t.residue_index)].append(t.token_index)
+    ligand_token_frames: dict[int, tuple[int, int, int]] = {}
+    for tok_indices in ligand_tokens_by_res.values():
+        atom_indices = [
+            ligand_token_to_atom[ti] for ti in tok_indices if ti in ligand_token_to_atom
+        ]
+        if len(atom_indices) < 3:
+            for ti in tok_indices:
+                if ti in ligand_token_to_atom:
+                    ai = ligand_token_to_atom[ti]
+                    ligand_token_frames[ti] = (ai, ai, ai)
+            continue
+        ref_pos_chain = np.array([atoms[ai].ref_pos for ai in atom_indices])
+        dist_mat = np.sqrt(
+            ((ref_pos_chain[:, None] - ref_pos_chain[None]) ** 2).sum(-1)
+        )
+        sort_indices = np.argsort(dist_mat, axis=1)
+        local_frames = np.column_stack(
+            [sort_indices[:, 1], sort_indices[:, 0], sort_indices[:, 2]]
+        )
+        for ti in tok_indices:
+            if ti not in ligand_token_to_atom:
+                continue
+            ai = ligand_token_to_atom[ti]
+            local_idx = atom_indices.index(ai)
+            fl = local_frames[local_idx]
+            ligand_token_frames[ti] = (
+                atom_indices[fl[0]],
+                atom_indices[fl[1]],
+                atom_indices[fl[2]],
+            )
+    # Build frames for all tokens
+    frames_list: list[tuple[int, int, int]] = []
+    for t in tokens:
+        ad = token_atoms.get(t.token_index, {})
+        fallback = list(ad.values())[0] if ad else 0
+        if t.mol_type == MOL_TYPE_PROTEIN:
+            if t.res_type == PROTEIN_UNK_RES_TYPE:
+                frames_list.append((fallback, fallback, fallback))
+            else:
+                frames_list.append((ad.get("N", 0), ad.get("CA", 0), ad.get("C", 0)))
+        elif t.mol_type in (MOL_TYPE_DNA, MOL_TYPE_RNA):
+            if t.res_type == PROTEIN_UNK_RES_TYPE:
+                frames_list.append((fallback, fallback, fallback))
+            else:
+                frames_list.append(
+                    (ad.get("C1'", 0), ad.get("C3'", 0), ad.get("C4'", 0))
+                )
+        elif t.mol_type == MOL_TYPE_NONPOLYMER:
+            if t.token_index in ligand_token_frames:
+                frames_list.append(ligand_token_frames[t.token_index])
+            else:
+                frames_list.append((fallback, fallback, fallback))
+        else:
+            frames_list.append((fallback, fallback, fallback))
+    frames = np.array(frames_list, dtype=np.int64)
+    # Compute resolved mask (vectorized)
+    n_atoms = len(atoms)
+    atom_positions = (
+        np.array([a.pos for a in atoms], dtype=np.float32)
+        if atoms
+        else np.zeros((0, 3), dtype=np.float32)
+    )
+    atom_is_valid = (
+        np.array([a.is_valid for a in atoms], dtype=bool)
+        if atoms
+        else np.zeros(0, dtype=bool)
+    )
+    atom_is_resolved = (
+        atom_is_valid & np.any(atom_positions != 0, axis=1)
+        if n_atoms > 0
+        else np.zeros(0, dtype=bool)
+    )
+    n_tokens = len(tokens)
+    if n_tokens == 0:
+        return frames, np.zeros(0, dtype=bool)
+    pos1 = atom_positions[frames[:, 0]]
+    pos2 = atom_positions[frames[:, 1]]
+    pos3 = atom_positions[frames[:, 2]]
+    all_resolved = (
+        atom_is_resolved[frames[:, 0]]
+        & atom_is_resolved[frames[:, 1]]
+        & atom_is_resolved[frames[:, 2]]
+    )
+    all_same = (frames[:, 0] == frames[:, 1]) & (frames[:, 1] == frames[:, 2])
+    v1 = pos1 - pos2
+    v2 = pos3 - pos2
+    norm1 = np.linalg.norm(v1, axis=1)
+    norm2 = np.linalg.norm(v2, axis=1)
+    valid_norms = (norm1 >= 1e-6) & (norm2 >= 1e-6)
+    cos_angle = np.zeros(n_tokens, dtype=np.float32)
+    mask = valid_norms
+    if np.any(mask):
+        cos_angle[mask] = np.sum(v1[mask] * v2[mask], axis=1) / (
+            norm1[mask] * norm2[mask]
+        )
+    cos_angle = np.clip(cos_angle, -1, 1)
+    angle_deg = np.degrees(np.arccos(np.abs(cos_angle)))
+    not_colinear = angle_deg >= 25
+    resolved_mask = all_resolved & ~all_same & valid_norms & not_colinear
+    return frames, resolved_mask
+def compute_token_bonds(
+    tokens: list[TokenInfo],
+    atoms: list[AtomInfo],
+    input: StructurePredictionInput,
+    chains: list[ChainInfo],
+) -> torch.Tensor:
+    """Compute dense token bond matrix [L, L, 1].
+    Includes ligand intra-residue bonds (from CCD) and covalent bonds.
+    """
+    n_tokens = len(tokens)
+    edge_set: set[tuple[int, int]] = set()
+    def add_bond(i: int, j: int) -> None:
+        if i != j:
+            edge_set.add((min(i, j), max(i, j)))
+    # Build per-residue atom name -> token_index mapping for ligands and modified residues
+    # Key: (asym_id, residue_index, atom_name) -> token_index
+    atom_name_to_token: dict[tuple[int, int, str], int] = {}
+    for atom in atoms:
+        if atom.is_valid:
+            t = tokens[atom.token_index] if atom.token_index < len(tokens) else None
+            if t and (
+                t.mol_type == MOL_TYPE_NONPOLYMER or t.res_type == PROTEIN_UNK_RES_TYPE
+            ):
+                atom_name_to_token[(t.asym_id, t.residue_index, atom.name)] = (
+                    atom.token_index
+                )
+    # Group atom-tokenized tokens by (asym_id, residue_index)
+    residue_tokens: dict[tuple[int, int], list[tuple[str, int]]] = defaultdict(list)
+    for atom in atoms:
+        if not atom.is_valid:
+            continue
+        t = tokens[atom.token_index] if atom.token_index < len(tokens) else None
+        if t and (
+            t.mol_type == MOL_TYPE_NONPOLYMER or t.res_type == PROTEIN_UNK_RES_TYPE
+        ):
+            residue_tokens[(t.asym_id, t.residue_index)].append(
+                (atom.name, atom.token_index)
+            )
+    # Add intra-residue bonds from CCD
+    for (asym_id_val, res_idx), atom_list in residue_tokens.items():
+        if not atom_list:
+            continue
+        res_name = tokens[atom_list[0][1]].residue_name
+        ccd_bonds = get_ligand_ccd_bonds(res_name)
+        atom_to_tok = {name: ti for name, ti in atom_list}
+        if ccd_bonds:
+            for a1, a2 in ccd_bonds:
+                if a1 in atom_to_tok and a2 in atom_to_tok:
+                    add_bond(atom_to_tok[a1], atom_to_tok[a2])
+        else:
+            # Fallback: fully connected within residue
+            tok_indices = [ti for _, ti in atom_list]
+            for i_idx in tok_indices:
+                for j_idx in tok_indices:
+                    add_bond(i_idx, j_idx)
+    # Add covalent bonds from input
+    if input.covalent_bonds:
+        # Build chain_id -> chain mapping
+        chain_by_id: dict[str, ChainInfo] = {c.chain_id: c for c in chains}
+        # Build (asym_id, residue_index) -> list of tokens for atom index lookup
+        chain_res_atoms: dict[tuple[int, int], list[AtomInfo]] = defaultdict(list)
+        for atom in atoms:
+            if atom.is_valid and atom.token_index < len(tokens):
+                t = tokens[atom.token_index]
+                chain_res_atoms[(t.asym_id, t.residue_index)].append(atom)
+        for cb in input.covalent_bonds:
+            c1 = chain_by_id.get(cb.chain_id1)
+            c2 = chain_by_id.get(cb.chain_id2)
+            if c1 is None or c2 is None:
+                continue
+            atoms_1 = chain_res_atoms.get((c1.asym_id, cb.res_idx1), [])
+            atoms_2 = chain_res_atoms.get((c2.asym_id, cb.res_idx2), [])
+            if cb.atom_idx1 < len(atoms_1) and cb.atom_idx2 < len(atoms_2):
+                add_bond(
+                    atoms_1[cb.atom_idx1].token_index, atoms_2[cb.atom_idx2].token_index
+                )
+    # Add peptide bonds at modified-residue boundaries: an atom-tokenized
+    # residue's N atom connects to the prev residue's C atom (and same for
+    # the C side to the next residue's N).
+    tokens_by_chain_res: dict[tuple[int, int], list[TokenInfo]] = defaultdict(list)
+    for t in tokens:
+        if t.mol_type == MOL_TYPE_PROTEIN:
+            tokens_by_chain_res[(t.asym_id, t.residue_index)].append(t)
+    def _backbone_token(res_tokens: list[TokenInfo], atom_name: str) -> int | None:
+        # Standard residue (single token wrapping all atoms): return that token.
+        if len(res_tokens) == 1 and res_tokens[0].res_type != PROTEIN_UNK_RES_TYPE:
+            return res_tokens[0].token_index
+        for t in res_tokens:
+            for a_idx in range(t.atom_start, t.atom_start + t.atom_count):
+                if a_idx < len(atoms) and atoms[a_idx].name == atom_name:
+                    return t.token_index
+        # Atom-tokenized residue without an atom of that name (e.g. ACE has
+        # no N, NH2 has no C). Fall back to the first atom-tokenized token.
+        return res_tokens[0].token_index if res_tokens else None
+    for (asym_id_val, res_idx), res_tokens in tokens_by_chain_res.items():
+        is_atom_tokenized = any(t.res_type == PROTEIN_UNK_RES_TYPE for t in res_tokens)
+        if not is_atom_tokenized:
+            continue  # Standard residue — no peptide bond added here.
+        n_tok = _backbone_token(res_tokens, "N")
+        c_tok = _backbone_token(res_tokens, "C")
+        prev_tokens = tokens_by_chain_res.get((asym_id_val, res_idx - 1))
+        if prev_tokens and n_tok is not None:
+            prev_c = _backbone_token(prev_tokens, "C")
+            if prev_c is not None:
+                add_bond(prev_c, n_tok)
+        next_tokens = tokens_by_chain_res.get((asym_id_val, res_idx + 1))
+        if next_tokens and c_tok is not None:
+            next_n = _backbone_token(next_tokens, "N")
+            if next_n is not None:
+                add_bond(c_tok, next_n)
+    # Expand to dense matrix
+    bonds = torch.zeros(n_tokens, n_tokens, 1, dtype=torch.float32)
+    for i, j in edge_set:
+        bonds[i, j, 0] = 1.0
+        bonds[j, i, 0] = 1.0
+    return bonds
+def compute_representative_atoms(
+    tokens: list[TokenInfo], atoms: list[AtomInfo]
+) -> torch.Tensor:
+    """Compute representative atom index per token (for token_to_rep_atom).
+    Returns:
+        distogram_atom_idx: [L] — representative atom per token
+            Protein: CB (or CA for GLY), DNA/RNA: C4/C2/C1', Ligand: first atom.
+    """
+    n_tokens = len(tokens)
+    # Build atom name -> index lookup per token
+    token_atoms: dict[int, dict[str, int]] = defaultdict(dict)
+    for atom in atoms:
+        if atom.is_valid:
+            token_atoms[atom.token_index][atom.name] = atom.atom_index
+    distogram_atom_idx = torch.zeros(n_tokens, dtype=torch.int64)
+    for t in tokens:
+        ad = token_atoms.get(t.token_index, {})
+        fallback_idx = list(ad.values())[0] if ad else 0
+        if t.mol_type == MOL_TYPE_PROTEIN:
+            rep_idx = ad.get("CB", ad.get("CA", fallback_idx))
+        elif t.mol_type in (MOL_TYPE_DNA, MOL_TYPE_RNA):
+            if t.res_type in (27, 32):  # Unknown nucleotides
+                rep_idx = ad.get("C1'", fallback_idx)
+            elif t.res_type in (23, 24, 28, 29):  # Purines (A, G)
+                rep_idx = ad.get("C4", ad.get("C1'", fallback_idx))
+            else:  # Pyrimidines (C, U, T)
+                rep_idx = ad.get("C2", ad.get("C1'", fallback_idx))
+        else:
+            rep_idx = fallback_idx
+        distogram_atom_idx[t.token_index] = rep_idx
+    return distogram_atom_idx
+def compute_msa_features(
+    input: StructurePredictionInput,
+    chains: list[ChainInfo],
+    tokens: list[TokenInfo],
+    max_seqs: int = 16384,
+) -> dict[str, torch.Tensor]:
+    """Compute MSA features from protein MSAs.
+    Uses taxonomy-based pairing across chains
+    (:func:`paired_msa.construct_paired_msa`): rows whose FASTA header
+    contains ``key=N`` get paired across chains sharing the same ``N``.
+    Output: msa [M, L], deletion_value [M, L], has_deletion [M, L],
+            deletion_mean [L], msa_mask [M, L]
+    """
+    from .esmfold2_paired_msa import (
+        construct_paired_msa,
+        protein_letter_to_res_type,
+    )
+    n_tokens = len(tokens)
+    # A single ProteinInput with id=['A','B','C',...] yields one item but
+    # multiple chains (one per id); broadcast the MSA across all of them.
+    chain_msas: dict[int, MSA | None] = {}
+    item_idx = 0
+    for item in input.sequences:
+        ids = [item.id] if isinstance(item.id, str) else list(item.id)
+        for _ in ids:
+            chain = chains[item_idx]
+            if isinstance(item, ProteinInput):
+                msa = item.msa
+                if msa is None:
+                    msa = MSA.from_sequences([item.sequence])
+                chain_msas[chain.asym_id] = msa
+            else:
+                chain_msas[chain.asym_id] = None
+            item_idx += 1
+    letter_to_res_type = protein_letter_to_res_type()
+    # Build per-chain query res_types (used for chains without an MSA).
+    chain_query_res_types: dict[int, np.ndarray] = {}
+    for chain in chains:
+        chain_tokens = [t for t in tokens if t.asym_id == chain.asym_id]
+        chain_query_res_types[chain.asym_id] = np.array(
+            [t.res_type for t in chain_tokens], dtype=np.int64
+        )
+    token_asym_ids = np.array([t.asym_id for t in tokens], dtype=np.int64)
+    token_res_ids = np.array([t.residue_index for t in tokens], dtype=np.int64)
+    msa_res, del_counts, paired = construct_paired_msa(
+        chain_msas,
+        chain_query_res_types,
+        token_asym_ids,
+        token_res_ids,
+        letter_to_res_type=letter_to_res_type,
+        max_seqs=max_seqs,
+    )
+    # Tokens for chains without an MSA get their res_type at row 0 and gap
+    # elsewhere; this mirrors the prior non-protein-token branch.
+    for t in tokens:
+        if chain_msas.get(t.asym_id) is None:
+            msa_res[:, t.token_index] = MSA_GAP_TOKEN_ID
+            msa_res[0, t.token_index] = t.res_type
+    if msa_res.shape[0] == 0:
+        msa_res = np.full((1, n_tokens), MSA_GAP_TOKEN_ID, dtype=np.int64)
+        del_counts = np.zeros((1, n_tokens), dtype=np.float32)
+    msa_data = torch.from_numpy(msa_res)
+    del_data = torch.from_numpy(del_counts)
+    has_deletion = del_data > 0
+    deletion_value = (np.pi / 2) * torch.arctan(del_data / 3)
+    deletion_mean = deletion_value.mean(dim=0)
+    msa_mask = torch.ones_like(msa_data, dtype=torch.bool)
+    return {
+        "msa": msa_data,
+        "deletion_value": deletion_value,
+        "has_deletion": has_deletion,
+        "deletion_mean": deletion_mean,
+        "msa_attention_mask": msa_mask,
+    }
+def compute_distogram_conditioning(
+    input: StructurePredictionInput,
+    chains: list[ChainInfo],
+    tokens: list[TokenInfo],
+    disto_center: torch.Tensor,
+    min_dist: float = 2.0,
+    max_dist: float = 22.0,
+    num_bins: int = 64,
+) -> tuple[torch.Tensor, torch.Tensor]:
+    """Compute distogram conditioning from user-provided distograms.
+    Returns:
+        disto_cond: [L, L] int64 (bin indices)
+        disto_cond_mask: [L, L] bool
+    """
+    n_tokens = len(tokens)
+    disto_cond = torch.zeros(n_tokens, n_tokens, dtype=torch.long)
+    disto_cond_mask = torch.zeros(n_tokens, n_tokens, dtype=torch.bool)
+    if not input.distogram_conditioning:
+        return disto_cond, disto_cond_mask
+    # Build chain_id -> asym_id mapping
+    chain_id_to_asym: dict[str, int] = {c.chain_id: c.asym_id for c in chains}
+    # Build asym_id -> token indices mapping
+    asym_to_tokens: dict[int, list[int]] = defaultdict(list)
+    for t in tokens:
+        asym_to_tokens[t.asym_id].append(t.token_index)
+    boundaries = torch.linspace(min_dist, max_dist, num_bins + 1)
+    for dc in input.distogram_conditioning:
+        asym_id_val = chain_id_to_asym.get(dc.chain_id)
+        if asym_id_val is None:
+            continue
+        tok_indices = asym_to_tokens[asym_id_val]
+        n_chain = len(tok_indices)
+        distogram = torch.tensor(dc.distogram, dtype=torch.float32)
+        if distogram.shape != (n_chain, n_chain):
+            raise ValueError(
+                f"Distogram shape {distogram.shape} doesn't match chain length {n_chain}"
+            )
+        # Bin the distogram
+        binned = torch.bucketize(distogram, boundaries[:-1]) - 1
+        binned = binned.clamp(0, num_bins - 1)
+        for i, ti in enumerate(tok_indices):
+            for j, tj in enumerate(tok_indices):
+                disto_cond[ti, tj] = binned[i, j]
+                disto_cond_mask[ti, tj] = True
+    return disto_cond, disto_cond_mask
+def build_feature_tensors(
+    chains: list[ChainInfo],
+    tokens: list[TokenInfo],
+    atoms: list[AtomInfo],
+    input: StructurePredictionInput,
+) -> dict[str, torch.Tensor]:
+    """Build all model input tensors from tokens and atoms."""
+    n_tokens = len(tokens)
+    n_real_atoms = len(atoms)
+    # Pad atoms to nearest multiple of 32
+    target_atoms = math.ceil(n_real_atoms / 32) * 32 if n_real_atoms > 0 else 32
+    n_padding = target_atoms - n_real_atoms
+    padding_atoms = [
+        AtomInfo(
+            name="",
+            element="",
+            charge=0,
+            ref_pos=_ZERO_POS.copy(),
+            pos=_ZERO_POS.copy(),
+            token_index=0,
+            atom_index=n_real_atoms + i,
+            space_uid=0,
+            is_valid=False,
+        )
+        for i in range(n_padding)
+    ]
+    all_atoms = atoms + padding_atoms
+    n_atoms = len(all_atoms)
+    # --- Token-level tensors ---
+    token_index_arr = np.empty(n_tokens, dtype=np.int64)
+    residue_index_arr = np.empty(n_tokens, dtype=np.int64)
+    asym_id_arr = np.empty(n_tokens, dtype=np.int64)
+    sym_id_arr = np.empty(n_tokens, dtype=np.int64)
+    entity_id_arr = np.empty(n_tokens, dtype=np.int64)
+    mol_type_arr = np.empty(n_tokens, dtype=np.int64)
+    res_type_arr = np.empty(n_tokens, dtype=np.int64)
+    input_ids_arr = np.empty(n_tokens, dtype=np.int64)
+    for i, t in enumerate(tokens):
+        token_index_arr[i] = t.token_index
+        residue_index_arr[i] = t.residue_index
+        asym_id_arr[i] = t.asym_id
+        sym_id_arr[i] = t.sym_id
+        entity_id_arr[i] = t.entity_id
+        mol_type_arr[i] = t.mol_type
+        res_type_arr[i] = t.res_type
+        input_ids_arr[i] = t.input_id
+    token_index = torch.from_numpy(token_index_arr)
+    residue_index = torch.from_numpy(residue_index_arr)
+    asym_id = torch.from_numpy(asym_id_arr)
+    sym_id = torch.from_numpy(sym_id_arr)
+    entity_id = torch.from_numpy(entity_id_arr)
+    mol_type = torch.from_numpy(mol_type_arr)
+    res_type = torch.from_numpy(res_type_arr)
+    input_ids = torch.from_numpy(input_ids_arr)
+    token_pad_mask = torch.ones(n_tokens, dtype=torch.bool)
+    # --- Atom-level tensors ---
+    ref_pos_arr = np.zeros((n_atoms, 3), dtype=np.float32)
+    ref_element_arr = np.zeros(n_atoms, dtype=np.int64)
+    ref_charge_arr = np.zeros(n_atoms, dtype=np.int8)
+    ref_atom_name_chars_arr = np.zeros((n_atoms, 4), dtype=np.int64)
+    ref_space_uid_arr = np.zeros(n_atoms, dtype=np.int64)
+    atom_pad_mask_arr = np.zeros(n_atoms, dtype=np.bool_)
+    atom_to_token_arr = np.zeros(n_atoms, dtype=np.int64)
+    all_positions = np.zeros((n_atoms, 3), dtype=np.float64)
+    is_valid_arr = np.zeros(n_atoms, dtype=np.bool_)
+    for i, atom in enumerate(all_atoms):
+        if atom.ref_pos is not None:
+            ref_pos_arr[i] = atom.ref_pos
+        ref_charge_arr[i] = atom.charge
+        ref_space_uid_arr[i] = (
+            atom.space_uid if atom.space_uid >= 0 else atom.token_index
+        )
+        atom_pad_mask_arr[i] = atom.is_valid
+        is_valid_arr[i] = atom.is_valid
+        all_positions[i] = atom.pos
+        if atom.is_valid:
+            ref_element_arr[i] = get_element_atomic_num(atom.element)
+            name_indices = encode_atom_name(atom.name)
+            ref_atom_name_chars_arr[i] = name_indices
+            atom_to_token_arr[i] = atom.token_index
+    ref_pos = torch.from_numpy(ref_pos_arr)
+    ref_element = torch.from_numpy(ref_element_arr)
+    ref_charge = torch.from_numpy(ref_charge_arr)
+    ref_atom_name_chars = torch.from_numpy(ref_atom_name_chars_arr)
+    ref_space_uid = torch.from_numpy(ref_space_uid_arr)
+    atom_pad_mask = torch.from_numpy(atom_pad_mask_arr)
+    atom_to_token = torch.from_numpy(atom_to_token_arr)
+    # Coordinates — center on resolved atoms
+    raw_coords = torch.from_numpy(all_positions)
+    is_nonzero = np.any(all_positions != 0, axis=1)
+    atom_resolved_arr = is_valid_arr & is_nonzero
+    resolved_mask = torch.from_numpy(atom_resolved_arr)
+    valid_mask = torch.from_numpy(is_valid_arr)
+    if resolved_mask.any():
+        centroid = raw_coords[resolved_mask].mean(dim=0, keepdim=True)
+        raw_coords = raw_coords - centroid
+        raw_coords[~valid_mask] = 0.0
+    coords = raw_coords.float().unsqueeze(0)  # [1, A, 3]
+    atom_resolved_mask = torch.tensor(atom_resolved_arr, dtype=torch.bool)
+    # --- Frames ---
+    frames, _ = compute_frame_indices(tokens, atoms)
+    frames_idx = torch.from_numpy(frames).to(torch.int64)
+    # --- Token bonds ---
+    token_bonds = compute_token_bonds(tokens, atoms, input, chains)
+    # --- Representative atoms ---
+    distogram_atom_idx = compute_representative_atoms(tokens, atoms)
+    # --- MSA features ---
+    msa_features = compute_msa_features(input, chains, tokens)
+    # --- Distogram conditioning ---
+    # disto_center is not needed for inference (no experimental coords)
+    disto_center = torch.zeros(n_tokens, 3, dtype=torch.float32)
+    disto_cond, disto_cond_mask = compute_distogram_conditioning(
+        input, chains, tokens, disto_center
+    )
+    # ref_pos: CCD conformer positions, used as-is for inference.
+    # No random rotation or masking — at inference there are no resolved
+    # experimental coordinates, so atom_resolved_mask is all False.
+    # The model uses ref_pos for atom feature embedding.
+    # --- Pocket (dropped) ---
+    pocket_feature = torch.zeros(n_tokens, dtype=torch.long)
+    return {
+        # Token-level
+        "token_index": token_index,
+        "residue_index": residue_index,
+        "asym_id": asym_id,
+        "entity_id": entity_id,
+        "sym_id": sym_id,
+        "mol_type": mol_type,
+        "res_type": res_type,
+        "input_ids": input_ids,
+        "token_bonds": token_bonds,
+        "token_attention_mask": token_pad_mask,
+        "pocket_feature": pocket_feature,
+        # Atom-level
+        "ref_pos": ref_pos,
+        "ref_element": ref_element,
+        "ref_charge": ref_charge,
+        "ref_atom_name_chars": ref_atom_name_chars,
+        "ref_space_uid": ref_space_uid,
+        "gt_coords": coords,
+        "atom_attention_mask": atom_pad_mask,
+        "atom_to_token": atom_to_token,
+        "is_resolved": atom_resolved_mask,
+        "distogram_atom_idx": distogram_atom_idx,
+        # Frames
+        "frames_idx": frames_idx,
+        # Distogram
+        "disto_cond": disto_cond,
+        "disto_cond_mask": disto_cond_mask,
+        # MSA
+        **msa_features,
+    }
+# =============================================================================
+# Top-level entry point
+# =============================================================================
+def prepare_esmfold2_input(
+    input: StructurePredictionInput, seed: int | None = None
+) -> tuple[dict[str, torch.Tensor], list[ChainInfo]]:
+    """Prepare ESMFold2 model inputs from StructurePredictionInput.
+    Args:
+        input: The structure prediction input (sequences, conditioning, etc.)
+        seed: Random seed for SMILES conformer generation and augmentation.
+    Returns:
+        Tuple of (feature_dict, chain_infos) where feature_dict contains
+        all tensors for the model forward pass, and chain_infos contains
+        metadata for output processing.
+    """
+    chains, tokens, atoms = build_chains_from_input(input, seed)
+    features = build_feature_tensors(chains, tokens, atoms, input)
+    return features, chains

esmfold2_processor.py ADDED Viewed

	@@ -0,0 +1,356 @@

+import random
+from contextlib import contextmanager, nullcontext
+from pathlib import Path
+from typing import Any
+import numpy as np
+import torch
+from .esmfold2_conformers import load_ccd
+from .esmfold2_output import build_molecular_complex_from_features
+from .esmfold2_prepare_input import ChainInfo, prepare_esmfold2_input
+from .esmfold2_types import (
+    MSA,
+    Modification,
+    ProteinInput,
+    StructurePredictionInput,
+)
+from .esmfold2_molecular_complex import MolecularComplexResult
+@contextmanager
+def _seed_context(seed: int | None):
+    if seed is None:
+        yield
+        return
+    py_state = random.getstate()
+    np_state = np.random.get_state()
+    torch_state = torch.random.get_rng_state()
+    cuda_state = torch.cuda.get_rng_state_all() if torch.cuda.is_available() else None
+    random.seed(seed)
+    np.random.seed(seed)
+    torch.manual_seed(seed)
+    if torch.cuda.is_available():
+        torch.cuda.manual_seed_all(seed)
+    try:
+        yield
+    finally:
+        random.setstate(py_state)
+        np.random.set_state(np_state)
+        torch.random.set_rng_state(torch_state)
+        if cuda_state is not None:
+            torch.cuda.set_rng_state_all(cuda_state)
+def clean_esmfold2_input(input: StructurePredictionInput) -> StructurePredictionInput:
+    """Group identical protein sequences into the same ProteinInput with multiple ids.
+    Example: Passing a tetramer like [ProteinInput(id=["0"], seq="AAA|AAA|BBB|BBB")]
+    gets converted into [ProteinInput(id=["0_0", "0_1"], seq="AAA"),
+                         ProteinInput(id=["0_2", "0_3"], seq="BBB")]
+    Preserves the original order of unique sequences. Also converts "|" chainbreak
+    tokens to ":" in the sequence.
+    """
+    cleaned_sequences: list = []
+    chain_to_ids: dict[str, list[str]] = {}
+    chain_to_modifications: dict[str, list] = {}
+    chain_to_msa: dict[str, MSA | None] = {}
+    for item in input.sequences:
+        if isinstance(item, ProteinInput):
+            sequence = ":".join(item.sequence.split("|"))
+            if ":" not in sequence:
+                cleaned_sequences.append(item)
+                continue
+            if ":" in sequence and input.covalent_bonds is not None:
+                raise ValueError(
+                    "Covalent bonds are not supported when using chainbreaks. "
+                    "Chains must be separated into multiple ProteinInput objects."
+                )
+            base_id = item.id[0] if isinstance(item.id, list) else item.id
+            chain_to_ids = {}
+            chain_to_modifications = {}
+            chain_to_msa = {}
+            chains = sequence.split(":")
+            chain_start_positions = []
+            pos = 0
+            for chain in chains:
+                chain_start_positions.append(pos)
+                pos += len(chain) + 1
+            if item.modifications is not None:
+                for chain_idx, chain in enumerate(chains):
+                    chain_start = chain_start_positions[chain_idx]
+                    chain_end = chain_start + len(chain)
+                    chain_modifications = []
+                    for mod in item.modifications:
+                        if chain_start <= mod.position < chain_end:
+                            adjusted_mod = Modification(
+                                position=mod.position - chain_start, ccd=mod.ccd
+                            )
+                            chain_modifications.append(adjusted_mod)
+                    if chain not in chain_to_modifications:
+                        chain_to_modifications[chain] = chain_modifications
+                    else:
+                        chain_to_modifications[chain].extend(chain_modifications)
+            if item.msa is not None:
+                for chain_idx, chain in enumerate(chains):
+                    if chain not in chain_to_msa:
+                        chain_start = chain_start_positions[chain_idx]
+                        chain_end = chain_start + len(chain)
+                        chain_msa = item.msa.select_positions(  # type: ignore
+                            np.arange(chain_start, chain_end)
+                        )
+                        chain_to_msa[chain] = chain_msa
+            for i, chain in enumerate(chains):
+                chain_id = base_id + "_" + str(i)
+                if chain in chain_to_ids:
+                    chain_to_ids[chain].append(chain_id)
+                else:
+                    chain_to_ids[chain] = [chain_id]
+                    cleaned_sequences.append((item, chain))
+        else:
+            cleaned_sequences.append(item)
+    for i in range(len(cleaned_sequences)):
+        if isinstance(cleaned_sequences[i], tuple):
+            item, chain = cleaned_sequences[i]
+            chain_ids = chain_to_ids[chain]
+            chain_modifications = (
+                chain_to_modifications.get(chain) if item.modifications else None
+            )
+            chain_msa = chain_to_msa.get(chain) if item.msa else None
+            cleaned_sequences[i] = ProteinInput(
+                id=chain_ids,
+                sequence=chain,
+                msa=chain_msa,
+                modifications=chain_modifications,
+            )
+    return StructurePredictionInput(
+        sequences=cleaned_sequences,
+        distogram_conditioning=input.distogram_conditioning,
+        covalent_bonds=input.covalent_bonds,
+    )
+class ESMFold2InputBuilder:
+    def __init__(self, ccd_cache: Path | None = None):
+        load_ccd(ccd_cache)
+    def prepare_input(
+        self,
+        input: StructurePredictionInput,
+        seed: int | None = None,
+        device: torch.device | str | None = None,
+    ) -> tuple[dict, list[ChainInfo]]:
+        """Prepare raw input for the folding model.
+        Converts user-provided StructurePredictionInput into batched tensors
+        ready for model inference.
+        Parameters
+        ----------
+        input : StructurePredictionInput
+            Input specification (sequences, structures, constraints, etc.).
+        seed : int, optional
+            Random seed for reproducibility.
+        device : torch.device or str, optional
+            Target device for the returned tensors. Defaults to CPU; pass
+            ``model.device`` to skip a separate ``.to(...)`` step. ``fold()``
+            forwards ``model.device`` automatically.
+        Returns
+        -------
+        tuple[dict, list[ChainInfo]]
+            Batched input tensors and chain metadata for output processing.
+        """
+        structure_prediction_input = clean_esmfold2_input(input)
+        with _seed_context(seed) if seed is not None else nullcontext():
+            features, chain_infos = prepare_esmfold2_input(
+                structure_prediction_input, seed=seed
+            )
+            features = {
+                k: (v[None].to(device) if device is not None else v[None])
+                if isinstance(v, torch.Tensor)
+                else v
+                for k, v in features.items()
+            }
+        return features, chain_infos
+    def __call__(
+        self,
+        input: StructurePredictionInput,
+        seed: int | None = None,
+        device: torch.device | str | None = None,
+    ) -> tuple[dict, list[ChainInfo]]:
+        return self.prepare_input(input, seed=seed, device=device)
+    def decode(
+        self,
+        output: dict[str, torch.Tensor],
+        features: dict[str, torch.Tensor],
+        chain_infos: list[ChainInfo],
+        *,
+        num_diffusion_samples: int = 1,
+        complex_id: str = "pred",
+    ) -> MolecularComplexResult | list[MolecularComplexResult]:
+        """Convert raw model outputs into one MolecularComplexResult per sample.
+        Parameters
+        ----------
+        output : dict[str, Tensor]
+            Output dict returned by ESMFold2Model.forward.
+        features : dict[str, Tensor]
+            Feature dict from :meth:`prepare_input` (batched, on the model device).
+        chain_infos : list[ChainInfo]
+            Chain metadata returned alongside `features`.
+        num_diffusion_samples : int
+            Number of diffusion samples present in the output (Bm = B * num_diffusion_samples).
+        complex_id : str
+            Identifier assigned to each MolecularComplex.
+        Returns
+        -------
+        MolecularComplexResult or list[MolecularComplexResult]
+            A single result when num_diffusion_samples == 1, otherwise a list of length Bm.
+        """
+        atom_mask = features["atom_attention_mask"][0]
+        ref_element = features["ref_element"][0]
+        ref_atom_name_chars = features["ref_atom_name_chars"][0]
+        sample_coords = output["sample_atom_coords"]
+        plddts = output["plddt"]
+        Bm = sample_coords.shape[0]
+        ptm_t = output.get("ptm")
+        iptm_t = output.get("iptm")
+        pae_t = output.get("pae")
+        distogram_t = output.get("distogram_logits")
+        pair_chains_t = output.get("pair_chains_iptm")
+        residue_index_t = output.get("residue_index")
+        entity_id_t = output.get("entity_id")
+        results: list[MolecularComplexResult] = []
+        for i in range(Bm):
+            mc = build_molecular_complex_from_features(
+                coords=sample_coords[i],
+                plddt=plddts[i],
+                atom_mask=atom_mask,
+                ref_element=ref_element,
+                ref_atom_name_chars=ref_atom_name_chars,
+                chain_infos=chain_infos,
+                complex_id=complex_id,
+            )
+            results.append(
+                MolecularComplexResult(
+                    complex=mc,
+                    plddt=plddts[i].detach().cpu(),
+                    ptm=float(ptm_t[i].item()) if ptm_t is not None else None,
+                    iptm=float(iptm_t[i].item()) if iptm_t is not None else None,
+                    pae=pae_t[i].detach().cpu() if pae_t is not None else None,
+                    distogram=(
+                        distogram_t[0].detach().cpu()
+                        if distogram_t is not None
+                        else None
+                    ),
+                    pair_chains_iptm=(
+                        pair_chains_t[i].detach().cpu()
+                        if pair_chains_t is not None
+                        else None
+                    ),
+                    residue_index=(
+                        residue_index_t[0].detach().cpu()
+                        if residue_index_t is not None
+                        else None
+                    ),
+                    entity_id=(
+                        entity_id_t[0].detach().cpu()
+                        if entity_id_t is not None
+                        else None
+                    ),
+                )
+            )
+        if num_diffusion_samples == 1 and len(results) == 1:
+            return results[0]
+        return results
+    def fold(
+        self,
+        model: Any,
+        input: StructurePredictionInput,
+        *,
+        num_loops: int = 3,
+        num_sampling_steps: int = 200,
+        num_diffusion_samples: int = 1,
+        seed: int | None = None,
+        noise_scale: float | None = None,
+        step_scale: float | None = None,
+        max_inference_sigma: int | None = None,
+        early_exit: bool = False,
+        complex_id: str = "pred",
+    ) -> MolecularComplexResult | list[MolecularComplexResult]:
+        """Fold a structure end-to-end: encode → model → decode.
+        Parameters
+        ----------
+        model : ESMFold2Model
+            The folding model. Must already be on the target device and in eval mode.
+        input : StructurePredictionInput
+            User-facing input specification.
+        num_loops, num_sampling_steps, num_diffusion_samples : int
+            Inference knobs forwarded to the model.
+        seed : int, optional
+            Seeds both input prep (SMILES conformer generation) and diffusion sampling.
+        noise_scale, step_scale, max_inference_sigma, early_exit
+            Optional sampler overrides forwarded to the model when not None.
+        complex_id : str
+            Identifier assigned to the predicted MolecularComplex(es).
+        Returns
+        -------
+        MolecularComplexResult or list[MolecularComplexResult]
+            A single result when num_diffusion_samples == 1, otherwise a list.
+        """
+        features, chain_infos = self.prepare_input(
+            input, seed=seed, device=model.device
+        )
+        sampler_kwargs: dict[str, Any] = {}
+        if noise_scale is not None:
+            sampler_kwargs["noise_scale"] = noise_scale
+        if step_scale is not None:
+            sampler_kwargs["step_scale"] = step_scale
+        if max_inference_sigma is not None:
+            sampler_kwargs["max_inference_sigma"] = max_inference_sigma
+        with torch.no_grad():
+            with _seed_context(seed) if seed is not None else nullcontext():
+                output = model(
+                    **features,
+                    num_loops=num_loops,
+                    num_sampling_steps=num_sampling_steps,
+                    num_diffusion_samples=num_diffusion_samples,
+                    early_exit=early_exit,
+                    **sampler_kwargs,
+                )
+        return self.decode(
+            output,
+            features,
+            chain_infos,
+            num_diffusion_samples=num_diffusion_samples,
+            complex_id=complex_id,
+        )
+__all__ = ["ESMFold2InputBuilder", "clean_esmfold2_input"]

esmfold2_protein_chain.py ADDED Viewed

	@@ -0,0 +1,1376 @@

+from __future__ import annotations
+import io
+import warnings
+from dataclasses import asdict, dataclass, replace
+from functools import cached_property
+from pathlib import Path
+from typing import Any, Mapping, Sequence
+import biotite.structure as bs
+import brotli
+import msgpack
+import msgpack_numpy
+import numpy as np
+import torch
+from biotite.database import rcsb
+from biotite.structure.io.pdb import PDBFile
+from biotite.structure.io.pdbx import CIFCategory, CIFColumn, CIFData, CIFFile
+from biotite.structure.io.pdbx import set_structure as set_structure_pdbx
+from scipy.spatial import ConvexHull, KDTree
+from scipy.spatial.distance import cdist, pdist, squareform
+from . import esmfold2_residue_constants
+from .esmfold2_misc import slice_python_object_as_numpy
+from .esmfold2_affine3d import Affine3D
+from .esmfold2_aligner import Aligner
+from .esmfold2_atom_indexer import AtomIndexer
+from .esmfold2_metrics import compute_gdt_ts, compute_lddt_ca
+from .esmfold2_mmcif_parsing import MmcifWrapper, Residue
+from .esmfold2_normalize_coordinates import (
+    apply_frame_to_coords,
+    get_protein_normalization_frame,
+)
+from .esmfold2_protein_structure import index_by_atom_name
+from .esmfold2_utils_types import PathOrBuffer
+msgpack_numpy.patch()
+CHAIN_ID_CONST = "A"
+def _str_key_to_int_key(dct: dict, ignore_keys: list[str] | None = None) -> dict:
+    new_dict = {}
+    for k, v in dct.items():
+        v_new = v
+        if k not in ignore_keys and isinstance(v, dict):
+            v_new = _str_key_to_int_key(v, ignore_keys=ignore_keys)
+        # Note assembly_composition is *supposed* to have string keys.
+        if isinstance(k, str) and k.isdigit():
+            new_dict[int(k)] = v_new
+        else:
+            new_dict[k] = v_new
+    return new_dict
+def _num_non_null_residues(seqres_to_structure_chain: Mapping[int, Residue]) -> int:
+    return sum(
+        residue.residue_number is not None
+        for residue in seqres_to_structure_chain.values()
+    )
+def infer_CB(C, N, Ca, L: float = 1.522, A: float = 1.927, D: float = -2.143):
+    """
+    Inspired by a util in trDesign:
+    https://github.com/gjoni/trDesign/blob/f2d5930b472e77bfacc2f437b3966e7a708a8d37/02-GD/utils.py#L92
+    input:  3 coords (a,b,c), (L)ength, (A)ngle, and (D)ihedral
+    output: 4th coord
+    """
+    norm = lambda x: x / np.sqrt(np.square(x).sum(-1, keepdims=True) + 1e-8)
+    with np.errstate(invalid="ignore"):  # inf - inf = nan is ok here
+        vec_bc = N - Ca
+        vec_ba = N - C
+    bc = norm(vec_bc)
+    n = norm(np.cross(vec_ba, bc))
+    m = [bc, np.cross(n, bc), n]
+    d = [L * np.cos(A), L * np.sin(A) * np.cos(D), -L * np.sin(A) * np.sin(D)]
+    return Ca + sum([m * d for m, d in zip(m, d)])
+def chain_to_ndarray(
+    atom_array: bs.AtomArray, mmcif: MmcifWrapper, chain_id: str, is_predicted=False
+):
+    entity_id = None
+    for entity, chains in mmcif.entities.items():
+        if chain_id in chains:
+            entity_id = entity
+    num_res = len(mmcif.chain_to_seqres[chain_id])
+    sequence = mmcif.chain_to_seqres[chain_id]
+    atom_positions = np.full([num_res, residue_constants.atom_type_num, 3], np.nan)
+    atom_mask = np.full([num_res, residue_constants.atom_type_num], False, dtype=bool)
+    residue_index = np.full([num_res], -1, dtype=np.int64)
+    insertion_code = np.full([num_res], "", dtype="<U4")
+    confidence = np.ones([num_res], dtype=np.float32)
+    for res_index in range(num_res):
+        chain = atom_array[atom_array.chain_id == chain_id]
+        assert isinstance(chain, bs.AtomArray)
+        res_at_position = mmcif.seqres_to_structure[chain_id][res_index]
+        if res_at_position.residue_number is None:
+            continue
+        residue_index[res_index] = res_at_position.residue_number
+        insertion_code[res_index] = res_at_position.insertion_code
+        res = chain[
+            (chain.res_id == res_at_position.residue_number)
+            & (chain.ins_code == res_at_position.insertion_code)
+            & (chain.hetero == res_at_position.hetflag)
+        ]
+        assert isinstance(res, bs.AtomArray)
+        # Atom level features
+        for atom in res:
+            atom_name = atom.atom_name
+            if atom_name == "SE" and atom.res_name == "MSE":
+                # Put the coords of the selenium atom in the sulphur column
+                atom_name = "SD"
+            if atom_name in residue_constants.atom_order:
+                atom_positions[res_index, residue_constants.atom_order[atom_name]] = (
+                    atom.coord
+                )
+                atom_mask[res_index, residue_constants.atom_order[atom_name]] = True
+                if is_predicted and atom_name == "CA":
+                    confidence[res_index] = atom.b_factor
+    assert all(sequence), "Some residue name was not specified correctly"
+    return (
+        sequence,
+        atom_positions,
+        atom_mask,
+        residue_index,
+        insertion_code,
+        confidence,
+        entity_id,
+    )
+@dataclass(frozen=True)
+class ProteinChain:
+    """Dataclass with atom37 representation of a single protein chain."""
+    id: str
+    sequence: str
+    chain_id: str  # author chain id - mutable
+    entity_id: int | None
+    residue_index: np.ndarray
+    insertion_code: np.ndarray
+    atom37_positions: np.ndarray
+    atom37_mask: np.ndarray
+    confidence: np.ndarray
+    mmcif: MmcifWrapper | None = None
+    atom37_confidence: np.ndarray | None = None  # [L, 37] per-atom pLDDT
+    def __post_init__(self):
+        assert self.atom37_mask.dtype == bool, self.atom37_mask.dtype
+        assert self.atom37_positions.shape[0] == len(self.sequence), (
+            self.atom37_positions.shape,
+            len(self.sequence),
+        )
+        assert self.atom37_mask.shape[0] == len(self.sequence), (
+            self.atom37_mask.shape,
+            len(self.sequence),
+        )
+        assert self.residue_index.shape[0] == len(self.sequence), (
+            self.residue_index.shape,
+            len(self.sequence),
+        )
+        assert self.insertion_code.shape[0] == len(self.sequence), (
+            self.insertion_code.shape,
+            len(self.sequence),
+        )
+        assert self.confidence.shape[0] == len(self.sequence), (
+            self.confidence.shape,
+            len(self.sequence),
+        )
+        if self.atom37_confidence is not None:
+            assert self.atom37_confidence.shape == self.atom37_mask.shape, (
+                self.atom37_confidence.shape,
+                self.atom37_mask.shape,
+            )
+    @cached_property
+    def atoms(self) -> AtomIndexer:
+        return AtomIndexer(self, property="atom37_positions", dim=-2)
+    @cached_property
+    def atom_mask(self) -> AtomIndexer:
+        return AtomIndexer(self, property="atom37_mask", dim=-1)
+    @cached_property
+    def atom_array(self) -> bs.AtomArray:
+        atoms = []
+        for res_idx_i, (
+            res_name,
+            res_idx,
+            ins_code,
+            positions,
+            mask,
+            conf,
+        ) in enumerate(
+            zip(
+                self.sequence,
+                self.residue_index,
+                self.insertion_code,
+                self.atom37_positions,
+                self.atom37_mask.astype(bool),
+                self.confidence,
+            )
+        ):
+            for i, pos in zip(np.where(mask)[0], positions[mask]):
+                b_factor = (
+                    self.atom37_confidence[res_idx_i, i]
+                    if self.atom37_confidence is not None
+                    else conf
+                )
+                atom = bs.Atom(
+                    coord=pos,
+                    chain_id="A" if self.chain_id is None else self.chain_id,
+                    res_id=res_idx,
+                    ins_code=ins_code,
+                    res_name=residue_constants.restype_1to3.get(res_name, "UNK"),
+                    hetero=False,
+                    atom_name=residue_constants.atom_types[i],
+                    element=residue_constants.atom_types[i][0],
+                    b_factor=float(b_factor),
+                )
+                atoms.append(atom)
+        return bs.array(atoms)
+    @cached_property
+    def residue_index_no_insertions(self) -> np.ndarray:
+        return self.residue_index + np.cumsum(self.insertion_code != "")
+    @cached_property
+    def atom_array_no_insertions(self) -> bs.AtomArray:
+        atoms = []
+        for res_idx, (res_name, positions, mask, conf) in enumerate(
+            zip(
+                self.sequence,
+                self.atom37_positions,
+                self.atom37_mask.astype(bool),
+                self.confidence,
+            )
+        ):
+            for i, pos in zip(np.where(mask)[0], positions[mask]):
+                b_factor = (
+                    self.atom37_confidence[res_idx, i]
+                    if self.atom37_confidence is not None
+                    else conf
+                )
+                atom = bs.Atom(
+                    coord=pos,
+                    # hard coded to as we currently only support single chain structures
+                    chain_id=CHAIN_ID_CONST,
+                    res_id=res_idx + 1,
+                    res_name=residue_constants.restype_1to3.get(res_name, "UNK"),
+                    hetero=False,
+                    atom_name=residue_constants.atom_types[i],
+                    element=residue_constants.atom_types[i][0],
+                    b_factor=float(b_factor),
+                )
+                atoms.append(atom)
+        return bs.array(atoms)
+    def __getitem__(self, idx: int | list[int] | slice | np.ndarray | torch.Tensor):
+        if isinstance(idx, int):
+            idx = [idx]
+        if isinstance(idx, torch.Tensor):
+            idx = idx.cpu().numpy()
+        sequence = slice_python_object_as_numpy(self.sequence, idx)
+        return replace(
+            self,
+            sequence=sequence,
+            residue_index=self.residue_index[..., idx],
+            insertion_code=self.insertion_code[..., idx],
+            atom37_positions=self.atom37_positions[..., idx, :, :],
+            atom37_mask=self.atom37_mask[..., idx, :],
+            confidence=self.confidence[..., idx],
+            atom37_confidence=self.atom37_confidence[..., idx, :]
+            if self.atom37_confidence is not None
+            else None,
+        )
+    def __len__(self):
+        return len(self.sequence)
+    def cbeta_contacts(self, distance_threshold: float = 8.0) -> np.ndarray:
+        distance = self.pdist_CB
+        contacts = (distance < distance_threshold).astype(np.int64)
+        contacts[np.isnan(distance)] = -1
+        np.fill_diagonal(contacts, -1)
+        return contacts
+    def to_pdb(self, path: PathOrBuffer, include_insertions: bool = True):
+        """Dssp works better w/o insertions."""
+        f = PDBFile()
+        if not include_insertions:
+            f.set_structure(self.atom_array_no_insertions)
+        else:
+            f.set_structure(self.atom_array)
+        f.write(path)
+    def to_pdb_string(self, include_insertions: bool = True) -> str:
+        buf = io.StringIO()
+        self.to_pdb(buf, include_insertions=include_insertions)
+        buf.seek(0)
+        return buf.read()
+    def to_mmcif(self, path: PathOrBuffer):
+        f = CIFFile()
+        set_structure_pdbx(f, self.atom_array, data_block=self.id)
+        # incantations molstar needs to render pLDDT / confidence onto
+        # the structure with "alphafold-view"
+        f.block["ma_qa_metric"] = CIFCategory(
+            name="ma_qa_metric",
+            columns={
+                "id": CIFColumn(data=CIFData(array=np.array([1, 2]), dtype=np.int64)),
+                "mode": CIFColumn(
+                    data=CIFData(array=np.array(["global", "local"]), dtype=np.str_)
+                ),
+                "name": CIFColumn(
+                    data=CIFData(array=np.array(["pLDDT", "pLDDT"]), dtype=np.str_)
+                ),
+            },
+        )
+        # table is a duplicate of data already in the atom array, but
+        # needed by molstar to render pLDDT / confidence
+        resid_pldd_table = {
+            # hard coded to as we currently only support single chain structures
+            "label_asym_id": CIFColumn(
+                data=CIFData(
+                    array=[CHAIN_ID_CONST] * len(self.residue_index), dtype=np.str_
+                )
+            ),
+            "label_comp_id": CIFColumn(
+                data=CIFData(
+                    array=[
+                        residue_constants.restype_1to3.get(c, "UNK")
+                        for c in self.sequence
+                    ],
+                    dtype=np.str_,
+                )
+            ),
+            "label_seq_id": CIFColumn(
+                data=CIFData(array=self.residue_index, dtype=np.int64)
+            ),
+            "ordinal_id": CIFColumn(
+                data=CIFData(array=self.residue_index, dtype=np.int64)
+            ),
+            # hard coded to show these are all local plDDT values
+            "metric_id": CIFColumn(
+                data=CIFData(array=["2"] * len(self.residue_index), dtype=np.str_)
+            ),
+            "metric_value": CIFColumn(
+                data=CIFData(array=self.confidence, dtype=np.float32)
+            ),
+            # hard coded to show there are the initial version, there are no revisions
+            "model_id": CIFColumn(
+                data=CIFData(array=["1"] * len(self.residue_index), dtype=np.str_)
+            ),
+        }
+        f.block["ma_qa_metric_local"] = CIFCategory(
+            name="ma_qa_metric_local", columns=resid_pldd_table
+        )
+        f.write(path)
+    def to_mmcif_string(self) -> str:
+        buf = io.StringIO()
+        self.to_mmcif(buf)
+        buf.seek(0)
+        return buf.read()
+    def state_dict(self, backbone_only=False, json_serializable=False):
+        """This state dict is optimized for storage, so it turns things to fp16 whenever
+        possible. Note that we also only support int32 residue indices, I'm hoping we don't
+        need more than 2**32 residues..."""
+        dct = {k: v for k, v in asdict(self).items() if k not in ["mmcif"]}
+        if backbone_only:
+            dct["atom37_mask"][:, 3:] = False
+        dct["atom37_positions"] = dct["atom37_positions"][dct["atom37_mask"]]
+        if dct.get("atom37_confidence") is not None:
+            dct["atom37_confidence"] = dct["atom37_confidence"][dct["atom37_mask"]]
+        else:
+            dct.pop("atom37_confidence", None)
+        for k, v in dct.items():
+            if isinstance(v, np.ndarray):
+                match v.dtype:
+                    case np.int64:
+                        dct[k] = v.astype(np.int32)
+                    case np.float64 | np.float32:
+                        dct[k] = v.astype(np.float16)
+                    case _:
+                        pass
+                if json_serializable:
+                    dct[k] = v.tolist()
+        return dct
+    def to_blob(self, backbone_only=False) -> bytes:
+        return brotli.compress(msgpack.dumps(self.state_dict(backbone_only)), quality=5)
+    @classmethod
+    def from_open_source(cls, pc: ProteinChain):
+        return cls(**vars(pc))
+    @classmethod
+    def from_state_dict(cls, dct):
+        # Note: assembly_composition is *supposed* to have string keys.
+        dct = _str_key_to_int_key(dct, ignore_keys=["assembly_composition"])
+        for k, v in dct.items():
+            if isinstance(v, list):
+                dct[k] = np.array(v)
+        atom37 = np.full((*dct["atom37_mask"].shape, 3), np.nan)
+        atom37[dct["atom37_mask"]] = dct["atom37_positions"]
+        dct["atom37_positions"] = atom37
+        if "atom37_confidence" in dct:
+            atom37_conf = np.full(dct["atom37_mask"].shape, np.nan, dtype=np.float32)
+            atom37_conf[dct["atom37_mask"]] = dct["atom37_confidence"]
+            dct["atom37_confidence"] = atom37_conf
+        dct = {
+            k: (
+                v.astype(np.float32)
+                if k in ["atom37_positions", "confidence", "atom37_confidence"]
+                else v
+            )
+            for k, v in dct.items()
+            if not (k == "atom37_confidence" and v is None)
+        }
+        return cls(**dct, mmcif=None)
+    @classmethod
+    def from_blob(cls, input: Path | str | io.BytesIO | bytes):
+        """NOTE(@zlin): blob + sparse coding + brotli + fp16 reduces memory
+        of chains from 52G/1M chains to 20G/1M chains, I think this is a good first
+        shot at compressing and dumping chains to disk. I'm sure there's better ways."""
+        match input:
+            case Path() | str():
+                bytes = Path(input).read_bytes()
+            case io.BytesIO():
+                bytes = input.getvalue()
+            case _:
+                bytes = input
+        return cls.from_state_dict(msgpack.loads(brotli.decompress(bytes)))
+    def sasa(self, by_residue: bool = True):
+        arr = self.atom_array_no_insertions
+        sasa_per_atom = bs.sasa(arr)  # type: ignore
+        if by_residue:
+            # Sum per-atom SASA into residue "bins", with np.bincount.
+            assert arr.res_id is not None
+            # NOTE(rverkuil): arr.res_id is 1-indexed, but np.bincount returns a sum for bin 0, so we strip.
+            # NOTE(aderry): We compute only for residues with coordinates, return NaN otherwise.
+            num_trailing_residues = len(self) - arr.res_id.max()
+            sasa_per_residue = np.concatenate(
+                [
+                    np.bincount(arr.res_id, weights=sasa_per_atom)[1:],
+                    np.zeros(num_trailing_residues),
+                ]
+            )
+            sasa_per_residue[~self.atom37_mask.any(-1)] = np.nan
+            assert len(sasa_per_residue) == len(self)
+            return sasa_per_residue
+        return sasa_per_atom
+    def sap_score(self, aggregation: str = "atom") -> np.ndarray:
+        """Computes per-atom SAP score.
+        Can optionally aggregate by residue (by averaging over atoms. NOTE: this returns values only for residues that have coordinates!)
+        or full-protein (sum of SAP score for atoms with SAP > 0, as in Lauer et al. 2011)."""
+        sap_radius = 5.0
+        arr = self.atom_array_no_insertions
+        # asserts to avoid type errors
+        assert arr.res_id is not None
+        assert arr.res_name is not None
+        assert arr.atom_name is not None
+        assert arr.coord is not None
+        # compute SASA and residue-specific properties
+        sasa_per_atom = self.sasa(by_residue=False)
+        resid_to_resname = dict(zip(arr.res_id, arr.res_name))
+        max_side_chain_asa = np.full(len(self), np.nan)
+        res_hydrophobicity = np.full(len(self), np.nan)
+        resolved_res_mask = self.atom37_mask.any(-1)
+        num_trailing_residues = len(self) - arr.res_id.max()
+        max_side_chain_asa[resolved_res_mask] = np.array(
+            [
+                residue_constants.side_chain_asa[resid_to_resname[i]]
+                for i in np.unique(arr.res_id)
+            ]
+        )
+        res_hydrophobicity[resolved_res_mask] = np.array(
+            [
+                residue_constants.hydrophobicity[resid_to_resname[i]]
+                for i in np.unique(arr.res_id)
+            ]
+        )
+        assert len(max_side_chain_asa) == len(self)
+        assert len(res_hydrophobicity) == len(self)
+        # compute SAP score
+        is_side_chain = ~bs.filter_peptide_backbone(arr)
+        sasa_per_atom[is_side_chain] = 0
+        kdtree = KDTree(arr.coord)
+        neighbors = kdtree.query_ball_tree(kdtree, sap_radius, p=2.0)
+        sap_by_atom = np.zeros_like(sasa_per_atom)
+        for i, nn_list in enumerate(neighbors):
+            saa_nn = np.zeros_like(sasa_per_atom)
+            saa_nn[nn_list] = sasa_per_atom[nn_list]
+            sasa_within_r = np.concatenate(
+                [
+                    np.bincount(arr.res_id, weights=saa_nn)[1:],
+                    np.zeros(num_trailing_residues),
+                ]
+            )
+            sap = np.nansum((sasa_within_r / max_side_chain_asa) * res_hydrophobicity)
+            sap_by_atom[i] = sap
+        match aggregation:
+            case "atom":
+                return sap_by_atom
+            case "residue":
+                sap_by_residue = np.concatenate(
+                    [
+                        np.bincount(arr.res_id, weights=sap_by_atom)[1:],
+                        np.zeros(num_trailing_residues),
+                    ]
+                ) / (
+                    np.concatenate(
+                        [np.bincount(arr.res_id)[1:], np.zeros(num_trailing_residues)]
+                    )
+                    + 1e-8
+                )
+                sap_by_residue[~resolved_res_mask] = np.nan
+                assert len(sap_by_residue) == len(self)
+                return sap_by_residue
+            case "protein":
+                return sum(sap_by_atom[sap_by_atom > 0])  # pyright: ignore[reportReturnType]
+            case _:
+                raise ValueError(
+                    f"Invalid aggregation method: {aggregation}. Must be one of 'atom', 'residue', or 'protein'"
+                )
+    def globularity(self) -> float:
+        # Computes globularity using total volumes divided by MVEE.
+        # We make the simplifying approximation that atoms never overlap.
+        # The globularity is only computed where structure exists.
+        # Besides the approximation above, this is inspired by:
+        # https://www.mdpi.com/2073-4352/11/12/1539
+        # NOTE(@zeming): due to the approximation we make here, that atoms never overlap, you might get >1 globularity
+        mask = self.atom37_mask.any(-1)
+        points = self.atom37_positions[self.atom37_mask]
+        sequence = [aa for aa, m in zip(self.sequence, mask) if m]  # type: ignore
+        A, _ = self._mvee(points, tol=1e-3)
+        mvee_volume = (4 * np.pi) / (3 * np.sqrt(np.linalg.det(A)))
+        volume = sum(residue_constants.amino_acid_volumes[x] for x in sequence)
+        ratio = volume / mvee_volume
+        # The paper says you must compare the ellipsoidal profile with T, a measurement of
+        # how elongated the ellipsoid is. We want a single number, so we multiply by 1/2T, so
+        # that value is normalized between 0-1
+        eigenvalues = np.linalg.eigvals(A)
+        R = 1 / np.sqrt(eigenvalues)
+        # ellipsoid radii length triangle inequality coefficient
+        T = max(R[0] / (R[1] + R[2]), R[1] / (R[0] + R[2]), R[2] / (R[0] + R[1]))
+        elongation_metric = 1 / max(T, 1)
+        return ratio * elongation_metric
+    @staticmethod
+    def _mvee(P: np.ndarray, tol, max_iter=10000):
+        # Finds minimum volume enclosing ellipsoid of a set of points.
+        # Returns A, c where the ellipse is defined as:
+        #    (x-c).T @ A @ (x-c) = 1
+        hull = ConvexHull(P)
+        P = P[hull.vertices]
+        P = P.T
+        # Data points
+        d, N = P.shape
+        Q = np.zeros((d + 1, N))
+        Q[:d, :] = P[:d, :N]
+        Q[d, :] = np.ones((1, N))
+        # Initializations
+        count = 1
+        err = 1.0
+        u = np.full((N, 1), 1 / N)  # 1st iteration
+        # Khachiyan Algorithm
+        for i in range(max_iter):
+            X = Q.dot(np.diag(u.squeeze())) @ Q.T
+            M = np.diag(Q.T @ np.linalg.inv(X) @ Q)
+            maximum, j = np.max(M), np.argmax(M)
+            step_size = (maximum - d - 1) / ((d + 1) * (maximum - 1))
+            new_u = (1 - step_size) * u
+            new_u[j] += step_size
+            count += 1
+            err = np.linalg.norm(new_u - u)
+            u = new_u
+            if err < tol:
+                break
+        else:
+            raise ValueError("MVEE did not converge")
+        d = P.shape[0]  # Fixed: use P.shape[0] instead of P.shape
+        U = np.diag(u.squeeze())
+        # The A matrix for the ellipse
+        A = (1 / d) * np.linalg.inv(P @ U @ P.T - (P @ u) @ (P @ u).T)
+        # Center of the ellipse
+        c = P @ u
+        return A, c
+    def radius_of_gyration(self):
+        arr = self.atom_array_no_insertions
+        return bs.gyration_radius(arr)
+    def align(
+        self,
+        target: ProteinChain,
+        mobile_inds: list[int] | np.ndarray | None = None,
+        target_inds: list[int] | np.ndarray | None = None,
+        only_use_backbone: bool = False,
+    ):
+        """
+        Aligns the current protein to the provided target.
+        Args:
+            target (ProteinChain): The target protein to align to.
+            mobile_inds (list[int], np.ndarray, optional): The indices of the mobile atoms to align. These are NOT residue indices
+            target_inds (list[int], np.ndarray, optional): The indices of the target atoms to align. These are NOT residue indices
+            only_use_backbone (bool, optional): If True, only align the backbone atoms.
+        """
+        aligner = Aligner(
+            self if mobile_inds is None else self[mobile_inds],
+            target if target_inds is None else target[target_inds],
+            only_use_backbone,
+        )
+        return aligner.apply(self)
+    def rmsd(
+        self,
+        target: ProteinChain,
+        also_check_reflection: bool = False,
+        mobile_inds: list[int] | np.ndarray | None = None,
+        target_inds: list[int] | np.ndarray | None = None,
+        only_compute_backbone_rmsd: bool = False,
+    ):
+        """
+        Compute the RMSD between this protein chain and another.
+        Args:
+            target (ProteinChain): The target (other) protein chain to compare to.
+            also_check_reflection (bool, optional): If True, also check if the reflection of the mobile atoms has a lower RMSD.
+            mobile_inds (list[int], optional): The indices of the mobile atoms to align. These are NOT residue indices
+            target_inds (list[int], optional): The indices of the target atoms to align. These are NOT residue indices
+            only_compute_backbone_rmsd (bool, optional): If True, only compute the RMSD of the backbone atoms.
+        """
+        if isinstance(target, bs.AtomArray):
+            raise ValueError(
+                "Support for bs.AtomArray removed, use "
+                "ProteinChain.from_atomarry for ProteinChain."
+            )
+        aligner = Aligner(
+            self if mobile_inds is None else self[mobile_inds],
+            target if target_inds is None else target[target_inds],
+            only_compute_backbone_rmsd,
+        )
+        avg_rmsd = aligner.rmsd
+        if not also_check_reflection:
+            return avg_rmsd
+        aligner = Aligner(
+            self if mobile_inds is None else self[mobile_inds],
+            target if target_inds is None else target[target_inds],
+            only_compute_backbone_rmsd,
+            use_reflection=True,
+        )
+        avg_rmsd_neg = aligner.rmsd
+        return min(avg_rmsd, avg_rmsd_neg)
+    def lddt_ca(
+        self,
+        native: ProteinChain,
+        mobile_inds: list[int] | np.ndarray | None = None,
+        target_inds: list[int] | np.ndarray | None = None,
+        **kwargs,
+    ) -> float | np.ndarray:
+        """Compute the LDDT between this protein chain and another. NOTE: LDDT IS NOT SYMMETRIC.
+        The call should always be prediction.lddt_ca(native).
+        Arguments:
+            native (ProteinChain): The ground truth protein chain
+            mobile_inds (list[int], np.ndarray, optional): The indices of the mobile atoms to align. These are NOT residue indices
+            target_inds (list[int], np.ndarray, optional): The indices of the target atoms to align. These are NOT residue indices
+        Returns:
+            float | np.ndarray: The LDDT score between the two protein chains, either
+                a single float or per-residue LDDT scores if `per_residue` is True.
+        """
+        lddt = compute_lddt_ca(
+            torch.tensor(self.atom37_positions[mobile_inds]).unsqueeze(0),
+            torch.tensor(native.atom37_positions[target_inds]).unsqueeze(0),
+            torch.tensor(native.atom37_mask[mobile_inds]).unsqueeze(0),
+            **kwargs,
+        )
+        return float(lddt) if lddt.numel() == 1 else lddt.numpy().flatten()
+    def gdt_ts(
+        self,
+        target: ProteinChain,
+        mobile_inds: list[int] | np.ndarray | None = None,
+        target_inds: list[int] | np.ndarray | None = None,
+        **kwargs,
+    ) -> float | np.ndarray:
+        """Compute the GDT_TS between this protein chain and another.
+        Arguments:
+            target (ProteinChain): The other protein chain to compare to.
+            mobile_inds (list[int], np.ndarray, optional): The indices of the mobile atoms to align. These are NOT residue indices
+            target_inds (list[int], np.ndarray, optional): The indices of the target atoms to align. These are NOT residue indices
+        Returns:
+            float: The GDT_TS score between the two protein chains.
+        """
+        gdt_ts = compute_gdt_ts(
+            mobile=torch.tensor(
+                index_by_atom_name(self.atom37_positions[mobile_inds], "CA"),
+                dtype=torch.float32,
+            ).unsqueeze(0),
+            target=torch.tensor(
+                index_by_atom_name(target.atom37_positions[target_inds], "CA"),
+                dtype=torch.float32,
+            ).unsqueeze(0),
+            atom_exists_mask=torch.tensor(
+                index_by_atom_name(self.atom37_mask[mobile_inds], "CA", dim=-1)
+                & index_by_atom_name(target.atom37_mask[target_inds], "CA", dim=-1)
+            ).unsqueeze(0),
+            **kwargs,
+        )
+        return float(gdt_ts) if gdt_ts.numel() == 1 else gdt_ts.numpy().flatten()
+    @classmethod
+    def chain_iterable_from_mmcif(
+        cls,
+        path: PathOrBuffer | MmcifWrapper,
+        id: str | None = None,
+        is_predicted: bool = False,
+        keep_source: bool = False,
+    ):
+        """Return a list[ProteinChain] object from an mmcif file, a iterable list of all protein chain
+        from an mmcif file
+        """
+        if isinstance(path, MmcifWrapper):
+            mmcif = path
+        else:
+            mmcif = MmcifWrapper.read(path, id)
+        for chain in bs.chain_iter(mmcif.structure):
+            chain = chain[bs.filter_amino_acids(chain) & ~chain.hetero]
+            if len(chain) == 0:
+                continue
+            chain_id = chain.chain_id[0]
+            entity_id = None
+            for entity, chains in mmcif.entities.items():
+                if chain_id in chains:
+                    entity_id = entity
+            assert entity_id is not None
+            (
+                sequence,
+                atom_positions,
+                atom_mask,
+                residue_index,
+                insertion_code,
+                confidence,
+                _,
+            ) = chain_to_ndarray(chain, mmcif, chain_id, is_predicted)
+            assert all(sequence), "Some residue name was not specified correctly"
+            yield cls(
+                id=mmcif.id,
+                sequence=sequence,
+                chain_id=chain_id,
+                entity_id=entity_id,
+                atom37_positions=atom_positions,
+                atom37_mask=atom_mask,
+                residue_index=residue_index,
+                insertion_code=insertion_code,
+                confidence=confidence,
+                mmcif=mmcif if keep_source else None,
+            )
+    @classmethod
+    def from_mmcif(
+        cls,
+        path: PathOrBuffer | MmcifWrapper,
+        chain_id: str | None = None,
+        entity_id: int | None = None,
+        id: str | None = None,
+        is_predicted: bool = False,
+        keep_source: bool = False,
+    ):
+        """Return a ProteinChain object from an mmcif file.
+        Args:
+            path (str | Path | io.TextIO): Path or buffer to read mmcif file from. Should be uncompressed.
+            id (str, optional): String identifier to assign to structure. Will attempt to infer otherwise.
+            is_predicted (bool): If True, reads b factor as the confidence readout. Default: False.
+            chain_id (str, optional): Select a chain corresponding to (author) chain id.
+            entity_id (int, optional): Select a chain corresponding to a particular entity.
+        If neither `chain_id` nor `entity_id` is specified, defaults to the first entity.
+        """
+        if isinstance(path, MmcifWrapper):
+            mmcif = path
+        else:
+            mmcif = MmcifWrapper.read(path, id)
+        # If neither chain_id nor entity_id is specified, default to the first entity
+        if chain_id is None and entity_id is None:
+            if not mmcif.entities:
+                raise ValueError("Structure contains no entities")
+            entity_id = min(mmcif.entities.keys())  # Pick the first entity by ID
+        if entity_id is not None:
+            assert chain_id is None
+            if entity_id not in mmcif.entities:
+                raise ValueError(
+                    f"Structure does not contain entity `{entity_id}`. Valid entities: {mmcif.entities.keys()}"
+                )
+            chains = mmcif.entities[entity_id]
+            # Select the chain id corresponding to the longest chain. If all are equal length, selects the first.
+            chain_id = max(
+                chains,
+                key=lambda chain: _num_non_null_residues(
+                    mmcif.seqres_to_structure[chain]
+                ),
+            )
+        else:
+            assert chain_id is not None
+            for entity, chains in mmcif.entities.items():
+                if chain_id in chains:
+                    entity_id = entity
+        if entity_id is None:
+            warnings.warn(
+                "Failed to detect entity_id from mmcif file, it may be malformed."
+            )
+        atom_array = mmcif.structure
+        (
+            sequence,
+            atom_positions,
+            atom_mask,
+            residue_index,
+            insertion_code,
+            confidence,
+            _,
+        ) = chain_to_ndarray(atom_array, mmcif, chain_id, is_predicted)
+        assert all(sequence), "Some residue name was not specified correctly"
+        return cls(
+            id=mmcif.id,
+            sequence=sequence,
+            chain_id=chain_id,
+            entity_id=entity_id,
+            atom37_positions=atom_positions,
+            atom37_mask=atom_mask.astype(bool),
+            residue_index=residue_index,
+            insertion_code=insertion_code,
+            confidence=confidence,
+            mmcif=mmcif if keep_source else None,
+        )
+    @classmethod
+    def from_atom37(
+        cls,
+        atom37_positions: np.ndarray | torch.Tensor,
+        *,
+        id: str | None = None,
+        sequence: str | None = None,
+        chain_id: str | None = None,
+        entity_id: int | None = None,
+        residue_index: np.ndarray | torch.Tensor | None = None,
+        insertion_code: np.ndarray | None = None,
+        confidence: np.ndarray | torch.Tensor | None = None,
+    ):
+        if isinstance(atom37_positions, torch.Tensor):
+            atom37_positions = atom37_positions.cpu().numpy()
+            if atom37_positions.ndim == 4:
+                if atom37_positions.shape[0] != 1:
+                    raise ValueError(
+                        f"Cannot handle batched inputs, atom37_positions has shape {atom37_positions.shape}"
+                    )
+                atom37_positions = atom37_positions[0]
+        assert isinstance(atom37_positions, np.ndarray)
+        seqlen = atom37_positions.shape[0]
+        atom_mask = np.isfinite(atom37_positions).all(-1)
+        if id is None:
+            id = ""
+        if sequence is None:
+            sequence = "A" * seqlen
+        if chain_id is None:
+            chain_id = "A"
+        if residue_index is None:
+            residue_index = np.arange(1, seqlen + 1)
+        elif isinstance(residue_index, torch.Tensor):
+            residue_index = residue_index.cpu().numpy()
+            assert isinstance(residue_index, np.ndarray)
+            if residue_index.ndim == 2:
+                if residue_index.shape[0] != 1:
+                    raise ValueError(
+                        f"Cannot handle batched inputs, residue_index has shape {residue_index.shape}"
+                    )
+                residue_index = residue_index[0]
+        assert isinstance(residue_index, np.ndarray)
+        if insertion_code is None:
+            insertion_code = np.array(["" for _ in range(seqlen)])
+        if confidence is None:
+            confidence = np.ones(seqlen, dtype=np.float32)
+        elif isinstance(confidence, torch.Tensor):
+            confidence = confidence.cpu().numpy()
+            assert isinstance(confidence, np.ndarray)
+            if confidence.ndim == 2:
+                if confidence.shape[0] != 1:
+                    raise ValueError(
+                        f"Cannot handle batched inputs, confidence has shape {confidence.shape}"
+                    )
+                confidence = confidence[0]
+        assert isinstance(confidence, np.ndarray)
+        return cls(
+            id=id,
+            sequence=sequence,  # type: ignore
+            chain_id=chain_id,
+            entity_id=entity_id,
+            atom37_positions=atom37_positions,
+            atom37_mask=atom_mask.astype(bool),
+            residue_index=residue_index,
+            insertion_code=insertion_code,
+            confidence=confidence,
+        )
+    @classmethod
+    def from_backbone_atom_coordinates(
+        cls, backbone_atom_coordinates: np.ndarray | torch.Tensor, **kwargs
+    ):
+        """Create a ProteinChain from a set of backbone atom coordinates.
+        This function simply expands the seqlen x 3 x 3 array of backbone atom
+        coordinates to a seqlen x 37 x 3 array of all atom coordinates, with the padded
+        positions set to infinity. This allows us to use from_atom37 to create the
+        appropriate ProteinChain object with the appropriate atom37_mask.
+        This function passes all kwargs to from_atom37.
+        """
+        if isinstance(backbone_atom_coordinates, torch.Tensor):
+            backbone_atom_coordinates = backbone_atom_coordinates.cpu().numpy()
+            if backbone_atom_coordinates.ndim == 4:
+                if backbone_atom_coordinates.shape[0] != 1:
+                    raise ValueError(
+                        f"Cannot handle batched inputs, backbone_atom_coordinates has "
+                        f"shape {backbone_atom_coordinates.shape}"
+                    )
+                backbone_atom_coordinates = backbone_atom_coordinates[0]
+        assert isinstance(backbone_atom_coordinates, np.ndarray)
+        assert backbone_atom_coordinates.ndim == 3
+        assert backbone_atom_coordinates.shape[-2] == 3
+        assert backbone_atom_coordinates.shape[-1] == 3
+        atom37_positions = np.full(
+            (backbone_atom_coordinates.shape[0], 37, 3),
+            np.inf,
+            dtype=backbone_atom_coordinates.dtype,
+        )
+        atom37_positions[:, :3, :] = backbone_atom_coordinates
+        return cls.from_atom37(atom37_positions=atom37_positions, **kwargs)
+    @classmethod
+    def from_pdb(
+        cls,
+        path: PathOrBuffer,
+        chain_id: str = "detect",
+        id: str | None = None,
+        is_predicted: bool = False,
+    ) -> "ProteinChain":
+        """Return a ProteinChain object from an pdb file. NOTE: prefer mmcif for rcsb PDB files.
+        This function is mostly to interface with old PDB files and predicted structures -
+        it will not fill out the entity id correctly
+        Args:
+            path (str | Path | io.TextIO): Path or buffer to read mmcif file from. Should be uncompressed.
+            id (str, optional): String identifier to assign to structure. Will attempt to infer otherwise.
+            is_predicted (bool): If True, reads b factor as the confidence readout. Default: False.
+            chain_id (str, optional): Select a chain corresponding to (author) chain id. "detect" uses the
+                first detected chain
+        """
+        if id is not None:
+            file_id = id
+        else:
+            match path:
+                case Path() | str():
+                    file_id = Path(path).with_suffix("").name
+                case _:
+                    file_id = "null"
+        atom_array = PDBFile.read(path).get_structure(
+            model=1, extra_fields=["b_factor"]
+        )
+        if chain_id == "detect":
+            chain_id = atom_array.chain_id[0]
+        atom_array = atom_array[
+            bs.filter_amino_acids(atom_array)
+            & ~atom_array.hetero
+            & (atom_array.chain_id == chain_id)
+        ]
+        entity_id = 1  # Not supplied in PDBfiles
+        sequence = "".join(
+            residue_constants.restype_3to1.get(monomer[0].res_name, "X")
+            for monomer in bs.residue_iter(atom_array)
+        )
+        num_res = len(sequence)
+        atom_positions = np.full(
+            [num_res, residue_constants.atom_type_num, 3], np.nan, dtype=np.float32
+        )
+        atom_mask = np.full(
+            [num_res, residue_constants.atom_type_num], False, dtype=bool
+        )
+        residue_index = np.full([num_res], -1, dtype=np.int64)
+        insertion_code = np.full([num_res], "", dtype="<U4")
+        confidence = np.ones([num_res], dtype=np.float32)
+        for i, res in enumerate(bs.residue_iter(atom_array)):
+            chain = atom_array[atom_array.chain_id == chain_id]
+            assert isinstance(chain, bs.AtomArray)
+            res_index = res[0].res_id
+            residue_index[i] = res_index
+            insertion_code[i] = res[0].ins_code
+            # Atom level features
+            for atom in res:
+                atom_name = atom.atom_name
+                if atom_name == "SE" and atom.res_name == "MSE":
+                    # Put the coords of the selenium atom in the sulphur column
+                    atom_name = "SD"
+                if atom_name in residue_constants.atom_order:
+                    atom_positions[i, residue_constants.atom_order[atom_name]] = (
+                        atom.coord
+                    )
+                    atom_mask[i, residue_constants.atom_order[atom_name]] = True
+                    if is_predicted and atom_name == "CA":
+                        confidence[i] = atom.b_factor
+        assert all(sequence), "Some residue name was not specified correctly"
+        return cls(
+            id=file_id,
+            sequence=sequence,
+            chain_id=chain_id,
+            entity_id=entity_id,
+            atom37_positions=atom_positions,
+            atom37_mask=atom_mask.astype(bool),
+            residue_index=residue_index,
+            insertion_code=insertion_code,
+            confidence=confidence,
+            mmcif=None,
+        )
+    @classmethod
+    def from_mds(cls, data: dict[str, Any]) -> "ProteinChain":
+        return cls(
+            id=data["id"],
+            chain_id=data["chain_id"],
+            entity_id=data["entity_id"],
+            sequence=data["sequence"],
+            residue_index=data["residue_index"],
+            insertion_code=np.asarray(data["insertion_code"]),
+            atom37_positions=data["atom37_positions"],
+            atom37_mask=data["atom37_mask"].astype(bool),
+            confidence=data["confidence"],
+            mmcif=None,
+        )
+    @classmethod
+    def from_rcsb(
+        cls,
+        pdb_id: str,
+        chain_id: str | None = None,
+        entity_id: int | None = None,
+        keep_source: bool = False,
+    ) -> ProteinChain:
+        f: io.StringIO = rcsb.fetch(pdb_id, "cif")  # type: ignore
+        return cls.from_mmcif(
+            f,
+            id=pdb_id,
+            chain_id=chain_id,
+            entity_id=entity_id,
+            keep_source=keep_source,
+            is_predicted=False,
+        )
+    @classmethod
+    def from_atomarray(
+        cls, atom_array: bs.AtomArray, id: str | None = None, is_predicted: bool = False
+    ) -> "ProteinChain":
+        """A simple converter from bs.AtomArray -> ProteinChain.
+        Uses PDB file format as intermediate."""
+        atom_array = atom_array.copy()
+        atom_array.box = None  # remove surrounding box, from_pdb won't handle this
+        pdb_file = PDBFile()  # pyright: ignore
+        pdb_file.set_structure(atom_array)
+        buf = io.StringIO()
+        pdb_file.write(buf)
+        buf.seek(0)
+        return cls.from_pdb(buf, id=id, is_predicted=is_predicted)
+    def get_normalization_frame(self) -> Affine3D:
+        """Given a set of coordinates, compute a single frame.
+        Specifically, we compute the average position of the N, CA, and C atoms use those 3 points to construct a frame using the Gram-Schmidt algorithm. The average CA position is used as the origin of the frame.
+        Returns:
+            Affine3D: [] tensor of Affine3D frame
+        """
+        coords = torch.from_numpy(self.atom37_positions)
+        frame = get_protein_normalization_frame(coords)
+        return frame
+    def apply_frame(self, frame: Affine3D) -> ProteinChain:
+        """Given a frame, apply the frame to the protein's coordinates.
+        Args:
+            frame (Affine3D): [] tensor of Affine3D frame
+        Returns:
+            ProteinChain: Transformed protein chain
+        """
+        coords = torch.from_numpy(self.atom37_positions).to(frame.trans.dtype)
+        coords = apply_frame_to_coords(coords, frame)
+        atom37_positions = coords.numpy()
+        return replace(self, atom37_positions=atom37_positions)
+    def normalize_coordinates(self) -> ProteinChain:
+        """Normalize the coordinates of the protein chain."""
+        return self.apply_frame(self.get_normalization_frame())
+    def infer_oxygen(self) -> ProteinChain:
+        """Oxygen position is fixed given N, CA, C atoms. Infer it if not provided."""
+        O_missing_indices = np.argwhere(
+            ~np.isfinite(self.atoms["O"]).all(axis=1)
+        ).squeeze()
+        O_vector = torch.tensor([0.6240, -1.0613, 0.0103], dtype=torch.float32)
+        N, CA, C = torch.from_numpy(self.atoms[["N", "CA", "C"]]).float().unbind(dim=1)
+        N = torch.roll(N, -3)
+        N[..., -1, :] = torch.nan
+        # Get the frame defined by the CA-C-N atom
+        frames = Affine3D.from_graham_schmidt(CA, C, N)
+        O = frames.apply(O_vector)
+        atom37_positions = self.atom37_positions.copy()
+        atom37_mask = self.atom37_mask.copy()
+        atom37_positions[O_missing_indices, residue_constants.atom_order["O"]] = O[
+            O_missing_indices
+        ].numpy()
+        atom37_mask[O_missing_indices, residue_constants.atom_order["O"]] = ~np.isnan(
+            atom37_positions[O_missing_indices, residue_constants.atom_order["O"]]
+        ).any(-1)
+        new_chain = replace(
+            self, atom37_positions=atom37_positions, atom37_mask=atom37_mask
+        )
+        return new_chain
+    @cached_property
+    def inferred_cbeta(self) -> np.ndarray:
+        """Infer cbeta positions based on N, C, CA."""
+        N, CA, C = np.moveaxis(self.atoms[["N", "CA", "C"]], 1, 0)
+        # See usage in trDesign codebase.
+        # https://github.com/gjoni/trDesign/blob/f2d5930b472e77bfacc2f437b3966e7a708a8d37/02-GD/utils.py#L140
+        CB = infer_CB(C, N, CA, 1.522, 1.927, -2.143)
+        return CB
+    def infer_cbeta(self, infer_cbeta_for_glycine: bool = False) -> ProteinChain:
+        """Return a new chain with inferred CB atoms at all residues except GLY.
+        Args:
+            infer_cbeta_for_glycine (bool): If True, infers a beta carbon for glycine
+                residues, even though that residue doesn't have one.  Default off.
+                NOTE(rverkuil): The reason for having this switch in the first place
+                is that sometimes we want a (inferred) CB coordinate for every residue,
+                for example for making a pairwise distance matrix, or doing an RMSD
+                calculation between two designs for a given structural template, w/
+                CB atoms.
+        """
+        atom37_positions = self.atom37_positions.copy()
+        atom37_mask = self.atom37_mask.copy()
+        inferred_cbeta_positions = self.inferred_cbeta
+        if not infer_cbeta_for_glycine:
+            inferred_cbeta_positions[np.array(list(self.sequence)) == "G", :] = np.nan
+        atom37_positions[:, residue_constants.atom_order["CB"]] = (
+            inferred_cbeta_positions
+        )
+        atom37_mask[:, residue_constants.atom_order["CB"]] = ~np.isnan(
+            atom37_positions[:, residue_constants.atom_order["CB"]]
+        ).any(-1)
+        new_chain = replace(
+            self, atom37_positions=atom37_positions, atom37_mask=atom37_mask
+        )
+        return new_chain
+    @cached_property
+    def pdist_CA(self) -> np.ndarray:
+        CA = self.atoms["CA"]
+        pdist_CA = squareform(pdist(CA))
+        return pdist_CA
+    @cached_property
+    def pdist_CB(self) -> np.ndarray:
+        pdist_CB = squareform(pdist(self.inferred_cbeta))
+        return pdist_CB
+    @classmethod
+    def as_complex(cls, chains: Sequence[ProteinChain]):
+        raise RuntimeError(
+            ".as_complex() has been deprecated in favor of .concat(). "
+            ".concat() will eventually be deprecated in favor of ProteinComplex..."
+        )
+    @classmethod
+    def concat(cls, chains: Sequence[ProteinChain], use_chainbreak: bool = True):
+        sep_tokens = {
+            "residue_index": np.array([-1]),
+            "insertion_code": np.array([""]),
+            "atom37_positions": np.full([1, 37, 3], np.inf),
+            "atom37_mask": np.zeros([1, 37], dtype=bool),
+            "confidence": np.array([0]),
+        }
+        def join_arrays(arrays: Sequence[np.ndarray], sep: np.ndarray):
+            if use_chainbreak:
+                full_array = []
+                for array in arrays:
+                    full_array.append(array)
+                    full_array.append(sep)
+                full_array = full_array[:-1]
+                return np.concatenate(full_array, 0)
+            else:
+                return np.concatenate(arrays, 0)
+        array_args: dict[str, np.ndarray] = {
+            name: join_arrays([getattr(chain, name) for chain in chains], sep)
+            for name, sep in sep_tokens.items()
+        }
+        chain_break = residue_constants.CHAIN_BREAK_TOKEN if use_chainbreak else ""
+        return cls(
+            id=chains[0].id,
+            sequence=chain_break.join(chain.sequence for chain in chains),
+            chain_id="A",
+            entity_id=None,
+            mmcif=None,
+            **array_args,
+        )
+    def find_nonpolymer_contacts(self):
+        assert self.mmcif is not None
+        nonpolymer_and_chain_id_to_array = self.mmcif.non_polymer_coords
+        results = []
+        for (
+            nonpolymer,
+            _,
+        ), nonpolymer_array in nonpolymer_and_chain_id_to_array.items():
+            assert nonpolymer_array.coord is not None
+            chain_coords = self.atom37_positions[self.atom37_mask]
+            distance = cdist(nonpolymer_array.coord, chain_coords)
+            is_contact = distance < 5
+            if not is_contact.any():
+                continue
+            contacting_atoms = np.where(is_contact.any(0))[0]
+            chain_index = np.where(self.atom37_mask)[0]
+            contacting_residues = np.unique(chain_index[contacting_atoms])
+            result = {
+                "ligand": nonpolymer.name,
+                "ligand_id": nonpolymer.comp_id,
+                "contacting_residues": contacting_residues.tolist(),
+            }
+            results.append(result)
+        return results
+    def select_residue_indices(
+        self, indices: list[int | str], ignore_x_mismatch: bool = False
+    ) -> ProteinChain:
+        numeric_indices = [
+            idx if isinstance(idx, int) else int(idx[1:]) for idx in indices
+        ]
+        mask = np.isin(self.residue_index, numeric_indices)
+        new = self[mask]
+        mismatches = []
+        for aa, idx in zip(new.sequence, indices):
+            if isinstance(idx, int):
+                continue
+            if aa == "X" and ignore_x_mismatch:
+                continue
+            if aa != idx[0]:
+                mismatches.append((aa, idx))
+        if mismatches:
+            mismatch_str = "; ".join(
+                f"Position {idx[1:]}, Expected: {idx[0]}, Received: {aa}"
+                for aa, idx in mismatches
+            )
+            raise RuntimeError(mismatch_str)
+        return new
+    def to_structure_encoder_inputs(
+        self,
+    ) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
+        """Convert protein chain to structure encoder inputs.
+        Returns:
+            tuple: (coordinates, plddt, residue_index) where:
+                - coordinates: (1, L, 37, 3) tensor of atom positions
+                - plddt: (1, L) tensor of confidence scores
+                - residue_index: (1, L) tensor of residue indices
+        """
+        # Convert to tensors and add batch dimension
+        coordinates = (
+            torch.from_numpy(self.atom37_positions).float().unsqueeze(0)
+        )  # (1, L, 37, 3)
+        plddt = torch.from_numpy(self.confidence).float().unsqueeze(0)  # (1, L)
+        residue_index = (
+            torch.from_numpy(self.residue_index).long().unsqueeze(0)
+        )  # (1, L)
+        return coordinates, plddt, residue_index

esmfold2_protein_complex.py ADDED Viewed

	@@ -0,0 +1,1241 @@

+from __future__ import annotations
+import io
+import itertools
+import random
+import re
+import warnings
+from dataclasses import asdict, dataclass, replace
+from functools import cached_property
+from pathlib import Path
+from subprocess import check_output
+from tempfile import TemporaryDirectory
+from typing import Any, Iterable, Sequence
+import biotite.structure as bs
+import brotli
+import msgpack
+import msgpack_numpy
+import numpy as np
+import torch
+from biotite.database import rcsb
+from biotite.file import InvalidFileError
+from biotite.structure.io.pdb import PDBFile
+from biotite.structure.io.pdbx import CIFCategory, CIFColumn, CIFData, CIFFile
+from biotite.structure.io.pdbx import set_structure as set_structure_pdbx
+from biotite.structure.io.pdbx.convert import _get_transformations, get_structure
+from biotite.structure.util import matrix_rotate
+from scipy.spatial import KDTree
+from . import esmfold2_residue_constants
+from .esmfold2_misc import slice_python_object_as_numpy
+from .esmfold2_affine3d import Affine3D
+from .esmfold2_aligner import Aligner
+from .esmfold2_atom_indexer import AtomIndexer
+from .esmfold2_metrics import compute_gdt_ts, compute_lddt_ca
+from .esmfold2_mmcif_parsing import MmcifWrapper, NoProteinError
+from .esmfold2_protein_chain import (
+    ProteinChain,
+    _str_key_to_int_key,
+    chain_to_ndarray,
+    index_by_atom_name,
+    infer_CB,
+)
+from .esmfold2_utils_types import PathOrBuffer
+msgpack_numpy.patch()
+SINGLE_LETTER_CHAIN_IDS = (
+    "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789"
+)
+def _parse_operation_expression(expression):
+    """
+    Get successive operation steps (IDs) for the given
+    ``oper_expression``.
+    Form the cartesian product, if necessary.
+    Copied from biotite and fixed a bug
+    """
+    # Split groups by parentheses:
+    # use the opening parenthesis as delimiter
+    # and just remove the closing parenthesis
+    expressions_per_step = expression.replace(")", "").split("(")
+    expressions_per_step = [e for e in expressions_per_step if len(e) > 0]
+    # Important: Operations are applied from right to left
+    expressions_per_step.reverse()
+    operations = []
+    for expr in expressions_per_step:
+        cur_expr = expr.split(",")
+        cur_op = []
+        # Deal with e='1-10,20-30,40-50' type expressions
+        for e in cur_expr:
+            if "-" in e:
+                first, last = e.split("-")
+                cur_op.extend(str(id) for id in range(int(first), int(last) + 1))
+            else:
+                cur_op.append(e)
+        operations.append(cur_op)
+    # Cartesian product of operations
+    return list(itertools.product(*operations))
+def _apply_transformations_fast(chains, transformation_dict, operations):
+    """
+    Get subassembly by applying the given operations to the input
+    structure containing affected asym IDs.
+    """
+    # Additional first dimesion for 'structure.repeat()'
+    results = []
+    # Apply corresponding transformation for each copy in the assembly
+    for c in chains:
+        for operation in operations:
+            coord = c.atom37_positions.copy()
+            # Execute for each transformation step
+            # in the operation expression
+            for op_step in operation:
+                T = transformation_dict[op_step]
+                # Rotate
+                coord = matrix_rotate(coord, T.rotation)
+                # Translate
+                coord += T.target_translation
+            new_chain = replace(c, atom37_positions=coord)
+            results.append(new_chain)
+    return results
+@dataclass
+class ProteinComplexMetadata:
+    entity_lookup: dict[int, int]
+    chain_lookup: dict[int, str]
+    mmcif: MmcifWrapper | None = None
+    # This is a dictionary that maps assembly ids to the list of unique chains
+    # in that assembly. Allows for usage of `switch_assembly`.
+    assembly_composition: dict[str, list[str]] | None = None
+@dataclass
+class DockQSingleScore:
+    native_chains: tuple[str, str]
+    DockQ: float
+    interface_rms: float
+    ligand_rms: float
+    fnat: float
+    fnonnat: float
+    clashes: float
+    F1: float
+    DockQ_F1: float
+@dataclass
+class DockQResult:
+    total_dockq: float
+    native_interfaces: int
+    chain_mapping: dict[str, str]
+    interfaces: dict[tuple[str, str], DockQSingleScore]
+    # zip(aligned.chain_iter(), native.chain_iter()) gives you the pairing
+    # aligned.rmsd(native) should give you a low rmsd irrespective of shuffling
+    aligned: ProteinComplex
+    aligned_rmsd: float
+@dataclass(frozen=True)
+class ProteinComplex:
+    """Dataclass with atom37 representation of an entire protein complex."""
+    id: str
+    sequence: str
+    entity_id: np.ndarray  # entities map to unique sequences
+    chain_id: np.ndarray  # multiple chains might share an entity id
+    sym_id: np.ndarray  # complexes might be copies of the same chain
+    residue_index: np.ndarray
+    insertion_code: np.ndarray
+    atom37_positions: np.ndarray
+    atom37_mask: np.ndarray
+    confidence: np.ndarray
+    # This metadata is parsed from the MMCIF file. For synthetic data, we do a best effort.
+    metadata: ProteinComplexMetadata
+    atom37_confidence: np.ndarray | None = None  # [L, 37] per-atom pLDDT
+    def __post_init__(self):
+        l = len(self.sequence)
+        assert self.atom37_positions.shape[0] == l, (self.atom37_positions.shape, l)
+        assert self.atom37_mask.shape[0] == l, (self.atom37_mask.shape, l)
+        assert self.residue_index.shape[0] == l, (self.residue_index.shape, l)
+        assert self.insertion_code.shape[0] == l, (self.insertion_code.shape, l)
+        assert self.confidence.shape[0] == l, (self.confidence.shape, l)
+        assert self.entity_id.shape[0] == l, (self.entity_id.shape, l)
+        assert self.chain_id.shape[0] == l, (self.chain_id.shape, l)
+        assert self.sym_id.shape[0] == l, (self.sym_id.shape, l)
+        if self.atom37_confidence is not None:
+            assert self.atom37_confidence.shape == self.atom37_mask.shape, (
+                self.atom37_confidence.shape,
+                self.atom37_mask.shape,
+            )
+    def __getitem__(self, idx: int | list[int] | slice | np.ndarray):
+        """This function slices protein complexes without consideration of chain breaks
+        NOTE: When slicing with a boolean mask, it's possible that the output array won't
+        be the expected length. This is because we do our best to preserve chainbreak tokens.
+        """
+        if isinstance(idx, int):
+            idx = [idx]
+        if isinstance(idx, list):
+            raise ValueError(
+                "ProteinComplex doesn't supports indexing with lists of indices"
+            )
+        if isinstance(idx, np.ndarray):
+            is_chainbreak = np.asarray([s == "|" for s in self.sequence])
+            idx = idx.astype(bool) | is_chainbreak
+        complex = self._unsafe_slice(idx)
+        if len(complex) == 0:
+            return complex
+        # detect runs of chainbreaks by searching for instances of '||' in complex.sequence
+        chainbreak_runs = np.asarray(
+            [
+                complex.sequence[i : i + 2] == "||"
+                for i in range(len(complex.sequence) - 1)
+            ]
+            + [complex.sequence[-1] == "|"]
+        )
+        # We should remove as many chainbreaks as possible from the start of the sequence
+        for i in range(len(chainbreak_runs)):
+            if complex.sequence[i] == "|":
+                chainbreak_runs[i] = True
+            else:
+                break
+        complex = complex._unsafe_slice(~chainbreak_runs)
+        return complex
+    def _unsafe_slice(self, idx: int | list[int] | slice | np.ndarray):
+        sequence = slice_python_object_as_numpy(self.sequence, idx)
+        return replace(
+            self,
+            sequence=sequence,
+            entity_id=self.entity_id[..., idx],
+            chain_id=self.chain_id[..., idx],
+            sym_id=self.sym_id[..., idx],
+            residue_index=self.residue_index[..., idx],
+            insertion_code=self.insertion_code[..., idx],
+            atom37_positions=self.atom37_positions[..., idx, :, :],
+            atom37_mask=self.atom37_mask[..., idx, :],
+            confidence=self.confidence[..., idx],
+            atom37_confidence=self.atom37_confidence[..., idx, :]
+            if self.atom37_confidence is not None
+            else None,
+        )
+    def __len__(self):
+        return len(self.sequence)
+    @property
+    def num_chains(self):
+        return len(self.chain_boundaries)
+    @cached_property
+    def atoms(self) -> AtomIndexer:
+        return AtomIndexer(self, property="atom37_positions", dim=-2)
+    @cached_property
+    def atom_mask(self) -> AtomIndexer:
+        return AtomIndexer(self, property="atom37_mask", dim=-1)
+    @cached_property
+    def chain_lengths(self) -> np.ndarray:
+        return np.diff(self.chain_boundaries, axis=1).flatten()
+    @cached_property
+    def chain_boundaries(self) -> list[tuple[int, int]]:
+        cb = [-1]
+        for i, s in enumerate(self.sequence):
+            if s == "|":
+                cb.append(i)
+        cb.append(len(self))
+        return [(cb[i] + 1, cb[i + 1]) for i in range(len(cb) - 1)]
+    def get_chain_by_index(self, index: int) -> ProteinChain:
+        try:
+            start, end = self.chain_boundaries[index]
+            return self[start:end].as_chain()
+        except IndexError:
+            raise IndexError(f"Chain index {index} out of bounds")
+    def get_chain_by_id(
+        self, chain_id: str, sample_chain_if_duplicate: bool = True
+    ) -> ProteinChain:
+        valid_indices = [
+            index
+            for index, id_of_index in self.metadata.chain_lookup.items()
+            if id_of_index == chain_id
+        ]
+        if not valid_indices:
+            raise KeyError(f"Chain ID {chain_id} not found")
+        if sample_chain_if_duplicate:
+            index_to_return = random.choice(valid_indices)
+            return self.get_chain_by_index(index_to_return)
+        else:
+            if len(valid_indices) > 1:
+                raise ValueError(f"Multiple chains with chain ID {chain_id} found")
+            return self.get_chain_by_index(valid_indices[0])
+    def chain_iter(self) -> Iterable[ProteinChain]:
+        for start, end in self.chain_boundaries:
+            c = self[start:end]
+            yield c.as_chain()
+    def as_chain(self, force_conversion: bool = False) -> ProteinChain:
+        """Convert the ProteinComplex to a ProteinChain.
+        Args:
+            force_conversion (bool): Forces the conversion into a protein chain even if the complex has multiple chains.
+                The purpose of this is to use ProteinChain specific functions (like cbeta_contacts).
+        """
+        if not force_conversion:
+            assert len(np.unique(self.chain_id)) == 1, f"{self.id}"
+            assert len(np.unique(self.entity_id)) == 1, f"{self.id}"
+            if self.chain_id[0] not in self.metadata.chain_lookup:
+                warnings.warn("Chain ID not found in metadata, using 'A' as default")
+            if self.entity_id[0] not in self.metadata.entity_lookup:
+                warnings.warn("Entity ID not found in metadata, using None as default")
+            chain_id = self.metadata.chain_lookup.get(self.chain_id[0], "A")
+            entity_id = self.metadata.entity_lookup.get(self.entity_id[0], None)
+        else:
+            chain_id = "A"
+            entity_id = None
+        return ProteinChain(
+            id=self.id,
+            sequence=self.sequence,
+            chain_id=chain_id,
+            entity_id=entity_id,
+            atom37_positions=self.atom37_positions,
+            atom37_mask=self.atom37_mask,
+            residue_index=self.residue_index,
+            insertion_code=self.insertion_code,
+            confidence=self.confidence,
+            mmcif=self.metadata.mmcif,
+            atom37_confidence=self.atom37_confidence,
+        )
+    @classmethod
+    def from_pdb(
+        cls, path: PathOrBuffer, id: str | None = None, is_predicted: bool = False
+    ) -> "ProteinComplex":
+        atom_array = PDBFile.read(path).get_structure(
+            model=1, extra_fields=["b_factor"]
+        )
+        chains = []
+        for chain in bs.chain_iter(atom_array):
+            chain = chain[~chain.hetero]
+            if len(chain) == 0:
+                continue
+            chains.append(ProteinChain.from_atomarray(chain, id, is_predicted))
+        return ProteinComplex.from_chains(chains)
+    def to_pdb(self, path: PathOrBuffer, include_insertions: bool = True):
+        atom_array = None
+        for chain in self.chain_iter():
+            carr = (
+                chain.atom_array
+                if include_insertions
+                else chain.atom_array_no_insertions
+            )
+            atom_array = carr if atom_array is None else atom_array + carr
+        f = PDBFile()
+        f.set_structure(atom_array)
+        f.write(path)
+    def to_pdb_string(self, include_insertions: bool = True) -> str:
+        buf = io.StringIO()
+        self.to_pdb(buf, include_insertions=include_insertions)
+        buf.seek(0)
+        return buf.read()
+    def normalize_chain_ids_for_pdb(self):
+        # Since PDB files have 1-letter chain IDs and don't support the idea of a symmetric index,
+        # we can normalize it instead which might be necessary for DockQ and to_pdb.
+        ids = SINGLE_LETTER_CHAIN_IDS
+        chains = []
+        for i, chain in enumerate(self.chain_iter()):
+            chain = replace(chain, chain_id=ids[i])
+            if i > len(ids):
+                raise RuntimeError("Too many chains to write to PDB file")
+            chains.append(chain)
+        return ProteinComplex.from_chains(chains)
+    def find_assembly_ids_with_chain(self, id: str) -> list[str]:
+        good_chains = []
+        if (comp := self.metadata.assembly_composition) is not None:
+            for assembly_id, chain_ids in comp.items():
+                if id in chain_ids:
+                    good_chains.append(assembly_id)
+        else:
+            raise ValueError(
+                "Cannot switch assemblies on this ProteinComplex, you must create the assembly from mmcif to support this"
+            )
+        return good_chains
+    def switch_assembly(self, id: str):
+        assert self.metadata.mmcif is not None
+        return get_assembly_fast(self.metadata.mmcif, assembly_id=id)
+    def state_dict(self, backbone_only=False, json_serializable=False):
+        """This state dict is optimized for storage, so it turns things to fp16 whenever
+        possible. Note that we also only support int32 residue indices, I'm hoping we don't
+        need more than 2**32 residues..."""
+        dct = {k: v for k, v in vars(self).items()}
+        if backbone_only:
+            dct["atom37_mask"][:, 3:] = False
+        dct["atom37_positions"] = dct["atom37_positions"][dct["atom37_mask"]]
+        if dct.get("atom37_confidence") is not None:
+            dct["atom37_confidence"] = dct["atom37_confidence"][dct["atom37_mask"]]
+        else:
+            dct.pop("atom37_confidence", None)
+        for k, v in dct.items():
+            if isinstance(v, np.ndarray):
+                match v.dtype:
+                    case np.int64:
+                        dct[k] = v.astype(np.int32)
+                    case np.float64 | np.float32:
+                        dct[k] = v.astype(np.float16)
+                    case _:
+                        pass
+                if json_serializable:
+                    dct[k] = v.tolist()
+            elif isinstance(v, ProteinComplexMetadata):
+                dct[k] = asdict(v)
+        dct["metadata"]["mmcif"] = None
+        # These can be populated with non-serializable objects and are not needed for reconstruction
+        dct.pop("atoms", None)
+        dct.pop("atom_mask", None)
+        dct.pop("per_chain_kd_trees", None)
+        return dct
+    def to_blob(self, backbone_only=False) -> bytes:
+        return brotli.compress(msgpack.dumps(self.state_dict(backbone_only)), quality=5)
+    @classmethod
+    def from_state_dict(cls, dct):
+        # Note: assembly_composition is *supposed* to have string keys.
+        dct = _str_key_to_int_key(dct, ignore_keys=["assembly_composition"])
+        for k, v in dct.items():
+            if isinstance(v, list):
+                dct[k] = np.array(v)
+        atom37 = np.full((*dct["atom37_mask"].shape, 3), np.nan)
+        atom37[dct["atom37_mask"]] = dct["atom37_positions"]
+        dct["atom37_positions"] = atom37
+        if "atom37_confidence" in dct:
+            atom37_conf = np.full(dct["atom37_mask"].shape, np.nan, dtype=np.float32)
+            atom37_conf[dct["atom37_mask"]] = dct["atom37_confidence"]
+            dct["atom37_confidence"] = atom37_conf
+        dct = {
+            k: (
+                v.astype(np.float32)
+                if k in ["atom37_positions", "confidence", "atom37_confidence"]
+                else v
+            )
+            for k, v in dct.items()
+        }
+        if "chain_boundaries" in dct:
+            del dct["chain_boundaries"]
+        if "chain_boundaries" in dct["metadata"]:
+            del dct["metadata"]["chain_boundaries"]
+        dct["metadata"] = ProteinComplexMetadata(**dct["metadata"])
+        return cls(**dct)
+    @classmethod
+    def from_blob(cls, input: Path | str | io.BytesIO | bytes):
+        """NOTE(@zlin): blob + sparse coding + brotli + fp16 reduces memory
+        of chains from 52G/1M chains to 20G/1M chains, I think this is a good first
+        shot at compressing and dumping chains to disk. I'm sure there's better ways."""
+        match input:
+            case Path() | str():
+                bytes = Path(input).read_bytes()
+            case io.BytesIO():
+                bytes = input.getvalue()
+            case _:
+                bytes = input
+        return cls.from_state_dict(
+            msgpack.loads(brotli.decompress(bytes), strict_map_key=False)
+        )
+    @classmethod
+    def from_rcsb(cls, pdb_id: str, keep_source: bool = False) -> ProteinComplex:
+        f: io.StringIO = rcsb.fetch(pdb_id, "cif")  # type: ignore
+        return cls.from_mmcif(f, id=pdb_id, keep_source=keep_source, is_predicted=False)
+    @classmethod
+    def from_mmcif(
+        cls,
+        path: PathOrBuffer,
+        id: str | None = None,
+        assembly_id: str | None = None,
+        is_predicted: bool = False,
+        keep_source: bool = False,
+    ):
+        """Return a ProteinComplex object from an mmcif file.
+        TODO(@zeming): there's actually multiple complexes per file, but for ease of implementation,
+        we only consider the first defined complex!
+        Args:
+            path (str | Path | io.TextIO): Path or buffer to read mmcif file from. Should be uncompressed.
+            id (str, optional): String identifier to assign to structure. Will attempt to infer otherwise.
+            is_predicted (bool): If True, reads b factor as the confidence readout. Default: False.
+            chain_id (str, optional): Select a chain corresponding to (author) chain id.
+        """
+        mmcif = MmcifWrapper.read(path, id)
+        return get_assembly_fast(mmcif, assembly_id=assembly_id)
+    @classmethod
+    def from_chains(
+        cls,
+        chains: Sequence[ProteinChain],
+        mmcif: MmcifWrapper | None = None,
+        all_assembly_metadata_dictionary: dict[str, list[str]] | None = None,
+    ):
+        if not chains:
+            raise ValueError(
+                "Cannot create a ProteinComplex from an empty list of chains"
+            )
+        # TODO(roshan): Make a proper protein complex class
+        def join_arrays(arrays: Sequence[np.ndarray], sep: np.ndarray):
+            full_array = []
+            for array in arrays:
+                full_array.append(array)
+                full_array.append(sep)
+            full_array = full_array[:-1]
+            return np.concatenate(full_array, 0)
+        sep_tokens = {
+            "residue_index": np.array([-1]),
+            "insertion_code": np.array([""]),
+            "atom37_positions": np.full([1, 37, 3], np.nan),
+            "atom37_mask": np.zeros([1, 37], dtype=bool),
+            "confidence": np.array([0]),
+        }
+        any_has_atom37_conf = any(c.atom37_confidence is not None for c in chains)
+        if any_has_atom37_conf:
+            sep_tokens["atom37_confidence"] = np.full([1, 37], np.nan, dtype=np.float32)
+        def _get_chain_attr(chain: ProteinChain, name: str) -> np.ndarray:
+            val = getattr(chain, name)
+            if val is None and name == "atom37_confidence":
+                return np.full([len(chain), 37], np.nan, dtype=np.float32)
+            return val
+        array_args: dict[str, np.ndarray] = {
+            name: join_arrays([_get_chain_attr(chain, name) for chain in chains], sep)
+            for name, sep in sep_tokens.items()
+        }
+        multimer_arrays = []
+        chain2num_max = -1
+        chain2num = {}
+        ent2num_max = -1
+        ent2num = {}
+        total_index = 0
+        for i, c in enumerate(chains):
+            num_res = c.residue_index.shape[0]
+            if c.chain_id not in chain2num:
+                chain2num[c.chain_id] = (chain2num_max := chain2num_max + 1)
+            chain_id_array = np.full([num_res], chain2num[c.chain_id], dtype=np.int64)
+            if c.entity_id is None:
+                entity_num = (ent2num_max := ent2num_max + 1)
+            else:
+                if c.entity_id not in ent2num:
+                    ent2num[c.entity_id] = (ent2num_max := ent2num_max + 1)
+                entity_num = ent2num[c.entity_id]
+            entity_id_array = np.full([num_res], entity_num, dtype=np.int64)
+            sym_id_array = np.full([num_res], i, dtype=np.int64)
+            multimer_arrays.append(
+                {
+                    "chain_id": chain_id_array,
+                    "entity_id": entity_id_array,
+                    "sym_id": sym_id_array,
+                }
+            )
+            total_index += num_res + 1
+        sep = np.array([-1])
+        update = {
+            name: join_arrays([dct[name] for dct in multimer_arrays], sep=sep)
+            for name in ["chain_id", "entity_id", "sym_id"]
+        }
+        array_args.update(update)
+        metadata = ProteinComplexMetadata(
+            mmcif=mmcif,
+            chain_lookup={v: k for k, v in chain2num.items()},
+            entity_lookup={v: k for k, v in ent2num.items()},
+            assembly_composition=all_assembly_metadata_dictionary,
+        )
+        return cls(
+            id=chains[0].id,
+            sequence=residue_constants.CHAIN_BREAK_TOKEN.join(
+                chain.sequence for chain in chains
+            ),
+            metadata=metadata,
+            **array_args,
+        )
+    def infer_oxygen(self) -> ProteinComplex:
+        """Oxygen position is fixed given N, CA, C atoms. Infer it if not provided."""
+        O_missing_indices = np.argwhere(
+            ~np.isfinite(self.atoms["O"]).all(axis=1)
+        ).squeeze()
+        O_vector = torch.tensor([0.6240, -1.0613, 0.0103], dtype=torch.float32)
+        N, CA, C = torch.from_numpy(self.atoms[["N", "CA", "C"]]).float().unbind(dim=1)
+        N = torch.roll(N, -3)
+        N[..., -1, :] = torch.nan
+        # Get the frame defined by the CA-C-N atom
+        frames = Affine3D.from_graham_schmidt(CA, C, N)
+        O = frames.apply(O_vector)
+        atom37_positions = self.atom37_positions.copy()
+        atom37_mask = self.atom37_mask.copy()
+        atom37_positions[O_missing_indices, residue_constants.atom_order["O"]] = O[
+            O_missing_indices
+        ].numpy()
+        atom37_mask[O_missing_indices, residue_constants.atom_order["O"]] = ~np.isnan(
+            atom37_positions[O_missing_indices, residue_constants.atom_order["O"]]
+        ).any(-1)
+        new_chain = replace(
+            self, atom37_positions=atom37_positions, atom37_mask=atom37_mask
+        )
+        return new_chain
+    def infer_cbeta(self, infer_cbeta_for_glycine: bool = False) -> ProteinComplex:
+        """Return a new chain with inferred CB atoms at all residues except GLY.
+        Args:
+            infer_cbeta_for_glycine (bool): If True, infers a beta carbon for glycine
+                residues, even though that residue doesn't have one.  Default off.
+                NOTE(rverkuil): The reason for having this switch in the first place
+                is that sometimes we want a (inferred) CB coordinate for every residue,
+                for example for making a pairwise distance matrix, or doing an RMSD
+                calculation between two designs for a given structural template, w/
+                CB atoms.
+        """
+        atom37_positions = self.atom37_positions.copy()
+        atom37_mask = self.atom37_mask.copy()
+        N, CA, C = np.moveaxis(self.atoms[["N", "CA", "C"]], 1, 0)
+        # See usage in trDesign codebase.
+        # https://github.com/gjoni/trDesign/blob/f2d5930b472e77bfacc2f437b3966e7a708a8d37/02-GD/utils.py#L140
+        inferred_cbeta_positions = infer_CB(C, N, CA, 1.522, 1.927, -2.143)
+        if not infer_cbeta_for_glycine:
+            inferred_cbeta_positions[np.array(list(self.sequence)) == "G", :] = np.nan
+        atom37_positions[:, residue_constants.atom_order["CB"]] = (
+            inferred_cbeta_positions
+        )
+        atom37_mask[:, residue_constants.atom_order["CB"]] = ~np.isnan(
+            atom37_positions[:, residue_constants.atom_order["CB"]]
+        ).any(-1)
+        new_chain = replace(
+            self, atom37_positions=atom37_positions, atom37_mask=atom37_mask
+        )
+        return new_chain
+    @classmethod
+    def from_open_source(cls, pc: ProteinComplex):
+        # TODO(@zeming): deprecated, should delete
+        return pc
+    @classmethod
+    def concat(cls, objs: list[ProteinComplex]) -> ProteinComplex:
+        pdb_ids = [obj.id for obj in objs]
+        if len(set(pdb_ids)) > 1:
+            raise RuntimeError(
+                "Concatention of protein complexes across different PDB ids is unsupported"
+            )
+        return ProteinComplex.from_chains(
+            list(itertools.chain.from_iterable(obj.chain_iter() for obj in objs))
+        )
+    def _sanity_check_complexes_are_comparable(self, other: ProteinComplex):
+        assert len(self) == len(other), "Protein complexes must have the same length"
+        assert len(list(self.chain_iter())) == len(
+            list(other.chain_iter())
+        ), "Protein complexes must have the same number of chains"
+    def rmsd(
+        self,
+        target: ProteinComplex,
+        also_check_reflection: bool = False,
+        mobile_inds: list[int] | np.ndarray | None = None,
+        target_inds: list[int] | np.ndarray | None = None,
+        only_compute_backbone_rmsd: bool = False,
+        compute_chain_assignment: bool = True,
+    ):
+        """
+        Compute the RMSD between this protein chain and another.
+        Args:
+            target (ProteinComplex): The target (other) protein complex to compare to.
+            also_check_reflection (bool, optional): If True, also check if the reflection of the mobile atoms has a lower RMSD.
+            mobile_inds (list[int], optional): The indices of the mobile atoms to align. These are NOT residue indices
+            target_inds (list[int], optional): The indices of the target atoms to align. These are NOT residue indices
+            only_compute_backbone_rmsd (bool, optional): If True, only compute the RMSD of the backbone atoms.
+        """
+        if compute_chain_assignment:
+            aligned = self.dockq(target).aligned
+        else:
+            aligned = self
+        aligner = Aligner(
+            aligned if mobile_inds is None else aligned[mobile_inds],
+            target if target_inds is None else target[target_inds],
+            only_compute_backbone_rmsd,
+        )
+        avg_rmsd = aligner.rmsd
+        if not also_check_reflection:
+            return avg_rmsd
+        aligner = Aligner(
+            aligned if mobile_inds is None else aligned[mobile_inds],
+            target if target_inds is None else target[target_inds],
+            only_compute_backbone_rmsd,
+            use_reflection=True,
+        )
+        avg_rmsd_neg = aligner.rmsd
+        return min(avg_rmsd, avg_rmsd_neg)
+    def lddt_ca(
+        self,
+        target: ProteinComplex,
+        mobile_inds: list[int] | np.ndarray | None = None,
+        target_inds: list[int] | np.ndarray | None = None,
+        compute_chain_assignment: bool = True,
+        **kwargs,
+    ) -> float | np.ndarray:
+        """Compute the LDDT between this protein complex and another.
+        Arguments:
+            target (ProteinComplex): The other protein complex to compare to.
+            mobile_inds (list[int], np.ndarray, optional): The indices of the mobile atoms to align. These are NOT residue indices
+            target_inds (list[int], np.ndarray, optional): The indices of the target atoms to align. These are NOT residue indices
+        Returns:
+            float | np.ndarray: The LDDT score between the two protein chains, either
+                a single float or per-residue LDDT scores if `per_residue` is True.
+        """
+        if compute_chain_assignment:
+            aligned = self.dockq(target).aligned
+        else:
+            aligned = self
+        lddt = compute_lddt_ca(
+            torch.tensor(aligned.atom37_positions[mobile_inds]).unsqueeze(0),
+            torch.tensor(target.atom37_positions[target_inds]).unsqueeze(0),
+            torch.tensor(aligned.atom37_mask[mobile_inds]).unsqueeze(0),
+            **kwargs,
+        )
+        return float(lddt) if lddt.numel() == 1 else lddt.numpy().flatten()
+    def gdt_ts(
+        self,
+        target: ProteinComplex,
+        mobile_inds: list[int] | np.ndarray | None = None,
+        target_inds: list[int] | np.ndarray | None = None,
+        compute_chain_assignment: bool = True,
+        **kwargs,
+    ) -> float | np.ndarray:
+        """Compute the GDT_TS between this protein complex and another.
+        Arguments:
+            target (ProteinComplex): The other protein complex to compare to.
+            mobile_inds (list[int], np.ndarray, optional): The indices of the mobile atoms to align. These are NOT residue indices
+            target_inds (list[int], np.ndarray, optional): The indices of the target atoms to align. These are NOT residue indices
+        Returns:
+            float: The GDT_TS score between the two protein chains.
+        """
+        if compute_chain_assignment:
+            aligned = self.dockq(target).aligned
+        else:
+            aligned = self
+        gdt_ts = compute_gdt_ts(
+            mobile=torch.tensor(
+                index_by_atom_name(aligned.atom37_positions[mobile_inds], "CA"),
+                dtype=torch.float32,
+            ).unsqueeze(0),
+            target=torch.tensor(
+                index_by_atom_name(target.atom37_positions[target_inds], "CA"),
+                dtype=torch.float32,
+            ).unsqueeze(0),
+            atom_exists_mask=torch.tensor(
+                index_by_atom_name(aligned.atom37_mask[mobile_inds], "CA", dim=-1)
+                & index_by_atom_name(target.atom37_mask[target_inds], "CA", dim=-1)
+            ).unsqueeze(0),
+            **kwargs,
+        )
+        return float(gdt_ts) if gdt_ts.numel() == 1 else gdt_ts.numpy().flatten()
+    def dockq(self, native: ProteinComplex):
+        # This function uses dockqv2 to compute the DockQ score. Because it does a mapping
+        # over all possible chains, it's quite slow. Be careful not to use this in an inference loop
+        # or something that requires fast scoring. It defaults to 8 CPUs.
+        #
+        # TODO(@zeming): Because we haven't properly implemented protein complexes for mmcif,
+        # if your protein has multi-letter or repeated chain IDs, this will fail. Please call
+        # pc = pc.normalize_chain_ids_for_pdb() before calling this function in that case (limit is 62 chains)
+        try:
+            pass
+        except BaseException:
+            raise RuntimeError(
+                "DockQ is not installed. Please update your environment."
+            )
+        self._sanity_check_complexes_are_comparable(native)
+        def sanity_check_chain_ids(pc: ProteinComplex):
+            ids = []
+            for i, chain in enumerate(pc.chain_iter()):
+                if i > len(SINGLE_LETTER_CHAIN_IDS):
+                    raise ValueError("Too many chains to write to PDB file")
+                if len(chain.chain_id) > 1:
+                    raise ValueError(
+                        "We only supports single letter chain IDs for DockQ"
+                    )
+                ids.append(chain.chain_id)
+            if len(set(ids)) != len(ids):
+                raise ValueError(f"Duplicate chain IDs in protein complex: {ids}")
+            return ids
+        sanity_check_chain_ids(self)
+        sanity_check_chain_ids(native)
+        with TemporaryDirectory() as tdir:
+            dir = Path(tdir)
+            self.to_pdb(dir / "self.pdb")
+            native.to_pdb(dir / "native.pdb")
+            output = check_output(["DockQ", dir / "self.pdb", dir / "native.pdb"])
+        lines = output.decode().split("\n")
+        # Remove the header comments
+        start_index = next(
+            i for i, line in enumerate(lines) if line.startswith("Model")
+        )
+        lines = lines[start_index:]
+        result = {}
+        interfaces = []
+        current_interface: dict = {}
+        for line in lines:
+            line = line.strip()
+            if not line:
+                continue
+            if line.startswith("Model  :"):
+                pass  # Tmp pdb file location, it's useless...
+            elif line.startswith("Native :"):
+                pass  # Tmp pdb file location, it's useless...
+            elif line.startswith("Total DockQ"):
+                total_dockq_match = re.search(
+                    r"Total DockQ over (\d+) native interfaces: ([\d.]+) with (.*) model:native mapping",
+                    line,
+                )
+                if total_dockq_match:
+                    result["value"] = float(total_dockq_match.group(2))
+                    result["native interfaces"] = int(total_dockq_match.group(1))
+                    native_chains, self_chains = total_dockq_match.group(3).split(":")
+                    result["mapping"] = dict(zip(native_chains, self_chains))
+                else:
+                    raise RuntimeError(
+                        "Failed to parse DockQ output, maybe your DockQ version is wrong?"
+                    )
+            elif line.startswith("Native chains:"):
+                if current_interface:
+                    interfaces.append(current_interface)
+                current_interface = {
+                    "Native chains": line.split(":")[1].strip().split(", ")
+                }
+            elif line.startswith("Model chains:"):
+                current_interface["Model chains"] = (
+                    line.split(":")[1].strip().split(", ")
+                )
+            elif ":" in line:
+                key, value = line.split(":", 1)
+                current_interface[key.strip()] = float(value.strip())
+        if current_interface:
+            interfaces.append(current_interface)
+        def parse_dict(d: dict[str, Any]) -> DockQSingleScore:
+            return DockQSingleScore(
+                native_chains=tuple(d["Native chains"]),  # type: ignore
+                DockQ=float(d["DockQ"]),
+                interface_rms=float(d["irms"]),
+                ligand_rms=float(d["Lrms"]),  # Note the capitalization difference
+                fnat=float(d["fnat"]),
+                fnonnat=float(d["fnonnat"]),
+                clashes=float(d["clashes"]),
+                F1=float(d["F1"]),
+                DockQ_F1=float(d["DockQ_F1"]),
+            )
+        inv_mapping = {v: k for k, v in result["mapping"].items()}
+        self_chain_map = {c.chain_id: c for c in self.chain_iter()}
+        realigned = []
+        for chain in native.chain_iter():
+            realigned.append(self_chain_map[inv_mapping[chain.chain_id]])
+        realigned = ProteinComplex.from_chains(realigned)
+        aligner = Aligner(realigned, native)
+        realigned = aligner.apply(realigned)
+        result = DockQResult(
+            total_dockq=result["value"],
+            native_interfaces=result["native interfaces"],
+            chain_mapping=result["mapping"],
+            interfaces={
+                (i["Model chains"][0], i["Model chains"][1]): parse_dict(i)
+                for i in interfaces
+            },
+            aligned=realigned,
+            aligned_rmsd=aligner.rmsd,
+        )
+        return result
+    @cached_property
+    def per_chain_kd_trees(self):
+        # Iterate over chains, build KDTree for each chain
+        kdtrees = []
+        CA = self.atoms["CA"]
+        for start, end in self.chain_boundaries:
+            chain_CA = CA[start:end]
+            chain_CA = chain_CA[np.isfinite(chain_CA).all(axis=-1)]
+            kdtrees.append(KDTree(chain_CA))
+        return kdtrees
+    def chain_adjacency(self, cutoff: float = 8.0) -> np.ndarray:
+        # Compute adjacency matrix for protein complex
+        num_chains = self.num_chains
+        adjacency = np.zeros((num_chains, num_chains), dtype=bool)
+        for (i, kdtree), (j, kdtree2) in itertools.combinations(
+            enumerate(self.per_chain_kd_trees), 2
+        ):
+            adj = kdtree.query_ball_tree(kdtree2, cutoff)
+            any_is_adjacent = any(len(a) > 0 for a in adj)
+            adjacency[i, j] = any_is_adjacent
+            adjacency[j, i] = any_is_adjacent
+        return adjacency
+    def chain_adjacency_by_index(self, index: int, cutoff: float = 8.0) -> np.ndarray:
+        num_chains = len(self.chain_boundaries)
+        adjacency = np.zeros(num_chains, dtype=bool)
+        for i, kdtree in enumerate(self.per_chain_kd_trees):
+            if i == index:
+                continue
+            adj = kdtree.query_ball_tree(self.per_chain_kd_trees[index], cutoff)
+            adjacency[i] = any(len(a) > 0 for a in adj)
+        return adjacency
+    def add_prefix_to_chain_ids(self, prefix: str) -> ProteinComplex:
+        """Rename all chains in the complex with a given prefix.
+        Args:
+            prefix (str): The prefix to use for the new chain IDs. Each chain will be
+                named as "{prefix}_{chain_id}".
+        Returns:
+            ProteinComplex: A new protein complex with renamed chains.
+        """
+        new_chains = []
+        for chain in self.chain_iter():
+            # Create new chain with updated chain_id
+            new_chain = replace(chain, chain_id=f"{prefix}_{chain.chain_id}")
+            new_chains.append(new_chain)
+        return ProteinComplex.from_chains(new_chains)
+    def sasa(self, by_residue: bool = True):
+        chain = self.as_chain(force_conversion=True)
+        return chain.sasa(by_residue=by_residue)
+    def to_mmcif_string(self) -> str:
+        """Convert the ProteinComplex to mmCIF format.
+        Returns:
+            str: The mmCIF content as a string.
+        """
+        # Convert the ProteinComplex to a biotite AtomArray
+        # Collect all atoms from all chains
+        all_atoms = []
+        for chain in self.chain_iter():
+            chain_atom_array = chain.atom_array
+            # Convert AtomArray to list of atoms and add to collection
+            all_atoms.extend(chain_atom_array)
+        # Create combined AtomArray from all atoms
+        if not all_atoms:
+            raise ValueError("No atoms found in protein complex")
+        atom_array = bs.array(all_atoms)
+        # Create CIF file
+        f = CIFFile()
+        set_structure_pdbx(f, atom_array, data_block=self.id)
+        # Add entity information for proper mmCIF structure
+        self._add_entity_information(f)
+        # Write to string
+        output = io.StringIO()
+        f.write(output)
+        return output.getvalue()
+    def _add_entity_information(self, cif_file: CIFFile) -> None:
+        """Add entity, entity_poly, and struct_asym sections to CIF file."""
+        # Group chains by sequence to create unique entities
+        entity_map = {}  # sequence -> entity_id
+        chain_to_entity = {}  # chain_id -> entity_id
+        entity_sequences = {}  # entity_id -> sequence
+        entity_id_counter = 1
+        for chain in self.chain_iter():
+            sequence = chain.sequence
+            if sequence not in entity_map:
+                entity_map[sequence] = entity_id_counter
+                entity_sequences[entity_id_counter] = sequence
+                entity_id_counter += 1
+            chain_to_entity[chain.chain_id] = entity_map[sequence]
+        # Create _entity section
+        entity_ids = []
+        entity_types = []
+        entity_descriptions = []
+        for entity_id in sorted(entity_sequences.keys()):
+            entity_ids.append(str(entity_id))
+            entity_types.append("polymer")
+            entity_descriptions.append(f"Protein chain (entity {entity_id})")
+        cif_file.block["entity"] = CIFCategory(
+            name="entity",
+            columns={
+                "id": CIFColumn(
+                    data=CIFData(array=np.array(entity_ids), dtype=np.str_)
+                ),
+                "type": CIFColumn(
+                    data=CIFData(array=np.array(entity_types), dtype=np.str_)
+                ),
+                "pdbx_description": CIFColumn(
+                    data=CIFData(array=np.array(entity_descriptions), dtype=np.str_)
+                ),
+            },
+        )
+        # Create _entity_poly section
+        poly_entity_ids = []
+        poly_types = []
+        poly_nstd_linkages = []
+        poly_sequences = []
+        for entity_id in sorted(entity_sequences.keys()):
+            poly_entity_ids.append(str(entity_id))
+            poly_types.append("polypeptide(L)")
+            poly_nstd_linkages.append("no")
+            poly_sequences.append(entity_sequences[entity_id])
+        cif_file.block["entity_poly"] = CIFCategory(
+            name="entity_poly",
+            columns={
+                "entity_id": CIFColumn(
+                    data=CIFData(array=np.array(poly_entity_ids), dtype=np.str_)
+                ),
+                "type": CIFColumn(
+                    data=CIFData(array=np.array(poly_types), dtype=np.str_)
+                ),
+                "nstd_linkage": CIFColumn(
+                    data=CIFData(array=np.array(poly_nstd_linkages), dtype=np.str_)
+                ),
+                "pdbx_seq_one_letter_code": CIFColumn(
+                    data=CIFData(array=np.array(poly_sequences), dtype=np.str_)
+                ),
+            },
+        )
+        # Create _struct_asym section
+        asym_ids = []
+        asym_entity_ids = []
+        asym_details = []
+        for chain in self.chain_iter():
+            asym_ids.append(chain.chain_id)
+            asym_entity_ids.append(str(chain_to_entity[chain.chain_id]))
+            asym_details.append("")
+        cif_file.block["struct_asym"] = CIFCategory(
+            name="struct_asym",
+            columns={
+                "id": CIFColumn(data=CIFData(array=np.array(asym_ids), dtype=np.str_)),
+                "entity_id": CIFColumn(
+                    data=CIFData(array=np.array(asym_entity_ids), dtype=np.str_)
+                ),
+                "details": CIFColumn(
+                    data=CIFData(array=np.array(asym_details), dtype=np.str_)
+                ),
+            },
+        )
+def get_assembly_fast(
+    mmcif: MmcifWrapper,
+    assembly_id=None,
+    model=None,
+    data_block=None,
+    altloc="first",
+    use_author_fields=True,
+):
+    pdbx_file = mmcif.raw
+    if pdbx_file is None:
+        raise InvalidFileError("No mmCIF data loaded")
+    assembly_gen_category = pdbx_file.block["pdbx_struct_assembly_gen"]
+    if assembly_gen_category is None:
+        raise InvalidFileError("File has no 'pdbx_struct_assembly_gen' category")
+    struct_oper_category = pdbx_file.block["pdbx_struct_oper_list"]
+    if struct_oper_category is None:
+        raise InvalidFileError("File has no 'pdbx_struct_oper_list' category")
+    if assembly_id is None:
+        assembly_id = assembly_gen_category["assembly_id"].data.array[0]
+    elif assembly_id not in assembly_gen_category["assembly_id"].data.array:
+        raise KeyError(f"File has no Assembly ID '{assembly_id}'")
+    ### Calculate all possible transformations
+    transformations = _get_transformations(struct_oper_category)
+    ### Get structure according to additional parameters
+    structure = get_structure(
+        pdbx_file, model, data_block, altloc, ["label_asym_id"], use_author_fields
+    )[0]  # type: ignore
+    # TODO(@zeming) This line will remove all non-protein structural elements,
+    # we should remove this when we want to parse these too.
+    structure: bs.AtomArray = structure[
+        bs.filter_amino_acids(structure) & ~structure.hetero  # type: ignore
+    ]
+    if len(structure) == 0:
+        raise NoProteinError
+    unique_asym_ids = np.unique(structure.label_asym_id)  # type: ignore
+    asym2chain = {}
+    asym2auth = {}
+    for asym_id in unique_asym_ids:
+        sub_structure: bs.AtomArray = structure[structure.label_asym_id == asym_id]  # type: ignore
+        chain_id: str = sub_structure[0].chain_id  # type: ignore
+        (
+            sequence,
+            atom_positions,
+            atom_mask,
+            residue_index,
+            insertion_code,
+            confidence,
+            entity_id,
+        ) = chain_to_ndarray(sub_structure, mmcif, chain_id, False)
+        asym2chain[asym_id] = ProteinChain(
+            id=mmcif.id or "unknown",
+            sequence=sequence,
+            chain_id=chain_id,
+            entity_id=entity_id,
+            atom37_positions=atom_positions,
+            atom37_mask=atom_mask,
+            residue_index=residue_index,
+            insertion_code=insertion_code,
+            confidence=confidence,
+            mmcif=None,
+        )
+        asym2auth[asym_id] = chain_id
+    ### Get transformations and apply them to the affected asym IDs
+    assembly = []
+    assembly_id_dict: dict[str, list[str]] = {}
+    # Process the target assembly ID
+    for aid, op_expr, asym_id_expr in zip(
+        assembly_gen_category["assembly_id"].data.array,
+        assembly_gen_category["oper_expression"].data.array,
+        assembly_gen_category["asym_id_list"].data.array,
+    ):
+        if aid == assembly_id:
+            # Parse operations and asym IDs for this specific entry
+            operations = _parse_operation_expression(op_expr)
+            asym_ids = asym_id_expr.split(",")
+            # Filter affected asym IDs to only protein chains, preserving order
+            sub_structures = [
+                asym2chain[asym_id] for asym_id in asym_ids if asym_id in asym2chain
+            ]
+            # Apply transformations
+            sub_assembly = _apply_transformations_fast(
+                sub_structures, transformations, operations
+            )
+            assembly.extend(sub_assembly)
+            # Build assembly_id_dict for this entry
+            assembly_id_dict[aid] = assembly_id_dict.get(aid, []) + [
+                asym2auth[id_] for id_ in asym_ids if id_ in asym2auth
+            ]
+    if len(assembly) == 0:
+        raise NoProteinError
+    return ProteinComplex.from_chains(assembly, mmcif, assembly_id_dict)
+def protein_chain_to_protein_complex(chain: ProteinChain) -> ProteinComplex:
+    if "|" not in chain.sequence:
+        return ProteinComplex.from_chains([chain])
+    chain_breaks = np.array(list(chain.sequence)) == "|"
+    chain_break_inds = np.where(chain_breaks)[0]
+    chain_break_inds = np.concatenate([[0], chain_break_inds, [len(chain)]])
+    chain_break_inds = np.array(list(zip(chain_break_inds[:-1], chain_break_inds[1:])))
+    complex_chains = []
+    for start, end in chain_break_inds:
+        if start != 0:
+            start += 1
+        complex_chains.append(chain[start:end])
+    complex_chains = [
+        ProteinChain.from_atom37(
+            chain.atom37_positions,
+            sequence=chain.sequence,
+            chain_id=SINGLE_LETTER_CHAIN_IDS[i],
+            entity_id=i,
+        )
+        for i, chain in enumerate(complex_chains)
+    ]
+    return ProteinComplex.from_chains(complex_chains)

esmfold2_protein_structure.py ADDED Viewed

	@@ -0,0 +1,307 @@

+from __future__ import annotations
+from typing import Tuple, TypeVar
+import numpy as np
+import torch
+import torch.nn.functional as F
+from torch import Tensor
+from torch.amp import autocast  # type: ignore
+from . import esmfold2_residue_constants
+from .esmfold2_misc import unbinpack
+from .esmfold2_affine3d import Affine3D
+ArrayOrTensor = TypeVar("ArrayOrTensor", np.ndarray, Tensor)
+def index_by_atom_name(
+    atom37: ArrayOrTensor, atom_names: str | list[str], dim: int = -2
+) -> ArrayOrTensor:
+    squeeze = False
+    if isinstance(atom_names, str):
+        atom_names = [atom_names]
+        squeeze = True
+    indices = [residue_constants.atom_order[atom_name] for atom_name in atom_names]
+    dim = dim % atom37.ndim
+    index = tuple(slice(None) if dim != i else indices for i in range(atom37.ndim))
+    result = atom37[index]  # type: ignore
+    if squeeze:
+        result = result.squeeze(dim)
+    return result
+def infer_cbeta_from_atom37(
+    atom37: ArrayOrTensor, L: float = 1.522, A: float = 1.927, D: float = -2.143
+):
+    """
+    Inspired by a util in trDesign:
+    https://github.com/gjoni/trDesign/blob/f2d5930b472e77bfacc2f437b3966e7a708a8d37/02-GD/utils.py#L92
+    input:  atom37, (L)ength, (A)ngle, and (D)ihedral
+    output: 4th coord
+    """
+    N = index_by_atom_name(atom37, "N", dim=-2)
+    CA = index_by_atom_name(atom37, "CA", dim=-2)
+    C = index_by_atom_name(atom37, "C", dim=-2)
+    if isinstance(atom37, np.ndarray):
+        def normalize(x: ArrayOrTensor):
+            return x / np.linalg.norm(x, axis=-1, keepdims=True)
+        cross = np.cross
+    else:
+        normalize = F.normalize  # type: ignore
+        cross = torch.cross
+    with np.errstate(invalid="ignore"):  # inf - inf = nan is ok here
+        vec_nca = N - CA
+        vec_nc = N - C
+    nca = normalize(vec_nca)
+    n = normalize(cross(vec_nc, nca))  # type: ignore
+    m = [nca, cross(n, nca), n]
+    d = [L * np.cos(A), L * np.sin(A) * np.cos(D), -L * np.sin(A) * np.sin(D)]
+    return CA + sum([m * d for m, d in zip(m, d)])
+@torch.no_grad()
+@autocast("cuda", enabled=False)
+def compute_alignment_tensors(
+    mobile: torch.Tensor,
+    target: torch.Tensor,
+    atom_exists_mask: torch.Tensor | None = None,
+    sequence_id: torch.Tensor | None = None,
+):
+    """
+    Align two batches of structures with support for masking invalid atoms using PyTorch.
+    Args:
+    - mobile (torch.Tensor): Batch of coordinates of structure to be superimposed in shape (B, N, 3)
+    - target (torch.Tensor): Batch of coordinates of structure that is fixed in shape (B, N, 3)
+    - atom_exists_mask (torch.Tensor, optional): Mask for Whether an atom exists of shape (B, N)
+    - sequence_id (torch.Tensor, optional): Sequence id tensor for binpacking.
+    Returns:
+    - centered_mobile (torch.Tensor): Batch of coordinates of structure centered mobile (B, N, 3)
+    - centroid_mobile (torch.Tensor): Batch of coordinates of mobile centeroid (B, 3)
+    - centered_target (torch.Tensor): Batch of coordinates of structure centered target (B, N, 3)
+    - centroid_target (torch.Tensor): Batch of coordinates of target centeroid (B, 3)
+    - rotation_matrix (torch.Tensor): Batch of coordinates of rotation matrix (B, 3, 3)
+    - num_valid_atoms (torch.Tensor): Batch of number of valid atoms for alignment (B,)
+    """
+    # Ensure both batches have the same number of structures, atoms, and dimensions
+    if sequence_id is not None:
+        mobile = unbinpack(mobile, sequence_id, pad_value=torch.nan)
+        target = unbinpack(target, sequence_id, pad_value=torch.nan)
+        if atom_exists_mask is not None:
+            atom_exists_mask = unbinpack(atom_exists_mask, sequence_id, pad_value=0)
+        else:
+            atom_exists_mask = torch.isfinite(target).all(-1)
+    assert mobile.shape == target.shape, "Batch structure shapes do not match!"
+    # Number of structures in the batch
+    batch_size = mobile.shape[0]
+    # if [B, Nres, Natom, 3], resize
+    if mobile.dim() == 4:
+        mobile = mobile.view(batch_size, -1, 3)
+    if target.dim() == 4:
+        target = target.view(batch_size, -1, 3)
+    if atom_exists_mask is not None and atom_exists_mask.dim() == 3:
+        atom_exists_mask = atom_exists_mask.view(batch_size, -1)
+    # Number of atoms
+    num_atoms = mobile.shape[1]
+    # Apply masks if provided
+    if atom_exists_mask is not None:
+        mobile = mobile.masked_fill(~atom_exists_mask.unsqueeze(-1), 0)
+        target = target.masked_fill(~atom_exists_mask.unsqueeze(-1), 0)
+    else:
+        atom_exists_mask = torch.ones(
+            batch_size, num_atoms, dtype=torch.bool, device=mobile.device
+        )
+    num_valid_atoms = atom_exists_mask.sum(dim=-1, keepdim=True)
+    # Compute centroids for each batch
+    centroid_mobile = mobile.sum(dim=-2, keepdim=True) / num_valid_atoms.unsqueeze(-1)
+    centroid_target = target.sum(dim=-2, keepdim=True) / num_valid_atoms.unsqueeze(-1)
+    # Handle potential division by zero if all atoms are invalid in a structure
+    centroid_mobile[num_valid_atoms == 0] = 0
+    centroid_target[num_valid_atoms == 0] = 0
+    # Center structures by subtracting centroids
+    centered_mobile = mobile - centroid_mobile
+    centered_target = target - centroid_target
+    centered_mobile = centered_mobile.masked_fill(~atom_exists_mask.unsqueeze(-1), 0)
+    centered_target = centered_target.masked_fill(~atom_exists_mask.unsqueeze(-1), 0)
+    # Compute covariance matrix for each batch
+    covariance_matrix = torch.matmul(centered_mobile.transpose(1, 2), centered_target)
+    # Singular Value Decomposition for each batch
+    u, _, v = torch.svd(covariance_matrix)
+    # Calculate rotation matrices for each batch
+    rotation_matrix = torch.matmul(u, v.transpose(1, 2))
+    return (
+        centered_mobile,
+        centroid_mobile,
+        centered_target,
+        centroid_target,
+        rotation_matrix,
+        num_valid_atoms,
+    )
+@torch.no_grad()
+@autocast("cuda", enabled=False)
+def compute_rmsd_no_alignment(
+    aligned: torch.Tensor,
+    target: torch.Tensor,
+    num_valid_atoms: torch.Tensor,
+    reduction: str = "batch",
+) -> torch.Tensor:
+    """
+    Compute RMSD between two batches of structures without alignment.
+    Args:
+    - mobile (torch.Tensor): Batch of coordinates of structure to be superimposed in shape (B, N, 3)
+    - target (torch.Tensor): Batch of coordinates of structure that is fixed in shape (B, N, 3)
+    - num_valid_atoms (torch.Tensor): Batch of number of valid atoms for alignment (B,)
+    - reduction (str): One of "batch", "per_sample", "per_residue".
+    Returns:
+    If reduction == "batch":
+        (torch.Tensor): 0-dim, Average Root Mean Square Deviation between the structures for each batch
+    If reduction == "per_sample":
+        (torch.Tensor): (B,)-dim, Root Mean Square Deviation between the structures for each batch
+    If reduction == "per_residue":
+        (torch.Tensor): (B, N)-dim, Root Mean Square Deviation between the structures for residue in the batch
+    """
+    if reduction not in ("per_residue", "per_sample", "batch"):
+        raise ValueError("Unrecognized reduction: '{reduction}'")
+    # Compute RMSD for each batch
+    diff = aligned - target
+    if reduction == "per_residue":
+        mean_squared_error = diff.square().view(diff.size(0), -1, 9).mean(dim=-1)
+    else:
+        mean_squared_error = diff.square().sum(dim=(1, 2)) / (
+            num_valid_atoms.squeeze(-1)
+        )
+    rmsd = torch.sqrt(mean_squared_error)
+    if reduction in ("per_sample", "per_residue"):
+        return rmsd
+    elif reduction == "batch":
+        avg_rmsd = rmsd.masked_fill(num_valid_atoms.squeeze(-1) == 0, 0).sum() / (
+            (num_valid_atoms > 0).sum() + 1e-8
+        )
+        return avg_rmsd
+    else:
+        raise ValueError(reduction)
+@torch.no_grad()
+@autocast("cuda", enabled=False)
+def compute_affine_and_rmsd(
+    mobile: torch.Tensor,
+    target: torch.Tensor,
+    atom_exists_mask: torch.Tensor | None = None,
+    sequence_id: torch.Tensor | None = None,
+) -> Tuple[Affine3D, torch.Tensor]:
+    """
+    Compute RMSD between two batches of structures with support for masking invalid atoms using PyTorch.
+    Args:
+    - mobile (torch.Tensor): Batch of coordinates of structure to be superimposed in shape (B, N, 3)
+    - target (torch.Tensor): Batch of coordinates of structure that is fixed in shape (B, N, 3)
+    - atom_exists_mask (torch.Tensor, optional): Mask for Whether an atom exists of shape (B, N)
+    - sequence_id (torch.Tensor, optional): Sequence id tensor for binpacking.
+    Returns:
+    - affine (Affine3D): Transformation between mobile and target structure
+    - avg_rmsd (torch.Tensor): Average Root Mean Square Deviation between the structures for each batch
+    """
+    (
+        centered_mobile,
+        centroid_mobile,
+        centered_target,
+        centroid_target,
+        rotation_matrix,
+        num_valid_atoms,
+    ) = compute_alignment_tensors(
+        mobile=mobile,
+        target=target,
+        atom_exists_mask=atom_exists_mask,
+        sequence_id=sequence_id,
+    )
+    # Apply rotation to mobile centroid
+    translation = torch.matmul(-centroid_mobile, rotation_matrix) + centroid_target
+    affine = Affine3D.from_tensor_pair(
+        translation, rotation_matrix.unsqueeze(dim=-3).transpose(-2, -1)
+    )
+    # Apply transformation to centered structure to compute rmsd
+    rotated_mobile = torch.matmul(centered_mobile, rotation_matrix)
+    avg_rmsd = compute_rmsd_no_alignment(
+        rotated_mobile, centered_target, num_valid_atoms, reduction="batch"
+    )
+    return affine, avg_rmsd
+def compute_gdt_ts_no_alignment(
+    aligned: torch.Tensor,
+    target: torch.Tensor,
+    atom_exists_mask: torch.Tensor,
+    reduction: str = "batch",
+) -> torch.Tensor:
+    """
+    Compute GDT_TS between two batches of structures without alignment.
+    Args:
+    - mobile (torch.Tensor): Batch of coordinates of structure to be superimposed in shape (B, N, 3)
+    - target (torch.Tensor): Batch of coordinates of structure that is fixed in shape (B, N, 3)
+    - atom_exists_mask (torch.Tensor): Mask for Whether an atom exists of shape (B, N). noo
+    - reduction (str): One of "batch", "per_sample".
+    Returns:
+    If reduction == "batch":
+        (torch.Tensor): 0-dim, GDT_TS between the structures for each batch
+    If reduction == "per_sample":
+        (torch.Tensor): (B,)-dim, GDT_TS between the structures for each sample in the batch
+    """
+    if reduction not in ("per_sample", "batch"):
+        raise ValueError("Unrecognized reduction: '{reduction}'")
+    if atom_exists_mask is None:
+        atom_exists_mask = torch.isfinite(target).all(dim=-1)
+    deviation = torch.linalg.vector_norm(aligned - target, dim=-1)
+    num_valid_atoms = atom_exists_mask.sum(dim=-1)
+    # Compute GDT_TS
+    score = (
+        ((deviation < 1) * atom_exists_mask).sum(dim=-1) / num_valid_atoms
+        + ((deviation < 2) * atom_exists_mask).sum(dim=-1) / num_valid_atoms
+        + ((deviation < 4) * atom_exists_mask).sum(dim=-1) / num_valid_atoms
+        + ((deviation < 8) * atom_exists_mask).sum(dim=-1) / num_valid_atoms
+    ) * 0.25
+    if reduction == "batch":
+        return score.mean()
+    elif reduction == "per_sample":
+        return score
+    else:
+        raise ValueError("Unrecognized reduction: '{reduction}'")

esmfold2_residue_constants.py ADDED Viewed

	@@ -0,0 +1,1224 @@

+# Copyright 2025 EvolutionaryScale
+# Copyright 2021 AlQuraishi Laboratory
+# Copyright 2021 DeepMind Technologies Limited
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Constants used in AlphaFold."""
+import collections
+import functools
+from pathlib import Path
+from typing import List, Mapping, Tuple
+import numpy as np
+# import tree
+# Internal import (35fd).
+# Distance from one CA to next CA [trans configuration: omega = 180].
+ca_ca = 3.80209737096
+# Format: The list for each AA type contains chi1, chi2, chi3, chi4 in
+# this order (or a relevant subset from chi1 onwards). ALA and GLY don't have
+# chi angles so their chi angle lists are empty.
+chi_angles_atoms = {
+    "ALA": [],
+    # Chi5 in arginine is always 0 +- 5 degrees, so ignore it.
+    "ARG": [
+        ["N", "CA", "CB", "CG"],
+        ["CA", "CB", "CG", "CD"],
+        ["CB", "CG", "CD", "NE"],
+        ["CG", "CD", "NE", "CZ"],
+    ],
+    "ASN": [["N", "CA", "CB", "CG"], ["CA", "CB", "CG", "OD1"]],
+    "ASP": [["N", "CA", "CB", "CG"], ["CA", "CB", "CG", "OD1"]],
+    "CYS": [["N", "CA", "CB", "SG"]],
+    "GLN": [
+        ["N", "CA", "CB", "CG"],
+        ["CA", "CB", "CG", "CD"],
+        ["CB", "CG", "CD", "OE1"],
+    ],
+    "GLU": [
+        ["N", "CA", "CB", "CG"],
+        ["CA", "CB", "CG", "CD"],
+        ["CB", "CG", "CD", "OE1"],
+    ],
+    "GLY": [],
+    "HIS": [["N", "CA", "CB", "CG"], ["CA", "CB", "CG", "ND1"]],
+    "ILE": [["N", "CA", "CB", "CG1"], ["CA", "CB", "CG1", "CD1"]],
+    "LEU": [["N", "CA", "CB", "CG"], ["CA", "CB", "CG", "CD1"]],
+    "LYS": [
+        ["N", "CA", "CB", "CG"],
+        ["CA", "CB", "CG", "CD"],
+        ["CB", "CG", "CD", "CE"],
+        ["CG", "CD", "CE", "NZ"],
+    ],
+    "MET": [
+        ["N", "CA", "CB", "CG"],
+        ["CA", "CB", "CG", "SD"],
+        ["CB", "CG", "SD", "CE"],
+    ],
+    "PHE": [["N", "CA", "CB", "CG"], ["CA", "CB", "CG", "CD1"]],
+    "PRO": [["N", "CA", "CB", "CG"], ["CA", "CB", "CG", "CD"]],
+    "SER": [["N", "CA", "CB", "OG"]],
+    "THR": [["N", "CA", "CB", "OG1"]],
+    "TRP": [["N", "CA", "CB", "CG"], ["CA", "CB", "CG", "CD1"]],
+    "TYR": [["N", "CA", "CB", "CG"], ["CA", "CB", "CG", "CD1"]],
+    "VAL": [["N", "CA", "CB", "CG1"]],
+    "UNK": [],
+}
+# If chi angles given in fixed-length array, this matrix determines how to mask
+# them for each AA type. The order is as per restype_order (see below).
+chi_angles_mask = [
+    [0.0, 0.0, 0.0, 0.0],  # ALA
+    [1.0, 1.0, 1.0, 1.0],  # ARG
+    [1.0, 1.0, 0.0, 0.0],  # ASN
+    [1.0, 1.0, 0.0, 0.0],  # ASP
+    [1.0, 0.0, 0.0, 0.0],  # CYS
+    [1.0, 1.0, 1.0, 0.0],  # GLN
+    [1.0, 1.0, 1.0, 0.0],  # GLU
+    [0.0, 0.0, 0.0, 0.0],  # GLY
+    [1.0, 1.0, 0.0, 0.0],  # HIS
+    [1.0, 1.0, 0.0, 0.0],  # ILE
+    [1.0, 1.0, 0.0, 0.0],  # LEU
+    [1.0, 1.0, 1.0, 1.0],  # LYS
+    [1.0, 1.0, 1.0, 0.0],  # MET
+    [1.0, 1.0, 0.0, 0.0],  # PHE
+    [1.0, 1.0, 0.0, 0.0],  # PRO
+    [1.0, 0.0, 0.0, 0.0],  # SER
+    [1.0, 0.0, 0.0, 0.0],  # THR
+    [1.0, 1.0, 0.0, 0.0],  # TRP
+    [1.0, 1.0, 0.0, 0.0],  # TYR
+    [1.0, 0.0, 0.0, 0.0],  # VAL
+    [0.0, 0.0, 0.0, 0.0],  # UNK
+]
+# The following chi angles are pi periodic: they can be rotated by a multiple
+# of pi without affecting the structure.
+chi_pi_periodic = [
+    [0.0, 0.0, 0.0, 0.0],  # ALA
+    [0.0, 0.0, 0.0, 0.0],  # ARG
+    [0.0, 0.0, 0.0, 0.0],  # ASN
+    [0.0, 1.0, 0.0, 0.0],  # ASP
+    [0.0, 0.0, 0.0, 0.0],  # CYS
+    [0.0, 0.0, 0.0, 0.0],  # GLN
+    [0.0, 0.0, 1.0, 0.0],  # GLU
+    [0.0, 0.0, 0.0, 0.0],  # GLY
+    [0.0, 0.0, 0.0, 0.0],  # HIS
+    [0.0, 0.0, 0.0, 0.0],  # ILE
+    [0.0, 0.0, 0.0, 0.0],  # LEU
+    [0.0, 0.0, 0.0, 0.0],  # LYS
+    [0.0, 0.0, 0.0, 0.0],  # MET
+    [0.0, 1.0, 0.0, 0.0],  # PHE
+    [0.0, 0.0, 0.0, 0.0],  # PRO
+    [0.0, 0.0, 0.0, 0.0],  # SER
+    [0.0, 0.0, 0.0, 0.0],  # THR
+    [0.0, 0.0, 0.0, 0.0],  # TRP
+    [0.0, 1.0, 0.0, 0.0],  # TYR
+    [0.0, 0.0, 0.0, 0.0],  # VAL
+    [0.0, 0.0, 0.0, 0.0],  # UNK
+]
+# Atoms positions relative to the 8 rigid groups, defined by the pre-omega, phi,
+# psi and chi angles:
+# 0: 'backbone group',
+# 1: 'pre-omega-group', (empty)
+# 2: 'phi-group', (currently empty, because it defines only hydrogens)
+# 3: 'psi-group',
+# 4,5,6,7: 'chi1,2,3,4-group'
+# The atom positions are relative to the axis-end-atom of the corresponding
+# rotation axis. The x-axis is in direction of the rotation axis, and the y-axis
+# is defined such that the dihedral-angle-definiting atom (the last entry in
+# chi_angles_atoms above) is in the xy-plane (with a positive y-coordinate).
+# format: [atomname, group_idx, rel_position]
+rigid_group_atom_positions = {
+    "ALA": [
+        ["N", 0, (-0.525, 1.363, 0.000)],
+        ["CA", 0, (0.000, 0.000, 0.000)],
+        ["C", 0, (1.526, -0.000, -0.000)],
+        ["CB", 0, (-0.529, -0.774, -1.205)],
+        ["O", 3, (0.627, 1.062, 0.000)],
+    ],
+    "ARG": [
+        ["N", 0, (-0.524, 1.362, -0.000)],
+        ["CA", 0, (0.000, 0.000, 0.000)],
+        ["C", 0, (1.525, -0.000, -0.000)],
+        ["CB", 0, (-0.524, -0.778, -1.209)],
+        ["O", 3, (0.626, 1.062, 0.000)],
+        ["CG", 4, (0.616, 1.390, -0.000)],
+        ["CD", 5, (0.564, 1.414, 0.000)],
+        ["NE", 6, (0.539, 1.357, -0.000)],
+        ["NH1", 7, (0.206, 2.301, 0.000)],
+        ["NH2", 7, (2.078, 0.978, -0.000)],
+        ["CZ", 7, (0.758, 1.093, -0.000)],
+    ],
+    "ASN": [
+        ["N", 0, (-0.536, 1.357, 0.000)],
+        ["CA", 0, (0.000, 0.000, 0.000)],
+        ["C", 0, (1.526, -0.000, -0.000)],
+        ["CB", 0, (-0.531, -0.787, -1.200)],
+        ["O", 3, (0.625, 1.062, 0.000)],
+        ["CG", 4, (0.584, 1.399, 0.000)],
+        ["ND2", 5, (0.593, -1.188, 0.001)],
+        ["OD1", 5, (0.633, 1.059, 0.000)],
+    ],
+    "ASP": [
+        ["N", 0, (-0.525, 1.362, -0.000)],
+        ["CA", 0, (0.000, 0.000, 0.000)],
+        ["C", 0, (1.527, 0.000, -0.000)],
+        ["CB", 0, (-0.526, -0.778, -1.208)],
+        ["O", 3, (0.626, 1.062, -0.000)],
+        ["CG", 4, (0.593, 1.398, -0.000)],
+        ["OD1", 5, (0.610, 1.091, 0.000)],
+        ["OD2", 5, (0.592, -1.101, -0.003)],
+    ],
+    "CYS": [
+        ["N", 0, (-0.522, 1.362, -0.000)],
+        ["CA", 0, (0.000, 0.000, 0.000)],
+        ["C", 0, (1.524, 0.000, 0.000)],
+        ["CB", 0, (-0.519, -0.773, -1.212)],
+        ["O", 3, (0.625, 1.062, -0.000)],
+        ["SG", 4, (0.728, 1.653, 0.000)],
+    ],
+    "GLN": [
+        ["N", 0, (-0.526, 1.361, -0.000)],
+        ["CA", 0, (0.000, 0.000, 0.000)],
+        ["C", 0, (1.526, 0.000, 0.000)],
+        ["CB", 0, (-0.525, -0.779, -1.207)],
+        ["O", 3, (0.626, 1.062, -0.000)],
+        ["CG", 4, (0.615, 1.393, 0.000)],
+        ["CD", 5, (0.587, 1.399, -0.000)],
+        ["NE2", 6, (0.593, -1.189, -0.001)],
+        ["OE1", 6, (0.634, 1.060, 0.000)],
+    ],
+    "GLU": [
+        ["N", 0, (-0.528, 1.361, 0.000)],
+        ["CA", 0, (0.000, 0.000, 0.000)],
+        ["C", 0, (1.526, -0.000, -0.000)],
+        ["CB", 0, (-0.526, -0.781, -1.207)],
+        ["O", 3, (0.626, 1.062, 0.000)],
+        ["CG", 4, (0.615, 1.392, 0.000)],
+        ["CD", 5, (0.600, 1.397, 0.000)],
+        ["OE1", 6, (0.607, 1.095, -0.000)],
+        ["OE2", 6, (0.589, -1.104, -0.001)],
+    ],
+    "GLY": [
+        ["N", 0, (-0.572, 1.337, 0.000)],
+        ["CA", 0, (0.000, 0.000, 0.000)],
+        ["C", 0, (1.517, -0.000, -0.000)],
+        ["O", 3, (0.626, 1.062, -0.000)],
+    ],
+    "HIS": [
+        ["N", 0, (-0.527, 1.360, 0.000)],
+        ["CA", 0, (0.000, 0.000, 0.000)],
+        ["C", 0, (1.525, 0.000, 0.000)],
+        ["CB", 0, (-0.525, -0.778, -1.208)],
+        ["O", 3, (0.625, 1.063, 0.000)],
+        ["CG", 4, (0.600, 1.370, -0.000)],
+        ["CD2", 5, (0.889, -1.021, 0.003)],
+        ["ND1", 5, (0.744, 1.160, -0.000)],
+        ["CE1", 5, (2.030, 0.851, 0.002)],
+        ["NE2", 5, (2.145, -0.466, 0.004)],
+    ],
+    "ILE": [
+        ["N", 0, (-0.493, 1.373, -0.000)],
+        ["CA", 0, (0.000, 0.000, 0.000)],
+        ["C", 0, (1.527, -0.000, -0.000)],
+        ["CB", 0, (-0.536, -0.793, -1.213)],
+        ["O", 3, (0.627, 1.062, -0.000)],
+        ["CG1", 4, (0.534, 1.437, -0.000)],
+        ["CG2", 4, (0.540, -0.785, -1.199)],
+        ["CD1", 5, (0.619, 1.391, 0.000)],
+    ],
+    "LEU": [
+        ["N", 0, (-0.520, 1.363, 0.000)],
+        ["CA", 0, (0.000, 0.000, 0.000)],
+        ["C", 0, (1.525, -0.000, -0.000)],
+        ["CB", 0, (-0.522, -0.773, -1.214)],
+        ["O", 3, (0.625, 1.063, -0.000)],
+        ["CG", 4, (0.678, 1.371, 0.000)],
+        ["CD1", 5, (0.530, 1.430, -0.000)],
+        ["CD2", 5, (0.535, -0.774, 1.200)],
+    ],
+    "LYS": [
+        ["N", 0, (-0.526, 1.362, -0.000)],
+        ["CA", 0, (0.000, 0.000, 0.000)],
+        ["C", 0, (1.526, 0.000, 0.000)],
+        ["CB", 0, (-0.524, -0.778, -1.208)],
+        ["O", 3, (0.626, 1.062, -0.000)],
+        ["CG", 4, (0.619, 1.390, 0.000)],
+        ["CD", 5, (0.559, 1.417, 0.000)],
+        ["CE", 6, (0.560, 1.416, 0.000)],
+        ["NZ", 7, (0.554, 1.387, 0.000)],
+    ],
+    "MET": [
+        ["N", 0, (-0.521, 1.364, -0.000)],
+        ["CA", 0, (0.000, 0.000, 0.000)],
+        ["C", 0, (1.525, 0.000, 0.000)],
+        ["CB", 0, (-0.523, -0.776, -1.210)],
+        ["O", 3, (0.625, 1.062, -0.000)],
+        ["CG", 4, (0.613, 1.391, -0.000)],
+        ["SD", 5, (0.703, 1.695, 0.000)],
+        ["CE", 6, (0.320, 1.786, -0.000)],
+    ],
+    "PHE": [
+        ["N", 0, (-0.518, 1.363, 0.000)],
+        ["CA", 0, (0.000, 0.000, 0.000)],
+        ["C", 0, (1.524, 0.000, -0.000)],
+        ["CB", 0, (-0.525, -0.776, -1.212)],
+        ["O", 3, (0.626, 1.062, -0.000)],
+        ["CG", 4, (0.607, 1.377, 0.000)],
+        ["CD1", 5, (0.709, 1.195, -0.000)],
+        ["CD2", 5, (0.706, -1.196, 0.000)],
+        ["CE1", 5, (2.102, 1.198, -0.000)],
+        ["CE2", 5, (2.098, -1.201, -0.000)],
+        ["CZ", 5, (2.794, -0.003, -0.001)],
+    ],
+    "PRO": [
+        ["N", 0, (-0.566, 1.351, -0.000)],
+        ["CA", 0, (0.000, 0.000, 0.000)],
+        ["C", 0, (1.527, -0.000, 0.000)],
+        ["CB", 0, (-0.546, -0.611, -1.293)],
+        ["O", 3, (0.621, 1.066, 0.000)],
+        ["CG", 4, (0.382, 1.445, 0.0)],
+        # ['CD', 5, (0.427, 1.440, 0.0)],
+        ["CD", 5, (0.477, 1.424, 0.0)],  # manually made angle 2 degrees larger
+    ],
+    "SER": [
+        ["N", 0, (-0.529, 1.360, -0.000)],
+        ["CA", 0, (0.000, 0.000, 0.000)],
+        ["C", 0, (1.525, -0.000, -0.000)],
+        ["CB", 0, (-0.518, -0.777, -1.211)],
+        ["O", 3, (0.626, 1.062, -0.000)],
+        ["OG", 4, (0.503, 1.325, 0.000)],
+    ],
+    "THR": [
+        ["N", 0, (-0.517, 1.364, 0.000)],
+        ["CA", 0, (0.000, 0.000, 0.000)],
+        ["C", 0, (1.526, 0.000, -0.000)],
+        ["CB", 0, (-0.516, -0.793, -1.215)],
+        ["O", 3, (0.626, 1.062, 0.000)],
+        ["CG2", 4, (0.550, -0.718, -1.228)],
+        ["OG1", 4, (0.472, 1.353, 0.000)],
+    ],
+    "TRP": [
+        ["N", 0, (-0.521, 1.363, 0.000)],
+        ["CA", 0, (0.000, 0.000, 0.000)],
+        ["C", 0, (1.525, -0.000, 0.000)],
+        ["CB", 0, (-0.523, -0.776, -1.212)],
+        ["O", 3, (0.627, 1.062, 0.000)],
+        ["CG", 4, (0.609, 1.370, -0.000)],
+        ["CD1", 5, (0.824, 1.091, 0.000)],
+        ["CD2", 5, (0.854, -1.148, -0.005)],
+        ["CE2", 5, (2.186, -0.678, -0.007)],
+        ["CE3", 5, (0.622, -2.530, -0.007)],
+        ["NE1", 5, (2.140, 0.690, -0.004)],
+        ["CH2", 5, (3.028, -2.890, -0.013)],
+        ["CZ2", 5, (3.283, -1.543, -0.011)],
+        ["CZ3", 5, (1.715, -3.389, -0.011)],
+    ],
+    "TYR": [
+        ["N", 0, (-0.522, 1.362, 0.000)],
+        ["CA", 0, (0.000, 0.000, 0.000)],
+        ["C", 0, (1.524, -0.000, -0.000)],
+        ["CB", 0, (-0.522, -0.776, -1.213)],
+        ["O", 3, (0.627, 1.062, -0.000)],
+        ["CG", 4, (0.607, 1.382, -0.000)],
+        ["CD1", 5, (0.716, 1.195, -0.000)],
+        ["CD2", 5, (0.713, -1.194, -0.001)],
+        ["CE1", 5, (2.107, 1.200, -0.002)],
+        ["CE2", 5, (2.104, -1.201, -0.003)],
+        ["OH", 5, (4.168, -0.002, -0.005)],
+        ["CZ", 5, (2.791, -0.001, -0.003)],
+    ],
+    "VAL": [
+        ["N", 0, (-0.494, 1.373, -0.000)],
+        ["CA", 0, (0.000, 0.000, 0.000)],
+        ["C", 0, (1.527, -0.000, -0.000)],
+        ["CB", 0, (-0.533, -0.795, -1.213)],
+        ["O", 3, (0.627, 1.062, -0.000)],
+        ["CG1", 4, (0.540, 1.429, -0.000)],
+        ["CG2", 4, (0.533, -0.776, 1.203)],
+    ],
+    # Assume alanine positions for unknown AA
+    "UNK": [
+        ["N", 0, (-0.525, 1.363, 0.000)],
+        ["CA", 0, (0.000, 0.000, 0.000)],
+        ["C", 0, (1.526, -0.000, -0.000)],
+    ],
+}
+# A list of atoms (excluding hydrogen) for each AA type. PDB naming convention.
+residue_atoms = {
+    "ALA": ["C", "CA", "CB", "N", "O"],
+    "ARG": ["C", "CA", "CB", "CG", "CD", "CZ", "N", "NE", "O", "NH1", "NH2"],
+    "ASP": ["C", "CA", "CB", "CG", "N", "O", "OD1", "OD2"],
+    "ASN": ["C", "CA", "CB", "CG", "N", "ND2", "O", "OD1"],
+    "CYS": ["C", "CA", "CB", "N", "O", "SG"],
+    "GLU": ["C", "CA", "CB", "CG", "CD", "N", "O", "OE1", "OE2"],
+    "GLN": ["C", "CA", "CB", "CG", "CD", "N", "NE2", "O", "OE1"],
+    "GLY": ["C", "CA", "N", "O"],
+    "HIS": ["C", "CA", "CB", "CG", "CD2", "CE1", "N", "ND1", "NE2", "O"],
+    "ILE": ["C", "CA", "CB", "CG1", "CG2", "CD1", "N", "O"],
+    "LEU": ["C", "CA", "CB", "CG", "CD1", "CD2", "N", "O"],
+    "LYS": ["C", "CA", "CB", "CG", "CD", "CE", "N", "NZ", "O"],
+    "MET": ["C", "CA", "CB", "CG", "CE", "N", "O", "SD"],
+    "PHE": ["C", "CA", "CB", "CG", "CD1", "CD2", "CE1", "CE2", "CZ", "N", "O"],
+    "PRO": ["C", "CA", "CB", "CG", "CD", "N", "O"],
+    "SER": ["C", "CA", "CB", "N", "O", "OG"],
+    "THR": ["C", "CA", "CB", "CG2", "N", "O", "OG1"],
+    "TRP": [
+        "C",
+        "CA",
+        "CB",
+        "CG",
+        "CD1",
+        "CD2",
+        "CE2",
+        "CE3",
+        "CZ2",
+        "CZ3",
+        "CH2",
+        "N",
+        "NE1",
+        "O",
+    ],
+    "TYR": ["C", "CA", "CB", "CG", "CD1", "CD2", "CE1", "CE2", "CZ", "N", "O", "OH"],
+    "VAL": ["C", "CA", "CB", "CG1", "CG2", "N", "O"],
+    "UNK": ["C", "CA", "N"],
+}
+# Naming swaps for ambiguous atom names.
+# Due to symmetries in the amino acids the naming of atoms is ambiguous in
+# 4 of the 20 amino acids.
+# (The LDDT paper lists 7 amino acids as ambiguous, but the naming ambiguities
+# in LEU, VAL and ARG can be resolved by using the 3d constellations of
+# the 'ambiguous' atoms and their neighbours)
+# TODO: ^ interpret this
+residue_atom_renaming_swaps = {
+    "ASP": {"OD1": "OD2"},
+    "GLU": {"OE1": "OE2"},
+    "PHE": {"CD1": "CD2", "CE1": "CE2"},
+    "TYR": {"CD1": "CD2", "CE1": "CE2"},
+}
+# Van der Waals radii [Angstroem] of the atoms (from Wikipedia)
+van_der_waals_radius = {"C": 1.7, "N": 1.55, "O": 1.52, "S": 1.8}
+Bond = collections.namedtuple("Bond", ["atom1_name", "atom2_name", "length", "stddev"])
+BondAngle = collections.namedtuple(
+    "BondAngle", ["atom1_name", "atom2_name", "atom3name", "angle_rad", "stddev"]
+)
+@functools.lru_cache(maxsize=None)
+def load_stereo_chemical_props() -> (
+    Tuple[
+        Mapping[str, List[Bond]],
+        Mapping[str, List[Bond]],
+        Mapping[str, List[BondAngle]],
+    ]
+):
+    """Load stereo_chemical_props.txt into a nice structure.
+    Load literature values for bond lengths and bond angles and translate
+    bond angles into the length of the opposite edge of the triangle
+    ("residue_virtual_bonds").
+    Returns:
+      residue_bonds:  dict that maps resname --> list of Bond tuples
+      residue_virtual_bonds: dict that maps resname --> list of Bond tuples
+      residue_bond_angles: dict that maps resname --> list of BondAngle tuples
+    """
+    stereo_chemical_props = Path(
+        "evolutionaryscale/structure/stereo_chemical_props.txt"
+    ).read_text()
+    lines_iter = iter(stereo_chemical_props.splitlines())
+    # Load bond lengths.
+    residue_bonds = {}
+    next(lines_iter)  # Skip header line.
+    for line in lines_iter:
+        if line.strip() == "-":
+            break
+        bond, resname, length, stddev = line.split()
+        atom1, atom2 = bond.split("-")
+        if resname not in residue_bonds:
+            residue_bonds[resname] = []
+        residue_bonds[resname].append(Bond(atom1, atom2, float(length), float(stddev)))
+    residue_bonds["UNK"] = []
+    # Load bond angles.
+    residue_bond_angles = {}
+    next(lines_iter)  # Skip empty line.
+    next(lines_iter)  # Skip header line.
+    for line in lines_iter:
+        if line.strip() == "-":
+            break
+        bond, resname, angle_degree, stddev_degree = line.split()
+        atom1, atom2, atom3 = bond.split("-")
+        if resname not in residue_bond_angles:
+            residue_bond_angles[resname] = []
+        residue_bond_angles[resname].append(
+            BondAngle(
+                atom1,
+                atom2,
+                atom3,
+                float(angle_degree) / 180.0 * np.pi,
+                float(stddev_degree) / 180.0 * np.pi,
+            )
+        )
+    residue_bond_angles["UNK"] = []
+    def make_bond_key(atom1_name, atom2_name):
+        """Unique key to lookup bonds."""
+        return "-".join(sorted([atom1_name, atom2_name]))
+    # Translate bond angles into distances ("virtual bonds").
+    residue_virtual_bonds = {}
+    for resname, bond_angles in residue_bond_angles.items():
+        # Create a fast lookup dict for bond lengths.
+        bond_cache = {}
+        for b in residue_bonds[resname]:
+            bond_cache[make_bond_key(b.atom1_name, b.atom2_name)] = b
+        residue_virtual_bonds[resname] = []
+        for ba in bond_angles:
+            bond1 = bond_cache[make_bond_key(ba.atom1_name, ba.atom2_name)]
+            bond2 = bond_cache[make_bond_key(ba.atom2_name, ba.atom3name)]
+            # Compute distance between atom1 and atom3 using the law of cosines
+            # c^2 = a^2 + b^2 - 2ab*cos(gamma).
+            gamma = ba.angle_rad
+            length = np.sqrt(
+                bond1.length**2
+                + bond2.length**2
+                - 2 * bond1.length * bond2.length * np.cos(gamma)
+            )
+            # Propagation of uncertainty assuming uncorrelated errors.
+            dl_outer = 0.5 / length
+            dl_dgamma = (2 * bond1.length * bond2.length * np.sin(gamma)) * dl_outer
+            dl_db1 = (2 * bond1.length - 2 * bond2.length * np.cos(gamma)) * dl_outer
+            dl_db2 = (2 * bond2.length - 2 * bond1.length * np.cos(gamma)) * dl_outer
+            stddev = np.sqrt(
+                (dl_dgamma * ba.stddev) ** 2
+                + (dl_db1 * bond1.stddev) ** 2
+                + (dl_db2 * bond2.stddev) ** 2
+            )
+            residue_virtual_bonds[resname].append(
+                Bond(ba.atom1_name, ba.atom3name, length, stddev)
+            )
+    return (residue_bonds, residue_virtual_bonds, residue_bond_angles)
+# Between-residue bond lengths for general bonds (first element) and for Proline
+# (second element).
+between_res_bond_length_c_n = [1.329, 1.341]
+between_res_bond_length_stddev_c_n = [0.014, 0.016]
+# Between-residue cos_angles.
+between_res_cos_angles_c_n_ca = [-0.5203, 0.0353]  # degrees: 121.352 +- 2.315
+between_res_cos_angles_ca_c_n = [-0.4473, 0.0311]  # degrees: 116.568 +- 1.995
+# This mapping is used when we need to store atom data in a format that requires
+# fixed atom data size for every residue (e.g. a numpy array).
+atom_types = [
+    "N",
+    "CA",
+    "C",
+    "CB",
+    "O",
+    "CG",
+    "CG1",
+    "CG2",
+    "OG",
+    "OG1",
+    "SG",
+    "CD",
+    "CD1",
+    "CD2",
+    "ND1",
+    "ND2",
+    "OD1",
+    "OD2",
+    "SD",
+    "CE",
+    "CE1",
+    "CE2",
+    "CE3",
+    "NE",
+    "NE1",
+    "NE2",
+    "OE1",
+    "OE2",
+    "CH2",
+    "NH1",
+    "NH2",
+    "OH",
+    "CZ",
+    "CZ2",
+    "CZ3",
+    "NZ",
+    "OXT",
+]
+atom_order = {atom_type: i for i, atom_type in enumerate(atom_types)}
+atom_type_num = len(atom_types)  # := 37.
+# A compact atom encoding with 14 columns
+# pylint: disable=line-too-long
+# pylint: disable=bad-whitespace
+restype_name_to_atom14_names = {
+    "ALA": ["N", "CA", "C", "O", "CB", "", "", "", "", "", "", "", "", ""],
+    "ARG": [
+        "N",
+        "CA",
+        "C",
+        "O",
+        "CB",
+        "CG",
+        "CD",
+        "NE",
+        "CZ",
+        "NH1",
+        "NH2",
+        "",
+        "",
+        "",
+    ],
+    "ASN": ["N", "CA", "C", "O", "CB", "CG", "OD1", "ND2", "", "", "", "", "", ""],
+    "ASP": ["N", "CA", "C", "O", "CB", "CG", "OD1", "OD2", "", "", "", "", "", ""],
+    "CYS": ["N", "CA", "C", "O", "CB", "SG", "", "", "", "", "", "", "", ""],
+    "GLN": ["N", "CA", "C", "O", "CB", "CG", "CD", "OE1", "NE2", "", "", "", "", ""],
+    "GLU": ["N", "CA", "C", "O", "CB", "CG", "CD", "OE1", "OE2", "", "", "", "", ""],
+    "GLY": ["N", "CA", "C", "O", "", "", "", "", "", "", "", "", "", ""],
+    "HIS": [
+        "N",
+        "CA",
+        "C",
+        "O",
+        "CB",
+        "CG",
+        "ND1",
+        "CD2",
+        "CE1",
+        "NE2",
+        "",
+        "",
+        "",
+        "",
+    ],
+    "ILE": ["N", "CA", "C", "O", "CB", "CG1", "CG2", "CD1", "", "", "", "", "", ""],
+    "LEU": ["N", "CA", "C", "O", "CB", "CG", "CD1", "CD2", "", "", "", "", "", ""],
+    "LYS": ["N", "CA", "C", "O", "CB", "CG", "CD", "CE", "NZ", "", "", "", "", ""],
+    "MET": ["N", "CA", "C", "O", "CB", "CG", "SD", "CE", "", "", "", "", "", ""],
+    "PHE": [
+        "N",
+        "CA",
+        "C",
+        "O",
+        "CB",
+        "CG",
+        "CD1",
+        "CD2",
+        "CE1",
+        "CE2",
+        "CZ",
+        "",
+        "",
+        "",
+    ],
+    "PRO": ["N", "CA", "C", "O", "CB", "CG", "CD", "", "", "", "", "", "", ""],
+    "SER": ["N", "CA", "C", "O", "CB", "OG", "", "", "", "", "", "", "", ""],
+    "THR": ["N", "CA", "C", "O", "CB", "OG1", "CG2", "", "", "", "", "", "", ""],
+    "TRP": [
+        "N",
+        "CA",
+        "C",
+        "O",
+        "CB",
+        "CG",
+        "CD1",
+        "CD2",
+        "NE1",
+        "CE2",
+        "CE3",
+        "CZ2",
+        "CZ3",
+        "CH2",
+    ],
+    "TYR": [
+        "N",
+        "CA",
+        "C",
+        "O",
+        "CB",
+        "CG",
+        "CD1",
+        "CD2",
+        "CE1",
+        "CE2",
+        "CZ",
+        "OH",
+        "",
+        "",
+    ],
+    "VAL": ["N", "CA", "C", "O", "CB", "CG1", "CG2", "", "", "", "", "", "", ""],
+    "UNK": ["N", "CA", "C", "", "", "", "", "", "", "", "", "", "", ""],
+}
+# pylint: enable=line-too-long
+# pylint: enable=bad-whitespace
+# This is the standard residue order when coding AA type as a number.
+# Reproduce it by taking 3-letter AA codes and sorting them alphabetically.
+restypes = [
+    "A",
+    "R",
+    "N",
+    "D",
+    "C",
+    "Q",
+    "E",
+    "G",
+    "H",
+    "I",
+    "L",
+    "K",
+    "M",
+    "F",
+    "P",
+    "S",
+    "T",
+    "W",
+    "Y",
+    "V",
+]
+restype_order = {restype: i for i, restype in enumerate(restypes)}
+restype_num = len(restypes)  # := 20.
+unk_restype_index = restype_num  # Catch-all index for unknown restypes.
+restypes_with_x = restypes + ["X"]
+restype_order_with_x = {restype: i for i, restype in enumerate(restypes_with_x)}
+bb_atoms = ["N", "CA", "C", "O"]
+# Hydrophobicity by residue (positive values are hydrophobic). Derived from Black & Mould (1991), normalized by subtracting 0.5.
+hydrophobicity = {
+    "ALA": 0.116,
+    "ARG": -0.5,
+    "ASN": -0.264,
+    "ASP": -0.472,
+    "CYS": 0.18,
+    "GLN": -0.249,
+    "GLU": -0.457,
+    "GLY": 0.001,
+    "HIS": -0.335,
+    "ILE": 0.443,
+    "LEU": 0.443,
+    "LYS": -0.217,
+    "MET": 0.238,
+    "PHE": 0.5,
+    "PRO": 0.211,
+    "SER": -0.141,
+    "THR": -0.05,
+    "TRP": 0.378,
+    "TYR": 0.38,
+    "VAL": 0.325,
+}
+# Side chain max accessible surface area in Ala-X-Ala tripeptide (from Chennamsetty et al. 2010).
+side_chain_asa = {
+    "ALA": 64.7809,
+    "ARG": 210.02,
+    "ASN": 113.187,
+    "ASP": 110.209,
+    "CYS": 95.2439,
+    "GLN": 147.855,
+    "GLU": 143.924,
+    "GLY": 23.1338,
+    "HIS": 146.449,
+    "ILE": 151.242,
+    "LEU": 139.524,
+    "LYS": 177.366,
+    "MET": 164.674,
+    "PHE": 186.7,
+    "PRO": 111.533,
+    "SER": 81.2159,
+    "THR": 111.597,
+    "TRP": 229.619,
+    "TYR": 200.306,
+    "VAL": 124.237,
+}
+# Approximate Volumes of amino acids in cubic angstroms.
+# https://www.imgt.org/IMGTeducation/Aide-memoire/_UK/aminoacids/abbreviation.html
+amino_acid_volumes = {
+    "A": 88.6,  # Alanine
+    "R": 173.4,  # Arginine
+    "N": 114.1,  # Asparagine
+    "D": 111.1,  # Aspartic acid
+    "C": 108.5,  # Cysteine
+    "Q": 143.8,  # Glutamine
+    "E": 138.4,  # Glutamic acid
+    "G": 60.1,  # Glycine
+    "H": 153.2,  # Histidine
+    "I": 166.7,  # Isoleucine
+    "L": 166.7,  # Leucine
+    "K": 168.6,  # Lysine
+    "M": 162.9,  # Methionine
+    "F": 189.9,  # Phenylalanine
+    "P": 112.7,  # Proline
+    "S": 89.0,  # Serine
+    "T": 116.1,  # Threonine
+    "W": 227.8,  # Tryptophan
+    "Y": 193.6,  # Tyrosine
+    "V": 140.0,  # Valine
+    "X": 88.6,  # Unknown, use Alanine as approximation
+}
+def sequence_to_onehot(
+    sequence: str, mapping: Mapping[str, int], map_unknown_to_x: bool = False
+) -> np.ndarray:
+    """Maps the given sequence into a one-hot encoded matrix.
+    Args:
+      sequence: An amino acid sequence.
+      mapping: A dictionary mapping amino acids to integers.
+      map_unknown_to_x: If True, any amino acid that is not in the mapping will be
+        mapped to the unknown amino acid 'X'. If the mapping doesn't contain
+        amino acid 'X', an error will be thrown. If False, any amino acid not in
+        the mapping will throw an error.
+    Returns:
+      A numpy array of shape (seq_len, num_unique_aas) with one-hot encoding of
+      the sequence.
+    Raises:
+      ValueError: If the mapping doesn't contain values from 0 to
+        num_unique_aas - 1 without any gaps.
+    """
+    num_entries = max(mapping.values()) + 1
+    if sorted(set(mapping.values())) != list(range(num_entries)):
+        raise ValueError(
+            "The mapping must have values from 0 to num_unique_aas-1 "
+            "without any gaps. Got: %s" % sorted(mapping.values())
+        )
+    one_hot_arr = np.zeros((len(sequence), num_entries), dtype=np.int32)
+    for aa_index, aa_type in enumerate(sequence):
+        if map_unknown_to_x:
+            if aa_type.isalpha() and aa_type.isupper():
+                aa_id = mapping.get(aa_type, mapping["X"])
+            else:
+                raise ValueError(f"Invalid character in the sequence: {aa_type}")
+        else:
+            aa_id = mapping[aa_type]
+        one_hot_arr[aa_index, aa_id] = 1
+    return one_hot_arr
+restype_1to3 = {
+    "A": "ALA",
+    "R": "ARG",
+    "N": "ASN",
+    "D": "ASP",
+    "C": "CYS",
+    "Q": "GLN",
+    "E": "GLU",
+    "G": "GLY",
+    "H": "HIS",
+    "I": "ILE",
+    "L": "LEU",
+    "K": "LYS",
+    "M": "MET",
+    "F": "PHE",
+    "P": "PRO",
+    "S": "SER",
+    "T": "THR",
+    "W": "TRP",
+    "Y": "TYR",
+    "V": "VAL",
+    "X": "UNK",
+}
+# NB: restype_3to1 differs from Bio.PDB.protein_letters_3to1 by being a simple
+# 1-to-1 mapping of 3 letter names to one letter names. The latter contains
+# many more, and less common, three letter names as keys and maps many of these
+# to the same one letter name (including 'X' and 'U' which we don't use here).
+restype_3to1 = {v: k for k, v in restype_1to3.items()}
+# Define a restype name for all unknown residues.
+unk_restype = "UNK"
+resnames = [restype_1to3[r] for r in restypes] + [unk_restype]
+resname_to_idx = {resname: i for i, resname in enumerate(resnames)}
+hydrophobic_resnames = {"VAL", "ILE", "LEU", "PHE", "MET", "TRP"}
+# The mapping here uses hhblits convention, so that B is mapped to D, J and O
+# are mapped to X, U is mapped to C, and Z is mapped to E. Other than that the
+# remaining 20 amino acids are kept in alphabetical order.
+# There are 2 non-amino acid codes, X (representing any amino acid) and
+# "-" representing a missing amino acid in an alignment.  The id for these
+# codes is put at the end (20 and 21) so that they can easily be ignored if
+# desired.
+HHBLITS_AA_TO_ID = {
+    "A": 0,
+    "B": 2,
+    "C": 1,
+    "D": 2,
+    "E": 3,
+    "F": 4,
+    "G": 5,
+    "H": 6,
+    "I": 7,
+    "J": 20,
+    "K": 8,
+    "L": 9,
+    "M": 10,
+    "N": 11,
+    "O": 20,
+    "P": 12,
+    "Q": 13,
+    "R": 14,
+    "S": 15,
+    "T": 16,
+    "U": 1,
+    "V": 17,
+    "W": 18,
+    "X": 20,
+    "Y": 19,
+    "Z": 3,
+    "-": 21,
+}
+# Partial inversion of HHBLITS_AA_TO_ID.
+ID_TO_HHBLITS_AA = {
+    0: "A",
+    1: "C",  # Also U.
+    2: "D",  # Also B.
+    3: "E",  # Also Z.
+    4: "F",
+    5: "G",
+    6: "H",
+    7: "I",
+    8: "K",
+    9: "L",
+    10: "M",
+    11: "N",
+    12: "P",
+    13: "Q",
+    14: "R",
+    15: "S",
+    16: "T",
+    17: "V",
+    18: "W",
+    19: "Y",
+    20: "X",  # Includes J and O.
+    21: "-",
+}
+restypes_with_x_and_gap = restypes + ["X", "-"]
+MAP_HHBLITS_AATYPE_TO_OUR_AATYPE = tuple(
+    restypes_with_x_and_gap.index(ID_TO_HHBLITS_AA[i])
+    for i in range(len(restypes_with_x_and_gap))
+)
+def _make_standard_atom_mask() -> np.ndarray:
+    """Returns [num_res_types, num_atom_types] mask array."""
+    # +1 to account for unknown (all 0s).
+    mask = np.zeros([restype_num + 1, atom_type_num], dtype=np.int32)
+    for restype, restype_letter in enumerate(restypes):
+        restype_name = restype_1to3[restype_letter]
+        atom_names = residue_atoms[restype_name]
+        for atom_name in atom_names:
+            atom_type = atom_order[atom_name]
+            mask[restype, atom_type] = 1
+    return mask
+STANDARD_ATOM_MASK = _make_standard_atom_mask()
+# A one hot representation for the first and second atoms defining the axis
+# of rotation for each chi-angle in each residue.
+def chi_angle_atom(atom_index: int) -> np.ndarray:
+    """Define chi-angle rigid groups via one-hot representations."""
+    chi_angles_index = {}
+    one_hots = []
+    for k, v in chi_angles_atoms.items():
+        indices = [atom_types.index(s[atom_index]) for s in v]
+        indices.extend([-1] * (4 - len(indices)))
+        chi_angles_index[k] = indices
+    for r in restypes:
+        res3 = restype_1to3[r]
+        one_hot = np.eye(atom_type_num)[chi_angles_index[res3]]
+        one_hots.append(one_hot)
+    one_hots.append(np.zeros([4, atom_type_num]))  # Add zeros for residue `X`.
+    one_hot = np.stack(one_hots, axis=0)
+    one_hot = np.transpose(one_hot, [0, 2, 1])
+    return one_hot
+chi_atom_1_one_hot = chi_angle_atom(1)
+chi_atom_2_one_hot = chi_angle_atom(2)
+# An array like chi_angles_atoms but using indices rather than names.
+chi_angles_atom_indices = [chi_angles_atoms[restype_1to3[r]] for r in restypes]
+# chi_angles_atom_indices = tree.map_structure(
+#     lambda atom_name: atom_order[atom_name], chi_angles_atom_indices
+# )
+chi_angles_atom_indices = np.array(
+    [
+        chi_atoms + ([[0, 0, 0, 0]] * (4 - len(chi_atoms)))
+        for chi_atoms in chi_angles_atom_indices
+    ]
+)
+# Mapping from (res_name, atom_name) pairs to the atom's chi group index
+# and atom index within that group.
+chi_groups_for_atom = collections.defaultdict(list)
+for res_name, chi_angle_atoms_for_res in chi_angles_atoms.items():
+    for chi_group_i, chi_group in enumerate(chi_angle_atoms_for_res):
+        for atom_i, atom in enumerate(chi_group):
+            chi_groups_for_atom[(res_name, atom)].append((chi_group_i, atom_i))
+chi_groups_for_atom = dict(chi_groups_for_atom)
+def _make_rigid_transformation_4x4(ex, ey, translation):
+    """Create a rigid 4x4 transformation matrix from two axes and transl."""
+    # Normalize ex.
+    ex_normalized = ex / np.linalg.norm(ex)
+    # make ey perpendicular to ex
+    ey_normalized = ey - np.dot(ey, ex_normalized) * ex_normalized
+    ey_normalized /= np.linalg.norm(ey_normalized)
+    # compute ez as cross product
+    eznorm = np.cross(ex_normalized, ey_normalized)
+    m = np.stack([ex_normalized, ey_normalized, eznorm, translation]).transpose()
+    m = np.concatenate([m, [[0.0, 0.0, 0.0, 1.0]]], axis=0)
+    return m
+# create an array with (restype, atomtype) --> rigid_group_idx
+# and an array with (restype, atomtype, coord) for the atom positions
+# and compute affine transformation matrices (4,4) from one rigid group to the
+# previous group
+restype_atom37_to_rigid_group = np.zeros([21, 37], dtype=int)
+restype_atom37_mask = np.zeros([21, 37], dtype=np.float32)
+restype_atom37_rigid_group_positions = np.zeros([21, 37, 3], dtype=np.float32)
+restype_atom14_to_rigid_group = np.zeros([21, 14], dtype=int)
+restype_atom14_mask = np.zeros([21, 14], dtype=np.float32)
+restype_atom14_rigid_group_positions = np.zeros([21, 14, 3], dtype=np.float32)
+restype_rigid_group_default_frame = np.zeros([21, 8, 4, 4], dtype=np.float32)
+def _make_rigid_group_constants():
+    """Fill the arrays above."""
+    for restype, restype_letter in enumerate(restypes_with_x):
+        resname = restype_1to3[restype_letter]
+        for atomname, group_idx, atom_position in rigid_group_atom_positions[resname]:
+            atomtype = atom_order[atomname]
+            restype_atom37_to_rigid_group[restype, atomtype] = group_idx
+            restype_atom37_mask[restype, atomtype] = 1
+            restype_atom37_rigid_group_positions[restype, atomtype, :] = atom_position
+            atom14idx = restype_name_to_atom14_names[resname].index(atomname)
+            restype_atom14_to_rigid_group[restype, atom14idx] = group_idx
+            restype_atom14_mask[restype, atom14idx] = 1
+            restype_atom14_rigid_group_positions[restype, atom14idx, :] = atom_position
+    for restype, restype_letter in enumerate(restypes_with_x):
+        resname = restype_1to3[restype_letter]
+        atom_positions = {
+            name: np.array(pos) for name, _, pos in rigid_group_atom_positions[resname]
+        }
+        # backbone to backbone is the identity transform
+        restype_rigid_group_default_frame[restype, 0, :, :] = np.eye(4)
+        # pre-omega-frame to backbone (currently dummy identity matrix)
+        restype_rigid_group_default_frame[restype, 1, :, :] = np.eye(4)
+        # phi-frame to backbone
+        mat = _make_rigid_transformation_4x4(
+            ex=atom_positions["N"] - atom_positions["CA"],
+            ey=np.array([1.0, 0.0, 0.0]),
+            translation=atom_positions["N"],
+        )
+        restype_rigid_group_default_frame[restype, 2, :, :] = mat
+        # psi-frame to backbone
+        mat = _make_rigid_transformation_4x4(
+            ex=atom_positions["C"] - atom_positions["CA"],
+            ey=atom_positions["CA"] - atom_positions["N"],
+            translation=atom_positions["C"],
+        )
+        restype_rigid_group_default_frame[restype, 3, :, :] = mat
+        # chi1-frame to backbone
+        if chi_angles_mask[restype][0]:
+            base_atom_names = chi_angles_atoms[resname][0]
+            base_atom_positions = [atom_positions[name] for name in base_atom_names]
+            mat = _make_rigid_transformation_4x4(
+                ex=base_atom_positions[2] - base_atom_positions[1],
+                ey=base_atom_positions[0] - base_atom_positions[1],
+                translation=base_atom_positions[2],
+            )
+            restype_rigid_group_default_frame[restype, 4, :, :] = mat
+        # chi2-frame to chi1-frame
+        # chi3-frame to chi2-frame
+        # chi4-frame to chi3-frame
+        # luckily all rotation axes for the next frame start at (0,0,0) of the
+        # previous frame
+        for chi_idx in range(1, 4):
+            if chi_angles_mask[restype][chi_idx]:
+                axis_end_atom_name = chi_angles_atoms[resname][chi_idx][2]
+                axis_end_atom_position = atom_positions[axis_end_atom_name]
+                mat = _make_rigid_transformation_4x4(
+                    ex=axis_end_atom_position,
+                    ey=np.array([-1.0, 0.0, 0.0]),
+                    translation=axis_end_atom_position,
+                )
+                restype_rigid_group_default_frame[restype, 4 + chi_idx, :, :] = mat
+_make_rigid_group_constants()
+def make_atom14_dists_bounds(overlap_tolerance=1.5, bond_length_tolerance_factor=15.0):
+    """compute upper and lower bounds for bonds to assess violations."""
+    restype_atom14_bond_lower_bound = np.zeros([21, 14, 14], np.float32)
+    restype_atom14_bond_upper_bound = np.zeros([21, 14, 14], np.float32)
+    restype_atom14_bond_stddev = np.zeros([21, 14, 14], np.float32)
+    residue_bonds, residue_virtual_bonds, _ = load_stereo_chemical_props()
+    for restype, restype_letter in enumerate(restypes):
+        resname = restype_1to3[restype_letter]
+        atom_list = restype_name_to_atom14_names[resname]
+        # create lower and upper bounds for clashes
+        for atom1_idx, atom1_name in enumerate(atom_list):
+            if not atom1_name:
+                continue
+            atom1_radius = van_der_waals_radius[atom1_name[0]]
+            for atom2_idx, atom2_name in enumerate(atom_list):
+                if (not atom2_name) or atom1_idx == atom2_idx:
+                    continue
+                atom2_radius = van_der_waals_radius[atom2_name[0]]
+                lower = atom1_radius + atom2_radius - overlap_tolerance
+                upper = 1e10
+                restype_atom14_bond_lower_bound[restype, atom1_idx, atom2_idx] = lower
+                restype_atom14_bond_lower_bound[restype, atom2_idx, atom1_idx] = lower
+                restype_atom14_bond_upper_bound[restype, atom1_idx, atom2_idx] = upper
+                restype_atom14_bond_upper_bound[restype, atom2_idx, atom1_idx] = upper
+        # overwrite lower and upper bounds for bonds and angles
+        for b in residue_bonds[resname] + residue_virtual_bonds[resname]:
+            atom1_idx = atom_list.index(b.atom1_name)
+            atom2_idx = atom_list.index(b.atom2_name)
+            lower = b.length - bond_length_tolerance_factor * b.stddev
+            upper = b.length + bond_length_tolerance_factor * b.stddev
+            restype_atom14_bond_lower_bound[restype, atom1_idx, atom2_idx] = lower
+            restype_atom14_bond_lower_bound[restype, atom2_idx, atom1_idx] = lower
+            restype_atom14_bond_upper_bound[restype, atom1_idx, atom2_idx] = upper
+            restype_atom14_bond_upper_bound[restype, atom2_idx, atom1_idx] = upper
+            restype_atom14_bond_stddev[restype, atom1_idx, atom2_idx] = b.stddev
+            restype_atom14_bond_stddev[restype, atom2_idx, atom1_idx] = b.stddev
+    return {
+        "lower_bound": restype_atom14_bond_lower_bound,  # shape (21,14,14)
+        "upper_bound": restype_atom14_bond_upper_bound,  # shape (21,14,14)
+        "stddev": restype_atom14_bond_stddev,  # shape (21,14,14)
+    }
+restype_atom14_ambiguous_atoms = np.zeros((21, 14), dtype=np.float32)
+restype_atom14_ambiguous_atoms_swap_idx = np.tile(np.arange(14, dtype=int), (21, 1))
+def _make_atom14_ambiguity_feats():
+    for res, pairs in residue_atom_renaming_swaps.items():
+        res_idx = restype_order[restype_3to1[res]]
+        for atom1, atom2 in pairs.items():
+            atom1_idx = restype_name_to_atom14_names[res].index(atom1)
+            atom2_idx = restype_name_to_atom14_names[res].index(atom2)
+            restype_atom14_ambiguous_atoms[res_idx, atom1_idx] = 1
+            restype_atom14_ambiguous_atoms[res_idx, atom2_idx] = 1
+            restype_atom14_ambiguous_atoms_swap_idx[res_idx, atom1_idx] = atom2_idx
+            restype_atom14_ambiguous_atoms_swap_idx[res_idx, atom2_idx] = atom1_idx
+_make_atom14_ambiguity_feats()
+def aatype_to_str_sequence(aatype):
+    return "".join([restypes_with_x[aatype[i]] for i in range(len(aatype))])
+# NOTE(thayes): These are computed based on the average CA->C and CA->N norm from rigid_group_atom_positions
+CA_TO_N_NORM = 1.4591
+CA_TO_C_NORM = 1.5252
+def _make_restype_atom37_to_atom14():
+    """Map from atom37 to atom14 per residue type."""
+    restype_atom37_to_atom14 = []  # mapping (restype, atom37) --> atom14
+    for rt in restypes:
+        atom_names = restype_name_to_atom14_names[restype_1to3[rt]]
+        atom_name_to_idx14 = {name: i for i, name in enumerate(atom_names)}
+        restype_atom37_to_atom14.append(
+            [
+                (atom_name_to_idx14[name] if name in atom_name_to_idx14 else 0)
+                for name in atom_types
+            ]
+        )
+    restype_atom37_to_atom14.append([0] * 37)
+    restype_atom37_to_atom14 = np.array(restype_atom37_to_atom14, dtype=np.int32)
+    return restype_atom37_to_atom14
+def _make_restype_atom14_to_atom37():
+    """Map from atom14 to atom37 per residue type."""
+    restype_atom14_to_atom37 = []  # mapping (restype, atom14) --> atom37
+    for rt in restypes:
+        atom_names = restype_name_to_atom14_names[restype_1to3[rt]]
+        restype_atom14_to_atom37.append(
+            [(atom_order[name] if name else 0) for name in atom_names]
+        )
+    # Add dummy mapping for restype 'UNK'
+    restype_atom14_to_atom37.append([0] * 14)
+    restype_atom14_to_atom37 = np.array(restype_atom14_to_atom37, dtype=np.int32)
+    return restype_atom14_to_atom37
+RESTYPE_ATOM14_TO_ATOM37 = _make_restype_atom14_to_atom37()
+RESTYPE_ATOM37_TO_ATOM14 = _make_restype_atom37_to_atom14()
+CHAIN_BREAK_TOKEN = "|"

esmfold2_sequential_dataclass.py ADDED Viewed

	@@ -0,0 +1,158 @@

+from abc import ABC, abstractmethod
+from dataclasses import dataclass, fields, replace
+from typing import TypeVar
+import numpy as np
+from .esmfold2_misc import concat_objects, slice_any_object
+T = TypeVar("T")
+@dataclass(frozen=True)
+class SequentialDataclass(ABC):
+    """
+    This is a builder on a dataclass that allows for automatic slicing and concatenation.
+    When representing multimodal data, we often have multiple datatypes which have sequence dimensions that are the same (e.g. the length of the protein).
+    When applying a transformation like a crop, we want to apply this to all tensors at the same time (e.g. crop the sequence, structure, and function).
+    We also have some fields that are not sequential (like an id, or data source), which we don't want to crop.
+    The SequentialDataclass abstracts this cropping away, allowing you to define dataclasses that implement `__len__`, `__getitem__` and `concat` automatically.
+    This is done through the `metadata` field, which can take 3 values:
+        `sequence` (bool): True or False, tells the dataclass whether this field is a sequential type. Default: False.
+        `sequence_dim` (int): Which dimension is the sequential dimension (e.g. for a list of inverse folded sequences, we want to index each sequence in the list, not the list itself). Default: 0.
+        `join_token` (Any): What token to use to join when concatenating elements. Default: None.
+    Example:
+        @dataclass(frozen=True)
+        class Foo(SequentialDataclass):
+            id: str
+            sequence: str = field(metadata={"sequence": True, "join_token": "|"})
+            tensor: torch.Tensor = field(metadata={"sequence": True, "join_token": torch.nan})
+            def __len__(self):
+                # Must implement the __len__ method
+                return len(self.sequence)
+        >>> foo = Foo(id="foo", sequence="ABCDE", tensor=torch.randn(5))
+        Foo(id='foo', sequence='ABCDE', tensor=tensor([ 0.0252, -0.3335, -0.5143,  0.0251, -1.0717]))
+        >>> foo[1:4]
+        Foo(id='foo', sequence='BCD', tensor=tensor([-0.3335, -0.5143,  0.0251]))
+        >>> foo[np.arange(5) < 3]
+        Foo(id='foo', sequence='ABC', tensor=tensor([ 0.0252, -0.3335, -0.5143]))
+        >>> Foo.concat([foo[:2], foo[3:]])
+        Foo(id='foo', sequence='AB|DE', tensor=tensor([ 0.0252, -0.3335,     nan,  0.0251, -1.0717]))
+        # Trying to create a type where the sequence lengths do not match raises an error
+        >>> foo = Foo(id="foo", sequence="ABCDE", tensor=torch.randn(6))
+        ValueError: Mismatch in sequence length for field: tensor. Expected 5, received 6
+    """
+    def __post_init__(self):
+        self._check_sequence_lengths_match()
+    @abstractmethod
+    def __len__(self):
+        raise NotImplementedError
+    def __getitem__(self, idx: int | list[int] | slice | np.ndarray):
+        updated_fields = {}
+        if isinstance(idx, int):
+            # make it so that things remain sequential
+            idx = [idx]
+        for fld in fields(self):
+            if fld.metadata.get("sequence", False):
+                # this is a sequence, should be the same length as all other sequences
+                sequence_dim = fld.metadata.get("sequence_dim", 0)
+                value = getattr(self, fld.name)
+                if value is None:
+                    continue
+                match sequence_dim:
+                    case 0:
+                        # sequence is first dimension
+                        value = getattr(self, fld.name)
+                        value = slice_any_object(value, idx)
+                        updated_fields[fld.name] = value
+                    case 1:
+                        new_value = [slice_any_object(item, idx) for item in value]
+                        updated_fields[fld.name] = value.__class__(new_value)
+                    case _:
+                        raise NotImplementedError(
+                            "Arbitrary slicing for different sequence length fields is not implemented"
+                        )
+        return replace(self, **updated_fields)
+    def _check_sequence_lengths_match(self):
+        """Checks if sequence lengths of all "sequence" fields match."""
+        for fld in fields(self):
+            if fld.metadata.get("sequence", False) and fld.name != "complex":
+                # this is a sequence, should be the same length as all other sequences
+                sequence_dim = fld.metadata.get("sequence_dim", 0)
+                value = getattr(self, fld.name)
+                if value is None:
+                    continue
+                match sequence_dim:
+                    case 0:
+                        # sequence is first dimension
+                        value = getattr(self, fld.name)
+                        if len(value) != len(self):
+                            raise ValueError(
+                                f"Mismatch in sequence length for field: {fld.name}. Expected {len(self)}, received {len(value)}"
+                            )
+                    case 1:
+                        for item in value:
+                            if len(item) != len(self):
+                                raise ValueError(
+                                    f"Mismatch in sequence length for field: {fld.name}. Expected {len(self)}, received {len(item)}"
+                                )
+                    case _:
+                        raise NotImplementedError(
+                            "Arbitrary matching for different sequence length fields is not implemented"
+                        )
+    @classmethod
+    def concat(cls, items: list[T], **kwargs) -> T:
+        updated_fields = {}
+        for fld in fields(cls):
+            if fld.metadata.get("sequence", False):
+                # this is a sequence, should be the same length as all other sequences
+                sequence_dim = fld.metadata.get("sequence_dim", 0)
+                join_value = fld.metadata.get("join_token", None)
+                if getattr(items[0], fld.name) is None:
+                    continue
+                values = [getattr(item, fld.name) for item in items]
+                match sequence_dim:
+                    case 0:
+                        # sequence is first dimension
+                        value = concat_objects(values, join_value)
+                        updated_fields[fld.name] = value
+                    case 1:
+                        new_value = [
+                            concat_objects(item, join_value) for item in zip(*values)
+                        ]
+                        updated_fields[fld.name] = getattr(
+                            items[0], fld.name
+                        ).__class__(new_value)
+                    case _:
+                        raise NotImplementedError(
+                            "Arbitrary joining for different sequence length fields is not implemented"
+                        )
+        updated_fields.update(kwargs)
+        return replace(
+            items[0],  # type: ignore
+            **updated_fields,
+        )

esmfold2_system.py ADDED Viewed

	@@ -0,0 +1,46 @@

+import io
+import subprocess
+import typing as T
+from pathlib import Path
+PathLike = T.Union[str, Path]
+PathOrBuffer = T.Union[PathLike, io.StringIO]
+def run_subprocess_with_errorcheck(
+    *popenargs,
+    capture_output: bool = False,
+    quiet: bool = False,
+    env: dict[str, str] | None = None,
+    shell: bool = False,
+    executable: str | None = None,
+    **kws,
+) -> subprocess.CompletedProcess:
+    """A command similar to subprocess.run, however the errormessage will
+    contain the stderr when using this function. This makes it significantly
+    easier to diagnose issues.
+    """
+    try:
+        if capture_output:
+            stdout = subprocess.PIPE
+        elif quiet:
+            stdout = subprocess.DEVNULL
+        else:
+            stdout = None
+        p = subprocess.run(
+            *popenargs,
+            stderr=subprocess.PIPE,
+            stdout=stdout,
+            check=True,
+            env=env,
+            shell=shell,
+            executable=executable,
+            **kws,
+        )
+    except subprocess.CalledProcessError as e:
+        raise RuntimeError(
+            f"Command failed with errorcode {e.returncode}." f"\n\n{e.stderr.decode()}"
+        )
+    return p

esmfold2_types.py ADDED Viewed

	@@ -0,0 +1,34 @@

+"""Re-exports of the canonical SPI dataclasses from input_builder.
+This module exists so the HF processor and downstream code can import the
+ESMFold2 input types from a single namespace without picking up internal-only
+sibling utilities. The actual definitions live in
+``esm.utils.structure.input_builder``.
+"""
+from .esmfold2_msa import MSA
+from .esmfold2_parsing import FastaEntry
+from .esmfold2_input_builder import (
+    CovalentBond,
+    DistogramConditioning,
+    DNAInput,
+    LigandInput,
+    Modification,
+    ProteinInput,
+    RNAInput,
+    StructurePredictionInput,
+)
+__all__ = [
+    "FastaEntry",
+    "MSA",
+    "Modification",
+    "ProteinInput",
+    "RNAInput",
+    "DNAInput",
+    "LigandInput",
+    "DistogramConditioning",
+    "CovalentBond",
+    "StructurePredictionInput",
+]

esmfold2_utils_types.py ADDED Viewed

	@@ -0,0 +1,34 @@

+from __future__ import annotations
+import io
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Union
+from cloudpathlib import CloudPath
+PathLike = Union[str, Path, CloudPath]
+PathOrBuffer = Union[PathLike, io.StringIO]
+@dataclass
+class FunctionAnnotation:
+    """Represents an annotation of a protein's function over a range of residues.
+    Fields:
+        label (str): An entry in either the function_tokens or residue_annotations tokenizer vocabs
+        start (int): Start index of this annotation.  1-indexed, inclusive.
+        end (int): End index of this annotation.  1-indexed, inclusive.
+    """
+    label: str
+    start: int
+    end: int
+    def to_tuple(self) -> tuple[str, int, int]:
+        return self.label, self.start, self.end
+    def __len__(self) -> int:
+        """Length of the annotation."""
+        return self.end - self.start + 1

modeling_esmc.py ADDED Viewed

	@@ -0,0 +1,1667 @@

+# Copyright 2026 Biohub. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""PyTorch ESMC model."""
+import importlib
+import math
+import re
+from dataclasses import dataclass
+from typing import Optional, cast
+import torch
+import torch.nn as nn
+from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
+from torch.nn import functional as F
+from transformers.modeling_outputs import (
+    MaskedLMOutput,
+    ModelOutput,
+    SequenceClassifierOutput,
+    TokenClassifierOutput,
+)
+from transformers.modeling_utils import PreTrainedModel
+from transformers.utils import (
+    auto_docstring,
+    can_return_tuple,
+    is_flash_attn_2_available,
+    logging,
+)
+from .configuration_esmc import ESMCConfig
+from .modeling_esmc_sae import _ESMCSAELayer
+logger = logging.get_logger(__name__)
+_CONFIG_FOR_DOC = "ESMCConfig"
+# Optional accelerated kernels. Pure-PyTorch fallbacks below if absent.
+if is_flash_attn_2_available():
+    flash_attn_module = importlib.import_module("flash_attn")
+    flash_bert_padding = importlib.import_module("flash_attn.bert_padding")
+    flash_attn_varlen_qkvpacked_func = (
+        flash_attn_module.flash_attn_varlen_qkvpacked_func
+    )
+    pad_input = flash_bert_padding.pad_input
+    unpad_input = flash_bert_padding.unpad_input
+    _flash_attn_available = True
+else:
+    pad_input = unpad_input = flash_attn_varlen_qkvpacked_func = None
+    _flash_attn_available = False
+try:
+    flash_rotary = importlib.import_module("flash_attn.ops.triton.rotary")
+    apply_triton_rotary = flash_rotary.apply_rotary
+    _flash_attn_rotary_available = torch.cuda.is_available()
+except ImportError:
+    apply_triton_rotary = None  # type: ignore[assignment]
+    _flash_attn_rotary_available = False
+# Transformer Engine: fused LayerNorm+Linear / LayerNorm+MLP kernels with
+# fp32 reduction inside the LayerNorm. Recommended on GPU for accurate bf16
+# inference; without it the pure-PyTorch fallback drifts ~O(10) in fp32 and
+# ~O(100) in bf16 on the unnormalized residual stream (perplexity stays
+# within rounding noise).
+try:
+    te = importlib.import_module("transformer_engine.pytorch")
+    _te_available = True
+except ImportError:
+    te = None  # type: ignore[assignment]
+    _te_available = False
+# xformers: preferred SDPA implementation on GPU. Provides a fused
+# bf16 attention kernel with deterministic reduction order. Flash
+# Attention 2 and PyTorch's ``F.scaled_dot_product_attention`` are
+# progressively-less-preferred fallbacks.
+try:
+    xops = importlib.import_module("xformers.ops")
+    _xformers_available = True
+except ImportError:
+    xops = None  # type: ignore[assignment]
+    _xformers_available = False
+# Flash Attention 2: secondary SDPA fallback. Used when xformers is not
+# installed; fp16 / bf16 only.
+if _flash_attn_available:
+    flash_attn_func = flash_attn_module.flash_attn_func
+else:
+    flash_attn_func = None  # type: ignore[assignment]
+if not _te_available:
+    logger.warning(
+        "ESMC: transformer_engine is not installed; falling back to "
+        "pure-PyTorch LayerNorm+Linear / LayerNorm+MLP. Outputs will differ "
+        "numerically — measured on the unnormalized residual stream (before "
+        "the final LayerNorm), ~O(10) max-diff in fp32 and ~O(100) in bf16; "
+        "after the final LayerNorm these shrink to a few ULP and perplexity "
+        "stays within rounding noise. Install with "
+        "`pip install transformer-engine[pytorch]` to enable fused fp32-"
+        "reduction LayerNorm."
+    )
+if not _xformers_available and not _flash_attn_available:
+    logger.warning(
+        "ESMC: neither xformers nor flash-attn is installed; falling back "
+        "to PyTorch ``F.scaled_dot_product_attention``. The attention "
+        "reduction order in bf16 differs from a fused kernel by ~1 bf16 "
+        "ULP per attention block; compounded across the 80-block stack "
+        "this reaches ~O(100) max-diff on the unnormalized residual stream. "
+        "Install xformers (preferred) with `pip install xformers` for a "
+        "fused attention kernel."
+    )
+if torch.cuda.is_available() and not _flash_attn_rotary_available:
+    logger.warning(
+        "ESMC: flash-attn rotary kernel not installed; falling back to "
+        "pure-PyTorch RoPE. For faster GPU inference run `pip install flash-attn`."
+    )
+# ---------------------------------------------------------------------------
+# Output dataclasses
+# ---------------------------------------------------------------------------
+@dataclass
+class ESMCOutput(ModelOutput):
+    """
+    Args:
+        last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, d_model)`):
+            Sequence of hidden states at the output of the last layer, after layer normalisation.
+        hidden_states (`torch.FloatTensor`, *optional*):
+            Stacked hidden states for all encoder layers.
+            Shape ``(n_layers, batch_size, sequence_length, d_model)``.
+            Returned when ``output_hidden_states=True``.
+        sae_outputs (`dict[str, torch.Tensor]`, *optional*):
+            SAE feature magnitudes keyed by SAE model name (sparse tensors).
+            Only populated when SAE models have been registered via
+            ``add_sae_models`` and ``compute_sae=True``.
+        attentions (`tuple(torch.FloatTensor)`, *optional*):
+            Per-layer attention weights of shape
+            ``(batch_size, num_heads, sequence_length, sequence_length)``.
+            Returned when ``output_attentions=True``.  Not available on the
+            ``flash_attention_2`` path.
+    """
+    last_hidden_state: torch.FloatTensor | None = None
+    hidden_states: torch.FloatTensor | None = None
+    sae_outputs: dict[str, torch.Tensor] | None = None
+    attentions: tuple[torch.FloatTensor, ...] | None = None
+@dataclass
+class ESMCMaskedLMOutput(MaskedLMOutput):
+    """
+    Args:
+        loss (`torch.FloatTensor` of shape `(1,)`, *optional*):
+            Masked language modelling loss. Returned when ``labels`` are provided.
+        logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, vocab_size)`):
+            Prediction scores of the language modelling head.
+        last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, d_model)`):
+            Final hidden states after layer normalisation.
+        hidden_states (`torch.FloatTensor`, *optional*):
+            Stacked hidden states. Shape ``(n_layers, batch_size, sequence_length, d_model)``.
+        sae_outputs (`dict[str, torch.Tensor]`, *optional*):
+            SAE feature magnitudes keyed by SAE model name (sparse tensors).
+        attentions (`tuple(torch.FloatTensor)`, *optional*):
+            Per-layer attention weights of shape
+            ``(batch_size, num_heads, sequence_length, sequence_length)``.
+            Returned when ``output_attentions=True``.
+    """
+    loss: torch.FloatTensor | None = None
+    logits: torch.FloatTensor | None = None
+    last_hidden_state: torch.FloatTensor | None = None
+    hidden_states: torch.FloatTensor | None = None
+    sae_outputs: dict[str, torch.Tensor] | None = None
+    attentions: tuple[torch.FloatTensor, ...] | None = None
+@dataclass
+class ESMCTokenClassifierOutput(TokenClassifierOutput):
+    """
+    Args:
+        loss (`torch.FloatTensor` of shape `(1,)`, *optional*):
+            Token classification loss. Returned when ``labels`` are provided.
+        logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, num_labels)`):
+            Classification scores (before SoftMax).
+        last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, d_model)`):
+            Final hidden states after layer normalisation.
+        hidden_states (`torch.FloatTensor`, *optional*):
+            Stacked hidden states. Shape ``(n_layers, batch_size, sequence_length, d_model)``.
+        sae_outputs (`dict[str, torch.Tensor]`, *optional*):
+            SAE feature magnitudes keyed by SAE model name (sparse tensors).
+        attentions (`tuple(torch.FloatTensor)`, *optional*):
+            Per-layer attention weights of shape
+            ``(batch_size, num_heads, sequence_length, sequence_length)``.
+            Returned when ``output_attentions=True``.
+    """
+    loss: torch.FloatTensor | None = None
+    logits: torch.FloatTensor | None = None
+    last_hidden_state: torch.FloatTensor | None = None
+    hidden_states: torch.FloatTensor | None = None
+    sae_outputs: dict[str, torch.Tensor] | None = None
+    attentions: tuple[torch.FloatTensor, ...] | None = None
+@dataclass
+class ESMCSequenceClassifierOutput(SequenceClassifierOutput):
+    """
+    Args:
+        loss (`torch.FloatTensor` of shape `(1,)`, *optional*):
+            Sequence classification loss. Returned when ``labels`` are provided.
+        logits (`torch.FloatTensor` of shape `(batch_size, num_labels)`):
+            Classification scores (before SoftMax).
+        last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, d_model)`):
+            Final hidden states after layer normalisation.
+        hidden_states (`torch.FloatTensor`, *optional*):
+            Stacked hidden states. Shape ``(n_layers, batch_size, sequence_length, d_model)``.
+        sae_outputs (`dict[str, torch.Tensor]`, *optional*):
+            SAE feature magnitudes keyed by SAE model name (sparse tensors).
+        attentions (`tuple(torch.FloatTensor)`, *optional*):
+            Per-layer attention weights of shape
+            ``(batch_size, num_heads, sequence_length, sequence_length)``.
+            Returned when ``output_attentions=True``.
+    """
+    loss: torch.FloatTensor | None = None
+    logits: torch.FloatTensor | None = None
+    last_hidden_state: torch.FloatTensor | None = None
+    hidden_states: torch.FloatTensor | None = None
+    sae_outputs: dict[str, torch.Tensor] | None = None
+    attentions: tuple[torch.FloatTensor, ...] | None = None
+# ---------------------------------------------------------------------------
+# Rotary position embedding helpers
+# ---------------------------------------------------------------------------
+def _rotate_half(x: torch.Tensor, interleaved: bool = False) -> torch.Tensor:
+    if not interleaved:
+        x1, x2 = x.chunk(2, dim=-1)
+        return torch.cat((-x2, x1), dim=-1)
+    x1, x2 = x[..., ::2], x[..., 1::2]
+    return torch.stack((-x2, x1), dim=-1).flatten(-2, -1)
+def _apply_rotary_emb_torch(
+    x: torch.Tensor, cos: torch.Tensor, sin: torch.Tensor, interleaved: bool = False
+) -> torch.Tensor:
+    """Apply rotary position embeddings (pure PyTorch, no Triton dependency).
+    Args:
+        x: ``(batch, seqlen, n_heads, head_dim)``
+        cos: ``(seqlen, rotary_dim / 2)``
+        sin: ``(seqlen, rotary_dim / 2)``
+    """
+    ro_dim = cos.shape[-1] * 2
+    seqlen = x.size(1)
+    cos = cos[:seqlen].unsqueeze(1).repeat(1, 1, 2)
+    sin = sin[:seqlen].unsqueeze(1).repeat(1, 1, 2)
+    return torch.cat(
+        [
+            x[..., :ro_dim] * cos + _rotate_half(x[..., :ro_dim], interleaved) * sin,
+            x[..., ro_dim:],
+        ],
+        dim=-1,
+    )
+class RotaryEmbedding(nn.Module):
+    """Rotary position embeddings (RoPE) as described in `RoFormer`_.
+    .. _RoFormer: https://arxiv.org/abs/2104.09864
+    Args:
+        dim: Size of a single attention head.
+        base: Frequency base for the sinusoidal positions.
+        interleaved: If ``True`` rotate adjacent pairs (GPT-J style) instead of
+            splitting the head dimension in half (GPT-NeoX style).
+        scaling_factor: Linear scaling factor applied to position indices.
+        pos_idx_in_fp32: Compute position indices in float32 to avoid bf16
+            rounding errors at large sequence lengths.
+    """
+    def __init__(
+        self,
+        dim: int,
+        base: float = 10000.0,
+        interleaved: bool = False,
+        scale_base: float | None = None,
+        scaling_factor: float = 1.0,
+        pos_idx_in_fp32: bool = True,
+        device=None,
+    ):
+        super().__init__()
+        self.dim = dim
+        self.base = base
+        self.interleaved = interleaved
+        self.scale_base = scale_base
+        self.scaling_factor = scaling_factor
+        self.pos_idx_in_fp32 = pos_idx_in_fp32
+        self._seq_len_cached = 0
+        self._cos_cached: torch.Tensor | None = None
+        self._sin_cached: torch.Tensor | None = None
+        self._cos_k_cached: torch.Tensor | None = None
+        self._sin_k_cached: torch.Tensor | None = None
+        self.reset_parameters(device=device)
+    def reset_parameters(self, device=None):
+        inv_freq = self._compute_inv_freq(device)
+        self.register_buffer("inv_freq", inv_freq, persistent=False)
+        arange = torch.arange(0, self.dim, 2, device=device, dtype=torch.float32)
+        scale = (
+            (arange + 0.4 * self.dim) / (1.4 * self.dim)
+            if self.scale_base is not None
+            else None
+        )
+        self.register_buffer("scale", scale, persistent=False)
+    def _compute_inv_freq(self, device=None) -> torch.Tensor:
+        return 1.0 / (
+            self.base
+            ** (
+                torch.arange(0, self.dim, 2, device=device, dtype=torch.float32)
+                / self.dim
+            )
+        )
+    def _update_cos_sin_cache(self, seqlen: int, device=None, dtype=None):
+        if self.inv_freq.is_meta:
+            self.reset_parameters(device=device)
+        if (
+            seqlen > self._seq_len_cached
+            or self._cos_cached is None
+            or self._cos_cached.device != device
+            or self._cos_cached.dtype != dtype
+            or (self.training and self._cos_cached.is_inference())
+        ):
+            self._seq_len_cached = seqlen
+            if self.pos_idx_in_fp32:
+                t = (
+                    torch.arange(seqlen, device=device, dtype=torch.float32)
+                    / self.scaling_factor
+                )
+                inv_freq = (
+                    self.inv_freq.to(torch.float32)
+                    if self.inv_freq.dtype != torch.float32
+                    else self.inv_freq
+                )
+            else:
+                t = (
+                    torch.arange(seqlen, device=device, dtype=self.inv_freq.dtype)  # type: ignore[call-overload]
+                    / self.scaling_factor
+                )
+                inv_freq = self.inv_freq
+            freqs = torch.outer(t, inv_freq)  # type: ignore[arg-type]
+            if self.scale is None:
+                self._cos_cached = torch.cos(freqs).to(dtype)
+                self._sin_cached = torch.sin(freqs).to(dtype)
+            else:
+                _scale: torch.Tensor = self.scale  # type: ignore[assignment]
+                power = (
+                    torch.arange(seqlen, dtype=_scale.dtype, device=_scale.device)
+                    - seqlen // 2
+                ) / self.scale_base  # type: ignore[operator]
+                scale = _scale.to(device=power.device) ** power.unsqueeze(-1)
+                self._cos_cached = (torch.cos(freqs) * scale).to(dtype)
+                self._sin_cached = (torch.sin(freqs) * scale).to(dtype)
+                self._cos_k_cached = (torch.cos(freqs) / scale).to(dtype)
+                self._sin_k_cached = (torch.sin(freqs) / scale).to(dtype)
+    def _apply(self, fn, recurse=True):
+        if self.inv_freq.is_meta:
+            self.reset_parameters(device="cpu")
+        result = super()._apply(fn, recurse=recurse)
+        # Recompute inv_freq on the new device: CPU vs CUDA ``pow`` differ by
+        # ~1 fp32 ULP, which compounds across attention layers. Keep this
+        # buffer fp32 even when the module is cast to bf16/fp16; otherwise the
+        # rounded RoPE frequencies drift from the internal ESMC path.
+        new_inv_freq = self._compute_inv_freq(device=self.inv_freq.device)
+        self.register_buffer("inv_freq", new_inv_freq, persistent=False)
+        self._seq_len_cached = 0
+        self._cos_cached = None
+        self._sin_cached = None
+        self._cos_k_cached = None
+        self._sin_k_cached = None
+        return result
+    def forward(
+        self, q: torch.Tensor, k: torch.Tensor, seqlen_offset: int = 0
+    ) -> tuple[torch.Tensor, torch.Tensor]:
+        """Apply RoPE to query and key tensors.
+        Args:
+            q: ``(batch, seqlen, n_heads, head_dim)``
+            k: ``(batch, seqlen, n_heads, head_dim)``
+            seqlen_offset: Offset used in incremental decoding.
+        Returns:
+            Tuple of rotated ``(q, k)`` tensors with the same shape as the inputs.
+        """
+        self._update_cos_sin_cache(
+            q.shape[1] + seqlen_offset, device=q.device, dtype=q.dtype
+        )
+        assert self._cos_cached is not None and self._sin_cached is not None
+        if self.scale is not None:
+            raise NotImplementedError("XPos scaling is not supported in this path.")
+        cos = self._cos_cached[seqlen_offset:]
+        sin = self._sin_cached[seqlen_offset:]
+        if _flash_attn_rotary_available and q.device.type == "cuda":
+            q_rot = apply_triton_rotary(q, cos, sin, interleaved=self.interleaved)  # type: ignore[misc]
+            k_rot = apply_triton_rotary(k, cos, sin, interleaved=self.interleaved)  # type: ignore[misc]
+        else:
+            q_rot = _apply_rotary_emb_torch(q, cos, sin, self.interleaved)
+            k_rot = _apply_rotary_emb_torch(k, cos, sin, self.interleaved)
+        return q_rot, k_rot
+class _TritonRotaryEmbedding(RotaryEmbedding):
+    """RoPE variant that delegates to the Flash-Attention Triton kernel.
+    Only used inside :class:`_FlashMultiHeadAttention` when Flash Attention 2
+    is available.  The ``forward`` signature differs from :class:`RotaryEmbedding`
+    because Flash Attention packs Q, K, V together.
+    """
+    def forward(
+        self, qkv: torch.Tensor, cu_seqlens: torch.Tensor, max_seqlen: int
+    ) -> torch.Tensor:  # type: ignore[override]
+        """Apply RoPE in-place to a packed ``(N, 3, n_heads, head_dim)`` tensor."""
+        self._update_cos_sin_cache(max_seqlen, device=qkv.device, dtype=qkv.dtype)
+        assert self._cos_cached is not None and self._sin_cached is not None
+        assert apply_triton_rotary is not None
+        apply_triton_rotary(
+            qkv[:, 0],
+            self._cos_cached,
+            self._sin_cached,
+            cu_seqlens=cu_seqlens,
+            max_seqlen=max_seqlen,
+            inplace=True,
+        )
+        apply_triton_rotary(
+            qkv[:, 1],
+            self._cos_cached,
+            self._sin_cached,
+            cu_seqlens=cu_seqlens,
+            max_seqlen=max_seqlen,
+            inplace=True,
+        )
+        return qkv
+# ---------------------------------------------------------------------------
+# Feed-forward network helpers
+# ---------------------------------------------------------------------------
+def _swiglu_hidden_dim(expansion_ratio: float, d_model: int) -> int:
+    """Round hidden dim to the nearest multiple of 256 after applying expansion_ratio."""
+    return int(((expansion_ratio * d_model) + 255) // 256 * 256)
+class _SwiGLU(nn.Module):
+    """SwiGLU activation: ``silu(x1) * x2`` where ``x`` is split along the last dim."""
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        x1, x2 = x.chunk(2, dim=-1)
+        return F.silu(x1) * x2
+class _PyTorchLayerNormLinear(nn.Module):
+    """LayerNorm followed by a Linear projection, sharing the parameter
+    names ``layer_norm_weight``, ``layer_norm_bias`` and ``weight`` so the
+    state-dict layout matches the accelerated TE module loaded on GPU.
+    """
+    def __init__(self, d_in: int, d_out: int, eps: float = 1e-5) -> None:
+        super().__init__()
+        self.d_in = d_in
+        self.eps = eps
+        self.layer_norm_weight = nn.Parameter(torch.ones(d_in))
+        self.layer_norm_bias = nn.Parameter(torch.zeros(d_in))
+        self.weight = nn.Parameter(torch.empty(d_out, d_in))
+        nn.init.normal_(self.weight, std=0.02)
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        x = F.layer_norm(
+            x, (self.d_in,), self.layer_norm_weight, self.layer_norm_bias, self.eps
+        )
+        return F.linear(x, self.weight)
+class _PyTorchLayerNormMLP(nn.Module):
+    """LayerNorm + SwiGLU MLP, sharing the parameter names
+    ``layer_norm_weight``, ``layer_norm_bias``, ``fc1_weight``,
+    ``fc2_weight`` so the state-dict layout matches the accelerated TE
+    module loaded on GPU.
+    """
+    def __init__(
+        self, hidden_size: int, ffn_hidden_size: int, eps: float = 1e-5
+    ) -> None:
+        super().__init__()
+        self.hidden_size = hidden_size
+        self.ffn_hidden_size = ffn_hidden_size
+        self.eps = eps
+        self.layer_norm_weight = nn.Parameter(torch.ones(hidden_size))
+        self.layer_norm_bias = nn.Parameter(torch.zeros(hidden_size))
+        self.fc1_weight = nn.Parameter(torch.empty(2 * ffn_hidden_size, hidden_size))
+        self.fc2_weight = nn.Parameter(torch.empty(hidden_size, ffn_hidden_size))
+        nn.init.normal_(self.fc1_weight, std=0.02)
+        nn.init.normal_(self.fc2_weight, std=0.02)
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        x = F.layer_norm(
+            x,
+            (self.hidden_size,),
+            self.layer_norm_weight,
+            self.layer_norm_bias,
+            self.eps,
+        )
+        x = F.linear(x, self.fc1_weight)
+        x1, x2 = x.chunk(2, dim=-1)
+        x = F.silu(x1) * x2
+        return F.linear(x, self.fc2_weight)
+def _swiglu_ln_ffn(d_model: int, expansion_ratio: float, bias: bool) -> nn.Module:
+    """LayerNorm + SwiGLU MLP. Uses Transformer Engine's fused LN+MLP when
+    available; otherwise returns the pure-PyTorch fallback with matching
+    state-dict layout."""
+    assert not bias, "ESMC was trained with bias=False; bias=True not supported"
+    hidden = _swiglu_hidden_dim(expansion_ratio, d_model)
+    if _te_available:
+        return te.LayerNormMLP(  # type: ignore[union-attr]
+            hidden_size=d_model,
+            ffn_hidden_size=hidden,
+            bias=bias,
+            activation="swiglu",
+            init_method=None,
+            output_layer_init_method=None,
+        )
+    return _PyTorchLayerNormMLP(hidden_size=d_model, ffn_hidden_size=hidden)
+def _make_attn_layernorm_qkv(d_model: int, bias: bool) -> nn.Module:
+    """LayerNorm + fused QKV projection. Uses Transformer Engine when
+    available; pure-PyTorch fallback otherwise."""
+    assert not bias, "ESMC was trained with bias=False; bias=True not supported"
+    if _te_available:
+        return te.LayerNormLinear(  # type: ignore[union-attr]
+            d_model, d_model * 3, bias=bias, init_method=None
+        )
+    return _PyTorchLayerNormLinear(d_model, d_model * 3)
+def _make_attn_out_proj(d_model: int, bias: bool) -> nn.Module:
+    """Attention output projection. Uses Transformer Engine when available;
+    pure-PyTorch ``nn.Linear`` otherwise."""
+    if _te_available:
+        return te.Linear(  # type: ignore[union-attr]
+            d_model, d_model, bias=bias, init_method=None
+        )
+    return nn.Linear(d_model, d_model, bias=bias)
+def _gelu_ln_ffn(d_model: int, expansion_ratio: float, bias: bool) -> nn.Sequential:
+    hidden = int(expansion_ratio * d_model)
+    return nn.Sequential(
+        nn.LayerNorm(d_model),
+        nn.Linear(d_model, hidden, bias=bias),
+        nn.GELU(),
+        nn.Linear(hidden, d_model, bias=bias),
+    )
+# ---------------------------------------------------------------------------
+# Attention
+# ---------------------------------------------------------------------------
+def _scaled_dot_product_attention(
+    q: torch.Tensor,
+    k: torch.Tensor,
+    v: torch.Tensor,
+    *,
+    n_heads: int,
+    d_head: int,
+    seq_id: torch.Tensor | None,
+) -> torch.Tensor:
+    """Scaled dot-product attention with optional chain-aware mask.
+    Dispatches in order of preference:
+      1. xformers ``memory_efficient_attention`` — preferred fused kernel,
+         requires ``xformers``, no chain mask.
+      2. Flash Attention 2 (``flash_attn.flash_attn_func``) — secondary
+         fused kernel, requires ``flash-attn``, no chain mask, fp16 /
+         bf16 only.
+      3. PyTorch's ``F.scaled_dot_product_attention`` — last-resort path;
+         also handles the chain-aware mask when ``seq_id`` is present
+         and the fp32 path that Flash Attention 2 does not support.
+    """
+    if seq_id is None and _xformers_available:
+        b, s, _ = q.shape
+        q4 = q.view(b, s, n_heads, d_head)
+        k4 = k.view(b, s, n_heads, d_head)
+        v4 = v.view(b, s, n_heads, d_head)
+        context = xops.memory_efficient_attention(  # type: ignore[union-attr]
+            q4, k4, v4, attn_bias=None, scale=d_head**-0.5
+        )
+        return context.reshape(b, s, n_heads * d_head)
+    if (
+        seq_id is None
+        and _flash_attn_available
+        and q.dtype in (torch.float16, torch.bfloat16)
+    ):
+        b, s, _ = q.shape
+        q4 = q.view(b, s, n_heads, d_head)
+        k4 = k.view(b, s, n_heads, d_head)
+        v4 = v.view(b, s, n_heads, d_head)
+        context = flash_attn_func(  # type: ignore[misc]
+            q4, k4, v4, dropout_p=0.0, softmax_scale=d_head**-0.5
+        )
+        return context.reshape(b, s, n_heads * d_head)  # type: ignore[union-attr]
+    b, s, _ = q.shape
+    q = q.view(b, s, n_heads, -1).transpose(1, 2)
+    k = k.view(b, s, n_heads, -1).transpose(1, 2)
+    v = v.view(b, s, n_heads, -1).transpose(1, 2)
+    if seq_id is not None:
+        mask = (seq_id.unsqueeze(-1) == seq_id.unsqueeze(-2)).unsqueeze(1)
+        context = F.scaled_dot_product_attention(q, k, v, mask)
+    else:
+        context = F.scaled_dot_product_attention(q, k, v)
+    _, h, _, d_out = context.shape
+    return context.transpose(1, 2).reshape(b, s, h * d_out)
+class MultiHeadAttention(nn.Module):
+    """Multi-head self-attention with QK LayerNorm and RoPE.
+    Args:
+        d_model: Model hidden dimension.
+        n_heads: Number of attention heads.
+        bias: Whether to use bias in linear layers.
+        qk_layernorm: Whether to apply LayerNorm to queries and keys before
+            computing attention scores.
+    """
+    def __init__(
+        self, d_model: int, n_heads: int, bias: bool = False, qk_layernorm: bool = True
+    ):
+        super().__init__()
+        self.d_model = d_model
+        self.n_heads = n_heads
+        self.d_head = d_model // n_heads
+        assert not bias, "ESMC was trained with bias=False; bias=True not supported"
+        self.layernorm_qkv = _make_attn_layernorm_qkv(d_model, bias)
+        self.out_proj = _make_attn_out_proj(d_model, bias)
+        if qk_layernorm:
+            self.q_ln = nn.LayerNorm(d_model, bias=bias)
+            self.k_ln = nn.LayerNorm(d_model, bias=bias)
+        else:
+            self.q_ln = nn.Identity()
+            self.k_ln = nn.Identity()
+        self.rotary = RotaryEmbedding(d_model // n_heads)
+    def _apply_rotary(
+        self, q: torch.Tensor, k: torch.Tensor
+    ) -> tuple[torch.Tensor, torch.Tensor]:
+        q = q.unflatten(-1, (self.n_heads, self.d_head))
+        k = k.unflatten(-1, (self.n_heads, self.d_head))
+        q, k = self.rotary(q, k)
+        q = q.flatten(-2, -1)
+        k = k.flatten(-2, -1)
+        return q, k
+    def forward(
+        self,
+        x: torch.Tensor,
+        seq_id: torch.Tensor | None,
+        output_attentions: bool = False,
+    ) -> tuple[torch.Tensor, torch.Tensor | None]:
+        """Return ``(context, attn_weights)``.
+        ``attn_weights`` is ``None`` unless ``output_attentions=True`` — the
+        fused SDPA backends (xformers, flash-attn 2, ``F.scaled_dot_product_attention``)
+        don't expose attention probabilities, so capturing them forces a
+        materialized ``softmax(Q @ K.T / sqrt(d)) @ V`` path with shape
+        ``(B, H, L, L)``.
+        """
+        qkv = self.layernorm_qkv(x)
+        q, k, v = torch.chunk(qkv, 3, dim=-1)
+        q = self.q_ln(q).to(q.dtype)
+        k = self.k_ln(k).to(q.dtype)
+        q, k = self._apply_rotary(q, k)
+        b, s, _ = q.shape
+        if output_attentions:
+            # Manual SDPA so attention probabilities are observable.
+            q4 = q.view(b, s, self.n_heads, self.d_head).transpose(1, 2)
+            k4 = k.view(b, s, self.n_heads, self.d_head).transpose(1, 2)
+            v4 = v.view(b, s, self.n_heads, self.d_head).transpose(1, 2)
+            scale = self.d_head**-0.5
+            attn_scores = (q4 @ k4.transpose(-2, -1)) * scale
+            if seq_id is not None:
+                mask = (seq_id.unsqueeze(-1) == seq_id.unsqueeze(-2)).unsqueeze(1)
+                attn_scores = attn_scores.masked_fill(~mask, float("-inf"))
+            attn_weights = torch.softmax(attn_scores, dim=-1)
+            context = (attn_weights @ v4).transpose(1, 2).reshape(b, s, -1)
+            return self.out_proj(context), attn_weights
+        context = _scaled_dot_product_attention(
+            q, k, v, n_heads=self.n_heads, d_head=self.d_head, seq_id=seq_id
+        )
+        return self.out_proj(context), None
+class _FlashMultiHeadAttention(MultiHeadAttention):
+    """Flash-Attention 2 variant of :class:`MultiHeadAttention`."""
+    def __init__(
+        self, d_model: int, n_heads: int, bias: bool = False, qk_layernorm: bool = True
+    ):
+        super().__init__(
+            d_model=d_model, n_heads=n_heads, bias=bias, qk_layernorm=qk_layernorm
+        )
+        self.rotary = _TritonRotaryEmbedding(d_model // n_heads)
+    def forward(
+        self,
+        x: torch.Tensor,
+        seq_id: torch.Tensor | None,
+        output_attentions: bool = False,
+    ) -> tuple[torch.Tensor, torch.Tensor | None]:
+        if output_attentions:
+            raise ValueError(
+                "output_attentions=True is not supported with "
+                "attn_implementation='flash_attention_2'. "
+                "Re-load the model with attn_implementation='sdpa' (or 'eager')."
+            )
+        assert seq_id is not None and seq_id.dtype == torch.bool
+        seqlens = seq_id.sum(dim=-1, dtype=torch.int32)
+        cu_seqlens = F.pad(torch.cumsum(seqlens, dim=0, dtype=torch.int32), (1, 0))
+        max_seqlen = int(seqlens.max().item())
+        qkv = self.layernorm_qkv(x)
+        q, k, v = torch.chunk(qkv, 3, dim=-1)
+        q = self.q_ln(q).to(q.dtype)
+        k = self.k_ln(k).to(q.dtype)
+        # ``q``/``k``/``v`` are 2D ``(T, D)`` here: the parent ``ESMCModel.forward``
+        # calls ``unpad_input`` before the transformer stack to produce the
+        # varlen-flat layout that ``flash_attn_varlen_qkvpacked_func`` requires.
+        T = q.shape[0]
+        qkv_packed = torch.stack([q, k, v], dim=1).view(T, 3, self.n_heads, self.d_head)
+        qkv_packed = self.rotary(qkv_packed, cu_seqlens, max_seqlen)
+        context = flash_attn_varlen_qkvpacked_func(  # type: ignore[misc]
+            qkv_packed, cu_seqlens, max_seqlen, softmax_scale=self.d_head**-0.5
+        )
+        n_out, h_out, d_out = context.shape  # type: ignore[union-attr]
+        return (
+            self.out_proj(context.reshape(n_out, h_out * d_out)),  # type: ignore[union-attr]
+            None,
+        )
+# ---------------------------------------------------------------------------
+# Transformer blocks
+# ---------------------------------------------------------------------------
+class UnifiedTransformerBlock(nn.Module):
+    """Single transformer block: pre-norm attention + pre-norm FFN with residual scaling.
+    Args:
+        d_model: Hidden dimension.
+        n_heads: Number of attention heads.
+        use_flash_attn: Use Flash Attention 2 kernel if available.
+        bias: Whether linear layers include bias terms.
+        expansion_ratio: Hidden-dim expansion ratio for the FFN.
+        residue_scaling_factor: Scales residual connections to stabilise deep
+            networks (``1 / sqrt(n_layers / 36)`` is the ESM3 scheme).
+        qk_layernorm: Whether to apply QK LayerNorm in attention.
+        ffn_type: Feed-forward activation: ``"swiglu"`` or ``"gelu"``.
+    """
+    def __init__(
+        self,
+        d_model: int,
+        n_heads: int,
+        use_flash_attn: bool = False,
+        bias: bool = False,
+        expansion_ratio: float = 4.0,
+        residue_scaling_factor: float = 1.0,
+        qk_layernorm: bool = True,
+        ffn_type: str = "swiglu",
+    ):
+        super().__init__()
+        attn_cls = _FlashMultiHeadAttention if use_flash_attn else MultiHeadAttention
+        self.attn = attn_cls(d_model, n_heads, bias=bias, qk_layernorm=qk_layernorm)
+        if ffn_type == "swiglu":
+            self.ffn = _swiglu_ln_ffn(d_model, expansion_ratio, bias)
+        elif ffn_type == "gelu":
+            self.ffn = _gelu_ln_ffn(d_model, expansion_ratio, bias)
+        else:
+            raise ValueError(
+                f"Unknown ffn_type: {ffn_type!r}. Choose 'swiglu' or 'gelu'."
+            )
+        self.scaling_factor = residue_scaling_factor
+    def forward(
+        self,
+        x: torch.Tensor,
+        sequence_id: torch.Tensor | None,
+        output_attentions: bool = False,
+    ) -> tuple[torch.Tensor, torch.Tensor | None]:
+        """
+        Args:
+            x: ``(batch, seq_len, d_model)``
+            sequence_id: ``(batch, seq_len)`` chain-ID tensor used to restrict
+                attention to tokens within the same chain. SDPA blocks accept
+                an integer tensor (``-1`` marks padding); the flash-attn block
+                takes a ``bool`` padding mask — the caller selects which.
+                ``None`` skips chain-aware masking entirely (fast path).
+            output_attentions: When ``True``, returns the per-head attention
+                weights for this block alongside the residual output.
+        Returns:
+            ``(output, attn_weights_or_None)``.  Shape of ``output`` is
+            ``(batch, seq_len, d_model)``; ``attn_weights`` shape is
+            ``(batch, num_heads, seq_len, seq_len)`` or ``None``.
+        """
+        attn_out, attn_weights = self.attn(
+            x, sequence_id, output_attentions=output_attentions
+        )
+        x = x + attn_out / self.scaling_factor
+        x = x + self.ffn(x) / self.scaling_factor
+        return x, attn_weights
+class TransformerStack(nn.Module):
+    """Stack of :class:`UnifiedTransformerBlock` layers with a final LayerNorm.
+    Args:
+        d_model: Hidden dimension.
+        n_heads: Number of attention heads.
+        n_layers: Number of transformer blocks.
+        scale_residue: When ``True`` apply ESM3 residue scaling
+            ``sqrt(n_layers / 36)`` to each block.
+        bias: Bias flag forwarded to every sub-module.
+        qk_layernorm: QK LayerNorm flag forwarded to every block.
+        ffn_type: FFN activation type (``"swiglu"`` or ``"gelu"``).
+        expansion_ratio: FFN expansion ratio.
+        use_flash_attn: Use Flash Attention 2 kernel when available.
+    """
+    def __init__(
+        self,
+        d_model: int,
+        n_heads: int,
+        n_layers: int,
+        scale_residue: bool = True,
+        bias: bool = False,
+        qk_layernorm: bool = True,
+        ffn_type: str = "swiglu",
+        expansion_ratio: float = 8 / 3,
+        use_flash_attn: bool = False,
+    ):
+        super().__init__()
+        self.blocks = nn.ModuleList(
+            [
+                UnifiedTransformerBlock(
+                    d_model,
+                    n_heads,
+                    use_flash_attn=use_flash_attn,
+                    residue_scaling_factor=math.sqrt(n_layers / 36)
+                    if scale_residue
+                    else 1.0,
+                    expansion_ratio=expansion_ratio,
+                    bias=bias,
+                    qk_layernorm=qk_layernorm,
+                    ffn_type=ffn_type,
+                )
+                for _ in range(n_layers)
+            ]
+        )
+        self.norm = nn.LayerNorm(d_model, bias=False)
+    def forward(
+        self,
+        x: torch.Tensor,
+        sequence_id: torch.Tensor | None = None,
+        layers_to_collect: list[int] | None = None,
+        output_attentions: bool = False,
+    ) -> tuple[
+        torch.Tensor,
+        torch.Tensor,
+        tuple[torch.Tensor, ...],
+        tuple[torch.Tensor, ...] | None,
+    ]:
+        """Run the full transformer stack.
+        Args:
+            x: ``(batch, seq_len, d_model)``
+            sequence_id: Optional chain-id tensor forwarded to each block.
+            layers_to_collect: Layer indices (0-based pre-block inputs plus
+                ``n_layers`` for the post-norm output) whose hidden states
+                should be returned.
+            output_attentions: When ``True``, collects the per-block attention
+                weights and returns them as the fourth tuple element.
+        Returns:
+            ``(post_norm, pre_norm, hidden_states, attentions)`` where
+            ``hidden_states`` is a (possibly empty) tuple of tensors and
+            ``attentions`` is a tuple of per-block ``(B, H, L, L)`` tensors
+            or ``None`` when ``output_attentions`` is ``False``.
+        """
+        if layers_to_collect is None:
+            layers_to_collect = []
+        collected: list[torch.Tensor] = []
+        all_attentions: list[torch.Tensor] = []
+        for layer_idx, block in enumerate(self.blocks):
+            if layer_idx in layers_to_collect:
+                collected.append(x)
+            x, attn_weights = block(x, sequence_id, output_attentions=output_attentions)
+            if output_attentions and attn_weights is not None:
+                all_attentions.append(attn_weights)
+        norm_x = self.norm(x)
+        if len(self.blocks) in layers_to_collect:
+            collected.append(norm_x)
+        attentions = tuple(all_attentions) if output_attentions else None
+        return norm_x, x, tuple(collected), attentions
+# ---------------------------------------------------------------------------
+# Pre-trained model base class
+# ---------------------------------------------------------------------------
+@auto_docstring
+class ESMCPreTrainedModel(PreTrainedModel):
+    """Base class for ESMC models.
+    Handles weight initialisation and declares module-level capabilities.
+    """
+    config_class = ESMCConfig
+    base_model_prefix = "esmc"
+    supports_gradient_checkpointing = False
+    _supports_sdpa = True
+    _supports_flash_attn = True
+    _supports_attention_backend = True
+    _no_split_modules = ["UnifiedTransformerBlock"]
+    _keys_to_ignore_on_load_unexpected = [r"\._extra_state$"]
+    def _init_weights(self, module: nn.Module):
+        std = self.config.initializer_range
+        if isinstance(module, nn.Linear):
+            module.weight.data.normal_(mean=0.0, std=std)
+            if module.bias is not None:
+                module.bias.data.zero_()
+        elif isinstance(module, RotaryEmbedding):
+            module.reset_parameters(device=self.device)
+# ---------------------------------------------------------------------------
+# Base encoder model
+# ---------------------------------------------------------------------------
+@auto_docstring
+class ESMCModel(ESMCPreTrainedModel):
+    """The bare ESMC encoder outputting raw hidden states.
+    ESMC is a protein language model trained by EvolutionaryScale using a
+    masked-token objective over amino acid sequences.  The architecture is a
+    standard Transformer encoder with RoPE positional embeddings, QK LayerNorm,
+    and SwiGLU feed-forward networks.
+    Args:
+        config: An :class:`ESMCConfig` instance.
+    """
+    def __init__(self, config: ESMCConfig):
+        super().__init__(config)
+        self._use_flash_attn = (
+            _flash_attn_available and config._attn_implementation == "flash_attention_2"
+        )
+        self.embed = nn.Embedding(config.vocab_size, config.d_model)
+        self.transformer = TransformerStack(
+            config.d_model,
+            config.n_heads,
+            config.n_layers,
+            use_flash_attn=self._use_flash_attn,
+        )
+        self._sae_models: nn.ModuleDict = nn.ModuleDict()
+        self.post_init()
+    def get_input_embeddings(self) -> nn.Embedding:
+        return self.embed
+    def set_input_embeddings(self, value: nn.Embedding):
+        self.embed = value
+    def add_sae_models(self, sae_models: list[_ESMCSAELayer]) -> None:
+        """Register one or more SAEs obtained from an :class:`ESMCSAEModel`.
+        Each is keyed by ``f"layer{N}"`` (the backbone-layer index ``N`` the
+        SAE is trained against, set by
+        :meth:`ESMCSAEModel.initialize_layers`). Attaching two SAEs for the
+        same backbone layer raises — only one SAE per layer can be active.
+        Example::
+            sae = ESMCSAEModel.from_pretrained(
+                "biohub/esmc-600m-2024-12-sae-k64-codebook16384"
+            )
+            sae.initialize_layers([27, 33])
+            model.add_sae_models([sae.layers["27"], sae.layers["33"]])
+        """
+        for layer in sae_models:
+            assert isinstance(layer, _ESMCSAELayer), (
+                f"Expected an SAE layer (model.layers['<idx>']), got "
+                f"{type(layer).__name__}."
+            )
+            key = f"layer{int(layer.layer)}"
+            if key in self._sae_models:
+                raise ValueError(
+                    f"An SAE is already registered at {key!r}. Only one SAE "
+                    "per backbone layer can be active — pick a different "
+                    "layer on one of them, or attach in a fresh model."
+                )
+            self._sae_models[key] = layer
+    _SAE_KEY_RE = re.compile(r"layer(\d+)")
+    def _get_sae_layer_num_requested(self, model_name: str) -> int:
+        """Recover the backbone-layer index from a key written by
+        :meth:`add_sae_models` (``"layer{N}"`` → ``N``)."""
+        match = self._SAE_KEY_RE.fullmatch(model_name)
+        assert (
+            match is not None
+        ), f"Unexpected SAE key {model_name!r}; expected 'layer{{N}}'."
+        return int(match.group(1))
+    def _validate_sae_inputs(self, input_ids: torch.Tensor) -> None:
+        assert torch.all(input_ids != self.config.mask_token_id), (
+            "SAE inputs must not contain mask tokens. "
+            "SAEs were trained on unmasked sequences."
+        )
+    def _get_sae_outputs(
+        self,
+        hidden_states: torch.Tensor,
+        layers_to_collect: list[int],
+        token_mask: torch.Tensor,
+        normalize_sae: bool = False,
+    ) -> dict[str, torch.Tensor]:
+        """Run all registered SAEs and return their feature magnitudes.
+        Args:
+            hidden_states: Stacked tensor of shape
+                ``(len(layers_to_collect), batch, seq_len, d_model)``.
+            layers_to_collect: The ESMC layer indices that were collected,
+                in the same order as the first dim of ``hidden_states``.
+            token_mask: Boolean mask ``(batch, seq_len)`` — ``True`` for
+                real (non-padding) tokens.
+            normalize_sae: When ``True``, scale features by ``idf / max``
+                using the per-feature stats trained alongside each SAE.
+        """
+        layer_to_idx = {layer: idx for idx, layer in enumerate(layers_to_collect)}
+        sae_outputs: dict[str, torch.Tensor] = {}
+        for model_name, sae_module in self._sae_models.items():
+            # `nn.ModuleDict` only stores `nn.Module`s at the type level;
+            # ``add_sae_models`` enforces that each entry is an ``_ESMCSAELayer``.
+            assert isinstance(sae_module, _ESMCSAELayer)
+            layer: _ESMCSAELayer = sae_module
+            requested_layer = self._get_sae_layer_num_requested(model_name)
+            layer_idx = layer_to_idx[requested_layer]
+            layer_states = hidden_states[layer_idx].clone().to(self.device)
+            sae_out = layer.get_sae_output(layer_states, token_mask)
+            features = sae_out.feature_magnitudes.detach()
+            if normalize_sae:
+                # ``register_buffer`` is typed as ``Tensor | Module`` on
+                # ``nn.Module``; narrow here since these are Tensors.
+                idf = cast(torch.Tensor, layer.idf)
+                max_val = cast(torch.Tensor, layer.max)
+                features = (features / max_val) * idf
+            sae_outputs[model_name] = features.to_sparse()
+        return sae_outputs
+    @can_return_tuple
+    @auto_docstring
+    def forward(
+        self,
+        input_ids: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        sequence_id: Optional[torch.Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        compute_sae: bool = True,
+        normalize_sae: bool = False,
+    ) -> tuple[torch.Tensor, ...] | ESMCOutput:
+        r"""
+        sequence_id (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Integer chain-ID tensor for chain-aware attention masking. Tokens with the same
+            non-negative integer value can attend to each other; tokens with different values
+            cannot (cross-chain masking). Padding positions should be set to ``-1``.
+            When provided, ``attention_mask`` is ignored. The ``flash_attention_2`` backend
+            only supports single-chain inputs (all non-padding values must be ``0``); pass
+            multi-chain ``sequence_id`` with ``attn_implementation='sdpa'`` (or ``'eager'``).
+        output_attentions (`bool`, *optional*):
+            Whether to return the per-block attention weights of shape
+            ``(batch_size, num_heads, sequence_length, sequence_length)``.
+            Forces a manual-SDPA path inside :class:`MultiHeadAttention` so the
+            attention probabilities are observable; raises on the
+            ``flash_attention_2`` path.
+        compute_sae (`bool`, *optional*, defaults to ``True``):
+            Whether to run any SAE models registered via :meth:`add_sae_models`.
+            Has no effect when no SAEs are registered.
+        normalize_sae (`bool`, *optional*, defaults to ``False``):
+            When ``True``, scale SAE feature magnitudes by ``idf / max`` (only
+            applied when the SAE's normalization buffers contain non-trivial values).
+        Examples:
+        ```python
+        >>> from transformers import AutoTokenizer, ESMCModel
+        >>> model = ESMCModel.from_pretrained("Biohub/ESMC-600M-2024-12")
+        >>> tokenizer = AutoTokenizer.from_pretrained("Biohub/ESMC-600M-2024-12")
+        >>> inputs = tokenizer(["MLKNVQVQLV"], return_tensors="pt")
+        >>> outputs = model(**inputs)
+        >>> outputs.last_hidden_state.shape
+        torch.Size([1, 12, 960])
+        ```
+        """
+        output_hidden_states = (
+            output_hidden_states
+            if output_hidden_states is not None
+            else self.config.output_hidden_states
+        )
+        output_attentions = (
+            output_attentions
+            if output_attentions is not None
+            else self.config.output_attentions
+        )
+        return_dict = (
+            return_dict if return_dict is not None else self.config.use_return_dict
+        )
+        output_sae = compute_sae and len(self._sae_models) > 0
+        # Determine which intermediate layers to collect.  When SAEs are
+        # registered we must collect at least the layers they target, even if
+        # the caller did not ask for all hidden states.
+        if output_hidden_states:
+            layers_to_collect: list[int] = list(range(self.config.n_layers + 1))
+        elif output_sae:
+            layers_to_collect = sorted(
+                {self._get_sae_layer_num_requested(name) for name in self._sae_models}
+            )
+        else:
+            layers_to_collect = []
+        user_supplied_sequence_id = sequence_id is not None
+        if sequence_id is not None:
+            bool_mask = sequence_id >= 0
+        else:
+            if attention_mask is None:
+                attention_mask = input_ids != self.config.pad_token_id
+            assert attention_mask is not None
+            bool_mask = attention_mask.bool()
+            sequence_id = bool_mask.to(torch.long) - 1
+        x = self.embed(input_ids)
+        b, l_ = x.shape[:2]
+        if self._use_flash_attn:
+            if user_supplied_sequence_id and (sequence_id > 0).any():
+                raise ValueError(
+                    "Multi-chain ``sequence_id`` (any value > 0) is not "
+                    "supported with attn_implementation='flash_attention_2'. "
+                    "Re-load the model with attn_implementation='sdpa' (or "
+                    "'eager') for chain-aware attention masking."
+                )
+            assert unpad_input is not None
+            x, indices, *_ = unpad_input(x, bool_mask)
+        else:
+            indices = None
+        if self._use_flash_attn:
+            trans_seq_id = bool_mask
+        elif user_supplied_sequence_id:
+            trans_seq_id = sequence_id
+        elif bool_mask.all() and not output_attentions:
+            # Fused SDPA fast path (xformers / flash) is correct only when the
+            # mask is uniform; output_attentions forces the manual branch.
+            trans_seq_id = None
+        else:
+            trans_seq_id = sequence_id
+        last_hidden_state, _, collected, attentions = self.transformer(
+            x,
+            sequence_id=trans_seq_id,
+            layers_to_collect=layers_to_collect,
+            output_attentions=output_attentions,
+        )
+        if self._use_flash_attn:
+            assert indices is not None and pad_input is not None
+            last_hidden_state = pad_input(last_hidden_state, indices, b, l_)
+            collected = [pad_input(h, indices, b, l_) for h in collected]
+        # Stack once; reused for both SAE and hidden-state output.
+        collected_tensor: torch.Tensor | None = (
+            torch.stack(collected, dim=0) if collected else None  # type: ignore[arg-type]
+        )
+        sae_outputs: dict[str, torch.Tensor] | None = None
+        if output_sae and collected_tensor is not None:
+            assert input_ids is not None
+            self._validate_sae_inputs(input_ids)
+            sae_outputs = self._get_sae_outputs(
+                collected_tensor, layers_to_collect, bool_mask, normalize_sae
+            )
+        hidden_states_tensor = collected_tensor if output_hidden_states else None
+        if not return_dict:
+            return tuple(
+                v
+                for v in [
+                    last_hidden_state,
+                    hidden_states_tensor,
+                    sae_outputs,
+                    attentions,
+                ]
+                if v is not None
+            )
+        return ESMCOutput(
+            last_hidden_state=last_hidden_state,
+            hidden_states=hidden_states_tensor,
+            sae_outputs=sae_outputs,
+            attentions=attentions,
+        )
+# ---------------------------------------------------------------------------
+# LM head
+# ---------------------------------------------------------------------------
+def _esmc_lm_head(
+    d_model: int, output_dim: int, hidden_dim: int | None = None
+) -> nn.Sequential:
+    """Linear → GELU → LayerNorm → Linear projection head for masked LM."""
+    hidden_dim = hidden_dim if hidden_dim is not None else d_model
+    return nn.Sequential(
+        nn.Linear(d_model, hidden_dim),
+        nn.GELU(),
+        nn.LayerNorm(hidden_dim),
+        nn.Linear(hidden_dim, output_dim),
+    )
+# ---------------------------------------------------------------------------
+# Masked language model
+# ---------------------------------------------------------------------------
+@auto_docstring
+class ESMCForMaskedLM(ESMCPreTrainedModel):
+    """ESMC with a masked language modelling head.
+    This is the primary pre-training objective of ESMC.  The LM head consists
+    of a single hidden layer with GELU activation followed by LayerNorm and a
+    linear projection to ``vocab_size``.
+    """
+    def __init__(self, config: ESMCConfig):
+        super().__init__(config)
+        self.esmc = ESMCModel(config)
+        self.lm_head = _esmc_lm_head(config.d_model, config.vocab_size)
+        self.post_init()
+    def get_output_embeddings(self) -> nn.Linear:
+        return self.lm_head[-1]  # type: ignore[return-value]
+    def set_output_embeddings(self, new_embeddings: nn.Linear):
+        self.lm_head[-1] = new_embeddings
+    def add_sae_models(self, sae_models: list[_ESMCSAELayer]) -> None:
+        """Proxy to :meth:`ESMCModel.add_sae_models`."""
+        self.esmc.add_sae_models(sae_models)
+    @can_return_tuple
+    @auto_docstring
+    def forward(
+        self,
+        input_ids: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        sequence_id: Optional[torch.Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        labels: Optional[torch.Tensor] = None,
+        compute_sae: bool = True,
+        normalize_sae: bool = False,
+    ) -> tuple[torch.Tensor, ...] | ESMCMaskedLMOutput:
+        r"""
+        sequence_id (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Integer chain-ID tensor forwarded to the encoder for chain-aware
+            attention masking. See :meth:`ESMCModel.forward` for the encoding.
+        labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Labels for masked language modelling loss.  Positions with label ``-100``
+            are ignored.  Other positions must be in ``[0, config.vocab_size)``.
+        output_attentions (`bool`, *optional*):
+            Whether to return per-block attention weights. Forwarded to the
+            backbone; raises on the ``flash_attention_2`` path.
+        compute_sae (`bool`, *optional*, defaults to ``True``):
+            Whether to run registered SAE models. Has no effect when none are registered.
+        normalize_sae (`bool`, *optional*, defaults to ``False``):
+            When ``True``, scale SAE features by ``idf / max`` normalization buffers.
+        Examples:
+        ```python
+        >>> from transformers import AutoTokenizer, ESMCForMaskedLM
+        >>> import torch
+        >>> model = ESMCForMaskedLM.from_pretrained("Biohub/ESMC-600M-2024-12")
+        >>> tokenizer = AutoTokenizer.from_pretrained("Biohub/ESMC-600M-2024-12")
+        >>> inputs = tokenizer(["MLKNVQ<mask>LV"], return_tensors="pt")
+        >>> outputs = model(**inputs)
+        >>> outputs.logits.shape
+        torch.Size([1, 11, 64])
+        ```
+        """
+        return_dict = (
+            return_dict if return_dict is not None else self.config.use_return_dict
+        )
+        encoder_outputs = self.esmc(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            sequence_id=sequence_id,
+            output_hidden_states=output_hidden_states,
+            output_attentions=output_attentions,
+            return_dict=True,
+            compute_sae=compute_sae,
+            normalize_sae=normalize_sae,
+        )
+        logits = self.lm_head(encoder_outputs.last_hidden_state)
+        loss: torch.Tensor | None = None
+        if labels is not None:
+            loss = CrossEntropyLoss(ignore_index=-100)(
+                logits.view(-1, self.config.vocab_size), labels.view(-1)
+            )
+        if not return_dict:
+            return tuple(
+                v
+                for v in [
+                    loss,
+                    logits,
+                    encoder_outputs.last_hidden_state,
+                    encoder_outputs.hidden_states,
+                    encoder_outputs.sae_outputs,
+                    encoder_outputs.attentions,
+                ]
+                if v is not None
+            )
+        return ESMCMaskedLMOutput(
+            loss=loss,
+            logits=logits,
+            last_hidden_state=encoder_outputs.last_hidden_state,
+            hidden_states=encoder_outputs.hidden_states,
+            sae_outputs=encoder_outputs.sae_outputs,
+            attentions=encoder_outputs.attentions,
+        )
+# ---------------------------------------------------------------------------
+# Classification heads
+# ---------------------------------------------------------------------------
+class _ESMCClassificationHead(nn.Module):
+    """Dense classification head applied to the ``<cls>`` token representation."""
+    def __init__(self, config: ESMCConfig):
+        super().__init__()
+        self.dense = nn.Linear(config.d_model, config.d_model)
+        self.dropout = nn.Dropout(config.classifier_dropout)
+        self.out_proj = nn.Linear(config.d_model, config.num_labels)
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        x = hidden_states[:, 0, :]  # <cls> token
+        x = self.dropout(x)
+        x = torch.tanh(self.dense(x))
+        x = self.dropout(x)
+        return self.out_proj(x)
+# ---------------------------------------------------------------------------
+# Sequence classification
+# ---------------------------------------------------------------------------
+@auto_docstring
+class ESMCForSequenceClassification(ESMCPreTrainedModel):
+    """ESMC with a sequence-level classification head.
+    A linear layer is applied to the ``<cls>`` token representation.
+    Supports regression (``num_labels == 1``), single-label classification,
+    and multi-label classification.
+    """
+    def __init__(self, config: ESMCConfig):
+        super().__init__(config)
+        self.num_labels = config.num_labels
+        self.esmc = ESMCModel(config)
+        self.classifier = _ESMCClassificationHead(config)
+        self.post_init()
+    def add_sae_models(self, sae_models: list[_ESMCSAELayer]) -> None:
+        """Proxy to :meth:`ESMCModel.add_sae_models`."""
+        self.esmc.add_sae_models(sae_models)
+    @can_return_tuple
+    @auto_docstring
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        labels: Optional[torch.Tensor] = None,
+        compute_sae: bool = True,
+        normalize_sae: bool = False,
+    ) -> tuple[torch.Tensor, ...] | ESMCSequenceClassifierOutput:
+        r"""
+        labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
+            Labels for sequence classification loss.  Indices must be in
+            ``[0, config.num_labels - 1]``.  For regression pass a float
+            tensor of shape ``(batch_size,)``.
+        output_attentions (`bool`, *optional*):
+            Whether to return per-block attention weights. Forwarded to the
+            backbone; raises on the ``flash_attention_2`` path.
+        compute_sae (`bool`, *optional*, defaults to ``True``):
+            Whether to run registered SAE models. Has no effect when none are registered.
+        normalize_sae (`bool`, *optional*, defaults to ``False``):
+            When ``True``, scale SAE features by ``idf / max`` normalization buffers.
+        """
+        return_dict = (
+            return_dict if return_dict is not None else self.config.use_return_dict
+        )
+        encoder_outputs = self.esmc(
+            input_ids,
+            attention_mask=attention_mask,
+            output_hidden_states=output_hidden_states,
+            output_attentions=output_attentions,
+            return_dict=True,
+            compute_sae=compute_sae,
+            normalize_sae=normalize_sae,
+        )
+        logits = self.classifier(encoder_outputs.last_hidden_state)
+        loss: torch.Tensor | None = None
+        if labels is not None:
+            labels = labels.to(logits.device)
+            if self.config.problem_type is None:
+                if self.num_labels == 1:
+                    self.config.problem_type = "regression"
+                elif self.num_labels > 1 and labels.dtype in (torch.long, torch.int):
+                    self.config.problem_type = "single_label_classification"
+                else:
+                    self.config.problem_type = "multi_label_classification"
+            if self.config.problem_type == "regression":
+                loss_fct = MSELoss()
+                loss = loss_fct(
+                    logits.squeeze() if self.num_labels == 1 else logits,
+                    labels.squeeze() if self.num_labels == 1 else labels,
+                )
+            elif self.config.problem_type == "single_label_classification":
+                loss = CrossEntropyLoss()(
+                    logits.view(-1, self.num_labels), labels.view(-1)
+                )
+            elif self.config.problem_type == "multi_label_classification":
+                loss = BCEWithLogitsLoss()(logits, labels)
+        if not return_dict:
+            return tuple(
+                v
+                for v in [
+                    loss,
+                    logits,
+                    encoder_outputs.last_hidden_state,
+                    encoder_outputs.hidden_states,
+                    encoder_outputs.sae_outputs,
+                    encoder_outputs.attentions,
+                ]
+                if v is not None
+            )
+        return ESMCSequenceClassifierOutput(
+            loss=loss,
+            logits=logits,
+            last_hidden_state=encoder_outputs.last_hidden_state,
+            hidden_states=encoder_outputs.hidden_states,
+            sae_outputs=encoder_outputs.sae_outputs,
+            attentions=encoder_outputs.attentions,
+        )
+# ---------------------------------------------------------------------------
+# Token classification
+# ---------------------------------------------------------------------------
+@auto_docstring
+class ESMCForTokenClassification(ESMCPreTrainedModel):
+    """ESMC with a per-token classification head.
+    Useful for tasks such as secondary structure prediction, contact-map
+    prediction, or per-residue labelling.
+    """
+    def __init__(self, config: ESMCConfig):
+        super().__init__(config)
+        self.num_labels = config.num_labels
+        self.esmc = ESMCModel(config)
+        self.dropout = nn.Dropout(config.classifier_dropout)
+        self.classifier = nn.Linear(config.d_model, config.num_labels)
+        self.post_init()
+    def add_sae_models(self, sae_models: list[_ESMCSAELayer]) -> None:
+        """Proxy to :meth:`ESMCModel.add_sae_models`."""
+        self.esmc.add_sae_models(sae_models)
+    @can_return_tuple
+    @auto_docstring
+    def forward(
+        self,
+        input_ids: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        labels: Optional[torch.Tensor] = None,
+        compute_sae: bool = True,
+        normalize_sae: bool = False,
+    ) -> tuple[torch.Tensor, ...] | ESMCTokenClassifierOutput:
+        r"""
+        labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Per-token labels.  Indices must be in ``[0, config.num_labels - 1]``.
+            Positions with index ``-100`` are ignored in the loss.
+        output_attentions (`bool`, *optional*):
+            Whether to return per-block attention weights. Forwarded to the
+            backbone; raises on the ``flash_attention_2`` path.
+        compute_sae (`bool`, *optional*, defaults to ``True``):
+            Whether to run registered SAE models. Has no effect when none are registered.
+        normalize_sae (`bool`, *optional*, defaults to ``False``):
+            When ``True``, scale SAE features by ``idf / max`` normalization buffers.
+        """
+        return_dict = (
+            return_dict if return_dict is not None else self.config.use_return_dict
+        )
+        encoder_outputs = self.esmc(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            output_hidden_states=output_hidden_states,
+            output_attentions=output_attentions,
+            return_dict=True,
+            compute_sae=compute_sae,
+            normalize_sae=normalize_sae,
+        )
+        sequence_output = self.dropout(encoder_outputs.last_hidden_state)
+        logits = self.classifier(sequence_output)
+        loss: torch.Tensor | None = None
+        if labels is not None:
+            loss = CrossEntropyLoss(ignore_index=-100)(
+                logits.view(-1, self.num_labels), labels.to(logits.device).view(-1)
+            )
+        if not return_dict:
+            return tuple(
+                v
+                for v in [
+                    loss,
+                    logits,
+                    encoder_outputs.last_hidden_state,
+                    encoder_outputs.hidden_states,
+                    encoder_outputs.sae_outputs,
+                    encoder_outputs.attentions,
+                ]
+                if v is not None
+            )
+        return ESMCTokenClassifierOutput(
+            loss=loss,
+            logits=logits,
+            last_hidden_state=encoder_outputs.last_hidden_state,
+            hidden_states=encoder_outputs.hidden_states,
+            sae_outputs=encoder_outputs.sae_outputs,
+            attentions=encoder_outputs.attentions,
+        )
+__all__ = [
+    "ESMCModel",
+    "ESMCForMaskedLM",
+    "ESMCForSequenceClassification",
+    "ESMCForTokenClassification",
+    "ESMCPreTrainedModel",
+]

modeling_esmc_sae.py ADDED Viewed

	@@ -0,0 +1,363 @@

+# Copyright 2026 Biohub. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""PyTorch ESMC SAE (Sparse Autoencoder) model.
+* :class:`ESMCSAEModel` — the published HF container, one repo per
+  ``(backbone, codebook_dim, k)`` group. Each backbone layer ships as a
+  ``layer_{i}.safetensors`` shard; ``from_pretrained`` downloads the whole
+  snapshot but loads no weights — callers materialize the layers they need
+  via :meth:`initialize_layers`. Single-layer repos auto-load so bare
+  ``forward(x)`` works.
+* :class:`_ESMCSAELayer` — internal ``nn.Module`` that holds the weights for
+  one ``(backbone, codebook_dim, k, layer)`` SAE. Not a published HF artifact;
+  obtained only via ``model.layers["<idx>"]``.
+"""
+from __future__ import annotations
+import os
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Optional
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from safetensors.torch import load_file, save_file
+from transformers.modeling_outputs import ModelOutput
+from transformers.modeling_utils import PreTrainedModel
+from transformers.utils import auto_docstring
+from .configuration_esmc_sae import ESMCSAEConfig, ESMCSAEParams
+@dataclass
+@auto_docstring(
+    custom_intro="""
+    Output type of [`ESMCSAEModel`].
+    """
+)
+class ESMCSAEOutput(ModelOutput):
+    feature_magnitudes: torch.Tensor
+    reconstruction_loss: Optional[torch.Tensor] = None
+    def to_sparse(self) -> None:
+        self.feature_magnitudes = self.feature_magnitudes.to_sparse()
+class _ESMCSAELayer(nn.Module):
+    """One backbone layer's SAE — internal building block of :class:`ESMCSAEModel`.
+    Not exposed via ``AutoModel`` and not loadable on its own. Obtain one
+    via ``model.layers["<layer_idx>"]`` after calling ``initialize_layers``.
+    """
+    def __init__(self, params: ESMCSAEParams):
+        super().__init__()
+        self.params = params
+        self.W_enc = nn.Parameter(torch.empty(params.d_model, params.codebook_dim))
+        self.W_dec = nn.Parameter(torch.empty(params.codebook_dim, params.d_model))
+        self.b_dec = nn.Parameter(torch.zeros(params.d_model))
+        # Per-feature normalization stats. Trained alongside the SAE for some
+        # variants; for variants that don't ship them, leaving these as ones
+        # makes ``_get_sae_outputs``'s ``features / max * idf`` a no-op.
+        self.register_buffer("idf", torch.ones(params.codebook_dim))
+        self.register_buffer("max", torch.ones(params.codebook_dim))
+    @property
+    def layer(self) -> int:
+        """Backbone-layer index this SAE is trained against."""
+        return self.params.layer
+    def forward(self, x: torch.Tensor, **_kwargs: object) -> ESMCSAEOutput:
+        del _kwargs
+        x = self._zscore_normalize_representation(x)
+        x_with_pre_encoder_bias = x - self.b_dec
+        preactivations = F.relu(x_with_pre_encoder_bias @ self.W_enc)
+        topk = torch.topk(preactivations, self.params.k, dim=-1)
+        feature_magnitudes = torch.zeros_like(preactivations).scatter(
+            -1, topk.indices, topk.values
+        )
+        reconstructed = feature_magnitudes @ self.W_dec + self.b_dec
+        reconstruction_loss = (reconstructed - x).pow(2).mean(dim=-1)
+        return ESMCSAEOutput(
+            feature_magnitudes=feature_magnitudes,
+            reconstruction_loss=reconstruction_loss,
+        )
+    def get_sae_output(
+        self, layer_states: torch.Tensor, token_mask: torch.Tensor
+    ) -> ESMCSAEOutput:
+        _, _, v_len = layer_states.shape
+        nonpad_states = layer_states[token_mask].view(-1, v_len)
+        return self(nonpad_states)
+    def _zscore_normalize_representation(self, x: torch.Tensor) -> torch.Tensor:
+        x_mean = x.mean(dim=-1, keepdim=True)
+        x = x - x_mean
+        x_std = x.std(dim=-1, keepdim=True)
+        return x / (x_std + 1e-5)
+@auto_docstring
+class ESMCSAEPreTrainedModel(PreTrainedModel):
+    config_class = ESMCSAEConfig
+    base_model_prefix = "esmc_sae"
+@auto_docstring(
+    custom_intro="""
+    HF container holding one SAE per backbone layer, all sharing the same
+    ``(d_model, codebook_dim, k)``.
+    ``from_pretrained`` downloads the entire repo (every ``layer_{i}.safetensors``)
+    into the local HF cache but does **not** load any weights into memory.
+    Callers materialize the layers they actually need by calling
+    :meth:`initialize_layers`. The full set is available on disk after the
+    first call, so subsequent layer switches read from the local cache without
+    re-downloading.
+    Examples::
+        model = ESMCSAEModel.from_pretrained(
+            "biohub/esmc-6b-2024-12-sae-k64-codebook16384"
+        )
+        model.initialize_layers([60])                  # ~2.5 GB into memory
+        out = model(layer_states, layer=60)            # forward through layer 60
+        model.initialize_layers([45])                  # add layer 45 (cached locally)
+        model.release_layer(60)                        # free layer 60
+    """
+)
+class ESMCSAEModel(ESMCSAEPreTrainedModel):
+    def __init__(self, config: ESMCSAEConfig):
+        super().__init__(config)
+        # Layers are populated lazily by ``initialize_layers``; the container
+        # starts empty so ``from_pretrained`` doesn't materialize hundreds of
+        # GB of unused parameters.
+        self.layers = nn.ModuleDict()
+        # Zero-element buffer that rides along with ``.to(device/dtype)``.
+        # ``initialize_layers`` reads its current device/dtype so SAEs added
+        # after ``model.to("cuda")`` land on CUDA without re-passing ``device=``.
+        self.register_buffer("_device_marker", torch.empty(0), persistent=False)
+        self._snapshot_dir: Optional[str] = None
+        self.post_init()
+    @classmethod
+    def from_pretrained(  # type: ignore[override]
+        cls, pretrained_model_name_or_path: str | os.PathLike, *model_args, **kwargs
+    ) -> "ESMCSAEModel":
+        """Download (or reuse cached) the full repo and return the model.
+        By default no weights are read into memory and the caller must invoke
+        :meth:`initialize_layers` before running :meth:`forward`. The single
+        exception is when the repo ships exactly one layer: that layer is
+        auto-loaded (honoring ``torch_dtype`` / ``device`` if passed) so the
+        bare ``forward(x)`` call just works.
+        Honored kwargs: ``revision``, ``cache_dir``, ``token``,
+        ``allow_patterns``, ``local_files_only``, ``force_download`` (forwarded
+        to ``snapshot_download``); ``torch_dtype`` and ``device`` (used by the
+        single-layer auto-load path; otherwise pass them to
+        :meth:`initialize_layers`). Behavioral kwargs that imply work we do
+        not perform (``device_map``, ``low_cpu_mem_usage``,
+        ``quantization_config``, ``attn_implementation``) raise so the user
+        isn't silently misled. Other HF housekeeping kwargs (``config``,
+        ``trust_remote_code``, ``adapter_kwargs``, …) are accepted and
+        ignored — they only matter for the standard loader, which we bypass.
+        """
+        del model_args
+        torch_dtype = kwargs.pop("torch_dtype", None)
+        device = kwargs.pop("device", None)
+        local_dir = _resolve_snapshot_dir(pretrained_model_name_or_path, kwargs)
+        unsupported = {
+            "device_map",
+            "low_cpu_mem_usage",
+            "quantization_config",
+            "attn_implementation",
+            "max_memory",
+            "offload_folder",
+            "offload_state_dict",
+        } & kwargs.keys()
+        if unsupported:
+            raise TypeError(
+                f"Unsupported kwargs to ESMCSAEModel.from_pretrained: "
+                f"{sorted(unsupported)}. The standard HF loader is bypassed —"
+                " call initialize_layers(..., device=, dtype=) instead."
+            )
+        config = ESMCSAEConfig.from_pretrained(local_dir)
+        model = cls(config)
+        model._snapshot_dir = str(local_dir)
+        if device is not None:
+            model.to(device)
+        if torch_dtype is not None:
+            model.to(torch_dtype)
+        if len(config.available_layers) == 1:
+            model.initialize_layers(list(config.available_layers))
+        return model
+    def initialize_layers(
+        self,
+        layers: list[int],
+        *,
+        device: torch.device | str | None = None,
+        dtype: torch.dtype | None = None,
+    ) -> None:
+        """Load the requested layers from the local snapshot into memory.
+        Layers already present in :attr:`self.layers` are skipped — calling
+        ``initialize_layers([23])`` twice is idempotent. ``device`` / ``dtype``
+        default to wherever the model itself lives (via the ``_device_marker``
+        buffer that moves with ``.to(...)``), so the common pattern of
+        ``model.to("cuda"); model.initialize_layers([7])`` Just Works.
+        """
+        assert self._snapshot_dir is not None, (
+            "ESMCSAEModel has no snapshot directory — call "
+            "from_pretrained first, or set _snapshot_dir manually."
+        )
+        if device is None:
+            device = self._device_marker.device
+        if dtype is None:
+            dtype = self._device_marker.dtype
+        snapshot_dir = Path(self._snapshot_dir)
+        available = set(self.config.available_layers)
+        for layer_idx in layers:
+            key = str(layer_idx)
+            if key in self.layers:
+                continue
+            if layer_idx not in available:
+                raise KeyError(
+                    f"Layer {layer_idx} is not in this repo. "
+                    f"available_layers={sorted(available)}"
+                )
+            shard = snapshot_dir / f"layer_{layer_idx}.safetensors"
+            if not shard.exists():
+                raise FileNotFoundError(
+                    f"Missing layer file {shard} — config lists layer "
+                    f"{layer_idx} as available but the shard is not on disk."
+                )
+            params = ESMCSAEParams(
+                d_model=self.config.d_model,
+                codebook_dim=self.config.codebook_dim,
+                k=self.config.k,
+                layer=layer_idx,
+            )
+            # Build on the meta device so we don't allocate weights that
+            # ``load_state_dict`` would immediately overwrite.
+            with torch.device("meta"):
+                layer = _ESMCSAELayer(params)
+            layer.to_empty(device=device)
+            layer.load_state_dict(load_file(str(shard)))
+            layer.to(dtype=dtype)
+            self.layers[key] = layer
+    def release_layer(self, layer: int) -> None:
+        """Drop the named layer from memory. No-op if not loaded."""
+        key = str(layer)
+        if key in self.layers:
+            del self.layers[key]
+    def loaded_layers(self) -> list[int]:
+        """Sorted list of layer indices currently materialized in memory."""
+        return sorted(int(k) for k in self.layers.keys())
+    def forward(
+        self, x: torch.Tensor, layer: int | None = None, **kwargs: object
+    ) -> ESMCSAEOutput:
+        if layer is None:
+            if len(self.layers) == 1:
+                # Unambiguous: exactly one layer loaded → use it.
+                ((_only_key, only_layer),) = self.layers.items()
+                return only_layer(x, **kwargs)
+            if len(self.layers) == 0:
+                raise RuntimeError(
+                    "No layers loaded — call "
+                    f"initialize_layers([...]) first. "
+                    f"available_layers={self.config.available_layers}"
+                )
+            raise RuntimeError(
+                "Multiple layers are loaded — please select one via "
+                f"forward(x, layer=<idx>). Loaded layers: {self.loaded_layers()}"
+            )
+        key = str(layer)
+        if key not in self.layers:
+            raise KeyError(
+                f"Layer {layer} is not loaded. Call "
+                f"initialize_layers([{layer}]) first. Loaded layers: "
+                f"{self.loaded_layers()}"
+            )
+        return self.layers[key](x, **kwargs)
+    def save_pretrained(  # type: ignore[override]
+        self, save_directory: str | os.PathLike, *args, **kwargs
+    ) -> None:
+        """Write ``config.json`` plus one ``layer_{i}.safetensors`` per loaded layer.
+        Only layers currently in :attr:`self.layers` are written.
+        ``available_layers`` in the saved config is synced to what's actually
+        on disk so a ``release_layer`` + ``save_pretrained`` round-trip never
+        advertises a layer whose shard is missing.
+        """
+        del args, kwargs
+        save_directory = Path(save_directory)
+        save_directory.mkdir(parents=True, exist_ok=True)
+        # Sync available_layers to what we're about to write — never advertise
+        # a layer that isn't on disk in this repo.
+        self.config.available_layers = self.loaded_layers()
+        self.config.save_pretrained(str(save_directory))
+        for key, layer in self.layers.items():
+            shard = save_directory / f"layer_{key}.safetensors"
+            save_file(
+                {
+                    k: v.detach().cpu().contiguous()
+                    for k, v in layer.state_dict().items()
+                },
+                str(shard),
+            )
+def _resolve_snapshot_dir(
+    pretrained_model_name_or_path: str | os.PathLike, kwargs: dict
+) -> str:
+    """Local dir → return as-is; hub id → ``snapshot_download`` it.
+    A directory only counts as "local" if it actually contains ``config.json``,
+    so a stale subdir named like a hub id (``./biohub/esmc-...``)
+    doesn't accidentally shadow the hub fetch.
+    Pops the standard ``snapshot_download`` keyword args from ``kwargs`` so
+    callers can forward them via ``from_pretrained``.
+    """
+    path = Path(pretrained_model_name_or_path)
+    if path.is_dir() and (path / "config.json").exists():
+        return str(path)
+    from huggingface_hub import snapshot_download
+    return snapshot_download(
+        repo_id=str(pretrained_model_name_or_path),
+        revision=kwargs.pop("revision", None),
+        cache_dir=kwargs.pop("cache_dir", None),
+        token=kwargs.pop("token", None),
+        allow_patterns=kwargs.pop("allow_patterns", None),
+        local_files_only=kwargs.pop("local_files_only", False),
+        force_download=kwargs.pop("force_download", False),
+    )
+__all__ = ["ESMCSAEModel", "ESMCSAEOutput", "ESMCSAEPreTrainedModel"]

modeling_esmfold2.py ADDED Viewed

	@@ -0,0 +1,1288 @@

+"""PyTorch ESMFold2 model — the standard released architecture.
+Quickstart::
+    from transformers import ESMFold2Model
+    model = ESMFold2Model.from_pretrained("biohub/ESMFold2").cuda().eval()
+    open("ubq.pdb", "w").write(model.infer_protein_as_pdb("MQIFVKTLTGKT..."))
+For multi-chain / ligand / MSA inputs see ``ESMFold2InputBuilder`` in the
+companion ``esm`` package.
+"""
+import importlib
+import math
+import sys
+from contextlib import contextmanager
+from pathlib import Path
+from typing import Any, cast
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torch import Tensor
+try:
+    te = importlib.import_module("transformer_engine.pytorch")
+    te_recipe = importlib.import_module("transformer_engine.common.recipe")
+    DelayedScaling = te_recipe.DelayedScaling
+    Format = te_recipe.Format
+    TE_AVAILABLE = True
+except ImportError:
+    te = None  # type: ignore[assignment]
+    DelayedScaling = None  # type: ignore[assignment]
+    Format = None  # type: ignore[assignment]
+    TE_AVAILABLE = False
+from transformers.modeling_utils import PreTrainedModel
+from .configuration_esmc import ESMCConfig as _FastPLMSESMCConfig
+from .configuration_esmc_sae import ESMCSAEConfig as _FastPLMSESMCSAEConfig
+from .configuration_esmfold2 import ESMFold2Config
+from .modeling_esmc import ESMCModel as _FastPLMSESMCModel
+from .modeling_esmc_sae import _ESMCSAELayer as _FastPLMSESMCSAELayer
+from .modeling_esmfold2_common import (
+    CHAR_VOCAB_SIZE,
+    MAX_ATOMIC_NUMBER,
+    NUM_RES_TYPES,
+    DiffusionStructureHead,
+    FoldingTrunk,
+    InputsEmbedder,
+    LanguageModelShim,
+    MSAPairWeightedAveraging,
+    OuterProductMean,
+    ResIdxAsymIdSymIdEntityIdEncoding,
+    RowAttentionPooling,
+    SwiGLUMLP,
+    TriangleMultiplicativeUpdate,
+    _categorical_mean,
+    _compute_intra_token_idx,
+    compute_lm_hidden_states,
+    gather_rep_atom_coords,
+    gather_token_to_atom,
+)
+from .esmfold2_affine3d import Affine3D as _FastPLMSESMFold2Affine3D
+from .esmfold2_aligner import Aligner as _FastPLMSESMFold2Aligner
+from .esmfold2_atom_indexer import AtomIndexer as _FastPLMSESMFold2AtomIndexer
+from .esmfold2_conformers import load_ccd as _fastplms_esmfold2_load_ccd
+from .esmfold2_constants import ELEMENT_NUMBER_TO_SYMBOL as _FASTPLMS_ESMFOLD2_ELEMENT_NUMBER_TO_SYMBOL
+from .esmfold2_constants_esm3 import CHAIN_BREAK_STR as _FASTPLMS_ESMFOLD2_CHAIN_BREAK_STR
+from .esmfold2_input_builder import StructurePredictionInput as _FastPLMSESMFold2StructurePredictionInput
+from .esmfold2_metrics import compute_rmsd as _fastplms_esmfold2_compute_rmsd
+from .esmfold2_misc import slice_any_object as _fastplms_esmfold2_slice_any_object
+from .esmfold2_mmcif_parsing import MmcifWrapper as _FastPLMSESMFold2MmcifWrapper
+from .esmfold2_molecular_complex import MolecularComplex as _FastPLMSESMFold2MolecularComplex
+from .esmfold2_msa import MSA as _FastPLMSESMFold2MSA
+from .esmfold2_msa_filter_sequences import greedy_select_indices as _fastplms_esmfold2_greedy_select_indices
+from .esmfold2_normalize_coordinates import normalize_coordinates as _fastplms_esmfold2_normalize_coordinates
+from .esmfold2_output import build_molecular_complex_from_features as _fastplms_esmfold2_build_molecular_complex_from_features
+from .esmfold2_paired_msa import construct_paired_msa as _fastplms_esmfold2_construct_paired_msa
+from .esmfold2_parsing import FastaEntry as _FastPLMSESMFold2FastaEntry
+from .esmfold2_predicted_aligned_error import compute_tm as _fastplms_esmfold2_compute_tm
+from .esmfold2_prepare_input import prepare_esmfold2_input as _fastplms_esmfold2_prepare_esmfold2_input
+from .esmfold2_processor import ESMFold2InputBuilder as _FastPLMSESMFold2InputBuilder
+from .esmfold2_protein_chain import ProteinChain as _FastPLMSESMFold2ProteinChain
+from .esmfold2_protein_complex import ProteinComplex as _FastPLMSESMFold2ProteinComplex
+from .esmfold2_protein_structure import index_by_atom_name as _fastplms_esmfold2_index_by_atom_name
+from .esmfold2_residue_constants import restypes as _FASTPLMS_ESMFOLD2_RESTYPES
+from .esmfold2_sequential_dataclass import SequentialDataclass as _FastPLMSESMFold2SequentialDataclass
+from .esmfold2_system import run_subprocess_with_errorcheck as _fastplms_esmfold2_run_subprocess_with_errorcheck
+from .esmfold2_types import ProteinInput as _FastPLMSESMFold2ProteinInput
+from .esmfold2_utils_types import PathOrBuffer as _FastPLMSESMFold2PathOrBuffer
+_EPS = 1e-6
+_NONPOLYMER_ID = 4
+# Default for the triangle / OPM / pair-transition L² ops. Caps peak memory
+# so L≈2k folds on an 80 GB GPU (~76 GB peak at chunk=128 for L=1438;
+# chunk=64 leaves headroom for the largest foldbench targets). Override via
+# ``model.set_chunk_size(...)``; pass None to disable chunking (faster for
+# short L but OOM-prone past ~600).
+_DEFAULT_CHUNK_SIZE = 64
+def _ensure_vendored_esm_alias() -> None:
+    package = __package__
+    assert package is not None
+    vendored_esm = importlib.import_module(f"{package}.esm")
+    sys.modules["esm"] = vendored_esm
+class PairTransition(nn.Module):
+    """LayerNorm + SwiGLU feed-forward residual block on the pair representation."""
+    def __init__(self, d_model: int, expansion_ratio: int = 4) -> None:
+        super().__init__()
+        self.norm = nn.LayerNorm(d_model)
+        self.ffn = SwiGLUMLP(d_model, expansion_ratio=expansion_ratio, bias=False)
+        self._chunk_size: int | None = _DEFAULT_CHUNK_SIZE
+    def set_chunk_size(self, chunk_size: int | None) -> None:
+        self._chunk_size = chunk_size
+    def forward(self, x: Tensor) -> Tensor:
+        if self._chunk_size is None or x.shape[1] <= self._chunk_size:
+            return self.ffn(self.norm(x))
+        out: list[Tensor] = []
+        for s in range(0, x.shape[1], self._chunk_size):
+            e = min(s + self._chunk_size, x.shape[1])
+            sl = x[:, s:e]
+            out.append(self.ffn(self.norm(sl)))
+        return torch.cat(out, dim=1)
+class ConfidenceHead(nn.Module):
+    """Predicts pLDDT, PAE, PDE, resolved-atom probability and distogram bins."""
+    boundaries: Tensor
+    def __init__(self, config: "ESMFold2Config") -> None:
+        super().__init__()
+        ch = config.confidence_head
+        d_single = config.d_single
+        d_pair = config.d_pair
+        d_inputs = config.inputs.d_inputs
+        boundaries = torch.linspace(ch.min_dist, ch.max_dist, ch.distogram_bins - 1)
+        self.register_buffer("boundaries", boundaries)
+        self.dist_bin_pairwise_embed = nn.Embedding(ch.distogram_bins, d_pair)
+        self.s_norm = nn.LayerNorm(d_single)
+        self.s_inputs_to_single = nn.Linear(d_inputs, d_single, bias=False)
+        self.s_to_z = nn.Linear(d_inputs, d_pair, bias=False)
+        self.s_to_z_transpose = nn.Linear(d_inputs, d_pair, bias=False)
+        self.s_to_z_prod_in1 = nn.Linear(d_inputs, d_pair, bias=False)
+        self.s_to_z_prod_in2 = nn.Linear(d_inputs, d_pair, bias=False)
+        self.s_to_z_prod_out = nn.Linear(d_pair, d_pair, bias=False)
+        self.s_input_to_s = nn.Linear(d_inputs, d_single, bias=False)
+        self.s_inputs_norm = nn.LayerNorm(d_inputs)
+        self.z_norm = nn.LayerNorm(d_pair)
+        self.row_attention_pooling = RowAttentionPooling(
+            d_pair=d_pair, d_single=d_single
+        )
+        pf = ch.folding_trunk
+        self.folding_trunk = FoldingTrunk(
+            n_layers=pf.n_layers, d_pair=d_pair, expansion_ratio=4
+        )
+        # Heads.
+        self.plddt_ln = nn.LayerNorm(d_single)
+        max_atoms_per_token = 23
+        self.plddt_weight = nn.Parameter(
+            torch.zeros(max_atoms_per_token, d_single, ch.num_plddt_bins)
+        )
+        self.pae_ln = nn.LayerNorm(d_pair)
+        self.pae_head = nn.Linear(d_pair, ch.num_pae_bins, bias=False)
+        self.pde_ln = nn.LayerNorm(d_pair)
+        self.pde_head = nn.Linear(d_pair, ch.num_pde_bins, bias=False)
+        self.resolved_ln = nn.LayerNorm(d_single)
+        # 2 = resolved logits ([unresolved, resolved]).
+        self.resolved_weight = nn.Parameter(
+            torch.zeros(max_atoms_per_token, d_single, 2)
+        )
+    def set_kernel_backend(self, backend: str | None) -> None:
+        self.folding_trunk.set_kernel_backend(backend)
+    def set_chunk_size(self, chunk_size: int | None) -> None:
+        self.folding_trunk.set_chunk_size(chunk_size)
+    @staticmethod
+    def _repeat_batch(x: Tensor, num_diffusion_samples: int) -> Tensor:
+        return (
+            x
+            if num_diffusion_samples == 1
+            else x.repeat_interleave(num_diffusion_samples, 0)
+        )
+    @staticmethod
+    def _flatten_sample_axis(x: Tensor) -> Tensor:
+        if x.ndim == 4:
+            b, mult, n, c = x.shape
+            return x.reshape(b * mult, n, c)
+        return x
+    def forward(
+        self,
+        s_inputs: Tensor,
+        z: Tensor,
+        x_pred: Tensor,
+        distogram_atom_idx: Tensor,
+        token_attention_mask: Tensor,
+        atom_to_token: Tensor,
+        atom_attention_mask: Tensor,
+        asym_id: Tensor,
+        mol_type: Tensor,
+        num_diffusion_samples: int = 1,
+        relative_position_encoding: Tensor | None = None,
+        token_bonds_encoding: Tensor | None = None,
+    ) -> dict[str, Tensor]:
+        s_inputs_normed = self.s_inputs_norm(s_inputs)
+        z_base = self.z_norm(z)
+        if relative_position_encoding is not None:
+            z_base = z_base + relative_position_encoding
+        if token_bonds_encoding is not None:
+            z_base = z_base + token_bonds_encoding
+        z_base = z_base + self.s_to_z(s_inputs_normed).unsqueeze(2)
+        z_base = z_base + self.s_to_z_transpose(s_inputs_normed).unsqueeze(1)
+        z_base = z_base + self.s_to_z_prod_out(
+            self.s_to_z_prod_in1(s_inputs_normed)[:, :, None, :]
+            * self.s_to_z_prod_in2(s_inputs_normed)[:, None, :, :]
+        )
+        pair = self._repeat_batch(z_base, num_diffusion_samples)
+        x_pred_flat = self._flatten_sample_axis(x_pred)
+        atom_to_token_m = self._repeat_batch(atom_to_token, num_diffusion_samples)
+        atom_mask_m = self._repeat_batch(atom_attention_mask, num_diffusion_samples)
+        rep_idx_m = self._repeat_batch(distogram_atom_idx, num_diffusion_samples).long()
+        mask = self._repeat_batch(token_attention_mask, num_diffusion_samples)
+        Bm = pair.shape[0]
+        rep_coords = gather_rep_atom_coords(x_pred_flat, rep_idx_m)
+        rep_distances = torch.cdist(
+            rep_coords, rep_coords, compute_mode="donot_use_mm_for_euclid_dist"
+        )
+        distogram_bins = (
+            (rep_distances.unsqueeze(-1) > self.boundaries).sum(dim=-1).long()
+        )
+        pair = pair + self.dist_bin_pairwise_embed(distogram_bins)
+        pair_mask = mask[:, :, None].float() * mask[:, None, :].float()
+        # FoldingTrunk handles the bf16 cast internally during inference so
+        # each block's fused trimul engages. In-place residual avoids an
+        # extra fp32 pair allocation.
+        with torch.amp.autocast("cuda", enabled=pair.is_cuda, dtype=torch.bfloat16):
+            pair_delta = self.folding_trunk(pair, pair_attention_mask=pair_mask)
+        pair.add_(pair_delta.float())
+        del pair_delta
+        single = self.row_attention_pooling(pair, mask)
+        atom_mask_f = atom_mask_m.float()
+        s_at_atoms = gather_token_to_atom(single, atom_to_token_m)
+        s_at_atoms_ln = self.plddt_ln(s_at_atoms)
+        intra_idx = _compute_intra_token_idx(atom_to_token_m)
+        intra_idx = intra_idx.clamp(max=self.plddt_weight.shape[0] - 1)
+        w_plddt = self.plddt_weight[intra_idx]
+        plddt_logits = torch.einsum("...c,...cb->...b", s_at_atoms_ln, w_plddt)
+        plddt_per_atom = _categorical_mean(plddt_logits, start=0.0, end=1.0)
+        L = single.shape[1]
+        plddt_sum = torch.zeros(Bm, L, device=single.device, dtype=plddt_per_atom.dtype)
+        atom_count = torch.zeros(
+            Bm, L, device=single.device, dtype=plddt_per_atom.dtype
+        )
+        atom_mask_t = atom_mask_f.to(plddt_per_atom.dtype)
+        plddt_sum.scatter_add_(1, atom_to_token_m, plddt_per_atom * atom_mask_t)
+        atom_count.scatter_add_(1, atom_to_token_m, atom_mask_t)
+        plddt = plddt_sum / atom_count.clamp(min=1e-6)
+        complex_plddt = (plddt_per_atom * atom_mask_f).sum(dim=-1) / (
+            atom_mask_f.sum(dim=-1) + _EPS
+        )
+        expanded_type = self._repeat_batch(mol_type, num_diffusion_samples)
+        expanded_asym = self._repeat_batch(asym_id, num_diffusion_samples)
+        is_ligand = (expanded_type == _NONPOLYMER_ID).float()
+        inter_chain = (
+            expanded_asym.unsqueeze(-1) != expanded_asym.unsqueeze(-2)
+        ).float()
+        near_contact = (rep_distances < 8).float()
+        interface_per_token = (
+            near_contact * inter_chain * (1.0 - is_ligand).unsqueeze(-1)
+        ).amax(dim=-1)
+        iplddt_weight = torch.where(
+            is_ligand.bool(),
+            torch.full_like(interface_per_token, 2.0),
+            interface_per_token,
+        )
+        iplddt_weight_atoms = gather_token_to_atom(
+            iplddt_weight.unsqueeze(-1), atom_to_token_m
+        ).squeeze(-1)
+        atom_iplddt_w = atom_mask_f * iplddt_weight_atoms
+        complex_iplddt = (plddt_per_atom * atom_iplddt_w).sum(dim=-1) / (
+            atom_iplddt_w.sum(dim=-1) + _EPS
+        )
+        plddt_ca = plddt_per_atom.gather(1, rep_idx_m)
+        # PAE
+        pae_logits = self.pae_head(self.pae_ln(pair))
+        pae = _categorical_mean(pae_logits, start=0.0, end=32.0).detach()
+        # PDE
+        pde_logits = self.pde_head(self.pde_ln(pair))
+        pde = _categorical_mean(pde_logits, start=0.0, end=32.0).detach()
+        # Resolved (per-atom binary).
+        s_at_atoms_res = self.resolved_ln(s_at_atoms)
+        w_res = self.resolved_weight[intra_idx]
+        resolved_logits = torch.einsum("...c,...cb->...b", s_at_atoms_res, w_res)
+        # pTM / ipTM from pae_logits.
+        n_bins = pae_logits.shape[-1]
+        bin_width = 32.0 / n_bins
+        bin_centers = torch.arange(
+            0.5 * bin_width, 32.0, bin_width, device=pae_logits.device
+        )
+        mask_f = mask.float()
+        N_res = mask_f.sum(dim=-1, keepdim=True)
+        d0 = 1.24 * (N_res.clamp(min=19) - 15) ** (1 / 3) - 1.8
+        tm_per_bin = 1 / (1 + (bin_centers / d0) ** 2)
+        pae_probs = F.softmax(pae_logits, dim=-1)
+        tm_expected = (pae_probs * tm_per_bin[:, None, None, :]).sum(dim=-1)
+        pair_mask_2d = mask_f.unsqueeze(-1) * mask_f.unsqueeze(-2)
+        ptm_per_row = (tm_expected * pair_mask_2d).sum(dim=-1) / (
+            pair_mask_2d.sum(dim=-1) + _EPS
+        )
+        ptm = ptm_per_row.max(dim=-1).values
+        inter_chain_mask = (
+            expanded_asym.unsqueeze(-1) != expanded_asym.unsqueeze(-2)
+        ).float() * pair_mask_2d
+        iptm_per_row = (tm_expected * inter_chain_mask).sum(dim=-1) / (
+            inter_chain_mask.sum(dim=-1) + _EPS
+        )
+        iptm = iptm_per_row.max(dim=-1).values
+        max_chain_id = int(expanded_asym.max().item()) if Bm > 0 else 0
+        n_chains = max_chain_id + 1
+        pair_chains_iptm = torch.zeros(
+            Bm, n_chains, n_chains, device=tm_expected.device, dtype=tm_expected.dtype
+        )
+        for c1 in range(n_chains):
+            chain_c1 = (expanded_asym == c1).float() * mask_f
+            if chain_c1.sum() == 0:
+                continue
+            for c2 in range(n_chains):
+                chain_c2 = (expanded_asym == c2).float() * mask_f
+                pair_m = chain_c1.unsqueeze(-1) * chain_c2.unsqueeze(-2)
+                denom = pair_m.sum(dim=(-1, -2)) + _EPS
+                pair_chains_iptm[:, c1, c2] = (tm_expected * pair_m).sum(
+                    dim=(-1, -2)
+                ) / denom
+        return {
+            "plddt_logits": plddt_logits,
+            "plddt": plddt.detach(),
+            "plddt_per_atom": plddt_per_atom.detach(),
+            "plddt_ca": plddt_ca.detach(),
+            "complex_plddt": complex_plddt.detach(),
+            "complex_iplddt": complex_iplddt.detach(),
+            "pae_logits": pae_logits,
+            "pae": pae,
+            "pde_logits": pde_logits,
+            "pde": pde,
+            "resolved_logits": resolved_logits,
+            "ptm": ptm.detach(),
+            "iptm": iptm.detach(),
+            "pair_chains_iptm": pair_chains_iptm.detach(),
+        }
+def _inverse_softplus(value: float) -> float:
+    return value + math.log(-math.expm1(-value))
+def _convert_te_modules_to_fp8_inplace(module: nn.Module) -> None:
+    """Re-init each TE module via quantized_model_init so weights live as fp8.
+    Must be called inside torch.no_grad(); covers nn.Linear, te.Linear,
+    te.LayerNormLinear, te.LayerNormMLP — the last two hold 99% of ESMC weight.
+    """
+    if not TE_AVAILABLE:
+        raise RuntimeError("transformer_engine is not available; cannot use fp8.")
+    quantized_model_init = importlib.import_module(
+        "transformer_engine.pytorch"
+    ).quantized_model_init
+    def _walk(mod: nn.Module) -> None:
+        for name, child in list(mod.named_children()):
+            replaced = False
+            if isinstance(child, nn.Linear):
+                in_f, out_f = child.in_features, child.out_features
+                has_bias = child.bias is not None
+                device = child.weight.device
+                dtype = child.weight.dtype
+                w = child.weight.data
+                b = child.bias.data if has_bias else None
+                setattr(mod, name, nn.Identity())
+                del child
+                torch.cuda.empty_cache()
+                with quantized_model_init(enabled=True):
+                    new_mod = te.Linear(  # type: ignore[union-attr]
+                        in_f, out_f, bias=has_bias, params_dtype=dtype
+                    ).to(device)
+                new_mod.weight.quantize_(w)  # type: ignore[attr-defined,operator]
+                if has_bias:
+                    assert b is not None
+                    new_mod.bias.data.copy_(b)  # type: ignore[union-attr]
+                del w, b
+                replaced = True
+            elif isinstance(child, te.Linear):  # type: ignore[union-attr]
+                # te.Linear with bf16 weight → re-init inside quantized_model_init for fp8.
+                in_f, out_f = child.in_features, child.out_features
+                has_bias = child.bias is not None
+                device = child.weight.device
+                dtype = (
+                    child.weight.dtype
+                    if not hasattr(child.weight, "_data")
+                    else torch.bfloat16
+                )
+                state = {k: v.detach().clone() for k, v in child.state_dict().items()}
+                setattr(mod, name, nn.Identity())
+                del child
+                torch.cuda.empty_cache()
+                with quantized_model_init(enabled=True):
+                    new_mod = te.Linear(  # type: ignore[union-attr]
+                        in_f,
+                        out_f,
+                        bias=has_bias,
+                        params_dtype=dtype,  # type: ignore[arg-type]
+                    ).to(device)  # type: ignore[arg-type]
+                new_mod.load_state_dict(state, strict=False)
+                replaced = True
+            elif (
+                hasattr(te, "LayerNormLinear") and isinstance(child, te.LayerNormLinear)  # type: ignore[union-attr]
+            ):
+                state = {k: v.detach().clone() for k, v in child.state_dict().items()}
+                hidden_size = child.in_features
+                out_features = child.out_features
+                has_bias = child.use_bias
+                device = next(child.parameters()).device
+                setattr(mod, name, nn.Identity())
+                del child
+                torch.cuda.empty_cache()
+                with quantized_model_init(enabled=True):
+                    new_mod = te.LayerNormLinear(  # type: ignore[union-attr]
+                        hidden_size,
+                        out_features,
+                        bias=has_bias,
+                        params_dtype=torch.bfloat16,
+                    ).to(device)
+                new_mod.load_state_dict(state, strict=False)
+                replaced = True
+            elif (
+                hasattr(te, "LayerNormMLP") and isinstance(child, te.LayerNormMLP)  # type: ignore[union-attr]
+            ):
+                state = {k: v.detach().clone() for k, v in child.state_dict().items()}
+                fc1_weight: Tensor = child.fc1_weight  # type: ignore[attr-defined]
+                hidden_size = int(fc1_weight.shape[1])
+                # fc1 packed as (2*ffn_hidden_size, hidden_size) for swiglu.
+                ffn_hidden_size = int(fc1_weight.shape[0]) // 2
+                has_bias = (
+                    getattr(child, "fc1_bias", None) is not None
+                    and child.fc1_bias is not None  # type: ignore[attr-defined]
+                )
+                device = fc1_weight.device
+                setattr(mod, name, nn.Identity())
+                del child
+                torch.cuda.empty_cache()
+                with quantized_model_init(enabled=True):
+                    new_mod = te.LayerNormMLP(  # type: ignore[union-attr]
+                        hidden_size=hidden_size,
+                        ffn_hidden_size=ffn_hidden_size,
+                        bias=has_bias,
+                        activation="swiglu",
+                        params_dtype=torch.bfloat16,
+                    ).to(device)  # type: ignore[arg-type]
+                new_mod.load_state_dict(state, strict=False)
+                replaced = True
+            if replaced:
+                # Freeze via .eval()+.requires_grad_(False); per-param ops would unwrap Float8Tensor.
+                new_mod.eval().requires_grad_(False)
+                setattr(mod, name, new_mod)
+                torch.cuda.empty_cache()
+            else:
+                _walk(child)
+    _walk(module)
+    torch.cuda.empty_cache()
+@contextmanager
+def _lm_precision_context(fp8: bool):
+    """bf16 autocast (+ optional TE fp8 autocast) around the LM forward.
+    te.autocast keeps te.Linear outputs bf16 instead of the fp32 default
+    (~425 MB at L=1024 in the hidden-state cache).
+    """
+    with torch.autocast(device_type="cuda", dtype=torch.bfloat16):
+        if fp8 and TE_AVAILABLE:
+            fp8_recipe = DelayedScaling(  # type: ignore[misc]
+                fp8_format=Format.HYBRID,  # type: ignore[union-attr]
+                amax_history_len=1,
+                amax_compute_algo="most_recent",
+            )
+            with te.autocast(enabled=True, recipe=fp8_recipe):  # type: ignore[union-attr]
+                yield
+        else:
+            yield
+class ESMFold2Model(PreTrainedModel):
+    """ESMFold2 — all-atom structure prediction with an ESMC PLM backbone.
+    This is the standard released ESMFold2 architecture (uses a linear-
+    recurrent trunk, internally referred to as "parcae").
+    Forward kwargs that callers commonly override:
+    * ``num_loops`` (default ``config.num_loops``): trunk refinement
+      loops.
+    * ``num_diffusion_samples`` (default ``config.num_diffusion_samples``):
+      parallel structure samples; the confidence head re-runs once per
+      sample, so memory scales linearly. Pass ``1`` for cheap inference.
+    * ``num_sampling_steps`` (default ``config.structure_head.inference_num_steps``):
+      diffusion ODE solver steps. Lower for speed, higher for quality.
+    Memory / perf knobs:
+    * ``model.set_chunk_size(int|None)``: caps L² ops (triangle / OPM /
+      pair transition) at this token-axis chunk. Default 64 — fits
+      L≈2k on an 80 GB GPU. Pass ``None`` for faster inference at L<600.
+    * ``model.set_kernel_backend(None | "fused" | "cuequivariance")``:
+      select kernel backend (None = reference path).
+    """
+    config_class = ESMFold2Config
+    _keys_to_ignore_on_load_unexpected = [r"\._extra_state$"]
+    def __init__(self, config: ESMFold2Config) -> None:
+        super().__init__(config)
+        d_inputs = config.inputs.d_inputs
+        d_pair = config.d_pair
+        self.inputs_embedder = InputsEmbedder(config)
+        self.z_init_1 = nn.Linear(d_inputs, d_pair, bias=False)
+        self.z_init_2 = nn.Linear(d_inputs, d_pair, bias=False)
+        self.rel_pos = ResIdxAsymIdSymIdEntityIdEncoding(
+            n_relative_residx_bins=config.n_relative_residx_bins,
+            n_relative_chain_bins=config.n_relative_chain_bins,
+            d_pair=d_pair,
+        )
+        self.token_bonds = nn.Linear(1, d_pair, bias=False)
+        self.language_model = LanguageModelShim(
+            d_z=d_pair, d_model=config.lm_d_model, num_layers=config.lm_num_layers
+        )
+        self._esmc: nn.Module | None = None
+        self._esmc_fp8: bool = False  # set by load_esmc(fp8=True)
+        self._esmfold2_input_builder: Any | None = None
+        pf = config.folding_trunk
+        self.folding_trunk = FoldingTrunk(
+            n_layers=pf.n_layers, d_pair=d_pair, expansion_ratio=4
+        )
+        if config.lm_encoder.enabled:
+            self.lm_encoder: FoldingTrunk | None = FoldingTrunk(
+                n_layers=config.lm_encoder.n_layers, d_pair=d_pair, expansion_ratio=4
+            )
+        else:
+            self.lm_encoder = None
+        self.parcae_input_norm = nn.LayerNorm(d_pair)
+        self.parcae_log_a = nn.Parameter(torch.zeros(d_pair))
+        parcae_decay_init = math.sqrt(1.0 / 5.0)
+        parcae_delta_init = -math.log(parcae_decay_init)
+        self.parcae_log_delta = nn.Parameter(
+            torch.full(
+                (d_pair,), _inverse_softplus(parcae_delta_init), dtype=torch.float32
+            )
+        )
+        self.parcae_b_cont = nn.Parameter(torch.eye(d_pair))
+        self.parcae_readout = nn.Linear(d_pair, d_pair, bias=False)
+        nn.init.eye_(self.parcae_readout.weight)
+        self.parcae_coda = FoldingTrunk(
+            n_layers=config.parcae.coda_n_layers, d_pair=d_pair, expansion_ratio=4
+        )
+        # Heads --------------------------------------------------------------
+        self.structure_head = DiffusionStructureHead(config)
+        self.distogram_head = nn.Linear(
+            d_pair, config.structure_head.distogram_bins, bias=True
+        )
+        self.confidence_head = ConfidenceHead(config)
+        msa_cfg = config.msa_encoder
+        self.msa_encoder = None
+        if msa_cfg.enabled:
+            self.msa_encoder = MSAEncoder(
+                d_msa=msa_cfg.d_msa,
+                d_pair=d_pair,
+                d_inputs=d_inputs,
+                d_hidden=msa_cfg.d_hidden,
+                n_layers=msa_cfg.n_layers,
+                n_heads_msa=msa_cfg.n_heads_msa,
+                msa_head_width=msa_cfg.msa_head_width,
+            )
+        self.post_init()
+    def load_esmc(self, esmc_model_path: str, precision: str = "bf16") -> None:
+        """Load the ESMC LM.
+        ``precision``: ``"bf16"`` (default), ``"fp32"``, or ``"fp8"``.
+        ``"fp8"`` requires H100 + TransformerEngine ≥ 2.x and quantizes
+        every TE module's weights to fp8 storage.
+        """
+        from .modeling_esmc import ESMCModel
+        dtype_map = {
+            "bf16": torch.bfloat16,
+            "fp32": torch.float32,
+            "fp8": torch.bfloat16,  # underlying weights stay bf16, TE re-quantizes to fp8
+        }
+        if precision not in dtype_map:
+            raise ValueError(
+                f"precision must be one of {list(dtype_map)}, got {precision!r}"
+            )
+        dtype = dtype_map[precision]
+        esmc = (
+            ESMCModel.from_pretrained(esmc_model_path)
+            .to(device=self.device, dtype=dtype)
+            .eval()
+        )
+        for p in esmc.parameters():
+            p.requires_grad_(False)
+        if precision == "fp8":
+            if not TE_AVAILABLE:
+                raise RuntimeError(
+                    "transformer_engine is not available; cannot use fp8."
+                )
+            with torch.no_grad():
+                _convert_te_modules_to_fp8_inplace(esmc)
+            self._esmc_fp8 = True
+        else:
+            self._esmc_fp8 = False
+        self._esmc = esmc
+    @classmethod
+    def from_pretrained(
+        cls, pretrained_model_name_or_path, *args, load_esmc: bool = True, **kwargs
+    ):
+        if cls is ESMFold2Model and "config" not in kwargs:
+            config = ESMFold2Config.from_pretrained(
+                pretrained_model_name_or_path, **kwargs
+            )
+            if config.type == "experimental":
+                experimental_module = importlib.import_module(
+                    f"{__package__}.modeling_esmfold2_experimental"
+                )
+                return experimental_module.ESMFold2ExperimentalModel.from_pretrained(
+                    pretrained_model_name_or_path,
+                    *args,
+                    config=config,
+                    load_esmc=load_esmc,
+                    **kwargs,
+                )
+            kwargs["config"] = config
+        # Pop the precision knob before forwarding to the HF loader.
+        esmc_precision = kwargs.pop("esmc_precision", "bf16")
+        model = super().from_pretrained(pretrained_model_name_or_path, *args, **kwargs)
+        if load_esmc:
+            model.load_esmc(model.config.esmc_id, precision=esmc_precision)
+        return model
+    def set_kernel_backend(self, backend: str | None) -> None:
+        """Select kernel backend.
+        Args:
+            backend: ``None`` (reference path), ``"fused"`` (vendored Triton
+                kernels), or ``"cuequivariance"`` (cuequivariance kernels
+                where applicable; vanilla python fallback otherwise).
+        """
+        self.folding_trunk.set_kernel_backend(backend)
+        if self.lm_encoder is not None:
+            self.lm_encoder.set_kernel_backend(backend)
+        self.parcae_coda.set_kernel_backend(backend)
+        self.confidence_head.set_kernel_backend(backend)
+        self.structure_head.set_kernel_backend(backend)
+    def apply_torch_compile(
+        self, mode: str = "fixed_seqlen", dynamic: bool | None = None
+    ) -> None:
+        """Compile L²-heavy blocks. ``mode='fixed_seqlen'`` recompiles per L; ``'dynamic_seqlen'`` compiles once.
+        Does NOT stack with our Triton kernels — call ``set_kernel_backend(None)``
+        before compiling.
+        """
+        import torch._dynamo
+        torch._dynamo.config.cache_size_limit = 512  # type: ignore[attr-defined]
+        torch._dynamo.config.accumulated_cache_size_limit = 512  # type: ignore[attr-defined]
+        # capture_scalar_outputs avoids graph breaks at .item() in atom-attention path.
+        torch._dynamo.config.capture_scalar_outputs = True  # type: ignore[attr-defined]
+        if dynamic is None:
+            dynamic = mode == "dynamic_seqlen"
+        kwargs: dict = {"dynamic": dynamic}
+        from .modeling_esmfold2_common import (
+            DiffusionModule,
+            DiffusionTransformer,
+            PairUpdateBlock,
+        )
+        compile_targets = (
+            PairUpdateBlock,
+            DiffusionTransformer,
+            DiffusionModule,
+            MSAEncoderBlock,
+        )
+        def _maybe_compile(module: nn.Module) -> None:
+            if isinstance(module, compile_targets):
+                module.forward = torch.compile(module.forward, **kwargs)  # type: ignore[assignment]
+        self.apply(_maybe_compile)
+    def set_chunk_size(self, chunk_size: int | None) -> None:
+        self.folding_trunk.set_chunk_size(chunk_size)
+        if self.lm_encoder is not None:
+            self.lm_encoder.set_chunk_size(chunk_size)
+        self.parcae_coda.set_chunk_size(chunk_size)
+        self.confidence_head.set_chunk_size(chunk_size)
+        if self.msa_encoder is not None:
+            self.msa_encoder.set_chunk_size(chunk_size)
+    def _compute_lm_hidden_states(
+        self,
+        input_ids: Tensor,
+        asym_id: Tensor,
+        residue_index: Tensor,
+        mol_type: Tensor,
+        tok_mask: Tensor,
+    ) -> Tensor:
+        assert self._esmc is not None
+        # fp8 TE kernels require prod(shape[:-1]) % 8 == 0.
+        pad_to = 8 if self._esmc_fp8 else None
+        with _lm_precision_context(self._esmc_fp8):
+            return compute_lm_hidden_states(
+                self._esmc,
+                input_ids,
+                asym_id,
+                residue_index,
+                mol_type,
+                tok_mask,
+                pad_to_multiple=pad_to,
+            )
+    def _discretized_dynamics(self) -> tuple[Tensor, Tensor]:
+        delta = F.softplus(self.parcae_log_delta)
+        a = torch.exp(-delta * torch.exp(self.parcae_log_a))
+        b = delta[:, None] * self.parcae_b_cont
+        return a, b
+    def _init_pair_state(self, ref: Tensor) -> Tensor:
+        std = math.sqrt(2.0 / (5.0 * ref.shape[-1]))
+        state = torch.empty_like(ref, dtype=torch.float32)
+        nn.init.trunc_normal_(state, mean=0.0, std=std, a=-3 * std, b=3 * std)
+        return state.to(dtype=ref.dtype)
+    def _run_one_loop(
+        self,
+        z: Tensor,
+        z_init: Tensor,
+        lm_z: Tensor | None,
+        _msa_kwargs: dict | None,
+        pair_mask: Tensor,
+        a: Tensor,
+        b_mat: Tensor,
+        total_steps: int,
+    ) -> Tensor:
+        # Helper method (not inline) so per-iter locals free on return —
+        # otherwise leaks ~2 GB L²×c_z into distogram/sample scope.
+        # training=True forces dropout under eval(), matching the per-loop
+        # dropout strategy used at train time.
+        lm_cfg = self.config.lm_encoder
+        _per_loop_lm_dropout = (
+            lm_z is not None
+            and getattr(lm_cfg, "per_loop_lm_dropout", False)
+            and getattr(lm_cfg, "lm_dropout", 0.0) > 0.0
+        )
+        _lm_dropout_p = getattr(lm_cfg, "lm_dropout", 0.0)
+        for _ in range(total_steps):
+            if _per_loop_lm_dropout:
+                assert lm_z is not None  # narrowed by _per_loop_lm_dropout
+                lm_z_i: Tensor | None = F.dropout(lm_z, p=_lm_dropout_p, training=True)
+            else:
+                lm_z_i = lm_z
+            refined_lm_z: Tensor | None = None
+            if lm_z_i is not None and self.lm_encoder is not None:
+                refined_lm_z = self.lm_encoder(
+                    lm_z_i.to(z_init.dtype), pair_attention_mask=pair_mask
+                )
+            z_inject_pair = z_init
+            if lm_z_i is not None and self.lm_encoder is None:
+                z_inject_pair = z_inject_pair + lm_z_i.to(z_inject_pair.dtype)
+            if self.msa_encoder is not None and _msa_kwargs is not None:
+                msa_pair = self.msa_encoder(x_pair=z_inject_pair, **_msa_kwargs).to(
+                    z_inject_pair.dtype
+                )
+                z_inject_pair = (
+                    msa_pair
+                    if self.config.msa_encoder_overwrite
+                    else (z_inject_pair + msa_pair)
+                )
+            if refined_lm_z is not None:
+                z_inject_pair = z_inject_pair + refined_lm_z.to(z_inject_pair.dtype)
+            injected_pair = self.parcae_input_norm(z_inject_pair)
+            z = a * z + F.linear(injected_pair.to(z.dtype), b_mat)
+            z = self.folding_trunk(z, pair_attention_mask=pair_mask)
+        return z
+    @torch.inference_mode()
+    def forward(
+        self,
+        token_index: Tensor,
+        residue_index: Tensor,
+        asym_id: Tensor,
+        sym_id: Tensor,
+        entity_id: Tensor,
+        mol_type: Tensor,
+        res_type: Tensor,
+        token_bonds: Tensor,
+        token_attention_mask: Tensor,
+        ref_pos: Tensor,
+        ref_element: Tensor,
+        ref_charge: Tensor,
+        ref_atom_name_chars: Tensor,
+        ref_space_uid: Tensor,
+        atom_attention_mask: Tensor,
+        atom_to_token: Tensor,
+        distogram_atom_idx: Tensor,
+        deletion_mean: Tensor | None = None,
+        msa: Tensor | None = None,
+        has_deletion: Tensor | None = None,
+        deletion_value: Tensor | None = None,
+        msa_attention_mask: Tensor | None = None,
+        input_ids: Tensor | None = None,
+        lm_hidden_states: Tensor | None = None,
+        num_loops: int | None = None,
+        num_diffusion_samples: int | None = None,
+        num_sampling_steps: int | None = None,
+        **kwargs,
+    ) -> dict[str, Tensor]:
+        tok_mask = token_attention_mask
+        atm_mask = atom_attention_mask
+        disto_idx = distogram_atom_idx
+        n_loops: int = num_loops if num_loops is not None else self.config.num_loops
+        n_samples: int = (
+            num_diffusion_samples
+            if num_diffusion_samples is not None
+            else self.config.num_diffusion_samples
+        )
+        total_steps = max(1, n_loops + 1)
+        if res_type.dim() == 2:
+            res_type_oh = F.one_hot(res_type.long(), num_classes=NUM_RES_TYPES).float()
+            res_type_oh = res_type_oh * tok_mask.unsqueeze(-1).float()
+        else:
+            res_type_oh = res_type.float()
+        if msa is not None:
+            msa_oh_profile = F.one_hot(msa.long(), num_classes=NUM_RES_TYPES).float()
+            if msa_attention_mask is not None:
+                mask_f = msa_attention_mask.float().unsqueeze(-1)
+                msa_oh_profile = msa_oh_profile * mask_f
+                valid_seq_count = msa_attention_mask.float().sum(dim=1).clamp(min=1)
+                profile = msa_oh_profile.sum(dim=1) / valid_seq_count.unsqueeze(-1)
+            else:
+                profile = msa_oh_profile.mean(dim=1)
+        else:
+            profile = res_type_oh
+        if deletion_mean is None:
+            deletion_mean = torch.zeros(
+                res_type.shape[0], res_type.shape[1], device=res_type.device
+            )
+        ref_element_oh = F.one_hot(
+            ref_element.long(), num_classes=MAX_ATOMIC_NUMBER
+        ).float()
+        ref_atom_name_chars_oh = F.one_hot(
+            ref_atom_name_chars.long(), num_classes=CHAR_VOCAB_SIZE
+        ).float()
+        # Bias-free downstream Linears require zeroed padding.
+        atm_mask_f = atm_mask.float()
+        ref_element_oh = ref_element_oh * atm_mask_f.unsqueeze(-1)
+        ref_atom_name_chars_oh = ref_atom_name_chars_oh * atm_mask_f.unsqueeze(
+            -1
+        ).unsqueeze(-1)
+        atom_to_token = atom_to_token * atm_mask.long()
+        use_amp = ref_pos.device.type == "cuda"
+        with torch.amp.autocast("cuda", enabled=use_amp, dtype=torch.bfloat16):
+            x_inputs = self.inputs_embedder(
+                aatype=res_type_oh,
+                profile=profile.float(),
+                deletion_mean=deletion_mean.float(),
+                ref_pos=ref_pos,
+                atom_attention_mask=atm_mask,
+                ref_space_uid=ref_space_uid,
+                ref_charge=ref_charge,
+                ref_element=ref_element_oh,
+                ref_atom_name_chars=ref_atom_name_chars_oh,
+                atom_to_token=atom_to_token,
+            )
+            z_init = self.z_init_1(x_inputs).unsqueeze(2) + self.z_init_2(
+                x_inputs
+            ).unsqueeze(1)
+            relative_position_encoding = self.rel_pos(
+                residue_index=residue_index,
+                asym_id=asym_id,
+                sym_id=sym_id,
+                entity_id=entity_id,
+                token_index=token_index,
+            )
+            token_bonds_encoding = self.token_bonds(token_bonds.float())
+            z_init = z_init + relative_position_encoding + token_bonds_encoding
+            if (
+                lm_hidden_states is None
+                and input_ids is not None
+                and self._esmc is not None
+            ):
+                lm_hidden_states = self._compute_lm_hidden_states(
+                    input_ids, asym_id, residue_index, mol_type, tok_mask
+                )
+            lm_z: Tensor | None = None
+            if lm_hidden_states is not None:
+                lm_z = self.language_model(lm_hidden_states.detach())
+            del lm_hidden_states
+            pair_mask = tok_mask[:, :, None].float() * tok_mask[:, None, :].float()
+            z = self._init_pair_state(z_init)
+            a, b = self._discretized_dynamics()
+            a = a.view(1, 1, 1, -1).to(device=z.device, dtype=z.dtype)
+            b_mat = b.to(device=z.device, dtype=z.dtype)
+            _msa_kwargs: dict | None = None
+            if self.msa_encoder is not None and msa is not None:
+                B_msa, M, L_msa = msa.shape
+                msa_oh = F.one_hot(
+                    msa.permute(0, 2, 1).long(), num_classes=NUM_RES_TYPES
+                ).float()
+                msa_attn = (
+                    msa_attention_mask.permute(0, 2, 1).float()
+                    if msa_attention_mask is not None
+                    else tok_mask[:, :, None].expand(-1, -1, M).float()
+                )
+                # Bias-free MSAEncoder.embed requires zeroed padding.
+                msa_oh = msa_oh * msa_attn.unsqueeze(-1)
+                hd = (
+                    has_deletion.permute(0, 2, 1).float()
+                    if has_deletion is not None
+                    else torch.zeros(B_msa, L_msa, M, device=msa.device)
+                )
+                dv = (
+                    deletion_value.permute(0, 2, 1).float()
+                    if deletion_value is not None
+                    else torch.zeros(B_msa, L_msa, M, device=msa.device)
+                )
+                _msa_kwargs = dict(
+                    x_inputs=x_inputs,
+                    msa_oh=msa_oh,
+                    has_deletion=hd,
+                    deletion_value=dv,
+                    msa_attention_mask=msa_attn,
+                )
+            # Method call (not inline loop) frees per-iter L²×c_z locals.
+            z = self._run_one_loop(
+                z=z,
+                z_init=z_init,
+                lm_z=lm_z,
+                _msa_kwargs=_msa_kwargs,
+                pair_mask=pair_mask,
+                a=a,
+                b_mat=b_mat,
+                total_steps=total_steps,
+            )
+            del z_init, lm_z, _msa_kwargs, a, b_mat
+            z = self.parcae_readout(z)
+            z = self.parcae_coda(z, pair_attention_mask=pair_mask)
+            z = z.float()
+        distogram_logits = self.distogram_head(z + z.transpose(-2, -3))
+        structure_output = self.structure_head.sample(
+            z_trunk=z,
+            s_inputs=x_inputs,
+            s_trunk=None,
+            relative_position_encoding=relative_position_encoding,
+            ref_pos=ref_pos,
+            ref_charge=ref_charge,
+            ref_mask=atm_mask,
+            ref_element=ref_element_oh,
+            ref_atom_name_chars=ref_atom_name_chars_oh,
+            ref_space_uid=ref_space_uid,
+            tok_idx=atom_to_token,
+            asym_id=asym_id,
+            residue_index=residue_index,
+            entity_id=entity_id,
+            token_index=token_index,
+            sym_id=sym_id,
+            token_attention_mask=tok_mask,
+            num_diffusion_samples=n_samples,
+            num_sampling_steps=num_sampling_steps,
+            return_atom_repr=False,
+            denoising_early_exit_rmsd=None,
+        )
+        sample_coords = structure_output["sample_atom_coords"]
+        assert sample_coords is not None
+        output: dict[str, Tensor] = {"distogram_logits": distogram_logits}
+        output["sample_atom_coords"] = sample_coords
+        confidence_output = self.confidence_head(
+            s_inputs=x_inputs.detach(),
+            z=z.detach().float(),
+            x_pred=sample_coords.detach(),
+            distogram_atom_idx=disto_idx,
+            token_attention_mask=tok_mask,
+            atom_to_token=atom_to_token,
+            atom_attention_mask=atm_mask,
+            asym_id=asym_id,
+            mol_type=mol_type,
+            num_diffusion_samples=n_samples,
+            relative_position_encoding=relative_position_encoding.detach(),
+            token_bonds_encoding=token_bonds_encoding.detach(),
+        )
+        output.update(confidence_output)
+        output["atom_pad_mask"] = (
+            atm_mask.unsqueeze(0) if atm_mask.dim() == 1 else atm_mask
+        )
+        output["residue_index"] = residue_index
+        output["entity_id"] = entity_id
+        return output
+    @torch.no_grad()
+    def infer_protein(self, seq: str, **forward_kwargs) -> dict:
+        from .protein_utils import prepare_protein_features
+        features = prepare_protein_features(seq)
+        features = {k: v.to(self.device) for k, v in features.items()}
+        return self(**features, **forward_kwargs)
+    @property
+    def input_builder(self):
+        if self._esmfold2_input_builder is None:
+            from .esmfold2_processor import ESMFold2InputBuilder
+            self._esmfold2_input_builder = ESMFold2InputBuilder()
+        return self._esmfold2_input_builder
+    @property
+    def input_types(self):
+        from . import esmfold2_types
+        return esmfold2_types
+    def prepare_structure_input(self, input, seed: int | None = None):
+        return self.input_builder.prepare_input(input, seed=seed, device=self.device)
+    def fold(
+        self,
+        input,
+        *,
+        num_loops: int = 3,
+        num_sampling_steps: int = 50,
+        num_diffusion_samples: int = 1,
+        seed: int | None = None,
+        noise_scale: float | None = None,
+        step_scale: float | None = None,
+        max_inference_sigma: int | None = None,
+        early_exit: bool = False,
+        complex_id: str = "pred",
+    ):
+        return self.input_builder.fold(
+            self,
+            input,
+            num_loops=num_loops,
+            num_sampling_steps=num_sampling_steps,
+            num_diffusion_samples=num_diffusion_samples,
+            seed=seed,
+            noise_scale=noise_scale,
+            step_scale=step_scale,
+            max_inference_sigma=max_inference_sigma,
+            early_exit=early_exit,
+            complex_id=complex_id,
+        )
+    def fold_protein(
+        self,
+        sequence: str,
+        *,
+        chain_id: str = "A",
+        num_loops: int = 3,
+        num_sampling_steps: int = 50,
+        num_diffusion_samples: int = 1,
+        seed: int | None = None,
+        complex_id: str = "pred",
+    ):
+        from .esmfold2_types import ProteinInput, StructurePredictionInput
+        input = StructurePredictionInput(
+            sequences=[ProteinInput(id=chain_id, sequence=sequence)]
+        )
+        return self.fold(
+            input,
+            num_loops=num_loops,
+            num_sampling_steps=num_sampling_steps,
+            num_diffusion_samples=num_diffusion_samples,
+            seed=seed,
+            complex_id=complex_id,
+        )
+    @staticmethod
+    def result_to_cif(result) -> str:
+        assert not isinstance(result, list), "Pass one MolecularComplexResult at a time."
+        return result.complex.to_mmcif()
+    @staticmethod
+    def result_to_pdb(result) -> str:
+        assert not isinstance(result, list), "Pass one MolecularComplexResult at a time."
+        return result.complex.to_protein_complex().to_pdb_string()
+    def save_as_cif(self, result, output_path: str | Path) -> None:
+        Path(output_path).write_text(self.result_to_cif(result))
+    def save_as_pdb(self, result, output_path: str | Path) -> None:
+        Path(output_path).write_text(self.result_to_pdb(result))
+    def infer_protein_as_cif(self, seq: str, **forward_kwargs) -> str:
+        return self.result_to_cif(self.fold_protein(seq, **forward_kwargs))
+    def infer_protein_as_pdb(self, seq: str, **forward_kwargs) -> str:
+        return self.result_to_pdb(self.fold_protein(seq, **forward_kwargs))
+class MSAEncoderBlock(nn.Module):
+    """One MSA encoder block: OPM into pair, MSA pair-weighted averaging, triangle update."""
+    def __init__(
+        self,
+        d_msa: int,
+        d_pair: int,
+        d_hidden: int,
+        n_heads_msa: int,
+        msa_head_width: int,
+        is_final_block: bool = False,
+    ) -> None:
+        super().__init__()
+        self.is_final_block = is_final_block
+        self.outer_product_mean = OuterProductMean(d_msa, d_hidden, d_pair)
+        if not is_final_block:
+            self.msa_pair_weighted_averaging = MSAPairWeightedAveraging(
+                d_msa, d_pair, n_heads_msa, msa_head_width
+            )
+            self.msa_transition = PairTransition(d_msa, expansion_ratio=4)
+        self.tri_mul_out = TriangleMultiplicativeUpdate(dim=d_pair, _outgoing=True)
+        self.tri_mul_in = TriangleMultiplicativeUpdate(dim=d_pair, _outgoing=False)
+        self.pair_transition = PairTransition(d_pair, expansion_ratio=4)
+    def set_chunk_size(self, chunk_size: int | None) -> None:
+        self.outer_product_mean.set_chunk_size(chunk_size)
+        self.tri_mul_out.set_chunk_size(chunk_size)
+        self.tri_mul_in.set_chunk_size(chunk_size)
+        if not self.is_final_block:
+            self.msa_transition.set_chunk_size(chunk_size)
+        self.pair_transition.set_chunk_size(chunk_size)
+    def forward(
+        self,
+        m: Tensor,
+        pair: Tensor,
+        msa_attention_mask: Tensor,
+        pair_attention_mask: Tensor,
+    ) -> tuple[Tensor, Tensor]:
+        pair = pair + self.outer_product_mean(m, msa_attention_mask)
+        if not self.is_final_block:
+            m = m + self.msa_pair_weighted_averaging(m, pair, pair_attention_mask)
+            m = m + self.msa_transition(m)
+        pair = pair + self.tri_mul_out(pair, mask=pair_attention_mask)
+        pair = pair + self.tri_mul_in(pair, mask=pair_attention_mask)
+        pair = pair + self.pair_transition(pair)
+        return m, pair
+class MSAEncoder(nn.Module):
+    """Stack of [`MSAEncoderBlock`] layers that conditions the pair on an MSA."""
+    def __init__(
+        self,
+        d_msa: int,
+        d_pair: int,
+        d_inputs: int,
+        d_hidden: int = 32,
+        n_layers: int = 4,
+        n_heads_msa: int = 8,
+        msa_head_width: int = 16,
+    ) -> None:
+        super().__init__()
+        self.embed = nn.Linear(35, d_msa, bias=False)
+        self.project_inputs = nn.Linear(d_inputs, d_msa, bias=False)
+        self.blocks = nn.ModuleList(
+            [
+                MSAEncoderBlock(
+                    d_msa=d_msa,
+                    d_pair=d_pair,
+                    d_hidden=d_hidden,
+                    n_heads_msa=n_heads_msa,
+                    msa_head_width=msa_head_width,
+                    is_final_block=(i == n_layers - 1),
+                )
+                for i in range(n_layers)
+            ]
+        )
+    def set_chunk_size(self, chunk_size: int | None) -> None:
+        for block in self.blocks:
+            cast(MSAEncoderBlock, block).set_chunk_size(chunk_size)
+    def forward(
+        self,
+        x_pair: Tensor,
+        x_inputs: Tensor,
+        msa_oh: Tensor,
+        has_deletion: Tensor,
+        deletion_value: Tensor,
+        msa_attention_mask: Tensor,
+    ) -> Tensor:
+        # All inputs are pre-transposed to [B, L, M, ...] before calling.
+        m_feat = torch.cat(
+            [msa_oh, has_deletion.unsqueeze(-1), deletion_value.unsqueeze(-1)], dim=-1
+        )
+        m = self.embed(m_feat) + self.project_inputs(x_inputs).unsqueeze(2)
+        tok_mask = msa_attention_mask[:, :, 0].bool()
+        pair_attention_mask = tok_mask.unsqueeze(2) & tok_mask.unsqueeze(1)
+        for block in self.blocks:
+            m, x_pair = block(m, x_pair, msa_attention_mask, pair_attention_mask)
+        return x_pair

modeling_esmfold2_common.py ADDED Viewed

The diff for this file is too large to render. See raw diff

protein_utils.py ADDED Viewed

	@@ -0,0 +1,488 @@

+# coding=utf-8
+# Copyright 2026 Biohub. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Self-contained protein featurization for ESMFold2 inference.
+Lets ``ESMFold2ExperimentalModel.infer_protein_as_pdb`` fold a protein sequence
+ESMFold-style without the ``esm`` companion package. The featurization
+mirrors ``ESMFold2InputBuilder.prepare_input`` for the protein-only path —
+``test_prepare_protein_features.py`` enforces tensor-exact parity.
+"""
+from __future__ import annotations
+import math
+import torch
+from torch import Tensor
+MOL_TYPE_PROTEIN = 0
+PROTEIN_UNK_RES_TYPE = 22
+MSA_GAP_TOKEN_ID = 1
+PROTEIN_RESIDUE_TO_RES_TYPE: dict[str, int] = {
+    "ALA": 2,
+    "ARG": 3,
+    "ASN": 4,
+    "ASP": 5,
+    "CYS": 6,
+    "GLN": 7,
+    "GLU": 8,
+    "GLY": 9,
+    "HIS": 10,
+    "ILE": 11,
+    "LEU": 12,
+    "LYS": 13,
+    "MET": 14,
+    "PHE": 15,
+    "PRO": 16,
+    "SER": 17,
+    "THR": 18,
+    "TRP": 19,
+    "TYR": 20,
+    "VAL": 21,
+}
+PROTEIN_1TO3: dict[str, str] = {
+    "A": "ALA",
+    "R": "ARG",
+    "N": "ASN",
+    "D": "ASP",
+    "C": "CYS",
+    "Q": "GLN",
+    "E": "GLU",
+    "G": "GLY",
+    "H": "HIS",
+    "I": "ILE",
+    "L": "LEU",
+    "K": "LYS",
+    "M": "MET",
+    "F": "PHE",
+    "P": "PRO",
+    "S": "SER",
+    "T": "THR",
+    "W": "TRP",
+    "Y": "TYR",
+    "V": "VAL",
+    "X": "UNK",
+}
+ESM_PROTEIN_VOCAB: dict[str, int] = {
+    "L": 4,
+    "A": 5,
+    "G": 6,
+    "V": 7,
+    "S": 8,
+    "E": 9,
+    "R": 10,
+    "T": 11,
+    "I": 12,
+    "D": 13,
+    "P": 14,
+    "K": 15,
+    "Q": 16,
+    "N": 17,
+    "F": 18,
+    "Y": 19,
+    "M": 20,
+    "H": 21,
+    "W": 22,
+    "C": 23,
+    "X": 3,
+}
+# Heavy atoms per canonical residue, in training-time order.
+PROTEIN_HEAVY_ATOMS: dict[str, list[str]] = {
+    "ALA": ["N", "CA", "C", "O", "CB"],
+    "ARG": ["N", "CA", "C", "O", "CB", "CG", "CD", "NE", "CZ", "NH1", "NH2"],
+    "ASN": ["N", "CA", "C", "O", "CB", "CG", "OD1", "ND2"],
+    "ASP": ["N", "CA", "C", "O", "CB", "CG", "OD1", "OD2"],
+    "CYS": ["N", "CA", "C", "O", "CB", "SG"],
+    "GLN": ["N", "CA", "C", "O", "CB", "CG", "CD", "OE1", "NE2"],
+    "GLU": ["N", "CA", "C", "O", "CB", "CG", "CD", "OE1", "OE2"],
+    "GLY": ["N", "CA", "C", "O"],
+    "HIS": ["N", "CA", "C", "O", "CB", "CG", "ND1", "CD2", "CE1", "NE2"],
+    "ILE": ["N", "CA", "C", "O", "CB", "CG1", "CG2", "CD1"],
+    "LEU": ["N", "CA", "C", "O", "CB", "CG", "CD1", "CD2"],
+    "LYS": ["N", "CA", "C", "O", "CB", "CG", "CD", "CE", "NZ"],
+    "MET": ["N", "CA", "C", "O", "CB", "CG", "SD", "CE"],
+    "PHE": ["N", "CA", "C", "O", "CB", "CG", "CD1", "CD2", "CE1", "CE2", "CZ"],
+    "PRO": ["N", "CA", "C", "O", "CB", "CG", "CD"],
+    "SER": ["N", "CA", "C", "O", "CB", "OG"],
+    "THR": ["N", "CA", "C", "O", "CB", "OG1", "CG2"],
+    "TRP": [
+        "N",
+        "CA",
+        "C",
+        "O",
+        "CB",
+        "CG",
+        "CD1",
+        "CD2",
+        "NE1",
+        "CE2",
+        "CE3",
+        "CZ2",
+        "CZ3",
+        "CH2",
+    ],
+    "TYR": ["N", "CA", "C", "O", "CB", "CG", "CD1", "CD2", "CE1", "CE2", "CZ", "OH"],
+    "VAL": ["N", "CA", "C", "O", "CB", "CG1", "CG2"],
+    "UNK": ["N", "CA", "C", "O"],
+}
+PROTEIN_REF_POS: dict[str, dict[str, tuple[float, float, float]]] = {
+    "ALA": {
+        "N": (-0.01003183238208294, -1.2073018550872803, -1.0555061101913452),
+        "CA": (-0.04190138354897499, 0.17447763681411743, -0.5729365348815918),
+        "C": (1.2127548456192017, 0.4737588167190552, 0.19521640241146088),
+        "O": (1.9390329122543335, 1.4484562873840332, -0.13759790360927582),
+        "CB": (-1.276943325996399, 0.4288230538368225, 0.29937705397605896),
+    },
+    "ARG": {
+        "N": (-2.0170421600341797, 0.6717798113822937, -1.1794233322143555),
+        "CA": (-2.0503084659576416, -0.5735036730766296, -0.4097220301628113),
+        "C": (-3.469440460205078, -1.0612813234329224, -0.2755832374095917),
+        "O": (-3.8218462467193604, -2.1369943618774414, -0.8294969797134399),
+        "CB": (-1.4193516969680786, -0.3735991418361664, 0.9852858781814575),
+        "CG": (0.11878877878189087, -0.3112654983997345, 0.963895857334137),
+        "CD": (0.6643245816230774, 1.0068185329437256, 0.3963329493999481),
+        "NE": (2.1090238094329834, 1.0977025032043457, 0.6120952367782593),
+        "CZ": (3.098905324935913, 0.3215920031070709, -0.09047172218561172),
+        "NH1": (4.461230278015137, 0.3844667971134186, 0.34141138195991516),
+        "NH2": (2.7856509685516357, -0.4166366159915924, -1.1148239374160767),
+    },
+    "ASN": {
+        "N": (-0.7595629096031189, 0.7503494620323181, 1.1369825601577759),
+        "CA": (-0.76087886095047, 0.23876343667507172, -0.23573364317417145),
+        "C": (-1.9211044311523438, -0.6982439160346985, -0.42196929454803467),
+        "O": (-2.677666187286377, -0.5753439664840698, -1.4223182201385498),
+        "CB": (0.5504899024963379, -0.5078350305557251, -0.5390339493751526),
+        "CG": (1.7250099182128906, 0.4264017939567566, -0.5778228640556335),
+        "OD1": (1.9470350742340088, 1.1086392402648926, -1.613560438156128),
+        "ND2": (2.57365345954895, 0.5730618834495544, 0.5608599781990051),
+    },
+    "ASP": {
+        "N": (-1.8452696800231934, -1.2169504165649414, 0.19437327980995178),
+        "CA": (-0.6379959583282471, -0.41974392533302307, 0.41681644320487976),
+        "C": (-0.9431572556495667, 1.0356197357177734, 0.18555717170238495),
+        "O": (-1.5183608531951904, 1.4045922756195068, -0.8739855885505676),
+        "CB": (0.48594576120376587, -0.8970447778701782, -0.5209363698959351),
+        "CG": (1.780342936515808, -0.19918935000896454, -0.2310730367898941),
+        "OD1": (2.5202910900115967, -0.6044584512710571, 0.7049641013145447),
+        "OD2": (2.1454880237579346, 0.9208861589431763, -0.9712985157966614),
+    },
+    "CYS": {
+        "N": (0.0469963513314724, 1.190075159072876, -1.1607273817062378),
+        "CA": (0.11344368755817413, -0.09400428831577301, -0.45952197909355164),
+        "C": (-1.2652032375335693, -0.6832379698753357, -0.3594406247138977),
+        "O": (-1.4631439447402954, -1.8851220607757568, -0.6826791763305664),
+        "CB": (0.6919880509376526, 0.09034398198127747, 0.952482283115387),
+        "SG": (2.4619927406311035, 0.5235707759857178, 0.9020372629165649),
+    },
+    "GLN": {
+        "N": (-2.370004653930664, -0.9637529850006104, -0.7942749261856079),
+        "CA": (-1.370002269744873, -0.6000258922576904, 0.2103111445903778),
+        "C": (-1.7545503377914429, 0.7091967463493347, 0.8433493971824646),
+        "O": (-1.8520662784576416, 0.7999289631843567, 2.0964975357055664),
+        "CB": (0.02040259726345539, -0.5004461407661438, -0.44764479994773865),
+        "CG": (1.1377512216567993, -0.28680720925331116, 0.582992434501648),
+        "CD": (2.4745187759399414, -0.24800164997577667, -0.09364881366491318),
+        "OE1": (3.1685523986816406, -1.2966246604919434, -0.1717153936624527),
+        "NE2": (2.947425603866577, 0.9601329565048218, -0.6888364553451538),
+    },
+    "GLU": {
+        "N": (-1.5850872993469238, -1.337684154510498, 0.9490851163864136),
+        "CA": (-1.0560977458953857, 0.027459044009447098, 1.0306966304779053),
+        "C": (-1.7741456031799316, 0.9664392471313477, 0.09259600937366486),
+        "O": (-1.9012441635131836, 2.181349992752075, 0.402479350566864),
+        "CB": (0.4706551432609558, 0.048803869634866714, 0.8114414811134338),
+        "CG": (0.9133604764938354, -0.4219329059123993, -0.5830985307693481),
+        "CD": (2.398822069168091, -0.3097084164619446, -0.7210537791252136),
+        "OE1": (3.1389315128326416, -1.274524450302124, -0.39029765129089355),
+        "OE2": (2.9647817611694336, 0.8781346082687378, -1.1732689142227173),
+    },
+    "GLY": {
+        "N": (-1.3942985534667969, -0.39875128865242004, -0.3370324671268463),
+        "CA": (-0.39974430203437805, 0.5488945245742798, 0.15242962539196014),
+        "C": (0.9440054893493652, -0.10314033925533295, 0.19859643280506134),
+        "O": (1.3352899551391602, -0.669218122959137, 1.2541258335113525),
+    },
+    "HIS": {
+        "N": (-1.4532867670059204, -1.0689626932144165, 0.881072461605072),
+        "CA": (-1.3396095037460327, 0.24797579646110535, 0.24960045516490936),
+        "C": (-2.675257921218872, 0.6571555733680725, -0.30441102385520935),
+        "O": (-3.1311378479003906, 1.8079776763916016, -0.06785715371370316),
+        "CB": (-0.3041955828666687, 0.21721023321151733, -0.8885309100151062),
+        "CG": (1.0887513160705566, 0.028941065073013306, -0.36419469118118286),
+        "ND1": (1.840459942817688, 1.0411773920059204, 0.29804590344429016),
+        "CD2": (1.780855417251587, -1.1011489629745483, -0.3814258575439453),
+        "CE1": (2.9566943645477295, 0.4924798905849457, 0.6477115750312805),
+        "NE2": (3.0280203819274902, -0.8751969337463379, 0.26084381341934204),
+    },
+    "ILE": {
+        "N": (-0.7167549729347229, -1.5426139831542969, -0.9983330368995667),
+        "CA": (-1.0636085271835327, -0.35169270634651184, -0.21393552422523499),
+        "C": (-1.3896740674972534, 0.8142145276069641, -1.1164065599441528),
+        "O": (-1.2377792596817017, 0.7302915453910828, -2.3656840324401855),
+        "CB": (0.061667006462812424, 0.01599610224366188, 0.8057394623756409),
+        "CG1": (1.502519965171814, -0.08899776637554169, 0.24154816567897797),
+        "CG2": (-0.053174979984760284, -0.8521055579185486, 2.0702083110809326),
+        "CD1": (1.7929610013961792, 0.899773120880127, -0.8863027691841125),
+    },
+    "LEU": {
+        "N": (1.9657520055770874, -1.9763224124908447, -0.18391533195972443),
+        "CA": (1.3077669143676758, -0.6677430868148804, -0.19492436945438385),
+        "C": (1.9905058145523071, 0.24182087182998657, 0.7879968285560608),
+        "O": (2.06896710395813, -0.07880014181137085, 2.0048046112060547),
+        "CB": (-0.20306941866874695, -0.8093230128288269, 0.11243502795696259),
+        "CG": (-0.9916267395019531, 0.5234957337379456, 0.06723011285066605),
+        "CD1": (-2.4228057861328125, 0.29949337244033813, 0.573042094707489),
+        "CD2": (-1.0282856225967407, 1.1250264644622803, -1.346014380455017),
+    },
+    "LYS": {
+        "N": (2.4221372604370117, -0.6473312377929688, 0.6370573043823242),
+        "CA": (2.0314927101135254, 0.2786507308483124, -0.4298512041568756),
+        "C": (2.7168593406677246, 1.595757246017456, -0.20924785733222961),
+        "O": (3.397681713104248, 2.116427421569824, -1.1332510709762573),
+        "CB": (0.5018402934074402, 0.4873858690261841, -0.49062973260879517),
+        "CG": (-0.25062066316604614, -0.7894009947776794, -0.9055535793304443),
+        "CD": (-1.769762635231018, -0.5552700161933899, -1.040329933166504),
+        "CE": (-2.576533555984497, -1.0221366882324219, 0.18493641912937164),
+        "NZ": (-2.269151210784912, -0.24293844401836395, 1.3849012851715088),
+    },
+    "MET": {
+        "N": (1.8903918266296387, -1.5252995491027832, -0.42638593912124634),
+        "CA": (1.2630571126937866, -0.24417810142040253, -0.7626462578773499),
+        "C": (2.30391001701355, 0.8367712497711182, -0.7254616618156433),
+        "O": (2.465414524078369, 1.5928632020950317, -1.7207728624343872),
+        "CB": (0.10567972809076309, 0.10861825942993164, 0.19741646945476532),
+        "CG": (-1.0658042430877686, -0.8736631274223328, 0.08811883628368378),
+        "SD": (-2.4557132720947266, -0.3332225978374481, 1.1461700201034546),
+        "CE": (-3.265165090560913, 0.7033554911613464, -0.11588376015424728),
+    },
+    "PHE": {
+        "N": (-2.8484435081481934, -1.525790810585022, 0.01789816841483116),
+        "CA": (-1.591969609260559, -0.8545162677764893, 0.35214468836784363),
+        "C": (-1.8900631666183472, 0.45833414793014526, 1.0232222080230713),
+        "O": (-1.3424992561340332, 0.74432373046875, 2.121629476547241),
+        "CB": (-0.760358452796936, -0.6342853307723999, -0.9257160425186157),
+        "CG": (0.604112982749939, -0.07200468331575394, -0.6148118376731873),
+        "CD1": (0.8468314409255981, 1.2480632066726685, -0.7146694660186768),
+        "CD2": (1.6827683448791504, -0.9758077263832092, -0.1423054188489914),
+        "CE1": (2.1801748275756836, 1.7875733375549316, -0.3744623064994812),
+        "CE2": (2.888307809829712, -0.48277512192726135, 0.16804970800876617),
+        "CZ": (3.149812936782837, 0.9656873941421509, 0.04440271109342575),
+    },
+    "PRO": {
+        "N": (-0.836250364780426, -0.9899801015853882, 0.5561304688453674),
+        "CA": (0.32722190022468567, -0.6164458394050598, -0.25072571635246277),
+        "C": (1.6121541261672974, -1.1711241006851196, 0.31082412600517273),
+        "O": (1.6127740144729614, -2.2771971225738525, 0.9156193733215332),
+        "CB": (0.3248198926448822, 0.9028244018554688, -0.33368146419525146),
+        "CG": (-1.1425083875656128, 1.2730128765106201, -0.2590600252151489),
+        "CD": (-1.8495968580245972, 0.026575811207294464, 0.2681289613246918),
+    },
+    "SER": {
+        "N": (0.674650251865387, 1.5018702745437622, -0.5367295145988464),
+        "CA": (0.00013792862591799349, 0.4966467022895813, 0.28510504961013794),
+        "C": (0.9941009879112244, -0.5374617576599121, 0.73505038022995),
+        "O": (1.0545241832733154, -0.8683545589447021, 1.9495396614074707),
+        "CB": (-1.1279288530349731, -0.1659376323223114, -0.5160963535308838),
+        "OG": (-1.8135979175567627, -1.085249662399292, 0.28947514295578003),
+    },
+    "THR": {
+        "N": (-1.325830340385437, -1.3728225231170654, 0.6882233023643494),
+        "CA": (-0.5433306097984314, -0.16364754736423492, 0.41697052121162415),
+        "C": (-1.294381856918335, 0.7077372074127197, -0.5549946427345276),
+        "O": (-1.6939635276794434, 0.23654410243034363, -1.6540418863296509),
+        "CB": (0.853203296661377, -0.5363803505897522, -0.14109353721141815),
+        "OG1": (1.5220820903778076, -1.379003643989563, 0.7635167837142944),
+        "CG2": (1.7225933074951172, 0.7054727077484131, -0.3651331067085266),
+    },
+    "TRP": {
+        "N": (3.686030864715576, 0.7599999904632568, 0.496155709028244),
+        "CA": (2.384092092514038, 0.09079249948263168, 0.5325262546539307),
+        "C": (2.1113572120666504, -0.6121063232421875, -0.7733646035194397),
+        "O": (1.796526312828064, -1.8323148488998413, -0.7775964140892029),
+        "CB": (1.281521201133728, 1.1139036417007446, 0.8559791445732117),
+        "CG": (-0.04292375594377518, 0.44645074009895325, 1.0942792892456055),
+        "CD1": (-0.42329534888267517, -0.15470874309539795, 2.2227554321289062),
+        "CD2": (-1.1023900508880615, 0.2158389836549759, 0.11529432237148285),
+        "NE1": (-1.7030320167541504, -0.7665823101997375, 2.0595016479492188),
+        "CE2": (-2.045644998550415, -0.4881173074245453, 0.710669219493866),
+        "CE3": (-1.2173502445220947, 0.6102271676063538, -1.300106406211853),
+        "CZ2": (-3.256009340286255, -0.9164394736289978, -0.00984987337142229),
+        "CZ3": (-2.315925121307373, 0.2306906282901764, -1.9776310920715332),
+        "CH2": (-3.3817875385284424, -0.5677337646484375, -1.3032053709030151),
+    },
+    "TYR": {
+        "N": (-1.7900604009628296, -0.8409399390220642, 1.3180142641067505),
+        "CA": (-1.913882851600647, 0.23552845418453217, 0.330669641494751),
+        "C": (-3.347280740737915, 0.3588399887084961, -0.09830684959888458),
+        "O": (-3.967811346054077, -0.6449354290962219, -0.5423302054405212),
+        "CB": (-1.0093992948532104, 0.0004731413209810853, -0.8981552124023438),
+        "CG": (0.4520410895347595, 0.021162061020731926, -0.5305932760238647),
+        "CD1": (1.0992432832717896, 1.1877919435501099, -0.3579142987728119),
+        "CD2": (1.1803174018859863, -1.253401279449463, -0.31122180819511414),
+        "CE1": (2.5253450870513916, 1.1990256309509277, 0.029804613441228867),
+        "CE2": (2.471151113510132, -1.240687608718872, 0.043534230440855026),
+        "CZ": (3.180687665939331, 0.04672492295503616, 0.2214856892824173),
+        "OH": (4.523719787597656, 0.0671030730009079, 0.5877485871315002),
+    },
+    "VAL": {
+        "N": (0.5987519025802612, -1.569443702697754, -0.7379124760627747),
+        "CA": (0.6014357209205627, -0.10503966361284256, -0.6336286664009094),
+        "C": (1.8391697406768799, 0.4067850410938263, 0.06351757049560547),
+        "O": (2.3952062129974365, -0.2666190266609192, 0.9731166958808899),
+        "CB": (-0.694736897945404, 0.4259096384048462, 0.03581475466489792),
+        "CG1": (-1.9276031255722046, 0.09515828639268875, -0.8172357082366943),
+        "CG2": (-0.8938426971435547, -0.08640842139720917, 1.472349762916565),
+    },
+    "UNK": {
+        "N": (0.0, 0.0, 0.0),
+        "CA": (0.0, 0.0, 0.0),
+        "C": (0.0, 0.0, 0.0),
+        "O": (0.0, 0.0, 0.0),
+    },
+}
+# Protonated nitrogens at physiological pH (matches CHARGED_ATOMS in the
+# opensource constants for the protein subset).
+PROTEIN_CHARGED_ATOMS: dict[tuple[str, str], int] = {
+    ("LYS", "NZ"): 1,
+    ("ARG", "NH2"): 1,
+    ("HIS", "ND1"): 1,
+}
+# Only the elements that appear in canonical protein heavy atoms.
+_PROTEIN_ELEMENT_TO_ATOMIC_NUM: dict[str, int] = {"C": 6, "N": 7, "O": 8, "S": 16}
+def _encode_atom_name(name: str) -> list[int]:
+    padded = name.ljust(4)[:4]
+    return [ord(c) - 32 if c != " " else 0 for c in padded]
+def prepare_protein_features(sequence: str) -> dict[str, Tensor]:
+    """Featurize a single protein sequence for ESMFold2ExperimentalModel.forward.
+    Returns the same keys with the same dtypes/shapes as
+    ``ESMFold2InputBuilder.prepare_input(StructurePredictionInput(...))``
+    restricted to a single-chain protein with no MSA, modifications,
+    distogram conditioning, or covalent bonds. All tensors have a
+    leading batch dim of 1; the caller is responsible for moving them
+    to the model device.
+    """
+    if not sequence:
+        raise ValueError("sequence must be non-empty")
+    res_3letter = [PROTEIN_1TO3.get(c, "UNK") for c in sequence]
+    L = len(sequence)
+    token_atom_starts: list[int] = []
+    atom_records: list[tuple[int, str, str, int, tuple[float, float, float]]] = []
+    res_type_vals: list[int] = []
+    input_id_vals: list[int] = []
+    distogram_rep_atom_idx: list[int] = []
+    atom_cursor = 0
+    for t_idx, (letter, res_3) in enumerate(zip(sequence, res_3letter)):
+        atom_names = PROTEIN_HEAVY_ATOMS[res_3]
+        res_type = PROTEIN_RESIDUE_TO_RES_TYPE.get(res_3, PROTEIN_UNK_RES_TYPE)
+        input_id = ESM_PROTEIN_VOCAB.get(letter, ESM_PROTEIN_VOCAB["X"])
+        token_atom_starts.append(atom_cursor)
+        for name in atom_names:
+            charge = PROTEIN_CHARGED_ATOMS.get((res_3, name), 0)
+            element = name[0]  # protein heavy atoms are all single-letter C/N/O/S
+            ref_pos = PROTEIN_REF_POS[res_3][name]
+            atom_records.append((t_idx, name, element, charge, ref_pos))
+            atom_cursor += 1
+        rep_name = "CB" if "CB" in atom_names else "CA"
+        distogram_rep_atom_idx.append(
+            token_atom_starts[t_idx] + atom_names.index(rep_name)
+        )
+        res_type_vals.append(res_type)
+        input_id_vals.append(input_id)
+    n_real_atoms = len(atom_records)
+    n_atoms = math.ceil(n_real_atoms / 32) * 32 if n_real_atoms > 0 else 32
+    ref_pos = torch.zeros(n_atoms, 3, dtype=torch.float32)
+    ref_element = torch.zeros(n_atoms, dtype=torch.int64)
+    ref_charge = torch.zeros(n_atoms, dtype=torch.int8)
+    ref_atom_name_chars = torch.zeros(n_atoms, 4, dtype=torch.int64)
+    ref_space_uid = torch.zeros(n_atoms, dtype=torch.int64)
+    atom_attention_mask = torch.zeros(n_atoms, dtype=torch.bool)
+    atom_to_token = torch.zeros(n_atoms, dtype=torch.int64)
+    for i, (t_idx, name, element, charge, pos) in enumerate(atom_records):
+        ref_pos[i] = torch.tensor(pos, dtype=torch.float32)
+        ref_element[i] = _PROTEIN_ELEMENT_TO_ATOMIC_NUM[element]
+        ref_charge[i] = charge
+        ref_atom_name_chars[i] = torch.tensor(
+            _encode_atom_name(name), dtype=torch.int64
+        )
+        ref_space_uid[i] = t_idx
+        atom_attention_mask[i] = True
+        atom_to_token[i] = t_idx
+    token_index = torch.arange(L, dtype=torch.int64)
+    residue_index = torch.arange(L, dtype=torch.int64)
+    asym_id = torch.zeros(L, dtype=torch.int64)
+    sym_id = torch.zeros(L, dtype=torch.int64)
+    entity_id = torch.ones(L, dtype=torch.int64)
+    mol_type = torch.full((L,), MOL_TYPE_PROTEIN, dtype=torch.int64)
+    res_type = torch.tensor(res_type_vals, dtype=torch.int64)
+    input_ids = torch.tensor(input_id_vals, dtype=torch.int64)
+    token_bonds = torch.zeros(L, L, 1, dtype=torch.float32)
+    token_attention_mask = torch.ones(L, dtype=torch.bool)
+    distogram_atom_idx = torch.tensor(distogram_rep_atom_idx, dtype=torch.int64)
+    # Single-sequence MSA: depth 1, row 0 is the sequence itself.
+    msa = res_type.unsqueeze(0)
+    msa_attention_mask = torch.ones(1, L, dtype=torch.bool)
+    has_deletion = torch.zeros(1, L, dtype=torch.bool)
+    deletion_value = torch.zeros(1, L, dtype=torch.float32)
+    deletion_mean = torch.zeros(L, dtype=torch.float32)
+    features = {
+        "token_index": token_index,
+        "residue_index": residue_index,
+        "asym_id": asym_id,
+        "sym_id": sym_id,
+        "entity_id": entity_id,
+        "mol_type": mol_type,
+        "res_type": res_type,
+        "input_ids": input_ids,
+        "token_bonds": token_bonds,
+        "token_attention_mask": token_attention_mask,
+        "ref_pos": ref_pos,
+        "ref_element": ref_element,
+        "ref_charge": ref_charge,
+        "ref_atom_name_chars": ref_atom_name_chars,
+        "ref_space_uid": ref_space_uid,
+        "atom_attention_mask": atom_attention_mask,
+        "atom_to_token": atom_to_token,
+        "distogram_atom_idx": distogram_atom_idx,
+        "msa": msa,
+        "msa_attention_mask": msa_attention_mask,
+        "has_deletion": has_deletion,
+        "deletion_value": deletion_value,
+        "deletion_mean": deletion_mean,
+    }
+    return {k: v.unsqueeze(0) for k, v in features.items()}