--- library_name: pytorch license: other tags: - glycans - proteins - protein-glycan - affinose - bertose - esm-c - pytorch --- # AFFINose Interaction Model This repository contains the AFFINose checkpoint for protein-glycan interaction inference. AFFINose combines BERTose glycan token representations with per-residue ESM-C protein embeddings and returns a scalar interaction score. ## Quick Start The recommended user path is the companion notebook. For direct Python use, download the checkpoint and vocabulary with `huggingface_hub`: ```python from huggingface_hub import hf_hub_download checkpoint = hf_hub_download( repo_id="supanthadey1/affinose-interaction-model", filename="checkpoints/affinose_interaction_model.pt", ) vocab = hf_hub_download( repo_id="supanthadey1/affinose-interaction-model", filename="vocab/bpe_vocabulary.json", ) ``` No Hugging Face token is required for this AFFINose checkpoint now that the repository is public. ESM-C is separate and may require the user's own Hugging Face login depending on EvolutionaryScale access requirements. ## Files - `checkpoints/affinose_interaction_model.pt` - AFFINose interaction checkpoint. - `vocab/bpe_vocabulary.json` - WURCS BPE vocabulary for glycan tokenization. - `src/affinose_model.py` - AFFINose architecture. - `src/affinose_inference.py` - standalone inference helper. - `src/affinose_dataset.py` - tokenizer and data utility helpers. - `src/bertose_model.py` - BERTose model definition used for glycan encoding. - `src/bertose_layers.py` - Transformer layers used by BERTose. - `src/wurcs_bpe_tokenizer.py` - WURCS BPE tokenizer. ## Input Provide one protein-glycan pair or a CSV batch. Glycans should be WURCS strings. Proteins can be provided as IDs linked to precomputed embeddings, or through the companion notebook as raw sequences that are embedded with ESM-C 300M. Batch CSVs use `sample_id,protein_id,protein_sequence,glycan_wurcs`. Free-text glycan names, common names, SNFG drawings, and IUPAC-condensed strings are not parsed directly by AFFINose. Convert those inputs to WURCS first, then score the protein-glycan pair. ## Protein Embedding Requirement AFFINose expects per-residue ESM-C 300M embeddings with shape `[L, 960]`. Do not mean-pool the protein before passing it into AFFINose. ESM-C is a separate EvolutionaryScale protein model. The ESM-C weights are not included in this repository. Users should install the `esm` package and let it download ESM-C 300M into their own runtime cache. ```python from esm.models.esmc import ESMC from esm.sdk.api import ESMProtein, LogitsConfig esmc = ESMC.from_pretrained("esmc_300m").to("cuda") # or "cpu" protein = ESMProtein(sequence="MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQ") protein_tensor = esmc.encode(protein) output = esmc.logits( protein_tensor, LogitsConfig(sequence=True, return_embeddings=True), ) protein_embeddings = output.embeddings # per-residue ESM-C 300M embeddings ``` If Hugging Face requests authentication for ESM-C, users should authenticate with their own Hugging Face account/token and accept any required EvolutionaryScale terms. BERTose/AFFINose tokens are not required once these repositories are public. ## Output A scalar protein-glycan interaction score from the trained AFFINose head. ## Scope This repository does not perform IUPAC-condensed/name-to-WURCS conversion. For now, provide WURCS directly. License metadata is currently `other`; update it when the final release license and citation text are chosen. ## References - EvolutionaryScale ESM package: https://github.com/evolutionaryscale/esm - ESM-C 300M Hugging Face model: https://huggingface.co/EvolutionaryScale/esmc-300m-2024-12