AFFINose Interaction Model
This repository contains the AFFINose checkpoint for protein-glycan interaction inference. AFFINose combines BERTose glycan token representations with per-residue ESM-C protein embeddings and returns a scalar interaction score.
Quick Start
The recommended user path is the companion notebook. For direct Python use, download the checkpoint and vocabulary with huggingface_hub:
from huggingface_hub import hf_hub_download
checkpoint = hf_hub_download(
repo_id="supanthadey1/affinose-interaction-model",
filename="checkpoints/affinose_interaction_model.pt",
)
vocab = hf_hub_download(
repo_id="supanthadey1/affinose-interaction-model",
filename="vocab/bpe_vocabulary.json",
)
No Hugging Face token is required for this AFFINose checkpoint now that the repository is public. ESM-C is separate and may require the user's own Hugging Face login depending on EvolutionaryScale access requirements.
Files
checkpoints/affinose_interaction_model.pt- AFFINose interaction checkpoint.vocab/bpe_vocabulary.json- WURCS BPE vocabulary for glycan tokenization.src/affinose_model.py- AFFINose architecture.src/affinose_inference.py- standalone inference helper.src/affinose_dataset.py- tokenizer and data utility helpers.src/bertose_model.py- BERTose model definition used for glycan encoding.src/bertose_layers.py- Transformer layers used by BERTose.src/wurcs_bpe_tokenizer.py- WURCS BPE tokenizer.
Input
Provide one protein-glycan pair or a CSV batch. Glycans should be WURCS strings. Proteins can be provided as IDs linked to precomputed embeddings, or through the companion notebook as raw sequences that are embedded with ESM-C 300M.
Batch CSVs use sample_id,protein_id,protein_sequence,glycan_wurcs. Free-text glycan names, common names, SNFG drawings, and IUPAC-condensed strings are not parsed directly by AFFINose. Convert those inputs to WURCS first, then score the protein-glycan pair.
Protein Embedding Requirement
AFFINose expects per-residue ESM-C 300M embeddings with shape [L, 960]. Do not mean-pool the protein before passing it into AFFINose.
ESM-C is a separate EvolutionaryScale protein model. The ESM-C weights are not included in this repository. Users should install the esm package and let it download ESM-C 300M into their own runtime cache.
from esm.models.esmc import ESMC
from esm.sdk.api import ESMProtein, LogitsConfig
esmc = ESMC.from_pretrained("esmc_300m").to("cuda") # or "cpu"
protein = ESMProtein(sequence="MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQ")
protein_tensor = esmc.encode(protein)
output = esmc.logits(
protein_tensor,
LogitsConfig(sequence=True, return_embeddings=True),
)
protein_embeddings = output.embeddings # per-residue ESM-C 300M embeddings
If Hugging Face requests authentication for ESM-C, users should authenticate with their own Hugging Face account/token and accept any required EvolutionaryScale terms. BERTose/AFFINose tokens are not required once these repositories are public.
Output
A scalar protein-glycan interaction score from the trained AFFINose head.
Scope
This repository does not perform IUPAC-condensed/name-to-WURCS conversion. For now, provide WURCS directly.
License metadata is currently other; update it when the final release license and citation text are chosen.
References
- EvolutionaryScale ESM package: https://github.com/evolutionaryscale/esm
- ESM-C 300M Hugging Face model: https://huggingface.co/EvolutionaryScale/esmc-300m-2024-12
- Downloads last month
- -