AFFINose Interaction Model

This repository contains the AFFINose checkpoint for protein-glycan interaction inference. AFFINose combines BERTose glycan token representations with per-residue ESM-C protein embeddings and returns a scalar interaction score.

Quick Start

The recommended user path is the companion notebook. For direct Python use, download the checkpoint and vocabulary with huggingface_hub:

from huggingface_hub import hf_hub_download

checkpoint = hf_hub_download(
    repo_id="supanthadey1/affinose-interaction-model",
    filename="checkpoints/affinose_interaction_model.pt",
)
vocab = hf_hub_download(
    repo_id="supanthadey1/affinose-interaction-model",
    filename="vocab/bpe_vocabulary.json",
)

No Hugging Face token is required for this AFFINose checkpoint now that the repository is public. ESM-C is separate and may require the user's own Hugging Face login depending on EvolutionaryScale access requirements.

Files

  • checkpoints/affinose_interaction_model.pt - AFFINose interaction checkpoint.
  • vocab/bpe_vocabulary.json - WURCS BPE vocabulary for glycan tokenization.
  • src/affinose_model.py - AFFINose architecture.
  • src/affinose_inference.py - standalone inference helper.
  • src/affinose_dataset.py - tokenizer and data utility helpers.
  • src/bertose_model.py - BERTose model definition used for glycan encoding.
  • src/bertose_layers.py - Transformer layers used by BERTose.
  • src/wurcs_bpe_tokenizer.py - WURCS BPE tokenizer.

Input

Provide one protein-glycan pair or a CSV batch. Glycans should be WURCS strings. Proteins can be provided as IDs linked to precomputed embeddings, or through the companion notebook as raw sequences that are embedded with ESM-C 300M.

Batch CSVs use sample_id,protein_id,protein_sequence,glycan_wurcs. Free-text glycan names, common names, SNFG drawings, and IUPAC-condensed strings are not parsed directly by AFFINose. Convert those inputs to WURCS first, then score the protein-glycan pair.

Protein Embedding Requirement

AFFINose expects per-residue ESM-C 300M embeddings with shape [L, 960]. Do not mean-pool the protein before passing it into AFFINose.

ESM-C is a separate EvolutionaryScale protein model. The ESM-C weights are not included in this repository. Users should install the esm package and let it download ESM-C 300M into their own runtime cache.

from esm.models.esmc import ESMC
from esm.sdk.api import ESMProtein, LogitsConfig

esmc = ESMC.from_pretrained("esmc_300m").to("cuda")  # or "cpu"
protein = ESMProtein(sequence="MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQ")
protein_tensor = esmc.encode(protein)
output = esmc.logits(
    protein_tensor,
    LogitsConfig(sequence=True, return_embeddings=True),
)
protein_embeddings = output.embeddings  # per-residue ESM-C 300M embeddings

If Hugging Face requests authentication for ESM-C, users should authenticate with their own Hugging Face account/token and accept any required EvolutionaryScale terms. BERTose/AFFINose tokens are not required once these repositories are public.

Output

A scalar protein-glycan interaction score from the trained AFFINose head.

Scope

This repository does not perform IUPAC-condensed/name-to-WURCS conversion. For now, provide WURCS directly.

License metadata is currently other; update it when the final release license and citation text are chosen.

References

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support