AFFINose Interaction Model

This repository contains the AFFINose checkpoint for protein-glycan interaction inference. AFFINose combines BERTose glycan token representations with per-residue ESM-C protein embeddings and returns a scalar interaction score.

Quick Start

The recommended user path is the companion notebook. For direct Python use, download the checkpoint and vocabulary with huggingface_hub:

from huggingface_hub import hf_hub_download

checkpoint = hf_hub_download(
    repo_id="supanthadey1/affinose-interaction-model",
    filename="checkpoints/affinose_interaction_model.pt",
)
vocab = hf_hub_download(
    repo_id="supanthadey1/affinose-interaction-model",
    filename="vocab/bpe_vocabulary.json",
)

No Hugging Face token is required for this AFFINose checkpoint now that the repository is public. ESM-C is separate and may require the user's own Hugging Face login depending on EvolutionaryScale access requirements.

Files

  • checkpoints/affinose_interaction_model.pt - AFFINose interaction checkpoint.
  • vocab/bpe_vocabulary.json - WURCS BPE vocabulary for glycan tokenization.
  • src/affinose_model.py - AFFINose architecture.
  • src/affinose_inference.py - standalone inference helper.
  • src/affinose_dataset.py - tokenizer and data utility helpers.
  • src/bertose_model.py - BERTose model definition used for glycan encoding.
  • src/bertose_layers.py - Transformer layers used by BERTose.
  • src/wurcs_bpe_tokenizer.py - WURCS BPE tokenizer.

Input

Provide one protein-glycan pair or a CSV batch. Glycans should be WURCS strings. Proteins can be provided as IDs linked to precomputed embeddings, or through the companion notebook as raw sequences that are embedded with ESM-C 300M.

Batch CSVs use sample_id,protein_id,protein_sequence,glycan_wurcs. Free-text glycan names, common names, SNFG drawings, and IUPAC-condensed strings are not parsed directly by AFFINose. Convert those inputs to WURCS first, then score the protein-glycan pair.

Protein Embedding Requirement

AFFINose expects per-residue ESM-C 300M embeddings with shape [L, 960]. Do not mean-pool the protein before passing it into AFFINose.

ESM-C is a separate EvolutionaryScale protein model. The ESM-C weights are not included in this repository. Users should install the esm package and let it download ESM-C 300M into their own runtime cache.

from esm.models.esmc import ESMC
from esm.sdk.api import ESMProtein, LogitsConfig

esmc = ESMC.from_pretrained("esmc_300m").to("cuda")  # or "cpu"
protein = ESMProtein(sequence="MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQ")
protein_tensor = esmc.encode(protein)
output = esmc.logits(
    protein_tensor,
    LogitsConfig(sequence=True, return_embeddings=True),
)
protein_embeddings = output.embeddings  # per-residue ESM-C 300M embeddings

If Hugging Face requests authentication for ESM-C, users should authenticate with their own Hugging Face account/token and accept any required EvolutionaryScale terms. BERTose/AFFINose tokens are not required once these repositories are public.

Output

A scalar protein-glycan interaction score from the trained AFFINose head.

Scope

This repository does not perform IUPAC-condensed/name-to-WURCS conversion. For now, provide WURCS directly.

License

This repository is released under the Apache License 2.0. See LICENSE.

References

Downloads last month
24
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support