File size: 3,708 Bytes
6a5be1f
 
 
 
 
 
 
 
 
 
 
 
 
7d06ac6
6a5be1f
7d06ac6
6a5be1f
0569e97
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6a5be1f
 
7d06ac6
6a5be1f
7d06ac6
8dc2602
 
7d06ac6
 
6a5be1f
 
8dc2602
6a5be1f
8dc2602
6a5be1f
44fcada
 
6a5be1f
 
7d06ac6
6a5be1f
8dc2602
60a4581
 
 
 
 
8dc2602
60a4581
8dc2602
 
60a4581
 
 
8dc2602
60a4581
 
7d06ac6
60a4581
6a5be1f
 
7d06ac6
8dc2602
 
6a5be1f
44fcada
6a5be1f
0569e97
60a4581
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
---
library_name: pytorch
license: other
tags:
- glycans
- proteins
- protein-glycan
- affinose
- bertose
- esm-c
- pytorch
---

# AFFINose Interaction Model

This repository contains the AFFINose checkpoint for protein-glycan interaction inference. AFFINose combines BERTose glycan token representations with per-residue ESM-C protein embeddings and returns a scalar interaction score.

## Quick Start

The recommended user path is the companion notebook. For direct Python use, download the checkpoint and vocabulary with `huggingface_hub`:

```python
from huggingface_hub import hf_hub_download

checkpoint = hf_hub_download(
    repo_id="supanthadey1/affinose-interaction-model",
    filename="checkpoints/affinose_interaction_model.pt",
)
vocab = hf_hub_download(
    repo_id="supanthadey1/affinose-interaction-model",
    filename="vocab/bpe_vocabulary.json",
)
```

No Hugging Face token is required for this AFFINose checkpoint now that the repository is public. ESM-C is separate and may require the user's own Hugging Face login depending on EvolutionaryScale access requirements.

## Files

- `checkpoints/affinose_interaction_model.pt` - AFFINose interaction checkpoint.
- `vocab/bpe_vocabulary.json` - WURCS BPE vocabulary for glycan tokenization.
- `src/affinose_model.py` - AFFINose architecture.
- `src/affinose_inference.py` - standalone inference helper.
- `src/affinose_dataset.py` - tokenizer and data utility helpers.
- `src/bertose_model.py` - BERTose model definition used for glycan encoding.
- `src/bertose_layers.py` - Transformer layers used by BERTose.
- `src/wurcs_bpe_tokenizer.py` - WURCS BPE tokenizer.

## Input

Provide one protein-glycan pair or a CSV batch. Glycans should be WURCS strings. Proteins can be provided as IDs linked to precomputed embeddings, or through the companion notebook as raw sequences that are embedded with ESM-C 300M.

Batch CSVs use `sample_id,protein_id,protein_sequence,glycan_wurcs`. Free-text glycan names, common names, SNFG drawings, and IUPAC-condensed strings are not parsed directly by AFFINose. Convert those inputs to WURCS first, then score the protein-glycan pair.

## Protein Embedding Requirement

AFFINose expects per-residue ESM-C 300M embeddings with shape `[L, 960]`. Do not mean-pool the protein before passing it into AFFINose.

ESM-C is a separate EvolutionaryScale protein model. The ESM-C weights are not included in this repository. Users should install the `esm` package and let it download ESM-C 300M into their own runtime cache.

```python
from esm.models.esmc import ESMC
from esm.sdk.api import ESMProtein, LogitsConfig

esmc = ESMC.from_pretrained("esmc_300m").to("cuda")  # or "cpu"
protein = ESMProtein(sequence="MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQ")
protein_tensor = esmc.encode(protein)
output = esmc.logits(
    protein_tensor,
    LogitsConfig(sequence=True, return_embeddings=True),
)
protein_embeddings = output.embeddings  # per-residue ESM-C 300M embeddings
```

If Hugging Face requests authentication for ESM-C, users should authenticate with their own Hugging Face account/token and accept any required EvolutionaryScale terms. BERTose/AFFINose tokens are not required once these repositories are public.

## Output

A scalar protein-glycan interaction score from the trained AFFINose head.

## Scope

This repository does not perform IUPAC-condensed/name-to-WURCS conversion. For now, provide WURCS directly.

License metadata is currently `other`; update it when the final release license and citation text are chosen.

## References

- EvolutionaryScale ESM package: https://github.com/evolutionaryscale/esm
- ESM-C 300M Hugging Face model: https://huggingface.co/EvolutionaryScale/esmc-300m-2024-12