BERTose and AFFINose Inference Notebook
This repository contains the cloud notebook for the public inference workflows:
- BERTose glycan embeddings.
- BERTose IAR token-level ambiguity resolution.
- AFFINose protein-glycan interaction scoring.
The notebook supports single examples and batch CSV upload paths.
Quick Start
- Open
notebooks/bertose_affinose_cloud_inference.ipynbin Google Colab, Jupyter, or a Google Cloud notebook. - Run the setup cell. It installs
torch,huggingface_hub,pandas,h5py,tqdm, andesm. - Run the BERTose embedding cells for WURCS-to-embedding output.
- Run the BERTose IAR cells for token-level ambiguity-resolution summaries.
- Run the AFFINose cells for protein-glycan scoring.
BERTose and AFFINose are public Hugging Face repositories, so users do not need your token to download these checkpoints. The notebook downloads the needed weights, vocabularies, and source files with huggingface_hub.
Input Contract
The current release expects glycans as WURCS strings.
- BERTose embeddings: one WURCS string, or a CSV with
sample_id,wurcs. - BERTose IAR: one WURCS string containing uncertainty markers, or a CSV with
sample_id,wurcs. - AFFINose: one WURCS glycan plus a protein sequence, protein embedding, or protein ID linked to uploaded ESM-C embeddings. Batch CSVs use
sample_id,protein_id,protein_sequence,glycan_wurcs.
Free-text glycan names, common names, SNFG drawings, and IUPAC-condensed strings are not parsed directly by this notebook. Convert those inputs to WURCS first, then run BERTose/AFFINose. If the input glycan is ambiguous, that ambiguity should be represented in the WURCS string with WURCS-style uncertainty markers before using BERTose IAR.
ESM-C Protein Embeddings for AFFINose
AFFINose uses ESM-C 300M per-residue protein embeddings as input. ESM-C is separate from BERTose/AFFINose and is not redistributed in these repositories.
Users who want to run protein-glycan prediction from raw protein sequences should install esm in the notebook runtime. The notebook loads ESM-C 300M with:
from esm.models.esmc import ESMC
esmc = ESMC.from_pretrained("esmc_300m").to("cuda")
The required AFFINose protein input is per-residue embeddings with shape [L, 960]. Do not mean-pool the protein first. If Hugging Face requires authentication for the EvolutionaryScale ESM-C model, users should log in with their own Hugging Face account/token and accept any required terms.
Files
notebooks/bertose_affinose_cloud_inference.ipynb- main cloud inference notebook.scripts/run_release_smoke_test.py- headless validation script for cluster/cloud smoke tests.examples/glycan_embedding_batch.csv- batch schema for glycan embeddings.examples/glycan_resolution_batch.csv- batch schema for ambiguity resolution.examples/affinose_batch.csv- batch schema for protein-glycan scoring.
Headless Validation
For a Nova/Google Cloud style smoke test outside Jupyter:
python scripts/run_release_smoke_test.py \
--output-dir release_smoke_outputs \
--protein-emb-h5 /path/to/esmc_embeddings.h5 \
--affinose-csv /path/to/combined_binding_data.csv
The script downloads the public Hugging Face repositories without a token, runs BERTose single and 20-glycan batch embedding, BERTose single and 20-glycan IAR, and AFFINose single plus five-pair batch scoring. It writes output CSVs and smoke_summary.json with timing.
Scope
The notebook expects WURCS glycans. IUPAC-condensed/name-to-WURCS conversion is intentionally outside this release and can be added as a separate preprocessing step.
License metadata is currently other; update it when the final release license and citation text are chosen.
References
- EvolutionaryScale ESM package: https://github.com/evolutionaryscale/esm
- ESM-C 300M Hugging Face model: https://huggingface.co/EvolutionaryScale/esmc-300m-2024-12