BERTose and AFFINose Inference Notebook

This repository contains the cloud notebook for the public inference workflows:

  • BERTose glycan embeddings.
  • BERTose IAR token-level ambiguity resolution.
  • AFFINose protein-glycan interaction scoring.

The notebook supports single examples and batch CSV upload paths.

Quick Start

  1. Open notebooks/bertose_affinose_cloud_inference.ipynb in Google Colab, Jupyter, or a Google Cloud notebook.
  2. Run the setup cell. It installs torch, huggingface_hub, pandas, h5py, tqdm, and esm.
  3. Run the BERTose embedding cells for WURCS-to-embedding output.
  4. Run the BERTose IAR cells for token-level ambiguity-resolution summaries.
  5. Run the AFFINose cells for protein-glycan scoring.

BERTose and AFFINose are public Hugging Face repositories, so users do not need your token to download these checkpoints. The notebook downloads the needed weights, vocabularies, and source files with huggingface_hub.

Input Contract

The current release expects glycans as WURCS strings.

  • BERTose embeddings: one WURCS string, or a CSV with sample_id,wurcs.
  • BERTose IAR: one WURCS string containing uncertainty markers, or a CSV with sample_id,wurcs.
  • AFFINose: one WURCS glycan plus a protein sequence, protein embedding, or protein ID linked to uploaded ESM-C embeddings. Batch CSVs use sample_id,protein_id,protein_sequence,glycan_wurcs.

Free-text glycan names, common names, SNFG drawings, and IUPAC-condensed strings are not parsed directly by this notebook. Convert those inputs to WURCS first, then run BERTose/AFFINose. If the input glycan is ambiguous, that ambiguity should be represented in the WURCS string with WURCS-style uncertainty markers before using BERTose IAR.

ESM-C Protein Embeddings for AFFINose

AFFINose uses ESM-C 300M per-residue protein embeddings as input. ESM-C is separate from BERTose/AFFINose and is not redistributed in these repositories.

Users who want to run protein-glycan prediction from raw protein sequences should install esm in the notebook runtime. The notebook loads ESM-C 300M with:

from esm.models.esmc import ESMC

esmc = ESMC.from_pretrained("esmc_300m").to("cuda")

The required AFFINose protein input is per-residue embeddings with shape [L, 960]. Do not mean-pool the protein first. If Hugging Face requires authentication for the EvolutionaryScale ESM-C model, users should log in with their own Hugging Face account/token and accept any required terms.

Files

  • notebooks/bertose_affinose_cloud_inference.ipynb - main cloud inference notebook.
  • scripts/run_release_smoke_test.py - headless validation script for cluster/cloud smoke tests.
  • examples/glycan_embedding_batch.csv - batch schema for glycan embeddings.
  • examples/glycan_resolution_batch.csv - batch schema for ambiguity resolution.
  • examples/affinose_batch.csv - batch schema for protein-glycan scoring.

Headless Validation

For a Nova/Google Cloud style smoke test outside Jupyter:

python scripts/run_release_smoke_test.py \
  --output-dir release_smoke_outputs \
  --protein-emb-h5 /path/to/esmc_embeddings.h5 \
  --affinose-csv /path/to/combined_binding_data.csv

The script downloads the public Hugging Face repositories without a token, runs BERTose single and 20-glycan batch embedding, BERTose single and 20-glycan IAR, and AFFINose single plus five-pair batch scoring. It writes output CSVs and smoke_summary.json with timing.

Scope

The notebook expects WURCS glycans. IUPAC-condensed/name-to-WURCS conversion is intentionally outside this release and can be added as a separate preprocessing step.

License metadata is currently other; update it when the final release license and citation text are chosen.

References

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support