BERTose Glycan Encoder

This repository contains the BERTose checkpoint for WURCS glycan embedding inference. It is the release-facing glycan representation model used by the companion notebook.

Quick Start

The recommended user path is the companion notebook:

from huggingface_hub import hf_hub_download

checkpoint = hf_hub_download(
    repo_id="supanthadey1/bertose-glycan-encoder",
    filename="checkpoints/bertose_glycan_encoder.pt",
)
vocab = hf_hub_download(
    repo_id="supanthadey1/bertose-glycan-encoder",
    filename="vocab/bpe_vocabulary.json",
)

No Hugging Face token is required for this BERTose checkpoint now that the repository is public.

Files

checkpoints/bertose_glycan_encoder.pt - BERTose glycan encoder checkpoint.
vocab/bpe_vocabulary.json - WURCS BPE vocabulary.
src/bertose_model.py - BERTose model definition.
src/bertose_layers.py - Transformer layers used by BERTose.
src/wurcs_bpe_tokenizer.py - WURCS BPE tokenizer.

Input

Provide one WURCS glycan string or a CSV batch with sample_id,wurcs.

Free-text glycan names, common names, SNFG drawings, and IUPAC-condensed strings are not parsed directly by this checkpoint. Convert those inputs to WURCS first, then run BERTose embedding inference.

Output

Dense glycan embeddings. The companion notebook defaults to [CLS] pooling and also supports mean pooling over valid glycan tokens.

Notes

This repository does not perform IUPAC-condensed/name-to-WURCS conversion. For now, provide WURCS directly.

License metadata is currently other; update it when the final release license and citation text are chosen.

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support