supanthadey1's picture
Document BERTose WURCS input contract
c9a7401 verified
---
library_name: pytorch
license: other
tags:
- glycans
- wurcs
- bertose
- embeddings
- pytorch
---
# BERTose Glycan Encoder
This repository contains the BERTose checkpoint for WURCS glycan embedding inference. It is the release-facing glycan representation model used by the companion notebook.
## Quick Start
The recommended user path is the companion notebook:
```python
from huggingface_hub import hf_hub_download
checkpoint = hf_hub_download(
repo_id="supanthadey1/bertose-glycan-encoder",
filename="checkpoints/bertose_glycan_encoder.pt",
)
vocab = hf_hub_download(
repo_id="supanthadey1/bertose-glycan-encoder",
filename="vocab/bpe_vocabulary.json",
)
```
No Hugging Face token is required for this BERTose checkpoint now that the repository is public.
## Files
- `checkpoints/bertose_glycan_encoder.pt` - BERTose glycan encoder checkpoint.
- `vocab/bpe_vocabulary.json` - WURCS BPE vocabulary.
- `src/bertose_model.py` - BERTose model definition.
- `src/bertose_layers.py` - Transformer layers used by BERTose.
- `src/wurcs_bpe_tokenizer.py` - WURCS BPE tokenizer.
## Input
Provide one WURCS glycan string or a CSV batch with `sample_id,wurcs`.
Free-text glycan names, common names, SNFG drawings, and IUPAC-condensed strings are not parsed directly by this checkpoint. Convert those inputs to WURCS first, then run BERTose embedding inference.
## Output
Dense glycan embeddings. The companion notebook defaults to `[CLS]` pooling and also supports mean pooling over valid glycan tokens.
## Notes
This repository does not perform IUPAC-condensed/name-to-WURCS conversion. For now, provide WURCS directly.
License metadata is currently `other`; update it when the final release license and citation text are chosen.