supanthadey1
/

bertose-glycan-encoder

Model card Files Files and versions

bertose-glycan-encoder / README.md

supanthadey1's picture

Document BERTose WURCS input contract

c9a7401 verified 22 days ago

|

history blame contribute delete

1.76 kB

	---
	library_name: pytorch
	license: other
	tags:
	- glycans
	- wurcs
	- bertose
	- embeddings
	- pytorch
	---

	# BERTose Glycan Encoder

	This repository contains the BERTose checkpoint for WURCS glycan embedding inference. It is the release-facing glycan representation model used by the companion notebook.

	## Quick Start

	The recommended user path is the companion notebook:

	```python
	from huggingface_hub import hf_hub_download

	checkpoint = hf_hub_download(
	repo_id="supanthadey1/bertose-glycan-encoder",
	filename="checkpoints/bertose_glycan_encoder.pt",
	)
	vocab = hf_hub_download(
	repo_id="supanthadey1/bertose-glycan-encoder",
	filename="vocab/bpe_vocabulary.json",
	)
	```

	No Hugging Face token is required for this BERTose checkpoint now that the repository is public.

	## Files

	- `checkpoints/bertose_glycan_encoder.pt` - BERTose glycan encoder checkpoint.
	- `vocab/bpe_vocabulary.json` - WURCS BPE vocabulary.
	- `src/bertose_model.py` - BERTose model definition.
	- `src/bertose_layers.py` - Transformer layers used by BERTose.
	- `src/wurcs_bpe_tokenizer.py` - WURCS BPE tokenizer.

	## Input

	Provide one WURCS glycan string or a CSV batch with `sample_id,wurcs`.

	Free-text glycan names, common names, SNFG drawings, and IUPAC-condensed strings are not parsed directly by this checkpoint. Convert those inputs to WURCS first, then run BERTose embedding inference.

	## Output

	Dense glycan embeddings. The companion notebook defaults to `[CLS]` pooling and also supports mean pooling over valid glycan tokens.

	## Notes

	This repository does not perform IUPAC-condensed/name-to-WURCS conversion. For now, provide WURCS directly.

	License metadata is currently `other`; update it when the final release license and citation text are chosen.