language: en
tags:
- pytorch
- causal-lm
- gpt2
- microbiome
- taxonomy
pipeline_tag: text-generation
license: apache-2.0
Waypoint-6m
Waypoint-6m is a GPT-2–style causal language model trained on newline-separated taxonomic strings from microbiome data, to learn representations of taxa co-occurrence and sequence structure.
Model summary
See our preprint for details and our GitHub repo for pretraining, finetuning and benchmarking scripts.
Join our slack community for support and discussion about microbiome foundation models.
Causal language model trained on newline-separated taxonomic strings. Each line is treated as a token sequence derived from a vocabulary of taxonomic labels; see Tokenizer below.
| Item | Details |
|---|---|
| Architecture | GPT-2 (model_type: gpt2 in config.json) |
| Vocab | Taxonomic tokenizer (vocab.json); size shown in the Hub file list / tokenizer_config.json |
| Remote code | Required — this repo includes tokenization_taxonomic.py for TaxonomicTokenizer |
Intended use
- Research and prototyping for taxonomic sequence modeling (e.g. pretraining representations, generation experiments).
- Not a diagnostic or clinical tool. Not validated for regulated or safety-critical decisions.
Usage
This repository is gated. To use it you'll need to:
- Request access — click the "Request access" button at the top of this repo's page on Hugging Face. Requests are auto-approved.
- Authenticate — log in to Hugging Face from your environment so the download tooling can use your token:
huggingface-cli login
Or set the token directly:
export HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxx
You can create a token at https://huggingface.co/settings/tokens.
Once both steps are done, you can load the model/dataset normally:
This checkpoint uses custom tokenizer code. You must pass trust_remote_code=True when loading the tokeniser.
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = outpost-bio/Waypoint-6m
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_id)
# Example: newline-separated taxonomic lines (format must match training)
text = (
"k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Enterobacterales; f__Enterobacteriaceae; g__Escherichia\n"
"k__Bacteria; p__Firmicutes; c__Clostridia; o__Lachnospirales; f__Lachnospiraceae; g__Blautia\n"
"k__Bacteria; p__Bacteroidota; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides\n"
)
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=32, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Training
- Data: Atlas dataset
- Objective: causal LM on taxonomic sequences.
License
apache-2.0
Citation
Learning the Language of the Microbiome with Transformers
Neythen J Treloar, Saif Ur-Rehman, Jenny Yang
bioRxiv 2026.05.02.722381; doi: https://doi.org/10.64898/2026.05.02.722381
Maintainer / contact: waypoint@outpost.bio