| --- |
| language: en |
| tags: |
| - pytorch |
| - causal-lm |
| - gpt2 |
| - microbiome |
| - taxonomy |
| pipeline_tag: text-generation |
| license: apache-2.0 |
| --- |
| |
| # Waypoint-45m |
| Waypoint-45m is a GPT-2–style causal language model trained on newline-separated taxonomic strings from microbiome data, to learn representations of taxa co-occurrence and sequence structure. |
|
|
| ## Model summary |
| See [our preprint](https://www.biorxiv.org/content/10.64898/2026.05.02.722381v1) for details |
|
|
| Join [our slack community](https://join.slack.com/t/outpostbio-waypoint/shared_invite/zt-3w6ivgtba-WJOCkdxiISxQpwVq9ZZxTA) for support and discussion about microbiome foundation models. |
|
|
| Causal language model trained on **newline-separated taxonomic strings**. Each line is treated as a token sequence derived from a vocabulary of taxonomic labels; see **Tokenizer** below. |
|
|
| | Item | Details | |
| |------|---------| |
| | Architecture | GPT-2 (`model_type: gpt2` in `config.json`) | |
| | Vocab | Taxonomic tokenizer (`vocab.json`); size shown in the Hub file list / `tokenizer_config.json` | |
| | Remote code | **Required** — this repo includes `tokenization_taxonomic.py` for `TaxonomicTokenizer` | |
|
|
| ## Intended use |
|
|
| - Research and prototyping for **taxonomic sequence modeling** (e.g. pretraining representations, generation experiments). |
| - **Not** a diagnostic or clinical tool. Not validated for regulated or safety-critical decisions. |
|
|
| ## Usage |
| This repository is **gated**. To use it you'll need to: |
|
|
| 1. **Request access** — click the "Request access" button at the top of this repo's page on Hugging Face. Requests are auto-approved. |
| 2. **Authenticate** — log in to Hugging Face from your environment so the download tooling can use your token: |
|
|
| ```bash |
| huggingface-cli login |
| ``` |
|
|
| Or set the token directly: |
|
|
| ```bash |
| export HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxx |
| ``` |
|
|
| You can create a token at https://huggingface.co/settings/tokens. |
|
|
| Once both steps are done, you can load the model/dataset normally: |
|
|
| This checkpoint uses **custom tokenizer code**. You must pass **`trust_remote_code=True`** when loading the tokeniser. |
|
|
| ```python |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| |
| model_id = outpost-bio/Waypoint-45m |
| |
| tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) |
| model = AutoModelForCausalLM.from_pretrained(model_id) |
| |
| # Example: newline-separated taxonomic lines (format must match training) |
| text = ( |
| "k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Enterobacterales; f__Enterobacteriaceae; g__Escherichia\n" |
| "k__Bacteria; p__Firmicutes; c__Clostridia; o__Lachnospirales; f__Lachnospiraceae; g__Blautia\n" |
| "k__Bacteria; p__Bacteroidota; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides\n" |
| ) |
| inputs = tokenizer(text, return_tensors="pt") |
| outputs = model.generate(**inputs, max_new_tokens=32, do_sample=False) |
| print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
| ``` |
|
|
| ## Training |
| - **Data:** [Atlas dataset](https://huggingface.co/datasets/outpost-bio/Atlas) |
| - **Objective:** causal LM on taxonomic sequences. |
|
|
| ## License |
| apache-2.0 |
|
|
| ## Citation |
| Learning the Language of the Microbiome with Transformers\ |
| Neythen J Treloar, Saif Ur-Rehman, Jenny Yang\ |
| bioRxiv 2026.05.02.722381; doi: https://doi.org/10.64898/2026.05.02.722381 |
|
|
| ## Model card contact |
|
|
| **Maintainer / contact:** neythen@outpost.bio |
|
|