Waypoint-45m / README.md
outpost-neythen's picture
Update README.md
73c7459 verified
---
language: en
tags:
- pytorch
- causal-lm
- gpt2
- microbiome
- taxonomy
pipeline_tag: text-generation
license: apache-2.0
---
# Waypoint-45m
Waypoint-45m is a GPT-2–style causal language model trained on newline-separated taxonomic strings from microbiome data, to learn representations of taxa co-occurrence and sequence structure.
## Model summary
See [our preprint](https://www.biorxiv.org/content/10.64898/2026.05.02.722381v1) for details
Join [our slack community](https://join.slack.com/t/outpostbio-waypoint/shared_invite/zt-3w6ivgtba-WJOCkdxiISxQpwVq9ZZxTA) for support and discussion about microbiome foundation models.
Causal language model trained on **newline-separated taxonomic strings**. Each line is treated as a token sequence derived from a vocabulary of taxonomic labels; see **Tokenizer** below.
| Item | Details |
|------|---------|
| Architecture | GPT-2 (`model_type: gpt2` in `config.json`) |
| Vocab | Taxonomic tokenizer (`vocab.json`); size shown in the Hub file list / `tokenizer_config.json` |
| Remote code | **Required** — this repo includes `tokenization_taxonomic.py` for `TaxonomicTokenizer` |
## Intended use
- Research and prototyping for **taxonomic sequence modeling** (e.g. pretraining representations, generation experiments).
- **Not** a diagnostic or clinical tool. Not validated for regulated or safety-critical decisions.
## Usage
This repository is **gated**. To use it you'll need to:
1. **Request access** — click the "Request access" button at the top of this repo's page on Hugging Face. Requests are auto-approved.
2. **Authenticate** — log in to Hugging Face from your environment so the download tooling can use your token:
```bash
huggingface-cli login
```
Or set the token directly:
```bash
export HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxx
```
You can create a token at https://huggingface.co/settings/tokens.
Once both steps are done, you can load the model/dataset normally:
This checkpoint uses **custom tokenizer code**. You must pass **`trust_remote_code=True`** when loading the tokeniser.
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = outpost-bio/Waypoint-45m
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_id)
# Example: newline-separated taxonomic lines (format must match training)
text = (
"k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Enterobacterales; f__Enterobacteriaceae; g__Escherichia\n"
"k__Bacteria; p__Firmicutes; c__Clostridia; o__Lachnospirales; f__Lachnospiraceae; g__Blautia\n"
"k__Bacteria; p__Bacteroidota; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides\n"
)
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=32, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
## Training
- **Data:** [Atlas dataset](https://huggingface.co/datasets/outpost-bio/Atlas)
- **Objective:** causal LM on taxonomic sequences.
## License
apache-2.0
## Citation
Learning the Language of the Microbiome with Transformers\
Neythen J Treloar, Saif Ur-Rehman, Jenny Yang\
bioRxiv 2026.05.02.722381; doi: https://doi.org/10.64898/2026.05.02.722381
## Model card contact
**Maintainer / contact:** neythen@outpost.bio