outpost-bio
/

Waypoint-45m

Text Generation

Model card Files Files and versions

Waypoint-45m / README.md

outpost-neythen's picture

outpost-neythen

Update README.md

73c7459 verified 10 days ago

|

history blame contribute delete

3.37 kB

	---
	language: en
	tags:
	- pytorch
	- causal-lm
	- gpt2
	- microbiome
	- taxonomy
	pipeline_tag: text-generation
	license: apache-2.0
	---

	# Waypoint-45m
	Waypoint-45m is a GPT-2–style causal language model trained on newline-separated taxonomic strings from microbiome data, to learn representations of taxa co-occurrence and sequence structure.

	## Model summary
	See [our preprint](https://www.biorxiv.org/content/10.64898/2026.05.02.722381v1) for details

	Join [our slack community](https://join.slack.com/t/outpostbio-waypoint/shared_invite/zt-3w6ivgtba-WJOCkdxiISxQpwVq9ZZxTA) for support and discussion about microbiome foundation models.

	Causal language model trained on newline-separated taxonomic strings. Each line is treated as a token sequence derived from a vocabulary of taxonomic labels; see Tokenizer below.

	\| Item \| Details \|
	\|------\|---------\|
	\| Architecture \| GPT-2 (`model_type: gpt2` in `config.json`) \|
	\| Vocab \| Taxonomic tokenizer (`vocab.json`); size shown in the Hub file list / `tokenizer_config.json` \|
	\| Remote code \| Required — this repo includes `tokenization_taxonomic.py` for `TaxonomicTokenizer` \|

	## Intended use

	- Research and prototyping for taxonomic sequence modeling (e.g. pretraining representations, generation experiments).
	- Not a diagnostic or clinical tool. Not validated for regulated or safety-critical decisions.

	## Usage
	This repository is gated. To use it you'll need to:

	1. Request access — click the "Request access" button at the top of this repo's page on Hugging Face. Requests are auto-approved.
	2. Authenticate — log in to Hugging Face from your environment so the download tooling can use your token:

	```bash
	huggingface-cli login
	```

	Or set the token directly:

	```bash
	export HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxx
	```

	You can create a token at https://huggingface.co/settings/tokens.

	Once both steps are done, you can load the model/dataset normally:

	This checkpoint uses custom tokenizer code. You must pass `trust_remote_code=True` when loading the tokeniser.

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_id = outpost-bio/Waypoint-45m

	tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(model_id)

	# Example: newline-separated taxonomic lines (format must match training)
	text = (
	"k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Enterobacterales; f__Enterobacteriaceae; g__Escherichia\n"
	"k__Bacteria; p__Firmicutes; c__Clostridia; o__Lachnospirales; f__Lachnospiraceae; g__Blautia\n"
	"k__Bacteria; p__Bacteroidota; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides\n"
	)
	inputs = tokenizer(text, return_tensors="pt")
	outputs = model.generate(**inputs, max_new_tokens=32, do_sample=False)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	## Training
	- Data: [Atlas dataset](https://huggingface.co/datasets/outpost-bio/Atlas)
	- Objective: causal LM on taxonomic sequences.

	## License
	apache-2.0

	## Citation
	Learning the Language of the Microbiome with Transformers\
	Neythen J Treloar, Saif Ur-Rehman, Jenny Yang\
	bioRxiv 2026.05.02.722381; doi: https://doi.org/10.64898/2026.05.02.722381

	## Model card contact

	Maintainer / contact: neythen@outpost.bio