Waypoint-6m / README.md

outpost-neythen

Update README.md

2f14877 verified about 19 hours ago

preview code

raw

history blame contribute delete

3.46 kB

metadata

language: en
tags:
  - pytorch
  - causal-lm
  - gpt2
  - microbiome
  - taxonomy
pipeline_tag: text-generation
license: apache-2.0

Waypoint-6m

Waypoint-6m is a GPT-2–style causal language model trained on newline-separated taxonomic strings from microbiome data, to learn representations of taxa co-occurrence and sequence structure.

Model summary

See our preprint for details and our GitHub repo for pretraining, finetuning and benchmarking scripts.

Join our slack community for support and discussion about microbiome foundation models.

Causal language model trained on newline-separated taxonomic strings. Each line is treated as a token sequence derived from a vocabulary of taxonomic labels; see Tokenizer below.

Item	Details
Architecture	GPT-2 (`model_type: gpt2` in `config.json`)
Vocab	Taxonomic tokenizer (`vocab.json`); size shown in the Hub file list / `tokenizer_config.json`
Remote code	Required — this repo includes `tokenization_taxonomic.py` for `TaxonomicTokenizer`

Intended use

Research and prototyping for taxonomic sequence modeling (e.g. pretraining representations, generation experiments).
Not a diagnostic or clinical tool. Not validated for regulated or safety-critical decisions.

Usage

This repository is gated. To use it you'll need to:

Request access — click the "Request access" button at the top of this repo's page on Hugging Face. Requests are auto-approved.
Authenticate — log in to Hugging Face from your environment so the download tooling can use your token:

   huggingface-cli login

Or set the token directly:

   export HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxx

You can create a token at https://huggingface.co/settings/tokens.

Once both steps are done, you can load the model/dataset normally:

This checkpoint uses custom tokenizer code. You must pass trust_remote_code=True when loading the tokeniser.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = outpost-bio/Waypoint-6m

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_id)

# Example: newline-separated taxonomic lines (format must match training)
text = (
    "k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Enterobacterales; f__Enterobacteriaceae; g__Escherichia\n"
    "k__Bacteria; p__Firmicutes; c__Clostridia; o__Lachnospirales; f__Lachnospiraceae; g__Blautia\n"
    "k__Bacteria; p__Bacteroidota; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides\n"
)
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=32, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training

Data: Atlas dataset
Objective: causal LM on taxonomic sequences.

License

apache-2.0

Citation

Learning the Language of the Microbiome with Transformers
Neythen J Treloar, Saif Ur-Rehman, Jenny Yang
bioRxiv 2026.05.02.722381; doi: https://doi.org/10.64898/2026.05.02.722381

Maintainer / contact: waypoint@outpost.bio