Waypoint-45m / README.md
outpost-neythen's picture
Update README.md
73c7459 verified
metadata
language: en
tags:
  - pytorch
  - causal-lm
  - gpt2
  - microbiome
  - taxonomy
pipeline_tag: text-generation
license: apache-2.0

Waypoint-45m

Waypoint-45m is a GPT-2–style causal language model trained on newline-separated taxonomic strings from microbiome data, to learn representations of taxa co-occurrence and sequence structure.

Model summary

See our preprint for details

Join our slack community for support and discussion about microbiome foundation models.

Causal language model trained on newline-separated taxonomic strings. Each line is treated as a token sequence derived from a vocabulary of taxonomic labels; see Tokenizer below.

Item Details
Architecture GPT-2 (model_type: gpt2 in config.json)
Vocab Taxonomic tokenizer (vocab.json); size shown in the Hub file list / tokenizer_config.json
Remote code Required — this repo includes tokenization_taxonomic.py for TaxonomicTokenizer

Intended use

  • Research and prototyping for taxonomic sequence modeling (e.g. pretraining representations, generation experiments).
  • Not a diagnostic or clinical tool. Not validated for regulated or safety-critical decisions.

Usage

This repository is gated. To use it you'll need to:

  1. Request access — click the "Request access" button at the top of this repo's page on Hugging Face. Requests are auto-approved.
  2. Authenticate — log in to Hugging Face from your environment so the download tooling can use your token:
   huggingface-cli login

Or set the token directly:

   export HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxx

You can create a token at https://huggingface.co/settings/tokens.

Once both steps are done, you can load the model/dataset normally:

This checkpoint uses custom tokenizer code. You must pass trust_remote_code=True when loading the tokeniser.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = outpost-bio/Waypoint-45m

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_id)

# Example: newline-separated taxonomic lines (format must match training)
text = (
    "k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Enterobacterales; f__Enterobacteriaceae; g__Escherichia\n"
    "k__Bacteria; p__Firmicutes; c__Clostridia; o__Lachnospirales; f__Lachnospiraceae; g__Blautia\n"
    "k__Bacteria; p__Bacteroidota; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides\n"
)
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=32, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training

  • Data: Atlas dataset
  • Objective: causal LM on taxonomic sequences.

License

apache-2.0

Citation

Learning the Language of the Microbiome with Transformers
Neythen J Treloar, Saif Ur-Rehman, Jenny Yang
bioRxiv 2026.05.02.722381; doi: https://doi.org/10.64898/2026.05.02.722381

Model card contact

Maintainer / contact: neythen@outpost.bio