Waypoint-45m / README.md

outpost-neythen

Update README.md

73c7459 verified 9 days ago

preview code

raw

history blame contribute delete

3.37 kB

metadata

language: en
tags:
  - pytorch
  - causal-lm
  - gpt2
  - microbiome
  - taxonomy
pipeline_tag: text-generation
license: apache-2.0

Waypoint-45m

Waypoint-45m is a GPT-2–style causal language model trained on newline-separated taxonomic strings from microbiome data, to learn representations of taxa co-occurrence and sequence structure.

Model summary

See our preprint for details

Join our slack community for support and discussion about microbiome foundation models.

Causal language model trained on newline-separated taxonomic strings. Each line is treated as a token sequence derived from a vocabulary of taxonomic labels; see Tokenizer below.

Item	Details
Architecture	GPT-2 (`model_type: gpt2` in `config.json`)
Vocab	Taxonomic tokenizer (`vocab.json`); size shown in the Hub file list / `tokenizer_config.json`
Remote code	Required — this repo includes `tokenization_taxonomic.py` for `TaxonomicTokenizer`

Intended use

Research and prototyping for taxonomic sequence modeling (e.g. pretraining representations, generation experiments).
Not a diagnostic or clinical tool. Not validated for regulated or safety-critical decisions.

Usage

This repository is gated. To use it you'll need to:

Request access — click the "Request access" button at the top of this repo's page on Hugging Face. Requests are auto-approved.
Authenticate — log in to Hugging Face from your environment so the download tooling can use your token:

   huggingface-cli login

Or set the token directly:

   export HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxx

You can create a token at https://huggingface.co/settings/tokens.

Once both steps are done, you can load the model/dataset normally:

This checkpoint uses custom tokenizer code. You must pass trust_remote_code=True when loading the tokeniser.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = outpost-bio/Waypoint-45m

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_id)

# Example: newline-separated taxonomic lines (format must match training)
text = (
    "k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Enterobacterales; f__Enterobacteriaceae; g__Escherichia\n"
    "k__Bacteria; p__Firmicutes; c__Clostridia; o__Lachnospirales; f__Lachnospiraceae; g__Blautia\n"
    "k__Bacteria; p__Bacteroidota; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides\n"
)
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=32, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training

Data: Atlas dataset
Objective: causal LM on taxonomic sequences.

License

apache-2.0

Citation

Learning the Language of the Microbiome with Transformers
Neythen J Treloar, Saif Ur-Rehman, Jenny Yang
bioRxiv 2026.05.02.722381; doi: https://doi.org/10.64898/2026.05.02.722381

Model card contact

Maintainer / contact: neythen@outpost.bio