GenomeOcean-500M-v1.2

GenomeOcean-500M-v1.2 is a 500-million-parameter causal language model for microbial genomic sequences. It is a continued-training checkpoint of GenomeOcean-500M (v1.0) trained on an expanded dataset that adds GTDB r226 representative genomes, INPHARED phage genomes, and the Zenodo RNA virus database to the original metagenomic training corpus.

What's new in v1.2

Change	Details
Expanded dataset	Added GTDB r226 (~85k representative genomes), INPHARED phage (~30k genomes), Zenodo RNA virus database
`[CLS]` domain tag	Viral/phage sequences are prefixed with `[CLS]` to signal sequence type to the model
`[SEP]` genome boundaries	`[SEP]` token inserted between genome records during training
IUPAC resolution	Ambiguity codes (R, Y, S, W, K, M, B, D, H, V) resolved by random sampling
Chunk overlap	150 bp overlap between consecutive chunks of long sequences

Model Details

Architecture: Mistral (decoder-only transformer)
Parameters: ~500M
Tokenizer: BPE, vocabulary size 4096
Context length: 1024 tokens (~5 kbp)
Training precision: bfloat16
Training framework: DeepSpeed ZeRO-1
Effective batch size: 128 (8 per GPU × 16 gradient accumulation steps)
Continued from: DOEJGI/GenomeOcean-500M (v1.0)

Quick Start

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "DOEJGI/GenomeOcean-500M-v1.2",
    padding_side="left",
)
model = AutoModelForCausalLM.from_pretrained(
    "DOEJGI/GenomeOcean-500M-v1.2",
    torch_dtype=torch.bfloat16,
).to("cuda")

# Generate a sequence continuation
sequence = "ATGCGATCGATCGATCGATCG"
inputs = tokenizer(sequence, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100, do_sample=True, temperature=0.8)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Data

Dataset	Type	Source
Antarctic metagenome	Metagenomic assembly	Internal
GRE metagenome	Metagenomic assembly	Internal
Harvard Forest metagenome	Metagenomic assembly	Internal
Lake Mendota metagenome	Metagenomic assembly	Internal
NEON metagenome	Metagenomic assembly	Internal
Oilcane metagenome	Metagenomic assembly	Internal
Tara Oceans metagenome	Metagenomic assembly	Internal
HMP2	Human microbiome	Internal
GTDB r226 representative genomes	Bacterial/Archaeal genomic DNA	https://gtdb.ecogenomic.org/
INPHARED phage genomes (Apr 2025)	Phage/bacteriophage DNA	https://github.com/RyanCook94/inphared
Zenodo RNA virus database	RNA virus genomes	https://zenodo.org/records/10989253

Special Token Usage

The v3 tokenizer uses special tokens in a specific way during training:

[CLS] — prepended to every chunk of viral/phage/RNA sequences to act as a domain tag. This allows the model to learn sequence-type-specific representations.
[SEP] — inserted at genome boundaries (between the last chunk of one genome and the first chunk of the next) during the packing step. This teaches the model that sequences are discrete units.
[UNK] — should never appear in clean data. IUPAC ambiguity codes are resolved before tokenization.

Citation

If you use GenomeOcean in your research, please cite:

@article{genomeocean2025,
  title={GenomeOcean: A Pretrained Microbial Genome Foundational Model},
  author={...},
  journal={...},
  year={2025}
}

License

Copyright (c) 2025, The Regents of the University of California, through Lawrence Berkeley National Laboratory (subject to receipt of any required approvals from the U.S. Dept. of Energy) and Northwestern University. All rights reserved.

If you have questions about your rights to use or distribute this software, please contact Berkeley Lab's Intellectual Property Office at IPO@lbl.gov.

NOTICE. This Software was developed under funding from the U.S. Department of Energy and the U.S. Government consequently retains certain rights. As such, the U.S. Government has been granted for itself and others acting on its behalf a paid-up, nonexclusive, irrevocable, worldwide license in the Software to reproduce, distribute copies to the public, prepare derivative works, and perform publicly and display publicly, and to permit others to do so.

Downloads last month: -

Safetensors

Model size

0.5B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including DOEJGI/GenomeOcean-500M-v1.2

GenomeOcean-v1.2

Collection

Added GTDB diverse dataset, trained to plateau • 2 items • Updated about 1 hour ago