GenomeOcean-500M-v1.2
GenomeOcean-500M-v1.2 is a 500-million-parameter causal language model for microbial genomic sequences. It is a continued-training checkpoint of GenomeOcean-500M (v1.0) trained on an expanded dataset that adds GTDB r226 representative genomes, INPHARED phage genomes, and the Zenodo RNA virus database to the original metagenomic training corpus.
What's new in v1.2
| Change | Details |
|---|---|
| Expanded dataset | Added GTDB r226 (~85k representative genomes), INPHARED phage (~30k genomes), Zenodo RNA virus database |
[CLS] domain tag |
Viral/phage sequences are prefixed with [CLS] to signal sequence type to the model |
[SEP] genome boundaries |
[SEP] token inserted between genome records during training |
| IUPAC resolution | Ambiguity codes (R, Y, S, W, K, M, B, D, H, V) resolved by random sampling |
| Chunk overlap | 150 bp overlap between consecutive chunks of long sequences |
Model Details
- Architecture: Mistral (decoder-only transformer)
- Parameters: ~500M
- Tokenizer: BPE, vocabulary size 4096
- Context length: 1024 tokens (~5 kbp)
- Training precision: bfloat16
- Training framework: DeepSpeed ZeRO-1
- Effective batch size: 128 (8 per GPU × 16 gradient accumulation steps)
- Continued from:
DOEJGI/GenomeOcean-500M(v1.0)
Quick Start
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
"DOEJGI/GenomeOcean-500M-v1.2",
padding_side="left",
)
model = AutoModelForCausalLM.from_pretrained(
"DOEJGI/GenomeOcean-500M-v1.2",
torch_dtype=torch.bfloat16,
).to("cuda")
# Generate a sequence continuation
sequence = "ATGCGATCGATCGATCGATCG"
inputs = tokenizer(sequence, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100, do_sample=True, temperature=0.8)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Training Data
| Dataset | Type | Source |
|---|---|---|
| Antarctic metagenome | Metagenomic assembly | Internal |
| GRE metagenome | Metagenomic assembly | Internal |
| Harvard Forest metagenome | Metagenomic assembly | Internal |
| Lake Mendota metagenome | Metagenomic assembly | Internal |
| NEON metagenome | Metagenomic assembly | Internal |
| Oilcane metagenome | Metagenomic assembly | Internal |
| Tara Oceans metagenome | Metagenomic assembly | Internal |
| HMP2 | Human microbiome | Internal |
| GTDB r226 representative genomes | Bacterial/Archaeal genomic DNA | https://gtdb.ecogenomic.org/ |
| INPHARED phage genomes (Apr 2025) | Phage/bacteriophage DNA | https://github.com/RyanCook94/inphared |
| Zenodo RNA virus database | RNA virus genomes | https://zenodo.org/records/10989253 |
Special Token Usage
The v3 tokenizer uses special tokens in a specific way during training:
[CLS]— prepended to every chunk of viral/phage/RNA sequences to act as a domain tag. This allows the model to learn sequence-type-specific representations.[SEP]— inserted at genome boundaries (between the last chunk of one genome and the first chunk of the next) during the packing step. This teaches the model that sequences are discrete units.[UNK]— should never appear in clean data. IUPAC ambiguity codes are resolved before tokenization.
Citation
If you use GenomeOcean in your research, please cite:
@article{genomeocean2025,
title={GenomeOcean: A Pretrained Microbial Genome Foundational Model},
author={...},
journal={...},
year={2025}
}
License
Copyright (c) 2025, The Regents of the University of California, through Lawrence Berkeley National Laboratory (subject to receipt of any required approvals from the U.S. Dept. of Energy) and Northwestern University. All rights reserved.
If you have questions about your rights to use or distribute this software, please contact Berkeley Lab's Intellectual Property Office at IPO@lbl.gov.
NOTICE. This Software was developed under funding from the U.S. Department of Energy and the U.S. Government consequently retains certain rights. As such, the U.S. Government has been granted for itself and others acting on its behalf a paid-up, nonexclusive, irrevocable, worldwide license in the Software to reproduce, distribute copies to the public, prepare derivative works, and perform publicly and display publicly, and to permit others to do so.
- Downloads last month
- -