NatureCode Ocean Life
A 9.91 billion parameter foundation model specialized for marine biodiversity, ocean ecosystems, and aquatic life sciences.
Model Description
NatureCode Ocean Life is a decoder-only transformer model trained on curated marine science data from authoritative biodiversity databases. It understands species taxonomy, marine habitats, ecological relationships, and oceanographic concepts.
Model Details
- Model Type: Decoder-only Transformer (Causal LM)
- Parameters: 9.91B
- Context Length: 4,096 tokens
- Training Data: 871,304 examples
- Language: English
- License: Apache 2.0
Training Data
Included Sources
| Source | Description | Size |
|---|---|---|
| OBIS | Ocean Biodiversity Information System - species occurrence records | 62GB |
| WoRMS | World Register of Marine Species - taxonomic database | 4K records |
| FishBase | Comprehensive fish species database | 4K records |
| GBIF | Global Biodiversity Information Facility (marine subset) | 1.8K records |
| arXiv | Marine science research papers | ~1,000 papers |
| OceanInstruct | Marine Q&A instruction data | 4.9MB |
| Text Corpus | Curated marine biology documents | 3.4GB |
Data Gaps (Not Yet Included)
The following data types are planned for future versions:
- Soil & Terrestrial Ecology - Land ecosystems, soil microbiomes, forest biodiversity
- Freshwater Systems - Rivers, lakes, wetlands, freshwater fish
- Climate Data - Ocean-atmosphere interactions, temperature records, climate projections
- Satellite Imagery - Ocean color, chlorophyll, sea surface temperature
- Acoustic Data - Marine mammal vocalizations, underwater soundscapes, bioacoustics
- Chemical Oceanography - Nutrient cycles, pH measurements, dissolved oxygen
- Genetic/eDNA - Environmental DNA, genomic sequences, phylogenetic data
- Historical Records - Historical fisheries data, expedition logs, museum collections
- Indigenous Knowledge - Traditional ecological knowledge, local species names
Intended Uses
Primary Use Cases
- Species identification and taxonomy queries
- Marine habitat and distribution research
- Biodiversity assessment support
- Ocean ecosystem understanding
- Marine conservation planning
- Fisheries science applications
- Educational content about marine life
Out of Scope
- Medical or dietary advice about seafood
- Real-time species identification from images (text-only model)
- Legal or regulatory compliance decisions
- Replacement for authoritative taxonomic databases
Training Procedure
Hardware
- 8x NVIDIA H100 80GB GPUs
- Distributed Data Parallel (DDP) training
Hyperparameters
- Batch Size: 16 per device
- Gradient Accumulation: 4 steps
- Effective Batch Size: 512
- Learning Rate: 3e-4 (with warmup)
- Warmup Steps: 1,000
- Total Steps: 50,000
- Precision: BF16 mixed precision
- Optimizer: AdamW
Evaluation
Model evaluation on marine biology benchmarks is ongoing. Performance metrics will be updated as they become available.
Limitations and Biases
Known Limitations
- Marine Focus: Limited knowledge of terrestrial and freshwater ecosystems
- Language: Training data primarily in English
- Temporal: May not include species discovered after training data cutoff
- Geographic: Data coverage may be uneven across ocean regions
- Depth: Deep-sea species may be underrepresented due to limited research data
Ethical Considerations
- Should not be used as the sole basis for conservation decisions
- Predictions should be validated against authoritative databases (WoRMS, OBIS, FishBase)
- May perpetuate biases present in historical biodiversity records
How to Use
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("naturecode/ocean-life")
tokenizer = AutoTokenizer.from_pretrained("naturecode/ocean-life")
prompt = "What are the main threats to coral reef ecosystems?"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Citation
@misc{naturecode-ocean-life-2026,
title={NatureCode Ocean Life: A Foundation Model for Marine Biodiversity},
author={NatureCode Team},
year={2026},
publisher={Hugging Face},
url={https://huggingface.co/naturecode/ocean-life}
}
Acknowledgments
Training data sourced from:
- Ocean Biodiversity Information System (OBIS)
- World Register of Marine Species (WoRMS)
- FishBase
- Global Biodiversity Information Facility (GBIF)
Contact
For questions or feedback, please open an issue on the model repository.