NatureCode Ocean Life

A 9.91 billion parameter foundation model specialized for marine biodiversity, ocean ecosystems, and aquatic life sciences.

Model Description

NatureCode Ocean Life is a decoder-only transformer model trained on curated marine science data from authoritative biodiversity databases. It understands species taxonomy, marine habitats, ecological relationships, and oceanographic concepts.

Model Details

  • Model Type: Decoder-only Transformer (Causal LM)
  • Parameters: 9.91B
  • Context Length: 4,096 tokens
  • Training Data: 871,304 examples
  • Language: English
  • License: Apache 2.0

Training Data

Included Sources

Source Description Size
OBIS Ocean Biodiversity Information System - species occurrence records 62GB
WoRMS World Register of Marine Species - taxonomic database 4K records
FishBase Comprehensive fish species database 4K records
GBIF Global Biodiversity Information Facility (marine subset) 1.8K records
arXiv Marine science research papers ~1,000 papers
OceanInstruct Marine Q&A instruction data 4.9MB
Text Corpus Curated marine biology documents 3.4GB

Data Gaps (Not Yet Included)

The following data types are planned for future versions:

  • Soil & Terrestrial Ecology - Land ecosystems, soil microbiomes, forest biodiversity
  • Freshwater Systems - Rivers, lakes, wetlands, freshwater fish
  • Climate Data - Ocean-atmosphere interactions, temperature records, climate projections
  • Satellite Imagery - Ocean color, chlorophyll, sea surface temperature
  • Acoustic Data - Marine mammal vocalizations, underwater soundscapes, bioacoustics
  • Chemical Oceanography - Nutrient cycles, pH measurements, dissolved oxygen
  • Genetic/eDNA - Environmental DNA, genomic sequences, phylogenetic data
  • Historical Records - Historical fisheries data, expedition logs, museum collections
  • Indigenous Knowledge - Traditional ecological knowledge, local species names

Intended Uses

Primary Use Cases

  • Species identification and taxonomy queries
  • Marine habitat and distribution research
  • Biodiversity assessment support
  • Ocean ecosystem understanding
  • Marine conservation planning
  • Fisheries science applications
  • Educational content about marine life

Out of Scope

  • Medical or dietary advice about seafood
  • Real-time species identification from images (text-only model)
  • Legal or regulatory compliance decisions
  • Replacement for authoritative taxonomic databases

Training Procedure

Hardware

  • 8x NVIDIA H100 80GB GPUs
  • Distributed Data Parallel (DDP) training

Hyperparameters

  • Batch Size: 16 per device
  • Gradient Accumulation: 4 steps
  • Effective Batch Size: 512
  • Learning Rate: 3e-4 (with warmup)
  • Warmup Steps: 1,000
  • Total Steps: 50,000
  • Precision: BF16 mixed precision
  • Optimizer: AdamW

Evaluation

Model evaluation on marine biology benchmarks is ongoing. Performance metrics will be updated as they become available.

Limitations and Biases

Known Limitations

  • Marine Focus: Limited knowledge of terrestrial and freshwater ecosystems
  • Language: Training data primarily in English
  • Temporal: May not include species discovered after training data cutoff
  • Geographic: Data coverage may be uneven across ocean regions
  • Depth: Deep-sea species may be underrepresented due to limited research data

Ethical Considerations

  • Should not be used as the sole basis for conservation decisions
  • Predictions should be validated against authoritative databases (WoRMS, OBIS, FishBase)
  • May perpetuate biases present in historical biodiversity records

How to Use

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("naturecode/ocean-life")
tokenizer = AutoTokenizer.from_pretrained("naturecode/ocean-life")

prompt = "What are the main threats to coral reef ecosystems?"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Citation

@misc{naturecode-ocean-life-2026,
  title={NatureCode Ocean Life: A Foundation Model for Marine Biodiversity},
  author={NatureCode Team},
  year={2026},
  publisher={Hugging Face},
  url={https://huggingface.co/naturecode/ocean-life}
}

Acknowledgments

Training data sourced from:

Contact

For questions or feedback, please open an issue on the model repository.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support