ocean-life / README.md
hilarl's picture
Update org name to naturecodeproject
be61bdf verified
metadata
license: apache-2.0
language:
  - en
tags:
  - marine-biology
  - biodiversity
  - oceanography
  - ecology
  - species-identification
  - foundation-model
datasets:
  - OBIS
  - WoRMS
  - FishBase
  - GBIF
pipeline_tag: text-generation

NatureCode Ocean Life

A 9.91 billion parameter foundation model specialized for marine biodiversity, ocean ecosystems, and aquatic life sciences.

Model Status: In Training (Checkpoint 11000/50000)

This is an intermediate checkpoint at step 11,000 of 50,000 total training steps (22% complete).

Current Training Progress

  • Current Step: 11,000
  • Target Steps: 50,000
  • Progress: 22%

What's Needed to Complete the Model

Remaining Training

  • Continue training from step 11,000 to step 50,000 (39,000 more steps)
  • Estimated compute: ~30 hours on 8x H100 GPUs
  • Training data: 871,304 marine biology examples

Infrastructure Requirements

  • 8x NVIDIA H100 80GB GPUs (a3-highgpu-8g)
  • Training config: batch_size=8, grad_accum=4, lr=3e-4
  • Mixed precision: BF16 with gradient checkpointing

Data Sources Already Included

  • OBIS (Ocean Biodiversity Information System): 62GB
  • WoRMS (World Register of Marine Species): 4K records
  • FishBase: 4K records
  • GBIF (marine subset): 1.8K records
  • arXiv marine papers: ~1,000 papers
  • OceanInstruct Q&A: 4.9MB
  • Curated marine text: 3.4GB

Data Gaps (Future Improvements)

  • Freshwater ecosystems
  • Climate/ocean interaction data
  • Satellite imagery integration
  • Acoustic/bioacoustic data
  • eDNA sequences

Model Architecture

Specification Value
Parameters 9.91B
Hidden Size 4096
Layers 48
Attention Heads 32
FFN Size 16384
Context Length 4096 tokens
Vocab Size 50257 (GPT-2)

Usage

import torch
from transformers import GPT2Tokenizer

# Load model weights
state_dict = torch.load('pytorch_model.bin', map_location='cpu')

# The model uses a custom architecture - see train.py for model class
# Or wait for the final release with HuggingFace transformers integration

License

Apache 2.0

Citation

@misc{naturecode-ocean-life-2026,
  title={NatureCode Ocean Life: A Foundation Model for Marine Biodiversity},
  author={NatureCode Team},
  year={2026},
  url={https://huggingface.co/naturecodeproject/ocean-life}
}