|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- en |
|
|
tags: |
|
|
- marine-biology |
|
|
- biodiversity |
|
|
- oceanography |
|
|
- ecology |
|
|
- species-identification |
|
|
- foundation-model |
|
|
datasets: |
|
|
- OBIS |
|
|
- WoRMS |
|
|
- FishBase |
|
|
- GBIF |
|
|
pipeline_tag: text-generation |
|
|
--- |
|
|
|
|
|
# NatureCode Ocean Life |
|
|
|
|
|
A 9.91 billion parameter foundation model specialized for marine biodiversity, ocean ecosystems, and aquatic life sciences. |
|
|
|
|
|
## Model Status: In Training (Checkpoint 11000/50000) |
|
|
|
|
|
This is an intermediate checkpoint at step 11,000 of 50,000 total training steps (22% complete). |
|
|
|
|
|
### Current Training Progress |
|
|
- **Current Step:** 11,000 |
|
|
- **Target Steps:** 50,000 |
|
|
- **Progress:** 22% |
|
|
|
|
|
## What's Needed to Complete the Model |
|
|
|
|
|
### Remaining Training |
|
|
- Continue training from step 11,000 to step 50,000 (39,000 more steps) |
|
|
- Estimated compute: ~30 hours on 8x H100 GPUs |
|
|
- Training data: 871,304 marine biology examples |
|
|
|
|
|
### Infrastructure Requirements |
|
|
- 8x NVIDIA H100 80GB GPUs (a3-highgpu-8g) |
|
|
- Training config: batch_size=8, grad_accum=4, lr=3e-4 |
|
|
- Mixed precision: BF16 with gradient checkpointing |
|
|
|
|
|
### Data Sources Already Included |
|
|
- OBIS (Ocean Biodiversity Information System): 62GB |
|
|
- WoRMS (World Register of Marine Species): 4K records |
|
|
- FishBase: 4K records |
|
|
- GBIF (marine subset): 1.8K records |
|
|
- arXiv marine papers: ~1,000 papers |
|
|
- OceanInstruct Q&A: 4.9MB |
|
|
- Curated marine text: 3.4GB |
|
|
|
|
|
### Data Gaps (Future Improvements) |
|
|
- Freshwater ecosystems |
|
|
- Climate/ocean interaction data |
|
|
- Satellite imagery integration |
|
|
- Acoustic/bioacoustic data |
|
|
- eDNA sequences |
|
|
|
|
|
## Model Architecture |
|
|
|
|
|
| Specification | Value | |
|
|
|--------------|-------| |
|
|
| Parameters | 9.91B | |
|
|
| Hidden Size | 4096 | |
|
|
| Layers | 48 | |
|
|
| Attention Heads | 32 | |
|
|
| FFN Size | 16384 | |
|
|
| Context Length | 4096 tokens | |
|
|
| Vocab Size | 50257 (GPT-2) | |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import GPT2Tokenizer |
|
|
|
|
|
# Load model weights |
|
|
state_dict = torch.load('pytorch_model.bin', map_location='cpu') |
|
|
|
|
|
# The model uses a custom architecture - see train.py for model class |
|
|
# Or wait for the final release with HuggingFace transformers integration |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
Apache 2.0 |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{naturecode-ocean-life-2026, |
|
|
title={NatureCode Ocean Life: A Foundation Model for Marine Biodiversity}, |
|
|
author={NatureCode Team}, |
|
|
year={2026}, |
|
|
url={https://huggingface.co/naturecodeproject/ocean-life} |
|
|
} |
|
|
``` |
|
|
|