Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,121 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language: en
|
| 3 |
+
license: apache-2.0
|
| 4 |
+
tags:
|
| 5 |
+
- marine-biology
|
| 6 |
+
- metagenomics
|
| 7 |
+
- environmental-modeling
|
| 8 |
+
- protein-domains
|
| 9 |
+
- tara-oceans
|
| 10 |
+
- vicreg
|
| 11 |
+
- joint-embedding
|
| 12 |
+
- self-supervised-learning
|
| 13 |
+
- pytorch
|
| 14 |
+
library_name: pytorch
|
| 15 |
+
pipeline_tag: tabular-regression
|
| 16 |
+
---
|
| 17 |
+
|
| 18 |
+
# TARA-WorldModel-VICReg
|
| 19 |
+
|
| 20 |
+
Joint environment--genome embedding model for marine ecosystem productivity prediction, trained with Variance-Invariance-Covariance Regularization (VICReg) on TARA Oceans data.
|
| 21 |
+
|
| 22 |
+
## Model Description
|
| 23 |
+
|
| 24 |
+
The World Model learns a shared latent space that aligns environmental context (satellite-derived variables) with microalgal protein domain composition (Pfam module abundances), then predicts marine productivity (chlorophyll-a, POC, NFLH) from the joint embedding.
|
| 25 |
+
|
| 26 |
+
### Architecture
|
| 27 |
+
|
| 28 |
+
```
|
| 29 |
+
Training:
|
| 30 |
+
env (24 dims) β EncoderE(128 β 32) β z_env ββ
|
| 31 |
+
ββ VICReg loss
|
| 32 |
+
pfam (20 dims) β EncoderP(256 β 128 β 32) β z_pfam
|
| 33 |
+
ββ Predictor(64 β 3) β productivity
|
| 34 |
+
|
| 35 |
+
Inference (environment-only):
|
| 36 |
+
env β EncoderE β z_env β Predictor β [chl-a, POC, NFLH]
|
| 37 |
+
```
|
| 38 |
+
|
| 39 |
+
- **EncoderE**: Linear(24β128) + BN + ReLU + Dropout(0.3) β Linear(128β32) + BN + ReLU + Dropout(0.3)
|
| 40 |
+
- **EncoderP**: Linear(20β256) + BN + ReLU + Dropout(0.3) β Linear(256β128) + BN + ReLU + Dropout(0.3) β Linear(128β32) + BN + ReLU + Dropout(0.3)
|
| 41 |
+
- **Predictor**: Linear(32β64) + ReLU β Linear(64β3)
|
| 42 |
+
- **Total parameters**: 53,187
|
| 43 |
+
|
| 44 |
+
### VICReg Loss
|
| 45 |
+
|
| 46 |
+
Non-contrastive self-supervised alignment (Bardes et al., ICLR 2022):
|
| 47 |
+
- **Invariance**: MSE between co-located env/pfam embeddings (Ξ»=25)
|
| 48 |
+
- **Variance**: Hinge loss preventing embedding collapse (Ξ»=25)
|
| 49 |
+
- **Covariance**: Off-diagonal penalty decorrelating dimensions (Ξ»=1)
|
| 50 |
+
- **Prediction**: MSE on productivity targets (Ξ±=1)
|
| 51 |
+
|
| 52 |
+
## Performance
|
| 53 |
+
|
| 54 |
+
Joint embedding improves POC prediction (RΒ² 0.422 β 0.532, 26% relative improvement) over environment-only baseline. Chlorophyll-a and NFLH are better predicted by environment alone (directly satellite-measured).
|
| 55 |
+
|
| 56 |
+
## Files
|
| 57 |
+
|
| 58 |
+
### Fold Checkpoints (leave-one-basin-out spatial CV)
|
| 59 |
+
|
| 60 |
+
Two training runs are provided:
|
| 61 |
+
- `world_model_fold_*_20260127_110243.pt` β Initial configuration (latent_dim=16)
|
| 62 |
+
- `world_model_fold_*_20260127_111754.pt` β Best configuration from hyperparameter sweep (latent_dim=32)
|
| 63 |
+
|
| 64 |
+
Six folds per run: Arctic, Atlantic, Indian, Mediterranean, Pacific, Southern.
|
| 65 |
+
|
| 66 |
+
### Configuration
|
| 67 |
+
|
| 68 |
+
- `phase2_best_config.json` β Hyperparameter sweep results (54 configurations, 3 seeds each)
|
| 69 |
+
|
| 70 |
+
## Hyperparameters (Best Config)
|
| 71 |
+
|
| 72 |
+
| Parameter | Value |
|
| 73 |
+
|-----------|-------|
|
| 74 |
+
| latent_dim | 32 |
|
| 75 |
+
| dropout | 0.3 |
|
| 76 |
+
| Ξ»_invariance | 25.0 |
|
| 77 |
+
| Ξ»_variance | 25.0 |
|
| 78 |
+
| Ξ»_covariance | 1.0 |
|
| 79 |
+
| pred_alpha | 1.0 |
|
| 80 |
+
| learning_rate | 0.001 |
|
| 81 |
+
| weight_decay | 1e-4 |
|
| 82 |
+
| batch_size | 128 |
|
| 83 |
+
| max_epochs | 300 |
|
| 84 |
+
| patience | 30 |
|
| 85 |
+
| grad_clip | 1.0 |
|
| 86 |
+
|
| 87 |
+
## Usage
|
| 88 |
+
|
| 89 |
+
```python
|
| 90 |
+
import torch
|
| 91 |
+
|
| 92 |
+
# Load fold checkpoint
|
| 93 |
+
ckpt = torch.load("world_model_fold_Atlantic_20260127_111754.pt", map_location="cpu")
|
| 94 |
+
|
| 95 |
+
# ckpt contains model_state_dict for the full WorldModel
|
| 96 |
+
# Requires WorldModel class from the training codebase
|
| 97 |
+
```
|
| 98 |
+
|
| 99 |
+
## Dataset
|
| 100 |
+
|
| 101 |
+
- **1,810 ocean samples** with co-located environment and Pfam profiles
|
| 102 |
+
- **24 environmental variables** (GEE oceanographic/atmospheric)
|
| 103 |
+
- **20 Pfam module features** (aggregated from 9,466 domains via co-occurrence clustering)
|
| 104 |
+
- **3 productivity targets** (chlorophyll-a, POC, NFLH)
|
| 105 |
+
- **Spatial cross-validation**: Leave-one-basin-out (6 ocean basins)
|
| 106 |
+
|
| 107 |
+
## Related Models
|
| 108 |
+
|
| 109 |
+
- [GreenGenomicsLab/algaGPT](https://huggingface.co/GreenGenomicsLab/algaGPT) β AlgaGPT protein classification
|
| 110 |
+
- [GreenGenomicsLab/LA4SR-Pythia70m-b-ckpt-55000](https://huggingface.co/GreenGenomicsLab/LA4SR-Pythia70m-b-ckpt-55000) β LA4SR-Pythia classification
|
| 111 |
+
- [GreenGenomicsLab/TARA-ELF-NET](https://huggingface.co/GreenGenomicsLab/TARA-ELF-NET) β Deep bidirectional envβpfam models
|
| 112 |
+
- [GreenGenomicsLab/TARA-XGBoost-Bidirectional](https://huggingface.co/GreenGenomicsLab/TARA-XGBoost-Bidirectional) β XGBoost bidirectional models
|
| 113 |
+
|
| 114 |
+
## Citation
|
| 115 |
+
|
| 116 |
+
LA4SR classification models:
|
| 117 |
+
> Nelson DR, Jaiswal AK, Ismail NS, Mystikou A, Salehi-Ashtiani K. *Patterns*. 2024;6(11).
|
| 118 |
+
|
| 119 |
+
## License
|
| 120 |
+
|
| 121 |
+
Apache 2.0
|