GreenGenomicsLab commited on
Commit
28dd986
Β·
verified Β·
1 Parent(s): 52d6afc

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +121 -0
README.md ADDED
@@ -0,0 +1,121 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ license: apache-2.0
4
+ tags:
5
+ - marine-biology
6
+ - metagenomics
7
+ - environmental-modeling
8
+ - protein-domains
9
+ - tara-oceans
10
+ - vicreg
11
+ - joint-embedding
12
+ - self-supervised-learning
13
+ - pytorch
14
+ library_name: pytorch
15
+ pipeline_tag: tabular-regression
16
+ ---
17
+
18
+ # TARA-WorldModel-VICReg
19
+
20
+ Joint environment--genome embedding model for marine ecosystem productivity prediction, trained with Variance-Invariance-Covariance Regularization (VICReg) on TARA Oceans data.
21
+
22
+ ## Model Description
23
+
24
+ The World Model learns a shared latent space that aligns environmental context (satellite-derived variables) with microalgal protein domain composition (Pfam module abundances), then predicts marine productivity (chlorophyll-a, POC, NFLH) from the joint embedding.
25
+
26
+ ### Architecture
27
+
28
+ ```
29
+ Training:
30
+ env (24 dims) β†’ EncoderE(128 β†’ 32) β†’ z_env ─┐
31
+ β”œβ†’ VICReg loss
32
+ pfam (20 dims) β†’ EncoderP(256 β†’ 128 β†’ 32) β†’ z_pfam
33
+ β””β†’ Predictor(64 β†’ 3) β†’ productivity
34
+
35
+ Inference (environment-only):
36
+ env β†’ EncoderE β†’ z_env β†’ Predictor β†’ [chl-a, POC, NFLH]
37
+ ```
38
+
39
+ - **EncoderE**: Linear(24β†’128) + BN + ReLU + Dropout(0.3) β†’ Linear(128β†’32) + BN + ReLU + Dropout(0.3)
40
+ - **EncoderP**: Linear(20β†’256) + BN + ReLU + Dropout(0.3) β†’ Linear(256β†’128) + BN + ReLU + Dropout(0.3) β†’ Linear(128β†’32) + BN + ReLU + Dropout(0.3)
41
+ - **Predictor**: Linear(32β†’64) + ReLU β†’ Linear(64β†’3)
42
+ - **Total parameters**: 53,187
43
+
44
+ ### VICReg Loss
45
+
46
+ Non-contrastive self-supervised alignment (Bardes et al., ICLR 2022):
47
+ - **Invariance**: MSE between co-located env/pfam embeddings (Ξ»=25)
48
+ - **Variance**: Hinge loss preventing embedding collapse (Ξ»=25)
49
+ - **Covariance**: Off-diagonal penalty decorrelating dimensions (Ξ»=1)
50
+ - **Prediction**: MSE on productivity targets (Ξ±=1)
51
+
52
+ ## Performance
53
+
54
+ Joint embedding improves POC prediction (RΒ² 0.422 β†’ 0.532, 26% relative improvement) over environment-only baseline. Chlorophyll-a and NFLH are better predicted by environment alone (directly satellite-measured).
55
+
56
+ ## Files
57
+
58
+ ### Fold Checkpoints (leave-one-basin-out spatial CV)
59
+
60
+ Two training runs are provided:
61
+ - `world_model_fold_*_20260127_110243.pt` β€” Initial configuration (latent_dim=16)
62
+ - `world_model_fold_*_20260127_111754.pt` β€” Best configuration from hyperparameter sweep (latent_dim=32)
63
+
64
+ Six folds per run: Arctic, Atlantic, Indian, Mediterranean, Pacific, Southern.
65
+
66
+ ### Configuration
67
+
68
+ - `phase2_best_config.json` β€” Hyperparameter sweep results (54 configurations, 3 seeds each)
69
+
70
+ ## Hyperparameters (Best Config)
71
+
72
+ | Parameter | Value |
73
+ |-----------|-------|
74
+ | latent_dim | 32 |
75
+ | dropout | 0.3 |
76
+ | Ξ»_invariance | 25.0 |
77
+ | Ξ»_variance | 25.0 |
78
+ | Ξ»_covariance | 1.0 |
79
+ | pred_alpha | 1.0 |
80
+ | learning_rate | 0.001 |
81
+ | weight_decay | 1e-4 |
82
+ | batch_size | 128 |
83
+ | max_epochs | 300 |
84
+ | patience | 30 |
85
+ | grad_clip | 1.0 |
86
+
87
+ ## Usage
88
+
89
+ ```python
90
+ import torch
91
+
92
+ # Load fold checkpoint
93
+ ckpt = torch.load("world_model_fold_Atlantic_20260127_111754.pt", map_location="cpu")
94
+
95
+ # ckpt contains model_state_dict for the full WorldModel
96
+ # Requires WorldModel class from the training codebase
97
+ ```
98
+
99
+ ## Dataset
100
+
101
+ - **1,810 ocean samples** with co-located environment and Pfam profiles
102
+ - **24 environmental variables** (GEE oceanographic/atmospheric)
103
+ - **20 Pfam module features** (aggregated from 9,466 domains via co-occurrence clustering)
104
+ - **3 productivity targets** (chlorophyll-a, POC, NFLH)
105
+ - **Spatial cross-validation**: Leave-one-basin-out (6 ocean basins)
106
+
107
+ ## Related Models
108
+
109
+ - [GreenGenomicsLab/algaGPT](https://huggingface.co/GreenGenomicsLab/algaGPT) β€” AlgaGPT protein classification
110
+ - [GreenGenomicsLab/LA4SR-Pythia70m-b-ckpt-55000](https://huggingface.co/GreenGenomicsLab/LA4SR-Pythia70m-b-ckpt-55000) β€” LA4SR-Pythia classification
111
+ - [GreenGenomicsLab/TARA-ELF-NET](https://huggingface.co/GreenGenomicsLab/TARA-ELF-NET) β€” Deep bidirectional env↔pfam models
112
+ - [GreenGenomicsLab/TARA-XGBoost-Bidirectional](https://huggingface.co/GreenGenomicsLab/TARA-XGBoost-Bidirectional) β€” XGBoost bidirectional models
113
+
114
+ ## Citation
115
+
116
+ LA4SR classification models:
117
+ > Nelson DR, Jaiswal AK, Ismail NS, Mystikou A, Salehi-Ashtiani K. *Patterns*. 2024;6(11).
118
+
119
+ ## License
120
+
121
+ Apache 2.0