GreenGenomicsLab
/

TARA-WorldModel-VICReg

@@ -16,80 +16,82 @@ pipeline_tag: tabular-regression
 # TARA-WorldModel-VICReg
-Joint environment–proteome embedding model using VICReg (Variance-Invariance-Covariance Regularization) self-supervised learning, applied to the TARA Oceans metagenomic dataset. This model aligns environmental and Pfam protein domain representations in a shared latent space.
-## Model Description
-### Architecture
-Two-layer MLP encoder with VICReg alignment loss:
 ```
-Environment branch: Input(env_dim) → Linear(hidden) → ReLU → Dropout(0.3) → Linear(32)
-PFAM branch:        Input(pfam_dim) → Linear(hidden) → ReLU → Dropout(0.3) → Linear(32)
 ```
-- **Latent dimension:** 32
-- **Parameters:** ~53K–64K depending on PFAM input dimensionality
-- **VICReg loss weights:** variance = 25.0, invariance = 25.0, covariance = 1.0
-- **Prediction head alpha:** 1.0
-### Training Data
-- **Source:** 1,151 samples with complete productivity data (chlorophyll-a, POC, NFLH) from 1,810 total TARA Oceans samples
-- **Environmental features:** Google Earth Engine oceanographic variables
-- **PFAM features:** CLR-transformed domain abundances reduced via PCA to 20, 32, or 64 dimensions
-## Performance
-This model represents an exploratory methodological approach that did not outperform the primary XGBoost bidirectional framework used in the ELF-NET study. Results are reported for transparency and reproducibility.
 ### 6-Fold Leave-One-Basin-Out (LOBO) CV
 | Target | Joint Model R² | Env-Only Baseline R² | Cohen's d | p-value |
 |--------|----------------|----------------------|-----------|---------|
 | POC | 0.532 | 0.422 | 0.026 | 0.38 |
-| Chl-a | 0.516 | 0.561 | — | — |
-| NFLH | 0.560 | 0.700 | — | — |
-The POC improvement (+0.110 R²) was not statistically significant.
 ### 9-Fold Spatial Block CV (matching primary XGBoost design)
-| PFAM dim | XGB Baseline R² | VICReg R² | ΔR² |
-|----------|-----------------|-----------|-----|
-| pfam20 | 0.417 | −2.045 | −2.462 |
-| pfam32 | 0.417 | −4.217 | −4.634 |
-| pfam64 | 0.417 | −1.262 | −1.679 |
-The MLP architecture produced catastrophically negative R² on spatially distinctive held-out basins (Mediterranean, mid-Pacific), where distribution shift defeats shallow neural networks. XGBoost's tree-based partitioning handles this regime far more effectively at N ≈ 1,100.
-### Interpretation
-The poor performance under spatial CV is driven by the **architecture confound** (MLP vs. XGBoost for small tabular data), not necessarily by absence of the PFAM alignment signal. A fair comparison would require XGBoost-with-VICReg-embeddings, which was not evaluated. The XGBoost bidirectional framework was retained as the primary modeling approach.
 ## Repository Contents
-- Model checkpoints (`.pt` files) for each fold and PFAM dimensionality
-- Hyperparameter sweep results
-- Per-fold training curves and metrics
 ## Usage
 ```python
 import torch
-# Load a VICReg checkpoint
 checkpoint = torch.load(
-    "path/to/vicreg_checkpoint.pt",
     map_location="cpu",
     weights_only=False
 )
 state_dict = checkpoint["model_state_dict"]
 ```
-## Related Models
-- [TARA-XGBoost-Bidirectional](https://huggingface.co/GreenGenomicsLab/TARA-XGBoost-Bidirectional) — Primary bidirectional models used in the ELF-NET study
-- [algaGPT](https://huggingface.co/GreenGenomicsLab/algaGPT) — Protein classification model used for proteome extraction
 ## References
@@ -104,14 +106,14 @@ Green Genomics Lab, New York University Abu Dhabi
 ## Citation
 ```bibtex
-@article{elfnet2026,
-  title={ELF-NET: Environment-Linked Functional Network for marine microalgal domain-environment coupling},
-  author={Nelson, David R. and Salehi-Ashtiani, Kourosh},
-  journal={Forthcoming},
-  year={2026}
 }
 ```
 ## Contact
-Kourosh Salehi-Ashtiani — ksa3@nyu.edu

 # TARA-WorldModel-VICReg
+Joint environment-proteome embedding model using VICReg (Variance-Invariance-Covariance Regularization) self-supervised learning, applied to the TARA Oceans metagenomic dataset. This model aligns environmental and Pfam protein domain representations in a shared 32-dimensional latent space.
+This model represents an exploratory methodological approach deposited for transparency and reproducibility. The XGBoost bidirectional framework ([TARA-XGBoost-Bidirectional](https://huggingface.co/GreenGenomicsLab/TARA-XGBoost-Bidirectional)) was retained as the primary modeling approach in the ELF-NET study.
+## Architecture
 ```
+Environment branch: Input(env_dim) -> Linear(hidden) -> ReLU -> Dropout(0.3) -> Linear(32)
+Pfam branch:        Input(pfam_dim) -> Linear(hidden) -> ReLU -> Dropout(0.3) -> Linear(32)
 ```
+| Property | Value |
+|----------|-------|
+| Latent dimension | 32 |
+| Parameters | ~53K--64K (varies with Pfam input dimensionality) |
+| VICReg loss weights | variance = 25.0, invariance = 25.0, covariance = 1.0 |
+| Prediction head alpha | 1.0 |
+## Training Data
+| Property | Value |
+|----------|-------|
+| Source | 1,151 samples with complete productivity data (Chl-a, POC, NFLH) from 1,810 total TARA Oceans samples |
+| Environmental features | Google Earth Engine oceanographic variables |
+| Pfam features | CLR-transformed domain abundances reduced via PCA to 20, 32, or 64 dimensions |
+## Performance
 ### 6-Fold Leave-One-Basin-Out (LOBO) CV
 | Target | Joint Model R² | Env-Only Baseline R² | Cohen's d | p-value |
 |--------|----------------|----------------------|-----------|---------|
 | POC | 0.532 | 0.422 | 0.026 | 0.38 |
+| Chl-a | 0.516 | 0.561 | -- | -- |
+| NFLH | 0.560 | 0.700 | -- | -- |
 ### 9-Fold Spatial Block CV (matching primary XGBoost design)
+| Pfam dim | XGB Baseline R² | VICReg R² | Delta R² |
+|----------|-----------------|-----------|----------|
+| pfam20 | 0.417 | -2.045 | -2.462 |
+| pfam32 | 0.417 | -4.217 | -4.634 |
+| pfam64 | 0.417 | -1.262 | -1.679 |
+The negative R² under spatial CV reflects the MLP architecture's sensitivity to distribution shift on spatially distinctive held-out basins (Mediterranean, mid-Pacific), a known limitation of shallow neural networks on small tabular datasets (N ~ 1,100). This is an architecture confound, not evidence against the Pfam alignment signal itself.
 ## Repository Contents
+| Directory | Contents |
+|-----------|----------|
+| `checkpoints/` | 24 model checkpoints (4 hyperparameter configurations x 6 ocean basin folds) |
+| `scripts/` | Core training code (`train_world_model.py`, `vicreg_loss.py`, `world_model.py`) |
+| `results/` | Per-fold metrics, training curves, hyperparameter sweep results, permutation tests |
+| `config/` | Best hyperparameter configuration |
 ## Usage
 ```python
 import torch
 checkpoint = torch.load(
+    "checkpoints/20260127_111754/world_model_fold_Arctic_20260127_111754.pt",
     map_location="cpu",
     weights_only=False
 )
 state_dict = checkpoint["model_state_dict"]
 ```
+## Related Resources
+| Resource | Link |
+|----------|------|
+| ELF-NET analysis pipeline (371 scripts, 15 modules) | [github.com/olympus-terminal/ELF-NET](https://github.com/olympus-terminal/ELF-NET) |
+| Bidirectional XGBoost models (primary approach) | [TARA-XGBoost-Bidirectional](https://huggingface.co/GreenGenomicsLab/TARA-XGBoost-Bidirectional) |
+| algaGPT protein classifier | [GreenGenomicsLab/algaGPT](https://huggingface.co/GreenGenomicsLab/algaGPT) |
+| Dark-whiteGPLM checkpoints | [SarahDaakour/dark-whiteGPLM](https://huggingface.co/SarahDaakour/dark-whiteGPLM) |
 ## References
 ## Citation
 ```bibtex
+@article{nelson2026elfnet,
+  title   = {Coupling of oceanographic state to the dark proteome: a foundation for genome-informed marine productivity modeling},
+  author  = {Nelson, David Roy and Plouviez, Maxence and Daakour, Sarah and Jaiswal, Ashish and Fu, Weiqi and Amin, Shady A. and Salehi-Ashtiani, Kourosh},
+  journal = {Forthcoming},
+  year    = {2026}
 }
 ```
 ## Contact
+Kourosh Salehi-Ashtiani -- ksa3@nyu.edu