GreenGenomicsLab commited on
Commit
d388ef7
·
verified ·
1 Parent(s): 3e651cb

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +46 -44
README.md CHANGED
@@ -16,80 +16,82 @@ pipeline_tag: tabular-regression
16
 
17
  # TARA-WorldModel-VICReg
18
 
19
- Joint environmentproteome embedding model using VICReg (Variance-Invariance-Covariance Regularization) self-supervised learning, applied to the TARA Oceans metagenomic dataset. This model aligns environmental and Pfam protein domain representations in a shared latent space.
20
 
21
- ## Model Description
22
 
23
- ### Architecture
24
- Two-layer MLP encoder with VICReg alignment loss:
25
 
26
  ```
27
- Environment branch: Input(env_dim) Linear(hidden) ReLU Dropout(0.3) Linear(32)
28
- PFAM branch: Input(pfam_dim) Linear(hidden) ReLU Dropout(0.3) Linear(32)
29
  ```
30
 
31
- - **Latent dimension:** 32
32
- - **Parameters:** ~53K–64K depending on PFAM input dimensionality
33
- - **VICReg loss weights:** variance = 25.0, invariance = 25.0, covariance = 1.0
34
- - **Prediction head alpha:** 1.0
 
 
35
 
36
- ### Training Data
37
- - **Source:** 1,151 samples with complete productivity data (chlorophyll-a, POC, NFLH) from 1,810 total TARA Oceans samples
38
- - **Environmental features:** Google Earth Engine oceanographic variables
39
- - **PFAM features:** CLR-transformed domain abundances reduced via PCA to 20, 32, or 64 dimensions
40
 
41
- ## Performance
 
 
 
 
42
 
43
- This model represents an exploratory methodological approach that did not outperform the primary XGBoost bidirectional framework used in the ELF-NET study. Results are reported for transparency and reproducibility.
44
 
45
  ### 6-Fold Leave-One-Basin-Out (LOBO) CV
46
 
47
  | Target | Joint Model R² | Env-Only Baseline R² | Cohen's d | p-value |
48
  |--------|----------------|----------------------|-----------|---------|
49
  | POC | 0.532 | 0.422 | 0.026 | 0.38 |
50
- | Chl-a | 0.516 | 0.561 | | |
51
- | NFLH | 0.560 | 0.700 | | |
52
-
53
- The POC improvement (+0.110 R²) was not statistically significant.
54
 
55
  ### 9-Fold Spatial Block CV (matching primary XGBoost design)
56
 
57
- | PFAM dim | XGB Baseline R² | VICReg R² | ΔR² |
58
- |----------|-----------------|-----------|-----|
59
- | pfam20 | 0.417 | 2.045 | 2.462 |
60
- | pfam32 | 0.417 | 4.217 | 4.634 |
61
- | pfam64 | 0.417 | 1.262 | 1.679 |
62
-
63
- The MLP architecture produced catastrophically negative R² on spatially distinctive held-out basins (Mediterranean, mid-Pacific), where distribution shift defeats shallow neural networks. XGBoost's tree-based partitioning handles this regime far more effectively at N ≈ 1,100.
64
-
65
- ### Interpretation
66
 
67
- The poor performance under spatial CV is driven by the **architecture confound** (MLP vs. XGBoost for small tabular data), not necessarily by absence of the PFAM alignment signal. A fair comparison would require XGBoost-with-VICReg-embeddings, which was not evaluated. The XGBoost bidirectional framework was retained as the primary modeling approach.
68
 
69
  ## Repository Contents
70
 
71
- - Model checkpoints (`.pt` files) for each fold and PFAM dimensionality
72
- - Hyperparameter sweep results
73
- - Per-fold training curves and metrics
 
 
 
74
 
75
  ## Usage
76
 
77
  ```python
78
  import torch
79
 
80
- # Load a VICReg checkpoint
81
  checkpoint = torch.load(
82
- "path/to/vicreg_checkpoint.pt",
83
  map_location="cpu",
84
  weights_only=False
85
  )
86
  state_dict = checkpoint["model_state_dict"]
87
  ```
88
 
89
- ## Related Models
90
 
91
- - [TARA-XGBoost-Bidirectional](https://huggingface.co/GreenGenomicsLab/TARA-XGBoost-Bidirectional) Primary bidirectional models used in the ELF-NET study
92
- - [algaGPT](https://huggingface.co/GreenGenomicsLab/algaGPT) — Protein classification model used for proteome extraction
 
 
 
 
93
 
94
  ## References
95
 
@@ -104,14 +106,14 @@ Green Genomics Lab, New York University Abu Dhabi
104
  ## Citation
105
 
106
  ```bibtex
107
- @article{elfnet2026,
108
- title={ELF-NET: Environment-Linked Functional Network for marine microalgal domain-environment coupling},
109
- author={Nelson, David R. and Salehi-Ashtiani, Kourosh},
110
- journal={Forthcoming},
111
- year={2026}
112
  }
113
  ```
114
 
115
  ## Contact
116
 
117
- Kourosh Salehi-Ashtiani ksa3@nyu.edu
 
16
 
17
  # TARA-WorldModel-VICReg
18
 
19
+ Joint environment-proteome embedding model using VICReg (Variance-Invariance-Covariance Regularization) self-supervised learning, applied to the TARA Oceans metagenomic dataset. This model aligns environmental and Pfam protein domain representations in a shared 32-dimensional latent space.
20
 
21
+ This model represents an exploratory methodological approach deposited for transparency and reproducibility. The XGBoost bidirectional framework ([TARA-XGBoost-Bidirectional](https://huggingface.co/GreenGenomicsLab/TARA-XGBoost-Bidirectional)) was retained as the primary modeling approach in the ELF-NET study.
22
 
23
+ ## Architecture
 
24
 
25
  ```
26
+ Environment branch: Input(env_dim) -> Linear(hidden) -> ReLU -> Dropout(0.3) -> Linear(32)
27
+ Pfam branch: Input(pfam_dim) -> Linear(hidden) -> ReLU -> Dropout(0.3) -> Linear(32)
28
  ```
29
 
30
+ | Property | Value |
31
+ |----------|-------|
32
+ | Latent dimension | 32 |
33
+ | Parameters | ~53K--64K (varies with Pfam input dimensionality) |
34
+ | VICReg loss weights | variance = 25.0, invariance = 25.0, covariance = 1.0 |
35
+ | Prediction head alpha | 1.0 |
36
 
37
+ ## Training Data
 
 
 
38
 
39
+ | Property | Value |
40
+ |----------|-------|
41
+ | Source | 1,151 samples with complete productivity data (Chl-a, POC, NFLH) from 1,810 total TARA Oceans samples |
42
+ | Environmental features | Google Earth Engine oceanographic variables |
43
+ | Pfam features | CLR-transformed domain abundances reduced via PCA to 20, 32, or 64 dimensions |
44
 
45
+ ## Performance
46
 
47
  ### 6-Fold Leave-One-Basin-Out (LOBO) CV
48
 
49
  | Target | Joint Model R² | Env-Only Baseline R² | Cohen's d | p-value |
50
  |--------|----------------|----------------------|-----------|---------|
51
  | POC | 0.532 | 0.422 | 0.026 | 0.38 |
52
+ | Chl-a | 0.516 | 0.561 | -- | -- |
53
+ | NFLH | 0.560 | 0.700 | -- | -- |
 
 
54
 
55
  ### 9-Fold Spatial Block CV (matching primary XGBoost design)
56
 
57
+ | Pfam dim | XGB Baseline R² | VICReg R² | Delta R² |
58
+ |----------|-----------------|-----------|----------|
59
+ | pfam20 | 0.417 | -2.045 | -2.462 |
60
+ | pfam32 | 0.417 | -4.217 | -4.634 |
61
+ | pfam64 | 0.417 | -1.262 | -1.679 |
 
 
 
 
62
 
63
+ The negative under spatial CV reflects the MLP architecture's sensitivity to distribution shift on spatially distinctive held-out basins (Mediterranean, mid-Pacific), a known limitation of shallow neural networks on small tabular datasets (N ~ 1,100). This is an architecture confound, not evidence against the Pfam alignment signal itself.
64
 
65
  ## Repository Contents
66
 
67
+ | Directory | Contents |
68
+ |-----------|----------|
69
+ | `checkpoints/` | 24 model checkpoints (4 hyperparameter configurations x 6 ocean basin folds) |
70
+ | `scripts/` | Core training code (`train_world_model.py`, `vicreg_loss.py`, `world_model.py`) |
71
+ | `results/` | Per-fold metrics, training curves, hyperparameter sweep results, permutation tests |
72
+ | `config/` | Best hyperparameter configuration |
73
 
74
  ## Usage
75
 
76
  ```python
77
  import torch
78
 
 
79
  checkpoint = torch.load(
80
+ "checkpoints/20260127_111754/world_model_fold_Arctic_20260127_111754.pt",
81
  map_location="cpu",
82
  weights_only=False
83
  )
84
  state_dict = checkpoint["model_state_dict"]
85
  ```
86
 
87
+ ## Related Resources
88
 
89
+ | Resource | Link |
90
+ |----------|------|
91
+ | ELF-NET analysis pipeline (371 scripts, 15 modules) | [github.com/olympus-terminal/ELF-NET](https://github.com/olympus-terminal/ELF-NET) |
92
+ | Bidirectional XGBoost models (primary approach) | [TARA-XGBoost-Bidirectional](https://huggingface.co/GreenGenomicsLab/TARA-XGBoost-Bidirectional) |
93
+ | algaGPT protein classifier | [GreenGenomicsLab/algaGPT](https://huggingface.co/GreenGenomicsLab/algaGPT) |
94
+ | Dark-whiteGPLM checkpoints | [SarahDaakour/dark-whiteGPLM](https://huggingface.co/SarahDaakour/dark-whiteGPLM) |
95
 
96
  ## References
97
 
 
106
  ## Citation
107
 
108
  ```bibtex
109
+ @article{nelson2026elfnet,
110
+ title = {Coupling of oceanographic state to the dark proteome: a foundation for genome-informed marine productivity modeling},
111
+ author = {Nelson, David Roy and Plouviez, Maxence and Daakour, Sarah and Jaiswal, Ashish and Fu, Weiqi and Amin, Shady A. and Salehi-Ashtiani, Kourosh},
112
+ journal = {Forthcoming},
113
+ year = {2026}
114
  }
115
  ```
116
 
117
  ## Contact
118
 
119
+ Kourosh Salehi-Ashtiani -- ksa3@nyu.edu