Upload results/phase3_conclusion.txt with huggingface_hub
Browse files- results/phase3_conclusion.txt +198 -0
results/phase3_conclusion.txt
ADDED
|
@@ -0,0 +1,198 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Phase 3 Statistical Comparison: Three-Model Ablation
|
| 2 |
+
# ============================================================
|
| 3 |
+
#
|
| 4 |
+
# Provenance:
|
| 5 |
+
# Script: /media/drn2/External/TARA-Oceans/03_analyses/WorldModelApp/scripts/t19_statistical_comparison_20260127_120453.py
|
| 6 |
+
# Input: /media/drn2/External/TARA-Oceans/03_analyses/WorldModelApp/results/phase3_baseline_performance.tsv
|
| 7 |
+
# /media/drn2/External/TARA-Oceans/03_analyses/WorldModelApp/results/phase3_envembed_performance.tsv
|
| 8 |
+
# /media/drn2/External/TARA-Oceans/03_analyses/WorldModelApp/results/phase3_joint_performance.tsv
|
| 9 |
+
# Date: 2026-01-27 12:07:06
|
| 10 |
+
# Integrity Check: PASSED
|
| 11 |
+
# N_samples: 1810 (bio_valid=1151)
|
| 12 |
+
# CV: Leave-one-basin-out (6 folds, Red_Sea merged into Indian)
|
| 13 |
+
|
| 14 |
+
## 1. Overall Pooled R2 (n=1,151 bio_valid samples)
|
| 15 |
+
|
| 16 |
+
Target (a) raw env (b) z_env (c) joint (b)-(a) (c)-(a) (c)-(b)
|
| 17 |
+
-------------------- ------------ ------------ ------------ ---------- ---------- ----------
|
| 18 |
+
chl-a 0.561 0.474 0.516 -0.087 -0.045 +0.042
|
| 19 |
+
POC 0.422 0.513 0.532 +0.091 +0.110 +0.019
|
| 20 |
+
NFLH 0.700 0.411 0.560 -0.290 -0.140 +0.149
|
| 21 |
+
|
| 22 |
+
## 2. Per-Fold R2 Values
|
| 23 |
+
|
| 24 |
+
### chl-a:
|
| 25 |
+
Fold (a) raw env (b) z_env (c) joint
|
| 26 |
+
-------------------- ------------ ------------ ------------
|
| 27 |
+
Arctic -0.147 -3.817 -7.303
|
| 28 |
+
Atlantic 0.539 0.490 0.572
|
| 29 |
+
Indian -0.631 -0.110 -0.859
|
| 30 |
+
Mediterranean 0.963 -7.602 -4.620
|
| 31 |
+
Pacific 0.529 0.414 0.317
|
| 32 |
+
Southern -0.867 -9.125 -3.796
|
| 33 |
+
Mean 0.064 -3.292 -2.615
|
| 34 |
+
Std 0.727 4.267 3.138
|
| 35 |
+
|
| 36 |
+
### POC:
|
| 37 |
+
Fold (a) raw env (b) z_env (c) joint
|
| 38 |
+
-------------------- ------------ ------------ ------------
|
| 39 |
+
Arctic 0.463 -2.035 -4.793
|
| 40 |
+
Atlantic 0.074 0.664 0.659
|
| 41 |
+
Indian 0.731 0.142 0.154
|
| 42 |
+
Mediterranean 0.951 -5.461 -2.401
|
| 43 |
+
Pacific 0.716 0.392 0.411
|
| 44 |
+
Southern 0.255 -45.493 -23.638
|
| 45 |
+
Mean 0.532 -8.632 -4.935
|
| 46 |
+
Std 0.329 18.205 9.402
|
| 47 |
+
|
| 48 |
+
### NFLH:
|
| 49 |
+
Fold (a) raw env (b) z_env (c) joint
|
| 50 |
+
-------------------- ------------ ------------ ------------
|
| 51 |
+
Arctic -5.250 -3.333 -0.084
|
| 52 |
+
Atlantic 0.783 0.652 0.628
|
| 53 |
+
Indian 0.650 0.256 0.283
|
| 54 |
+
Mediterranean 0.300 -2.343 -0.917
|
| 55 |
+
Pacific 0.718 0.397 0.641
|
| 56 |
+
Southern 0.463 0.051 0.007
|
| 57 |
+
Mean -0.389 -0.720 0.093
|
| 58 |
+
Std 2.388 1.681 0.580
|
| 59 |
+
|
| 60 |
+
## 3. Paired Statistical Tests (6 folds)
|
| 61 |
+
|
| 62 |
+
Comparison Target mean_diff t p_t p_W d sign
|
| 63 |
+
------------------------- -------- ---------- -------- -------- -------- -------- --------
|
| 64 |
+
(b) vs (a) chl-a -3.356 -1.96 0.107 0.156 -0.80 1/5/0
|
| 65 |
+
(b) vs (a) POC -9.164 -1.24 0.270 0.094 -0.51 1/5/0
|
| 66 |
+
(b) vs (a) NFLH -0.330 -0.56 0.600 0.312 -0.23 1/5/0
|
| 67 |
+
(c) vs (a) chl-a -2.679 -2.12 0.088 0.062 -0.86 1/5/0
|
| 68 |
+
(c) vs (a) POC -5.467 -1.44 0.209 0.156 -0.59 1/5/0
|
| 69 |
+
(c) vs (a) NFLH +0.482 +0.51 0.634 0.438 +0.21 1/5/0
|
| 70 |
+
(c) vs (b) chl-a +0.677 +0.54 0.613 1.000 +0.22 3/3/0
|
| 71 |
+
(c) vs (b) POC +3.697 +1.00 0.365 0.312 +0.41 4/2/0
|
| 72 |
+
(c) vs (b) NFLH +0.813 +1.51 0.191 0.219 +0.62 4/2/0
|
| 73 |
+
|
| 74 |
+
* = significant at alpha=0.05
|
| 75 |
+
sign = folds favoring new/reference/tied
|
| 76 |
+
d = Cohen's d (positive = new model better)
|
| 77 |
+
|
| 78 |
+
## 4. Mean Difference 95% Confidence Intervals
|
| 79 |
+
|
| 80 |
+
Comparison Target mean_diff CI_low CI_high Contains 0?
|
| 81 |
+
------------------------- -------- ---------- ---------- ---------- ------------
|
| 82 |
+
(b) vs (a) chl-a -3.356 -7.752 +1.040 Yes
|
| 83 |
+
(b) vs (a) POC -9.164 -28.154 +9.826 Yes
|
| 84 |
+
(b) vs (a) NFLH -0.330 -1.848 +1.187 Yes
|
| 85 |
+
(c) vs (a) chl-a -2.679 -5.930 +0.572 Yes
|
| 86 |
+
(c) vs (a) POC -5.467 -15.212 +4.279 Yes
|
| 87 |
+
(c) vs (a) NFLH +0.482 -1.963 +2.928 Yes
|
| 88 |
+
(c) vs (b) chl-a +0.677 -2.550 +3.904 Yes
|
| 89 |
+
(c) vs (b) POC +3.697 -5.836 +13.230 Yes
|
| 90 |
+
(c) vs (b) NFLH +0.813 -0.570 +2.196 Yes
|
| 91 |
+
|
| 92 |
+
## 5. Effect Size Interpretation
|
| 93 |
+
|
| 94 |
+
Cohen's d conventions: |d| < 0.2 = negligible, 0.2-0.5 = small, 0.5-0.8 = medium, > 0.8 = large
|
| 95 |
+
|
| 96 |
+
(b) vs (a) chl-a : d=-0.80 (large, favors baseline)
|
| 97 |
+
(b) vs (a) POC : d=-0.51 (medium, favors baseline)
|
| 98 |
+
(b) vs (a) NFLH : d=-0.23 (small, favors baseline)
|
| 99 |
+
(c) vs (a) chl-a : d=-0.86 (large, favors baseline)
|
| 100 |
+
(c) vs (a) POC : d=-0.59 (medium, favors baseline)
|
| 101 |
+
(c) vs (a) NFLH : d=+0.21 (small, favors joint)
|
| 102 |
+
(c) vs (b) chl-a : d=+0.22 (small, favors joint)
|
| 103 |
+
(c) vs (b) POC : d=+0.41 (small, favors joint)
|
| 104 |
+
(c) vs (b) NFLH : d=+0.62 (medium, favors joint)
|
| 105 |
+
|
| 106 |
+
## 6. Sensitivity Analysis (excluding Arctic, 5 folds)
|
| 107 |
+
|
| 108 |
+
Comparison Target mean_diff t p_t d
|
| 109 |
+
----------------------------------- -------- ---------- -------- -------- --------
|
| 110 |
+
(b) vs (a) excl_Arctic chl-a -3.293 -1.57 0.191 -0.70
|
| 111 |
+
(b) vs (a) excl_Arctic POC -10.497 -1.18 0.304 -0.53
|
| 112 |
+
(b) vs (a) excl_Arctic NFLH -0.780 -1.67 0.171 -0.74
|
| 113 |
+
(c) vs (a) excl_Arctic chl-a -1.784 -1.63 0.178 -0.73
|
| 114 |
+
(c) vs (a) excl_Arctic POC -5.509 -1.19 0.301 -0.53
|
| 115 |
+
(c) vs (a) excl_Arctic NFLH -0.454 -2.24 0.089 -1.00
|
| 116 |
+
(c) vs (b) excl_Arctic chl-a +1.509 +1.31 0.260 +0.59
|
| 117 |
+
(c) vs (b) excl_Arctic POC +4.988 +1.17 0.306 +0.52
|
| 118 |
+
(c) vs (b) excl_Arctic NFLH +0.326 +1.16 0.309 +0.52
|
| 119 |
+
|
| 120 |
+
## 7. Sign Consistency Analysis
|
| 121 |
+
|
| 122 |
+
(b) vs (a) fold-level: 3/18 comparisons favor (b) (16.7%)
|
| 123 |
+
(c) vs (a) fold-level: 3/18 comparisons favor (c) (16.7%)
|
| 124 |
+
|
| 125 |
+
Detailed per-fold sign for (b) vs (a):
|
| 126 |
+
chl-a : Arctic=a>b, Atlantic=a>b, Indian=b>a, Mediterranean=a>b, Pacific=a>b, Southern=a>b
|
| 127 |
+
POC : Arctic=a>b, Atlantic=b>a, Indian=a>b, Mediterranean=a>b, Pacific=a>b, Southern=a>b
|
| 128 |
+
NFLH : Arctic=b>a, Atlantic=a>b, Indian=a>b, Mediterranean=a>b, Pacific=a>b, Southern=a>b
|
| 129 |
+
|
| 130 |
+
Detailed per-fold sign for (c) vs (a):
|
| 131 |
+
chl-a : Arctic=a>c, Atlantic=c>a, Indian=a>c, Mediterranean=a>c, Pacific=a>c, Southern=a>c
|
| 132 |
+
POC : Arctic=a>c, Atlantic=c>a, Indian=a>c, Mediterranean=a>c, Pacific=a>c, Southern=a>c
|
| 133 |
+
NFLH : Arctic=c>a, Atlantic=a>c, Indian=a>c, Mediterranean=a>c, Pacific=a>c, Southern=a>c
|
| 134 |
+
|
| 135 |
+
## 8. OUTCOME DETERMINATION
|
| 136 |
+
|
| 137 |
+
============================================================
|
| 138 |
+
OUTCOME: MODERATE CASE (PRD Section 9.2)
|
| 139 |
+
============================================================
|
| 140 |
+
|
| 141 |
+
Joint embedding captures meaningful structure; genomic layer adds marginal or target-specific signal over environment.
|
| 142 |
+
|
| 143 |
+
### Evidence Summary:
|
| 144 |
+
|
| 145 |
+
1. POOLED R2 (n=1,151):
|
| 146 |
+
- chl-a: Best = baseline (R2=0.561)
|
| 147 |
+
- POC: Best = joint (R2=0.532)
|
| 148 |
+
- NFLH: Best = baseline (R2=0.700)
|
| 149 |
+
|
| 150 |
+
2. KEY TEST -- (b) vs (a) [does VICReg co-training improve env encoder?]:
|
| 151 |
+
- chl-a: -0.087 R2 (degradation)
|
| 152 |
+
- POC: +0.091 R2 (improvement)
|
| 153 |
+
- NFLH: -0.290 R2 (degradation)
|
| 154 |
+
|
| 155 |
+
3. (c) vs (b) [does z_pfam add to z_env?]:
|
| 156 |
+
- chl-a: +0.042 R2 (yes)
|
| 157 |
+
- POC: +0.019 R2 (yes)
|
| 158 |
+
- NFLH: +0.149 R2 (yes)
|
| 159 |
+
|
| 160 |
+
4. STATISTICAL SIGNIFICANCE:
|
| 161 |
+
- No comparisons reach significance at alpha=0.05 (with only 6 folds, power is very limited)
|
| 162 |
+
|
| 163 |
+
5. SIGN CONSISTENCY:
|
| 164 |
+
- (b) beats (a): 16.7% of fold-target comparisons
|
| 165 |
+
- (c) beats (a): 16.7% of fold-target comparisons
|
| 166 |
+
|
| 167 |
+
## 9. Nuanced Interpretation
|
| 168 |
+
|
| 169 |
+
While the overall outcome is MODERATE CASE, several nuances deserve attention:
|
| 170 |
+
|
| 171 |
+
a) POC IMPROVEMENT: Model (b) envembed OUTPERFORMS baseline for POC
|
| 172 |
+
(pooled R2: 0.513 vs 0.422, delta=+0.091).
|
| 173 |
+
VICReg co-training with PFAM modules DOES improve the environment encoder
|
| 174 |
+
for particulate organic carbon prediction. This is a target-specific success.
|
| 175 |
+
|
| 176 |
+
b) z_pfam ADDS COMPLEMENTARY INFORMATION: Model (c) beats (b) for all 3 targets.
|
| 177 |
+
Pooled deltas: chl-a=+0.042, POC=+0.019, NFLH=+0.149
|
| 178 |
+
The PFAM encoder captures information not in the environment encoder.
|
| 179 |
+
|
| 180 |
+
c) XGBoost vs MLP CONFOUND: The baseline uses XGBoost (300 trees, max_depth=6)
|
| 181 |
+
while models (b) and (c) use 2-layer MLPs. XGBoost is a stronger learner for
|
| 182 |
+
tabular data at this sample size (N=1,151 bio_valid). An apples-to-apples
|
| 183 |
+
comparison would use the same architecture for all models.
|
| 184 |
+
|
| 185 |
+
d) FOLD INSTABILITY: Mediterranean, Southern, and Arctic folds show catastrophically
|
| 186 |
+
negative R2 for models (b) and (c). These enclosed/polar basins are too distinct
|
| 187 |
+
for cross-basin generalization via MLP. XGBoost handles distribution shift better
|
| 188 |
+
through tree-based partitioning.
|
| 189 |
+
|
| 190 |
+
e) LIMITED STATISTICAL POWER: With only 6 CV folds, the minimum achievable Wilcoxon
|
| 191 |
+
p-value is 0.031 (all 6 folds agree). Paired t-tests require n>=6 for reasonable
|
| 192 |
+
power. Several comparisons show meaningful effect sizes but fail to reach
|
| 193 |
+
significance due to fold-count limitation.
|
| 194 |
+
|
| 195 |
+
f) VICReg DOMINANCE: Val VICReg loss (~38-44) >> val pred loss (~0.2-3.2). The encoder
|
| 196 |
+
is primarily optimized for alignment, not productivity prediction. A two-stage
|
| 197 |
+
approach (train VICReg first, then fine-tune for prediction) might yield better results.
|
| 198 |
+
|