TARA-XGBoost-Bidirectional
Bidirectional XGBoost ensemble models linking marine environmental variables to microalgal protein domain (Pfam) abundance profiles from the TARA Oceans metagenomic dataset.
Model Description
Forward Models (Environment β Pfam)
- Input: 32 Google Earth Engine oceanographic variables
- Output: CLR-transformed abundance of 100 top-variance Pfam domains
- Architecture: 100 independent XGBoost regressors (one per target domain)
- Performance: Median test RΒ² = 0.20 (IQR: 0.16β0.29; max RΒ² = 0.54)
Reverse Models (Pfam β Environment)
- Input: 9,611 Pfam domain abundances (100 PCA components, 72.2% variance)
- Output: 31 environmental variables
- Architecture: 31 independent XGBoost regressors (one per target variable)
- Key results under 10-fold spatial block CV:
- SST: RΒ² = 0.38 (spatial block CV; RΒ² = 0.61 on 80/20 split)
- Bathymetry: RΒ² = 0.42 (spatial block CV; RΒ² = 0.53 on 80/20 split)
- Cross-basin validation (bathymetry): RΒ² = 0.25
XGBoost Hyperparameters
| Parameter | Value |
|---|---|
| n_estimators | 200 |
| max_depth | 6 |
| learning_rate | 0.1 |
| subsample | 0.8 |
| colsample_bytree | 0.8 |
| min_child_weight | 3 |
| reg_alpha | 0.1 |
| reg_lambda | 1.0 |
Files
| File | Size | Description |
|---|---|---|
xgboost_forward_models_20260124_104452.joblib |
49 MB | 100 forward models |
xgboost_reverse_models_20260124_104452.joblib |
15 MB | 31 reverse models |
model_manifest_20260124_104452.json |
β | Feature lists and hyperparameters |
Usage
import joblib
# Load reverse models (Pfam β Environment)
reverse_bundle = joblib.load("xgboost_reverse_models_20260124_104452.joblib")
# Load forward models (Environment β Pfam)
forward_bundle = joblib.load("xgboost_forward_models_20260124_104452.joblib")
Dataset
- Source: AlgaGPT-extracted proteomes from 2,044 TARA Oceans metagenomic assemblies
- Samples with complete metadata: 1,279 (SMART-filtered) to 1,810 (GPS-recovered with WOA23)
- Environmental data: Google Earth Engine (GEE) satellite products + WOA23 nutrients
Related Models
- algaGPT β Protein classification model used for proteome extraction
- TARA-WorldModel-VICReg β VICReg joint embedding alternative approach
Authors
David R. Nelson, Kourosh Salehi-Ashtiani
Green Genomics Lab, New York University Abu Dhabi
Citation
@article{elfnet2026,
title={ELF-NET: Environment-Linked Functional Network for marine microalgal domain-environment coupling},
author={Nelson, David R. and Salehi-Ashtiani, Kourosh},
journal={Forthcoming},
year={2026}
}
Contact
Kourosh Salehi-Ashtiani β ksa3@nyu.edu