TARA-XGBoost-Bidirectional

Bidirectional XGBoost ensemble models linking marine environmental variables to microalgal protein domain (Pfam) abundance profiles from the TARA Oceans metagenomic dataset.

Model Description

Forward Models (Environment β†’ Pfam)

  • Input: 32 Google Earth Engine oceanographic variables
  • Output: CLR-transformed abundance of 100 top-variance Pfam domains
  • Architecture: 100 independent XGBoost regressors (one per target domain)
  • Performance: Median test RΒ² = 0.20 (IQR: 0.16–0.29; max RΒ² = 0.54)

Reverse Models (Pfam β†’ Environment)

  • Input: 9,611 Pfam domain abundances (100 PCA components, 72.2% variance)
  • Output: 31 environmental variables
  • Architecture: 31 independent XGBoost regressors (one per target variable)
  • Key results under 10-fold spatial block CV:
    • SST: RΒ² = 0.38 (spatial block CV; RΒ² = 0.61 on 80/20 split)
    • Bathymetry: RΒ² = 0.42 (spatial block CV; RΒ² = 0.53 on 80/20 split)
    • Cross-basin validation (bathymetry): RΒ² = 0.25

XGBoost Hyperparameters

Parameter Value
n_estimators 200
max_depth 6
learning_rate 0.1
subsample 0.8
colsample_bytree 0.8
min_child_weight 3
reg_alpha 0.1
reg_lambda 1.0

Files

File Size Description
xgboost_forward_models_20260124_104452.joblib 49 MB 100 forward models
xgboost_reverse_models_20260124_104452.joblib 15 MB 31 reverse models
model_manifest_20260124_104452.json β€” Feature lists and hyperparameters

Usage

import joblib

# Load reverse models (Pfam β†’ Environment)
reverse_bundle = joblib.load("xgboost_reverse_models_20260124_104452.joblib")

# Load forward models (Environment β†’ Pfam)
forward_bundle = joblib.load("xgboost_forward_models_20260124_104452.joblib")

Dataset

  • Source: AlgaGPT-extracted proteomes from 2,044 TARA Oceans metagenomic assemblies
  • Samples with complete metadata: 1,279 (SMART-filtered) to 1,810 (GPS-recovered with WOA23)
  • Environmental data: Google Earth Engine (GEE) satellite products + WOA23 nutrients

Related Models

  • algaGPT β€” Protein classification model used for proteome extraction
  • TARA-WorldModel-VICReg β€” VICReg joint embedding alternative approach

Authors

David R. Nelson, Kourosh Salehi-Ashtiani

Green Genomics Lab, New York University Abu Dhabi

Citation

@article{elfnet2026,
  title={ELF-NET: Environment-Linked Functional Network for marine microalgal domain-environment coupling},
  author={Nelson, David R. and Salehi-Ashtiani, Kourosh},
  journal={Forthcoming},
  year={2026}
}

Contact

Kourosh Salehi-Ashtiani β€” ksa3@nyu.edu

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support