TARA-XGBoost-Bidirectional
Bidirectional XGBoost ensemble models linking marine environmental variables to microalgal protein domain (Pfam) abundance profiles from the TARA Oceans metagenomic dataset.
Model Description
Forward Models (Environment β Pfam)
- Input: 32 Google Earth Engine oceanographic variables
- Output: CLR-transformed abundance of 100 top-variance Pfam domains
- 100 independent XGBoost regressors (one per target domain)
- Median test RΒ² = 0.20 (IQR: 0.16β0.29; max RΒ² = 0.54)
Reverse Models (Pfam β Environment)
- Input: 9,611 Pfam domain abundances (100 PCA components, 72.2% variance)
- Output: 31 environmental variables
- 31 independent XGBoost regressors (one per target variable)
- Best targets: MODIS SST (RΒ² = 0.61), mean SST (RΒ² = 0.61), bathymetry (RΒ² = 0.53)
- Independent-subset validation: bathymetry RΒ² = 0.25 across disjoint ocean basins
XGBoost Hyperparameters
| Parameter | Value |
|---|---|
| n_estimators | 200 |
| max_depth | 6 |
| learning_rate | 0.1 |
| subsample | 0.8 |
| colsample_bytree | 0.8 |
| min_child_weight | 3 |
| reg_alpha | 0.1 |
| reg_lambda | 1.0 |
Files
xgboost_forward_models_20260124_104452.joblibβ 100 forward models (49 MB)xgboost_reverse_models_20260124_104452.joblibβ 31 reverse models (15 MB)model_manifest_20260124_104452.jsonβ Feature lists and hyperparameters
Usage
import joblib
# Load reverse models (Pfam β Environment)
reverse_bundle = joblib.load("xgboost_reverse_models_20260124_104452.joblib")
# Load forward models (Environment β Pfam)
forward_bundle = joblib.load("xgboost_forward_models_20260124_104452.joblib")
Dataset
Trained on AlgaGPT-extracted proteomes from 2,044 TARA Oceans metagenomic assemblies. Environmental variables from Google Earth Engine (GEE) for 1,279 samples with complete metadata.
Related Models
- GreenGenomicsLab/algaGPT β AlgaGPT protein classification model
- GreenGenomicsLab/LA4SR-Pythia70m-b-ckpt-55000 β LA4SR-Pythia classification model
- GreenGenomicsLab/TARA-ELF-NET β Deep neural network bidirectional models
Citation
LA4SR classification models:
Nelson DR, Jaiswal AK, Ismail NS, Mystikou A, Salehi-Ashtiani K. Patterns. 2024;6(11).
License
Apache 2.0