TARA-ELF-NET / README.md

GreenGenomicsLab

Upload ELF-NET bidirectional neural network checkpoints (env2pfam + pfam2env)

df6ede9 verified 2 days ago

preview code

raw

history blame contribute delete

5.78 kB

metadata

language: en
license: apache-2.0
tags:
  - marine-biology
  - metagenomics
  - environmental-modeling
  - protein-domains
  - tara-oceans
  - pfam
  - pytorch
library_name: pytorch
pipeline_tag: tabular-regression

ELF-NET: Environment-Linked Functional Network

Bidirectional neural network checkpoints linking marine environmental variables to microalgal protein domain (Pfam) abundance profiles from the TARA Oceans metagenomic dataset.

Model Description

ELF-NET consists of two complementary prediction directions:

env2pfam (Environment → Pfam Abundance)

Predicts the abundance of thousands of Pfam protein domains at a marine sampling site given 94 environmental features (30 oceanographic/atmospheric variables + 64 AlphaEarth spectral eigenvectors).

pfam2env (Pfam Abundance → Environmental Features)

Predicts 64 environmental features from observed Pfam domain abundance profiles (9,611 input domains).

Repository Structure

├── env2pfam/
│   ├── algagpt_full/       # AlgaGPT-extracted proteomes, full architecture
│   ├── algagpt_light/      # AlgaGPT-extracted proteomes, light architecture
│   ├── pythia_full/        # LA4SR-Pythia-extracted proteomes, full architecture
│   └── pythia_light/       # LA4SR-Pythia-extracted proteomes, light architecture
├── pfam2env/
│   ├── full/               # Full architecture
│   └── light/              # Light architecture
└── README.md

Each subdirectory contains:

best_model.pt — PyTorch checkpoint (model_state_dict, optimizer_state_dict, best_val_loss)
config.json — Hyperparameters and feature lists
final_metrics.json — Train/val/test metrics
training_history.json — Per-epoch training curves

Architectures

env2pfam Full

Input(94) → Linear(512) + BN + ReLU + Dropout(0.2)
         → Linear(1024) + BN + ReLU + Dropout(0.2)
         → Linear(2048) + BN + ReLU + Dropout(0.2)
         → Linear(4096) + BN + ReLU + Dropout(0.2)
         → Linear(output_dim)

env2pfam Light

Input(94) → Linear(256) + BN + ReLU + Dropout(0.2)
         → Linear(512) + BN + ReLU + Dropout(0.2)
         → Linear(1024) + BN + ReLU + Dropout(0.2)
         → Linear(2048) + BN + ReLU + Dropout(0.2)
         → Linear(output_dim)

pfam2env Full

InputBatchNorm(9611) → Linear(2048) + ReLU + Dropout
                     → Linear(512) + ReLU + Dropout
                     → Linear(128) + ReLU + Dropout
                     → Linear(64)

pfam2env Light

InputBatchNorm(9611) → Linear(512) + ReLU + Dropout
                     → Linear(256) + ReLU + Dropout
                     → Linear(128) + ReLU + Dropout
                     → Linear(64)

Performance

env2pfam (Environment → Pfam)

Variant	Dataset	Output Dim	LR	Test R²	Test MSE	Test MAE
pythia_full	LA4SR-Pythia	17,245	1e-3	0.1487	14.597	2.411
pythia_light	LA4SR-Pythia	17,245	1e-4	0.1432	14.561	2.454
algagpt_full	AlgaGPT	20,318	1e-3	0.1189	14.006	2.381
algagpt_light	AlgaGPT	20,318	1e-4	0.1070	14.136	2.415

R² is the mean across all output Pfam dimensions. The modest R² values reflect the high dimensionality of the output space (17K–20K Pfam domains) and the inherent stochasticity of metagenomic sampling.

pfam2env (Pfam → Environment)

Variant	Input Dim	LR	Test R²	Test MSE	Test MAE
full	9,611	1e-3	-0.0057	0.00931	0.0724
light	9,611	1e-3	-0.0055	0.00931	0.0724

Negative R² indicates performance near the mean-prediction baseline. These checkpoints document the pfam→env direction of the bidirectional framework and are included for completeness and reproducibility.

Input Features (env2pfam)

30 environmental variables:

Air temperature (mean, max, min, range °C)
Precipitation (mean mm)
Solar radiation (MJ/m²)
Elevation (m), bathymetry (m), distance to coast (km)
Land cover class
Sea surface temperature (SST mean, max, min, range °C; MODIS SST mean)
Chlorophyll-a (mean, max, min mg/m³)
Normalized fluorescence line height (NFLH mean)
Particulate organic carbon (POC mean mg/m³)
Remote sensing reflectance (Rrs at 412, 443, 469, 488, 531, 547, 555, 645, 667, 678 nm)

64 AlphaEarth spectral eigenvectors (A00–A63)

Datasets

Two LLM-mediated proteome extraction strategies were applied to TARA Oceans metagenomic assemblies:

LA4SR-Pythia: 2,049 samples → 17,245 Pfam domains
AlgaGPT: 2,044 samples → 20,318 Pfam domains

Both used SNAP gene prediction followed by hmmsearch against the Pfam database. The different extraction strategies yield different protein sets and domain profiles from the same underlying metagenomes.

Training Details

Framework: PyTorch
Loss: MSE
Optimizer: Adam (weight_decay=1e-4 for pfam2env)
Scheduler: Cosine annealing (pfam2env)
Early stopping: Patience 20 (env2pfam) / 30 (pfam2env)
Batch size: 32
Max epochs: 200
Seed: 42
Hardware: CUDA GPU

Usage

import torch
import json

# Load model config
with open("env2pfam/pythia_full/config.json") as f:
    config = json.load(f)

# Load checkpoint
checkpoint = torch.load(
    "env2pfam/pythia_full/best_model.pt",
    map_location="cpu",
    weights_only=False
)
state_dict = checkpoint["model_state_dict"]

# Reconstruct model (requires the ELF-NET model class)
# model.load_state_dict(state_dict)

Citation

If you use these checkpoints, please cite the associated manuscript (citation forthcoming).

License

Apache 2.0