|
|
--- |
|
|
language: en |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- marine-biology |
|
|
- metagenomics |
|
|
- environmental-modeling |
|
|
- protein-domains |
|
|
- tara-oceans |
|
|
- pfam |
|
|
- pytorch |
|
|
library_name: pytorch |
|
|
pipeline_tag: tabular-regression |
|
|
--- |
|
|
|
|
|
# ELF-NET: Environment-Linked Functional Network |
|
|
|
|
|
Bidirectional neural network checkpoints linking marine environmental variables to microalgal protein domain (Pfam) abundance profiles from the TARA Oceans metagenomic dataset. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
ELF-NET consists of two complementary prediction directions: |
|
|
|
|
|
### env2pfam (Environment β Pfam Abundance) |
|
|
Predicts the abundance of thousands of Pfam protein domains at a marine sampling site given 94 environmental features (30 oceanographic/atmospheric variables + 64 AlphaEarth spectral eigenvectors). |
|
|
|
|
|
### pfam2env (Pfam Abundance β Environmental Features) |
|
|
Predicts 64 environmental features from observed Pfam domain abundance profiles (9,611 input domains). |
|
|
|
|
|
## Repository Structure |
|
|
|
|
|
``` |
|
|
βββ env2pfam/ |
|
|
β βββ algagpt_full/ # AlgaGPT-extracted proteomes, full architecture |
|
|
β βββ algagpt_light/ # AlgaGPT-extracted proteomes, light architecture |
|
|
β βββ pythia_full/ # LA4SR-Pythia-extracted proteomes, full architecture |
|
|
β βββ pythia_light/ # LA4SR-Pythia-extracted proteomes, light architecture |
|
|
βββ pfam2env/ |
|
|
β βββ full/ # Full architecture |
|
|
β βββ light/ # Light architecture |
|
|
βββ README.md |
|
|
``` |
|
|
|
|
|
Each subdirectory contains: |
|
|
- `best_model.pt` β PyTorch checkpoint (model_state_dict, optimizer_state_dict, best_val_loss) |
|
|
- `config.json` β Hyperparameters and feature lists |
|
|
- `final_metrics.json` β Train/val/test metrics |
|
|
- `training_history.json` β Per-epoch training curves |
|
|
|
|
|
## Architectures |
|
|
|
|
|
### env2pfam Full |
|
|
``` |
|
|
Input(94) β Linear(512) + BN + ReLU + Dropout(0.2) |
|
|
β Linear(1024) + BN + ReLU + Dropout(0.2) |
|
|
β Linear(2048) + BN + ReLU + Dropout(0.2) |
|
|
β Linear(4096) + BN + ReLU + Dropout(0.2) |
|
|
β Linear(output_dim) |
|
|
``` |
|
|
|
|
|
### env2pfam Light |
|
|
``` |
|
|
Input(94) β Linear(256) + BN + ReLU + Dropout(0.2) |
|
|
β Linear(512) + BN + ReLU + Dropout(0.2) |
|
|
β Linear(1024) + BN + ReLU + Dropout(0.2) |
|
|
β Linear(2048) + BN + ReLU + Dropout(0.2) |
|
|
β Linear(output_dim) |
|
|
``` |
|
|
|
|
|
### pfam2env Full |
|
|
``` |
|
|
InputBatchNorm(9611) β Linear(2048) + ReLU + Dropout |
|
|
β Linear(512) + ReLU + Dropout |
|
|
β Linear(128) + ReLU + Dropout |
|
|
β Linear(64) |
|
|
``` |
|
|
|
|
|
### pfam2env Light |
|
|
``` |
|
|
InputBatchNorm(9611) β Linear(512) + ReLU + Dropout |
|
|
β Linear(256) + ReLU + Dropout |
|
|
β Linear(128) + ReLU + Dropout |
|
|
β Linear(64) |
|
|
``` |
|
|
|
|
|
## Performance |
|
|
|
|
|
### env2pfam (Environment β Pfam) |
|
|
|
|
|
| Variant | Dataset | Output Dim | LR | Test RΒ² | Test MSE | Test MAE | |
|
|
|---|---|---|---|---|---|---| |
|
|
| **pythia_full** | LA4SR-Pythia | 17,245 | 1e-3 | **0.1487** | 14.597 | 2.411 | |
|
|
| pythia_light | LA4SR-Pythia | 17,245 | 1e-4 | 0.1432 | 14.561 | 2.454 | |
|
|
| algagpt_full | AlgaGPT | 20,318 | 1e-3 | 0.1189 | 14.006 | 2.381 | |
|
|
| algagpt_light | AlgaGPT | 20,318 | 1e-4 | 0.1070 | 14.136 | 2.415 | |
|
|
|
|
|
RΒ² is the mean across all output Pfam dimensions. The modest RΒ² values reflect the high dimensionality of the output space (17Kβ20K Pfam domains) and the inherent stochasticity of metagenomic sampling. |
|
|
|
|
|
### pfam2env (Pfam β Environment) |
|
|
|
|
|
| Variant | Input Dim | LR | Test RΒ² | Test MSE | Test MAE | |
|
|
|---|---|---|---|---|---| |
|
|
| full | 9,611 | 1e-3 | -0.0057 | 0.00931 | 0.0724 | |
|
|
| light | 9,611 | 1e-3 | -0.0055 | 0.00931 | 0.0724 | |
|
|
|
|
|
Negative RΒ² indicates performance near the mean-prediction baseline. These checkpoints document the pfamβenv direction of the bidirectional framework and are included for completeness and reproducibility. |
|
|
|
|
|
## Input Features (env2pfam) |
|
|
|
|
|
**30 environmental variables:** |
|
|
- Air temperature (mean, max, min, range Β°C) |
|
|
- Precipitation (mean mm) |
|
|
- Solar radiation (MJ/mΒ²) |
|
|
- Elevation (m), bathymetry (m), distance to coast (km) |
|
|
- Land cover class |
|
|
- Sea surface temperature (SST mean, max, min, range Β°C; MODIS SST mean) |
|
|
- Chlorophyll-a (mean, max, min mg/mΒ³) |
|
|
- Normalized fluorescence line height (NFLH mean) |
|
|
- Particulate organic carbon (POC mean mg/mΒ³) |
|
|
- Remote sensing reflectance (Rrs at 412, 443, 469, 488, 531, 547, 555, 645, 667, 678 nm) |
|
|
|
|
|
**64 AlphaEarth spectral eigenvectors** (A00βA63) |
|
|
|
|
|
## Datasets |
|
|
|
|
|
Two LLM-mediated proteome extraction strategies were applied to TARA Oceans metagenomic assemblies: |
|
|
|
|
|
- **LA4SR-Pythia**: 2,049 samples β 17,245 Pfam domains |
|
|
- **AlgaGPT**: 2,044 samples β 20,318 Pfam domains |
|
|
|
|
|
Both used SNAP gene prediction followed by hmmsearch against the Pfam database. The different extraction strategies yield different protein sets and domain profiles from the same underlying metagenomes. |
|
|
|
|
|
## Training Details |
|
|
|
|
|
- **Framework**: PyTorch |
|
|
- **Loss**: MSE |
|
|
- **Optimizer**: Adam (weight_decay=1e-4 for pfam2env) |
|
|
- **Scheduler**: Cosine annealing (pfam2env) |
|
|
- **Early stopping**: Patience 20 (env2pfam) / 30 (pfam2env) |
|
|
- **Batch size**: 32 |
|
|
- **Max epochs**: 200 |
|
|
- **Seed**: 42 |
|
|
- **Hardware**: CUDA GPU |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
import torch |
|
|
import json |
|
|
|
|
|
# Load model config |
|
|
with open("env2pfam/pythia_full/config.json") as f: |
|
|
config = json.load(f) |
|
|
|
|
|
# Load checkpoint |
|
|
checkpoint = torch.load( |
|
|
"env2pfam/pythia_full/best_model.pt", |
|
|
map_location="cpu", |
|
|
weights_only=False |
|
|
) |
|
|
state_dict = checkpoint["model_state_dict"] |
|
|
|
|
|
# Reconstruct model (requires the ELF-NET model class) |
|
|
# model.load_state_dict(state_dict) |
|
|
``` |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use these checkpoints, please cite the associated manuscript (citation forthcoming). |
|
|
|
|
|
## License |
|
|
|
|
|
Apache 2.0 |
|
|
|