File size: 5,775 Bytes
df6ede9 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 |
---
language: en
license: apache-2.0
tags:
- marine-biology
- metagenomics
- environmental-modeling
- protein-domains
- tara-oceans
- pfam
- pytorch
library_name: pytorch
pipeline_tag: tabular-regression
---
# ELF-NET: Environment-Linked Functional Network
Bidirectional neural network checkpoints linking marine environmental variables to microalgal protein domain (Pfam) abundance profiles from the TARA Oceans metagenomic dataset.
## Model Description
ELF-NET consists of two complementary prediction directions:
### env2pfam (Environment β Pfam Abundance)
Predicts the abundance of thousands of Pfam protein domains at a marine sampling site given 94 environmental features (30 oceanographic/atmospheric variables + 64 AlphaEarth spectral eigenvectors).
### pfam2env (Pfam Abundance β Environmental Features)
Predicts 64 environmental features from observed Pfam domain abundance profiles (9,611 input domains).
## Repository Structure
```
βββ env2pfam/
β βββ algagpt_full/ # AlgaGPT-extracted proteomes, full architecture
β βββ algagpt_light/ # AlgaGPT-extracted proteomes, light architecture
β βββ pythia_full/ # LA4SR-Pythia-extracted proteomes, full architecture
β βββ pythia_light/ # LA4SR-Pythia-extracted proteomes, light architecture
βββ pfam2env/
β βββ full/ # Full architecture
β βββ light/ # Light architecture
βββ README.md
```
Each subdirectory contains:
- `best_model.pt` β PyTorch checkpoint (model_state_dict, optimizer_state_dict, best_val_loss)
- `config.json` β Hyperparameters and feature lists
- `final_metrics.json` β Train/val/test metrics
- `training_history.json` β Per-epoch training curves
## Architectures
### env2pfam Full
```
Input(94) β Linear(512) + BN + ReLU + Dropout(0.2)
β Linear(1024) + BN + ReLU + Dropout(0.2)
β Linear(2048) + BN + ReLU + Dropout(0.2)
β Linear(4096) + BN + ReLU + Dropout(0.2)
β Linear(output_dim)
```
### env2pfam Light
```
Input(94) β Linear(256) + BN + ReLU + Dropout(0.2)
β Linear(512) + BN + ReLU + Dropout(0.2)
β Linear(1024) + BN + ReLU + Dropout(0.2)
β Linear(2048) + BN + ReLU + Dropout(0.2)
β Linear(output_dim)
```
### pfam2env Full
```
InputBatchNorm(9611) β Linear(2048) + ReLU + Dropout
β Linear(512) + ReLU + Dropout
β Linear(128) + ReLU + Dropout
β Linear(64)
```
### pfam2env Light
```
InputBatchNorm(9611) β Linear(512) + ReLU + Dropout
β Linear(256) + ReLU + Dropout
β Linear(128) + ReLU + Dropout
β Linear(64)
```
## Performance
### env2pfam (Environment β Pfam)
| Variant | Dataset | Output Dim | LR | Test RΒ² | Test MSE | Test MAE |
|---|---|---|---|---|---|---|
| **pythia_full** | LA4SR-Pythia | 17,245 | 1e-3 | **0.1487** | 14.597 | 2.411 |
| pythia_light | LA4SR-Pythia | 17,245 | 1e-4 | 0.1432 | 14.561 | 2.454 |
| algagpt_full | AlgaGPT | 20,318 | 1e-3 | 0.1189 | 14.006 | 2.381 |
| algagpt_light | AlgaGPT | 20,318 | 1e-4 | 0.1070 | 14.136 | 2.415 |
RΒ² is the mean across all output Pfam dimensions. The modest RΒ² values reflect the high dimensionality of the output space (17Kβ20K Pfam domains) and the inherent stochasticity of metagenomic sampling.
### pfam2env (Pfam β Environment)
| Variant | Input Dim | LR | Test RΒ² | Test MSE | Test MAE |
|---|---|---|---|---|---|
| full | 9,611 | 1e-3 | -0.0057 | 0.00931 | 0.0724 |
| light | 9,611 | 1e-3 | -0.0055 | 0.00931 | 0.0724 |
Negative RΒ² indicates performance near the mean-prediction baseline. These checkpoints document the pfamβenv direction of the bidirectional framework and are included for completeness and reproducibility.
## Input Features (env2pfam)
**30 environmental variables:**
- Air temperature (mean, max, min, range Β°C)
- Precipitation (mean mm)
- Solar radiation (MJ/mΒ²)
- Elevation (m), bathymetry (m), distance to coast (km)
- Land cover class
- Sea surface temperature (SST mean, max, min, range Β°C; MODIS SST mean)
- Chlorophyll-a (mean, max, min mg/mΒ³)
- Normalized fluorescence line height (NFLH mean)
- Particulate organic carbon (POC mean mg/mΒ³)
- Remote sensing reflectance (Rrs at 412, 443, 469, 488, 531, 547, 555, 645, 667, 678 nm)
**64 AlphaEarth spectral eigenvectors** (A00βA63)
## Datasets
Two LLM-mediated proteome extraction strategies were applied to TARA Oceans metagenomic assemblies:
- **LA4SR-Pythia**: 2,049 samples β 17,245 Pfam domains
- **AlgaGPT**: 2,044 samples β 20,318 Pfam domains
Both used SNAP gene prediction followed by hmmsearch against the Pfam database. The different extraction strategies yield different protein sets and domain profiles from the same underlying metagenomes.
## Training Details
- **Framework**: PyTorch
- **Loss**: MSE
- **Optimizer**: Adam (weight_decay=1e-4 for pfam2env)
- **Scheduler**: Cosine annealing (pfam2env)
- **Early stopping**: Patience 20 (env2pfam) / 30 (pfam2env)
- **Batch size**: 32
- **Max epochs**: 200
- **Seed**: 42
- **Hardware**: CUDA GPU
## Usage
```python
import torch
import json
# Load model config
with open("env2pfam/pythia_full/config.json") as f:
config = json.load(f)
# Load checkpoint
checkpoint = torch.load(
"env2pfam/pythia_full/best_model.pt",
map_location="cpu",
weights_only=False
)
state_dict = checkpoint["model_state_dict"]
# Reconstruct model (requires the ELF-NET model class)
# model.load_state_dict(state_dict)
```
## Citation
If you use these checkpoints, please cite the associated manuscript (citation forthcoming).
## License
Apache 2.0
|