Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,74 @@
|
|
| 1 |
-
-
|
| 2 |
-
|
| 3 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Microbiome Transformer (Set-Based OTU Stability Model)
|
| 2 |
+
|
| 3 |
+
This repository provides Transformer checkpoints for microbiome set modeling using SSU rRNA OTU embeddings (ProkBERT-derived vectors) and optional text metadata embeddings.
|
| 4 |
+
|
| 5 |
+
Please see https://github.com/the-puzzler/microbiome-model for mentioned code.
|
| 6 |
+
|
| 7 |
+
## Model summary
|
| 8 |
+
|
| 9 |
+
- **Architecture:** `MicrobiomeTransformer` (see `model.py`)
|
| 10 |
+
- **Input type 1 (DNA/OTU):** 384-d embeddings
|
| 11 |
+
- **Input type 2 (text metadata):** 1536-d embeddings
|
| 12 |
+
- **Core behavior:** permutation-invariant set encoding via Transformer encoder (no positional encodings)
|
| 13 |
+
- **Output:** per-token scalar logits (used as stability scores)
|
| 14 |
+
|
| 15 |
+
## Available checkpoints
|
| 16 |
+
|
| 17 |
+
| Filename | Size variant | Metadata variant | `d_model` | `num_layers` | `dim_feedforward` | `nhead` |
|
| 18 |
+
|---|---|---|---:|---:|---:|---:|
|
| 19 |
+
| `small-notext.pt` | small | DNA-only | 20 | 3 | 80 | 5 |
|
| 20 |
+
| `small-text.pt` | small | DNA + text | 20 | 3 | 80 | 5 |
|
| 21 |
+
| `large-notext.pt` | large | DNA-only | 100 | 5 | 400 | 5 |
|
| 22 |
+
| `large-text.pt` | large | DNA + text | 100 | 5 | 400 | 5 |
|
| 23 |
+
|
| 24 |
+
Shared dimensions:
|
| 25 |
+
- `OTU_EMB = 384`
|
| 26 |
+
- `TXT_EMB = 1536`
|
| 27 |
+
- `DROPOUT = 0.1`
|
| 28 |
+
|
| 29 |
+
## Input expectations
|
| 30 |
+
|
| 31 |
+
1. Build a set of OTU embeddings (ProkBERT vectors) per sample.
|
| 32 |
+
2. Optionally build a set of text embeddings (metadata) per sample for text-enabled variants.
|
| 33 |
+
3. Feed both sets as:
|
| 34 |
+
- `embeddings_type1`: shape `(B, N_otu, 384)`
|
| 35 |
+
- `embeddings_type2`: shape `(B, N_txt, 1536)`
|
| 36 |
+
- `mask`: shape `(B, N_otu + N_txt)` with valid positions as `True`
|
| 37 |
+
- `type_indicators`: shape `(B, N_otu + N_txt)` (0 for OTU tokens, 1 for text tokens)
|
| 38 |
+
|
| 39 |
+
## Minimal loading example
|
| 40 |
+
|
| 41 |
+
```python
|
| 42 |
+
import torch
|
| 43 |
+
from model import MicrobiomeTransformer
|
| 44 |
+
|
| 45 |
+
ckpt_path = "large-notext.pt" # or small-notext.pt / small-text.pt / large-text.pt
|
| 46 |
+
checkpoint = torch.load(ckpt_path, map_location="cpu")
|
| 47 |
+
state_dict = checkpoint.get("model_state_dict", checkpoint)
|
| 48 |
+
|
| 49 |
+
is_small = "small" in ckpt_path
|
| 50 |
+
model = MicrobiomeTransformer(
|
| 51 |
+
input_dim_type1=384,
|
| 52 |
+
input_dim_type2=1536,
|
| 53 |
+
d_model=20 if is_small else 100,
|
| 54 |
+
nhead=5,
|
| 55 |
+
num_layers=3 if is_small else 5,
|
| 56 |
+
dim_feedforward=80 if is_small else 400,
|
| 57 |
+
dropout=0.1,
|
| 58 |
+
)
|
| 59 |
+
model.load_state_dict(state_dict, strict=False)
|
| 60 |
+
model.eval()
|
| 61 |
+
```
|
| 62 |
+
|
| 63 |
+
## Intended use
|
| 64 |
+
|
| 65 |
+
- Microbiome representation learning from OTU sets
|
| 66 |
+
- Stability-style scoring of community members
|
| 67 |
+
- Downstream analyses such as dropout/colonization prediction and rollout trajectory experiments
|
| 68 |
+
|
| 69 |
+
## Limitations
|
| 70 |
+
|
| 71 |
+
- This is a research model and not a clinical diagnostic tool.
|
| 72 |
+
- Outputs depend strongly on upstream OTU mapping, embedding resolution, and cohort preprocessing.
|
| 73 |
+
- Text-enabled checkpoints expect compatible metadata embedding pipelines.
|
| 74 |
+
|