basilboy commited on
Commit
db8dda6
·
verified ·
1 Parent(s): 7a9fcee

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +74 -3
README.md CHANGED
@@ -1,3 +1,74 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Microbiome Transformer (Set-Based OTU Stability Model)
2
+
3
+ This repository provides Transformer checkpoints for microbiome set modeling using SSU rRNA OTU embeddings (ProkBERT-derived vectors) and optional text metadata embeddings.
4
+
5
+ Please see https://github.com/the-puzzler/microbiome-model for mentioned code.
6
+
7
+ ## Model summary
8
+
9
+ - **Architecture:** `MicrobiomeTransformer` (see `model.py`)
10
+ - **Input type 1 (DNA/OTU):** 384-d embeddings
11
+ - **Input type 2 (text metadata):** 1536-d embeddings
12
+ - **Core behavior:** permutation-invariant set encoding via Transformer encoder (no positional encodings)
13
+ - **Output:** per-token scalar logits (used as stability scores)
14
+
15
+ ## Available checkpoints
16
+
17
+ | Filename | Size variant | Metadata variant | `d_model` | `num_layers` | `dim_feedforward` | `nhead` |
18
+ |---|---|---|---:|---:|---:|---:|
19
+ | `small-notext.pt` | small | DNA-only | 20 | 3 | 80 | 5 |
20
+ | `small-text.pt` | small | DNA + text | 20 | 3 | 80 | 5 |
21
+ | `large-notext.pt` | large | DNA-only | 100 | 5 | 400 | 5 |
22
+ | `large-text.pt` | large | DNA + text | 100 | 5 | 400 | 5 |
23
+
24
+ Shared dimensions:
25
+ - `OTU_EMB = 384`
26
+ - `TXT_EMB = 1536`
27
+ - `DROPOUT = 0.1`
28
+
29
+ ## Input expectations
30
+
31
+ 1. Build a set of OTU embeddings (ProkBERT vectors) per sample.
32
+ 2. Optionally build a set of text embeddings (metadata) per sample for text-enabled variants.
33
+ 3. Feed both sets as:
34
+ - `embeddings_type1`: shape `(B, N_otu, 384)`
35
+ - `embeddings_type2`: shape `(B, N_txt, 1536)`
36
+ - `mask`: shape `(B, N_otu + N_txt)` with valid positions as `True`
37
+ - `type_indicators`: shape `(B, N_otu + N_txt)` (0 for OTU tokens, 1 for text tokens)
38
+
39
+ ## Minimal loading example
40
+
41
+ ```python
42
+ import torch
43
+ from model import MicrobiomeTransformer
44
+
45
+ ckpt_path = "large-notext.pt" # or small-notext.pt / small-text.pt / large-text.pt
46
+ checkpoint = torch.load(ckpt_path, map_location="cpu")
47
+ state_dict = checkpoint.get("model_state_dict", checkpoint)
48
+
49
+ is_small = "small" in ckpt_path
50
+ model = MicrobiomeTransformer(
51
+ input_dim_type1=384,
52
+ input_dim_type2=1536,
53
+ d_model=20 if is_small else 100,
54
+ nhead=5,
55
+ num_layers=3 if is_small else 5,
56
+ dim_feedforward=80 if is_small else 400,
57
+ dropout=0.1,
58
+ )
59
+ model.load_state_dict(state_dict, strict=False)
60
+ model.eval()
61
+ ```
62
+
63
+ ## Intended use
64
+
65
+ - Microbiome representation learning from OTU sets
66
+ - Stability-style scoring of community members
67
+ - Downstream analyses such as dropout/colonization prediction and rollout trajectory experiments
68
+
69
+ ## Limitations
70
+
71
+ - This is a research model and not a clinical diagnostic tool.
72
+ - Outputs depend strongly on upstream OTU mapping, embedding resolution, and cohort preprocessing.
73
+ - Text-enabled checkpoints expect compatible metadata embedding pipelines.
74
+