File size: 1,455 Bytes
089d665 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 | # Gemeo training pipelines (Phase 2)
These scaffolds turn the bootstrap `gemeo/` runtime into a SOTA learned
digital twin. Each script is **self-contained** and produces one checkpoint
that the runtime auto-discovers.
## Prerequisites
```bash
pip install torch torch_geometric tqdm
# optional, for TxGNN starter:
pip install pyhealth
```
GPU strongly recommended (A100 or RTX 4090). Fits in 24 GB VRAM with the
default batch sizes.
## Pipeline
```
primekg.py → data/primekg.pt (~5 GB once)
hgt.py → gemeo/artifacts/hgt_patient_encoder.pt
txgnn.py → gemeo/artifacts/txgnn.pt
tgnn.py → gemeo/artifacts/tgnn_trajectory.pt
neuralsurv.py → gemeo/artifacts/neuralsurv.pt
```
The runtime checks each artifact path on call; if missing, falls back to
the bootstrap path (no breakage).
## Datasets
| Source | Use | License |
|---|---|---|
| PrimeKG (Harvard) | KG backbone for HGT/TxGNN | MIT |
| HPO + HPO Annotation | phenotype hierarchy + disease annotations | CC-BY |
| Orphanet (XML) | rare disease ontology | CC-BY |
| ClinicalTrials.gov | trial features | public domain |
| `gemeo/feedback.jsonl` | active-learning labels from production | private |
| RareBench / RareBench-BR | held-out evaluation | varies |
## Citation
If you use any of these checkpoints, cite:
> Timmers D, Kawassaki A. *Gemeo: Heterogeneous graph foundation model for rare disease digital twins grounded in Brazilian SUS.* Raras, 2026.
|