| # Gemeo training pipelines (Phase 2) |
|
|
| These scaffolds turn the bootstrap `gemeo/` runtime into a SOTA learned |
| digital twin. Each script is **self-contained** and produces one checkpoint |
| that the runtime auto-discovers. |
|
|
| ## Prerequisites |
|
|
| ```bash |
| pip install torch torch_geometric tqdm |
| # optional, for TxGNN starter: |
| pip install pyhealth |
| ``` |
|
|
| GPU strongly recommended (A100 or RTX 4090). Fits in 24 GB VRAM with the |
| default batch sizes. |
|
|
| ## Pipeline |
|
|
| ``` |
| primekg.py β data/primekg.pt (~5 GB once) |
| hgt.py β gemeo/artifacts/hgt_patient_encoder.pt |
| txgnn.py β gemeo/artifacts/txgnn.pt |
| tgnn.py β gemeo/artifacts/tgnn_trajectory.pt |
| neuralsurv.py β gemeo/artifacts/neuralsurv.pt |
| ``` |
|
|
| The runtime checks each artifact path on call; if missing, falls back to |
| the bootstrap path (no breakage). |
|
|
| ## Datasets |
|
|
| | Source | Use | License | |
| |---|---|---| |
| | PrimeKG (Harvard) | KG backbone for HGT/TxGNN | MIT | |
| | HPO + HPO Annotation | phenotype hierarchy + disease annotations | CC-BY | |
| | Orphanet (XML) | rare disease ontology | CC-BY | |
| | ClinicalTrials.gov | trial features | public domain | |
| | `gemeo/feedback.jsonl` | active-learning labels from production | private | |
| | RareBench / RareBench-BR | held-out evaluation | varies | |
|
|
| ## Citation |
|
|
| If you use any of these checkpoints, cite: |
|
|
| > Timmers D, Kawassaki A. *Gemeo: Heterogeneous graph foundation model for rare disease digital twins grounded in Brazilian SUS.* Raras, 2026. |
|
|