# Gemeo training pipelines (Phase 2) These scaffolds turn the bootstrap `gemeo/` runtime into a SOTA learned digital twin. Each script is **self-contained** and produces one checkpoint that the runtime auto-discovers. ## Prerequisites ```bash pip install torch torch_geometric tqdm # optional, for TxGNN starter: pip install pyhealth ``` GPU strongly recommended (A100 or RTX 4090). Fits in 24 GB VRAM with the default batch sizes. ## Pipeline ``` primekg.py → data/primekg.pt (~5 GB once) hgt.py → gemeo/artifacts/hgt_patient_encoder.pt txgnn.py → gemeo/artifacts/txgnn.pt tgnn.py → gemeo/artifacts/tgnn_trajectory.pt neuralsurv.py → gemeo/artifacts/neuralsurv.pt ``` The runtime checks each artifact path on call; if missing, falls back to the bootstrap path (no breakage). ## Datasets | Source | Use | License | |---|---|---| | PrimeKG (Harvard) | KG backbone for HGT/TxGNN | MIT | | HPO + HPO Annotation | phenotype hierarchy + disease annotations | CC-BY | | Orphanet (XML) | rare disease ontology | CC-BY | | ClinicalTrials.gov | trial features | public domain | | `gemeo/feedback.jsonl` | active-learning labels from production | private | | RareBench / RareBench-BR | held-out evaluation | varies | ## Citation If you use any of these checkpoints, cite: > Timmers D, Kawassaki A. *Gemeo: Heterogeneous graph foundation model for rare disease digital twins grounded in Brazilian SUS.* Raras, 2026.