gemeo-twin-stack / src /gemeo /train /README.md

timmers

GEMEO world-model — initial release (module + NeuralSurv ckpt + RareBench v49 + KG embeddings)

089d665 verified 4 days ago

preview code

raw

history blame contribute delete

1.46 kB

Gemeo training pipelines (Phase 2)

These scaffolds turn the bootstrap gemeo/ runtime into a SOTA learned digital twin. Each script is self-contained and produces one checkpoint that the runtime auto-discovers.

Prerequisites

pip install torch torch_geometric tqdm
# optional, for TxGNN starter:
pip install pyhealth

GPU strongly recommended (A100 or RTX 4090). Fits in 24 GB VRAM with the default batch sizes.

Pipeline

primekg.py    → data/primekg.pt        (~5 GB once)
hgt.py        → gemeo/artifacts/hgt_patient_encoder.pt
txgnn.py      → gemeo/artifacts/txgnn.pt
tgnn.py       → gemeo/artifacts/tgnn_trajectory.pt
neuralsurv.py → gemeo/artifacts/neuralsurv.pt

The runtime checks each artifact path on call; if missing, falls back to the bootstrap path (no breakage).

Datasets

Source	Use	License
PrimeKG (Harvard)	KG backbone for HGT/TxGNN	MIT
HPO + HPO Annotation	phenotype hierarchy + disease annotations	CC-BY
Orphanet (XML)	rare disease ontology	CC-BY
ClinicalTrials.gov	trial features	public domain
`gemeo/feedback.jsonl`	active-learning labels from production	private
RareBench / RareBench-BR	held-out evaluation	varies

Citation

If you use any of these checkpoints, cite:

Timmers D, Kawassaki A. Gemeo: Heterogeneous graph foundation model for rare disease digital twins grounded in Brazilian SUS. Raras, 2026.