gemeo-twin-stack / src /gemeo /README.md
timmers's picture
GEMEO world-model β€” initial release (module + NeuralSurv ckpt + RareBench v49 + KG embeddings)
089d665 verified

Gemeo

SOTA digital twin module for rare disease patients, grounded in Brazilian SUS. A learned, graph-native, continuously-evolving twin that fuses a Heterogeneous Graph Transformer over PrimeKG with the country's public-health constraints.

gemeo = patient embedding
      + cohort retrieval (patients-like-mine)
      + reasoning subgraph (KG sparsification)
      + trajectory (TGNN over snapshot chains)
      + risk + survival (NeuralSurv)
      + drug repurposing (TxGNN fine-tuned)
      + active learning (info-gain on KG)
      + counterfactual (what-if engine)
      + SUS grounding (PCDT/CEAF/UF)
      + feedback loop
      + viz payload

Installation

Already part of rarasnet-swarm-py. Auto-mounted in main.py at /api/gemeo/*.

Optional Phase-2 training:

pip install torch_geometric tqdm
python -m gemeo.train.primekg
python -m gemeo.train.hgt

Quickstart

from gemeo import build_gemeo, what_if

twin = await build_gemeo(
    case_text="Menino, 5 anos, ataxia progressiva, telangiectasia, AFP elevado.",
    patient_info={"age": 5, "sex": "M"},
    context={"sus_region": "SP"},
)

twin.diagnoses[:3]              # top hypotheses (ranked)
twin.cohort.members[:5]          # patients-like-mine
twin.subgraph                    # reasoning subgraph
twin.trajectory.horizons         # 6/12/24m predictions
twin.risk.survival_curve         # months β†’ P(alive)
twin.drugs.candidates[:3]        # repurposing
twin.next_questions[:3]          # active learning
twin.sus_check.pcdt_url          # PCDT compliance
twin.viz_data                    # ready for react-force-graph

API endpoints

Method Path Purpose
POST /api/gemeo/build create twin from case
GET /api/gemeo/{id} full twin
POST /api/gemeo/{id}/evolve add new clinical data
POST /api/gemeo/{id}/whatif counterfactual
POST /api/gemeo/{id}/feedback record correction
GET /api/gemeo/{id}/{cohort,subgraph,trajectory,risk,drugs,trials,next-questions,sus,viz} per-capability getters
GET /api/gemeo/health bridge + feedback stats

Architecture

Two-tier:

  • Bootstrap (today) β€” wraps existing swarm-py modules + raras-app artifacts. Everything works on day-1, no training needed.
  • Phase-2 SOTA (training) β€” gemeo/train/ scaffolds for HGT, TxGNN, TGNN, NeuralSurv, CF-GNN. When checkpoints land in gemeo/artifacts/, the runtime auto-detects and overrides bootstrap paths.
                  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                  β”‚    raras-app             β”‚
                  β”‚  data/graph-ml/*.npz     β”‚ ← read-only via gemeo.bridge
                  β”‚  Patient.embedding (Neo4j)β”‚
                  β”‚  /grafo (force-graph)     β”‚ ← consumes /api/gemeo/{id}/viz
                  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                β”‚
                  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                  β”‚     gemeo (this module)     β”‚
                  β”‚                             β”‚
                  β”‚  bridge.py   ── load .npz   β”‚
                  β”‚  encoder.py  ── HGT or boot β”‚
                  β”‚  cohort.py   ── kNN+graph   β”‚
                  β”‚  subgraph.py ── KG sparsify β”‚
                  β”‚  trajectory  ── TGNN or LLM β”‚
                  β”‚  risk.py     ── NeuralSurv  β”‚
                  β”‚  repurpose   ── TxGNN+SUS   β”‚
                  β”‚  whatif.py   ── CF-GNN      β”‚
                  β”‚  ask.py      ── info-gain   β”‚
                  β”‚  ground_sus  ── PCDT/UF     β”‚
                  β”‚  feedback    ── jsonl ledgerβ”‚
                  β”‚  viz.py      ── force-graph β”‚
                  β”‚  core.py     ── orchestratorβ”‚
                  β”‚  api.py      ── FastAPI     β”‚
                  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                β”‚
                  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                  β”‚   swarm-py existing infra   β”‚
                  β”‚  digital_twin_workflow      β”‚
                  β”‚  patient_space (KG)          β”‚
                  β”‚  trajectory_engine, risk_qua β”‚
                  β”‚  drug_repurposer, trial_     β”‚
                  β”‚  matcher, brazilian_context  β”‚
                  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

What's bootstrap vs. learned

Capability Bootstrap (works today) Phase-2 SOTA
Patient embedding Weighted mean of fused-768/3072-dim disease+HPO+gene embeddings (matches raras-app) HGT trained on PrimeKG with disease link-pred + patient contrastive losses
Cohort Neo4j vector kNN + Cypher overlap same retrieval, learned embedding
Subgraph Cypher 1-hop sparsification KG sparsification trained on diagnostic outcomes
Trajectory LLM over disease natural history TRANS-style TGNN over snapshot chains
Risk / survival Rule-based severity β†’ exponential survival NeuralSurv Bayesian survival on KG-walk features
Drug repurposing KG walks Disease→Gene→Drug TxGNN fine-tuned on PrimeKG + SUS auxiliary head
What-if Heuristic: mutate snapshot, re-run CF-GNNExplainer + do-calculus
Active learning Info-gain over KG annotation frequencies Bayesian acquisition over learned dx posterior

Citation

Timmers D, Kawassaki A. Gemeo: Heterogeneous graph foundation model for rare disease digital twins grounded in Brazilian SUS. Raras, 2026.