File size: 5,920 Bytes
089d665 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 | # Gemeo
> **SOTA digital twin module for rare disease patients, grounded in Brazilian SUS.**
> A learned, graph-native, continuously-evolving twin that fuses a Heterogeneous
> Graph Transformer over PrimeKG with the country's public-health constraints.
```
gemeo = patient embedding
+ cohort retrieval (patients-like-mine)
+ reasoning subgraph (KG sparsification)
+ trajectory (TGNN over snapshot chains)
+ risk + survival (NeuralSurv)
+ drug repurposing (TxGNN fine-tuned)
+ active learning (info-gain on KG)
+ counterfactual (what-if engine)
+ SUS grounding (PCDT/CEAF/UF)
+ feedback loop
+ viz payload
```
## Installation
Already part of `rarasnet-swarm-py`. Auto-mounted in `main.py` at `/api/gemeo/*`.
Optional Phase-2 training:
```bash
pip install torch_geometric tqdm
python -m gemeo.train.primekg
python -m gemeo.train.hgt
```
## Quickstart
```python
from gemeo import build_gemeo, what_if
twin = await build_gemeo(
case_text="Menino, 5 anos, ataxia progressiva, telangiectasia, AFP elevado.",
patient_info={"age": 5, "sex": "M"},
context={"sus_region": "SP"},
)
twin.diagnoses[:3] # top hypotheses (ranked)
twin.cohort.members[:5] # patients-like-mine
twin.subgraph # reasoning subgraph
twin.trajectory.horizons # 6/12/24m predictions
twin.risk.survival_curve # months β P(alive)
twin.drugs.candidates[:3] # repurposing
twin.next_questions[:3] # active learning
twin.sus_check.pcdt_url # PCDT compliance
twin.viz_data # ready for react-force-graph
```
## API endpoints
| Method | Path | Purpose |
|---|---|---|
| POST | `/api/gemeo/build` | create twin from case |
| GET | `/api/gemeo/{id}` | full twin |
| POST | `/api/gemeo/{id}/evolve` | add new clinical data |
| POST | `/api/gemeo/{id}/whatif` | counterfactual |
| POST | `/api/gemeo/{id}/feedback` | record correction |
| GET | `/api/gemeo/{id}/{cohort,subgraph,trajectory,risk,drugs,trials,next-questions,sus,viz}` | per-capability getters |
| GET | `/api/gemeo/health` | bridge + feedback stats |
## Architecture
Two-tier:
- **Bootstrap (today)** β wraps existing swarm-py modules + raras-app artifacts.
Everything works on day-1, no training needed.
- **Phase-2 SOTA (training)** β `gemeo/train/` scaffolds for HGT, TxGNN,
TGNN, NeuralSurv, CF-GNN. When checkpoints land in `gemeo/artifacts/`,
the runtime auto-detects and overrides bootstrap paths.
```
ββββββββββββββββββββββββββ
β raras-app β
β data/graph-ml/*.npz β β read-only via gemeo.bridge
β Patient.embedding (Neo4j)β
β /grafo (force-graph) β β consumes /api/gemeo/{id}/viz
βββββββββββββββ¬βββββββββββββββ
β
βββββββββββββββΌβββββββββββββββ
β gemeo (this module) β
β β
β bridge.py ββ load .npz β
β encoder.py ββ HGT or boot β
β cohort.py ββ kNN+graph β
β subgraph.py ββ KG sparsify β
β trajectory ββ TGNN or LLM β
β risk.py ββ NeuralSurv β
β repurpose ββ TxGNN+SUS β
β whatif.py ββ CF-GNN β
β ask.py ββ info-gain β
β ground_sus ββ PCDT/UF β
β feedback ββ jsonl ledgerβ
β viz.py ββ force-graph β
β core.py ββ orchestratorβ
β api.py ββ FastAPI β
βββββββββββββββ¬βββββββββββββββ
β
βββββββββββββββΌβββββββββββββββ
β swarm-py existing infra β
β digital_twin_workflow β
β patient_space (KG) β
β trajectory_engine, risk_qua β
β drug_repurposer, trial_ β
β matcher, brazilian_context β
ββββββββββββββββββββββββββββββ
```
## What's bootstrap vs. learned
| Capability | Bootstrap (works today) | Phase-2 SOTA |
|---|---|---|
| **Patient embedding** | Weighted mean of fused-768/3072-dim disease+HPO+gene embeddings (matches raras-app) | HGT trained on PrimeKG with disease link-pred + patient contrastive losses |
| **Cohort** | Neo4j vector kNN + Cypher overlap | same retrieval, learned embedding |
| **Subgraph** | Cypher 1-hop sparsification | KG sparsification trained on diagnostic outcomes |
| **Trajectory** | LLM over disease natural history | TRANS-style TGNN over snapshot chains |
| **Risk / survival** | Rule-based severity β exponential survival | NeuralSurv Bayesian survival on KG-walk features |
| **Drug repurposing** | KG walks DiseaseβGeneβDrug | TxGNN fine-tuned on PrimeKG + SUS auxiliary head |
| **What-if** | Heuristic: mutate snapshot, re-run | CF-GNNExplainer + do-calculus |
| **Active learning** | Info-gain over KG annotation frequencies | Bayesian acquisition over learned dx posterior |
## Citation
Timmers D, Kawassaki A. *Gemeo: Heterogeneous graph foundation model for rare disease digital twins grounded in Brazilian SUS.* Raras, 2026.
|