File size: 5,920 Bytes
089d665
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
# Gemeo

> **SOTA digital twin module for rare disease patients, grounded in Brazilian SUS.**
> A learned, graph-native, continuously-evolving twin that fuses a Heterogeneous
> Graph Transformer over PrimeKG with the country's public-health constraints.

```
gemeo = patient embedding
      + cohort retrieval (patients-like-mine)
      + reasoning subgraph (KG sparsification)
      + trajectory (TGNN over snapshot chains)
      + risk + survival (NeuralSurv)
      + drug repurposing (TxGNN fine-tuned)
      + active learning (info-gain on KG)
      + counterfactual (what-if engine)
      + SUS grounding (PCDT/CEAF/UF)
      + feedback loop
      + viz payload
```

## Installation

Already part of `rarasnet-swarm-py`. Auto-mounted in `main.py` at `/api/gemeo/*`.

Optional Phase-2 training:
```bash
pip install torch_geometric tqdm
python -m gemeo.train.primekg
python -m gemeo.train.hgt
```

## Quickstart

```python
from gemeo import build_gemeo, what_if

twin = await build_gemeo(
    case_text="Menino, 5 anos, ataxia progressiva, telangiectasia, AFP elevado.",
    patient_info={"age": 5, "sex": "M"},
    context={"sus_region": "SP"},
)

twin.diagnoses[:3]              # top hypotheses (ranked)
twin.cohort.members[:5]          # patients-like-mine
twin.subgraph                    # reasoning subgraph
twin.trajectory.horizons         # 6/12/24m predictions
twin.risk.survival_curve         # months β†’ P(alive)
twin.drugs.candidates[:3]        # repurposing
twin.next_questions[:3]          # active learning
twin.sus_check.pcdt_url          # PCDT compliance
twin.viz_data                    # ready for react-force-graph
```

## API endpoints

| Method | Path | Purpose |
|---|---|---|
| POST | `/api/gemeo/build` | create twin from case |
| GET  | `/api/gemeo/{id}` | full twin |
| POST | `/api/gemeo/{id}/evolve` | add new clinical data |
| POST | `/api/gemeo/{id}/whatif` | counterfactual |
| POST | `/api/gemeo/{id}/feedback` | record correction |
| GET  | `/api/gemeo/{id}/{cohort,subgraph,trajectory,risk,drugs,trials,next-questions,sus,viz}` | per-capability getters |
| GET  | `/api/gemeo/health` | bridge + feedback stats |

## Architecture

Two-tier:

- **Bootstrap (today)** β€” wraps existing swarm-py modules + raras-app artifacts.
  Everything works on day-1, no training needed.
- **Phase-2 SOTA (training)** β€” `gemeo/train/` scaffolds for HGT, TxGNN,
  TGNN, NeuralSurv, CF-GNN. When checkpoints land in `gemeo/artifacts/`,
  the runtime auto-detects and overrides bootstrap paths.

```
                  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                  β”‚    raras-app             β”‚
                  β”‚  data/graph-ml/*.npz     β”‚ ← read-only via gemeo.bridge
                  β”‚  Patient.embedding (Neo4j)β”‚
                  β”‚  /grafo (force-graph)     β”‚ ← consumes /api/gemeo/{id}/viz
                  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                β”‚
                  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                  β”‚     gemeo (this module)     β”‚
                  β”‚                             β”‚
                  β”‚  bridge.py   ── load .npz   β”‚
                  β”‚  encoder.py  ── HGT or boot β”‚
                  β”‚  cohort.py   ── kNN+graph   β”‚
                  β”‚  subgraph.py ── KG sparsify β”‚
                  β”‚  trajectory  ── TGNN or LLM β”‚
                  β”‚  risk.py     ── NeuralSurv  β”‚
                  β”‚  repurpose   ── TxGNN+SUS   β”‚
                  β”‚  whatif.py   ── CF-GNN      β”‚
                  β”‚  ask.py      ── info-gain   β”‚
                  β”‚  ground_sus  ── PCDT/UF     β”‚
                  β”‚  feedback    ── jsonl ledgerβ”‚
                  β”‚  viz.py      ── force-graph β”‚
                  β”‚  core.py     ── orchestratorβ”‚
                  β”‚  api.py      ── FastAPI     β”‚
                  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                β”‚
                  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                  β”‚   swarm-py existing infra   β”‚
                  β”‚  digital_twin_workflow      β”‚
                  β”‚  patient_space (KG)          β”‚
                  β”‚  trajectory_engine, risk_qua β”‚
                  β”‚  drug_repurposer, trial_     β”‚
                  β”‚  matcher, brazilian_context  β”‚
                  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

## What's bootstrap vs. learned

| Capability | Bootstrap (works today) | Phase-2 SOTA |
|---|---|---|
| **Patient embedding** | Weighted mean of fused-768/3072-dim disease+HPO+gene embeddings (matches raras-app) | HGT trained on PrimeKG with disease link-pred + patient contrastive losses |
| **Cohort** | Neo4j vector kNN + Cypher overlap | same retrieval, learned embedding |
| **Subgraph** | Cypher 1-hop sparsification | KG sparsification trained on diagnostic outcomes |
| **Trajectory** | LLM over disease natural history | TRANS-style TGNN over snapshot chains |
| **Risk / survival** | Rule-based severity β†’ exponential survival | NeuralSurv Bayesian survival on KG-walk features |
| **Drug repurposing** | KG walks Disease→Gene→Drug | TxGNN fine-tuned on PrimeKG + SUS auxiliary head |
| **What-if** | Heuristic: mutate snapshot, re-run | CF-GNNExplainer + do-calculus |
| **Active learning** | Info-gain over KG annotation frequencies | Bayesian acquisition over learned dx posterior |

## Citation

Timmers D, Kawassaki A. *Gemeo: Heterogeneous graph foundation model for rare disease digital twins grounded in Brazilian SUS.* Raras, 2026.