gemeo-twin-stack / src /gemeo /README.md
timmers's picture
GEMEO world-model β€” initial release (module + NeuralSurv ckpt + RareBench v49 + KG embeddings)
089d665 verified
# Gemeo
> **SOTA digital twin module for rare disease patients, grounded in Brazilian SUS.**
> A learned, graph-native, continuously-evolving twin that fuses a Heterogeneous
> Graph Transformer over PrimeKG with the country's public-health constraints.
```
gemeo = patient embedding
+ cohort retrieval (patients-like-mine)
+ reasoning subgraph (KG sparsification)
+ trajectory (TGNN over snapshot chains)
+ risk + survival (NeuralSurv)
+ drug repurposing (TxGNN fine-tuned)
+ active learning (info-gain on KG)
+ counterfactual (what-if engine)
+ SUS grounding (PCDT/CEAF/UF)
+ feedback loop
+ viz payload
```
## Installation
Already part of `rarasnet-swarm-py`. Auto-mounted in `main.py` at `/api/gemeo/*`.
Optional Phase-2 training:
```bash
pip install torch_geometric tqdm
python -m gemeo.train.primekg
python -m gemeo.train.hgt
```
## Quickstart
```python
from gemeo import build_gemeo, what_if
twin = await build_gemeo(
case_text="Menino, 5 anos, ataxia progressiva, telangiectasia, AFP elevado.",
patient_info={"age": 5, "sex": "M"},
context={"sus_region": "SP"},
)
twin.diagnoses[:3] # top hypotheses (ranked)
twin.cohort.members[:5] # patients-like-mine
twin.subgraph # reasoning subgraph
twin.trajectory.horizons # 6/12/24m predictions
twin.risk.survival_curve # months β†’ P(alive)
twin.drugs.candidates[:3] # repurposing
twin.next_questions[:3] # active learning
twin.sus_check.pcdt_url # PCDT compliance
twin.viz_data # ready for react-force-graph
```
## API endpoints
| Method | Path | Purpose |
|---|---|---|
| POST | `/api/gemeo/build` | create twin from case |
| GET | `/api/gemeo/{id}` | full twin |
| POST | `/api/gemeo/{id}/evolve` | add new clinical data |
| POST | `/api/gemeo/{id}/whatif` | counterfactual |
| POST | `/api/gemeo/{id}/feedback` | record correction |
| GET | `/api/gemeo/{id}/{cohort,subgraph,trajectory,risk,drugs,trials,next-questions,sus,viz}` | per-capability getters |
| GET | `/api/gemeo/health` | bridge + feedback stats |
## Architecture
Two-tier:
- **Bootstrap (today)** β€” wraps existing swarm-py modules + raras-app artifacts.
Everything works on day-1, no training needed.
- **Phase-2 SOTA (training)** β€” `gemeo/train/` scaffolds for HGT, TxGNN,
TGNN, NeuralSurv, CF-GNN. When checkpoints land in `gemeo/artifacts/`,
the runtime auto-detects and overrides bootstrap paths.
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ raras-app β”‚
β”‚ data/graph-ml/*.npz β”‚ ← read-only via gemeo.bridge
β”‚ Patient.embedding (Neo4j)β”‚
β”‚ /grafo (force-graph) β”‚ ← consumes /api/gemeo/{id}/viz
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ gemeo (this module) β”‚
β”‚ β”‚
β”‚ bridge.py ── load .npz β”‚
β”‚ encoder.py ── HGT or boot β”‚
β”‚ cohort.py ── kNN+graph β”‚
β”‚ subgraph.py ── KG sparsify β”‚
β”‚ trajectory ── TGNN or LLM β”‚
β”‚ risk.py ── NeuralSurv β”‚
β”‚ repurpose ── TxGNN+SUS β”‚
β”‚ whatif.py ── CF-GNN β”‚
β”‚ ask.py ── info-gain β”‚
β”‚ ground_sus ── PCDT/UF β”‚
β”‚ feedback ── jsonl ledgerβ”‚
β”‚ viz.py ── force-graph β”‚
β”‚ core.py ── orchestratorβ”‚
β”‚ api.py ── FastAPI β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ swarm-py existing infra β”‚
β”‚ digital_twin_workflow β”‚
β”‚ patient_space (KG) β”‚
β”‚ trajectory_engine, risk_qua β”‚
β”‚ drug_repurposer, trial_ β”‚
β”‚ matcher, brazilian_context β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
## What's bootstrap vs. learned
| Capability | Bootstrap (works today) | Phase-2 SOTA |
|---|---|---|
| **Patient embedding** | Weighted mean of fused-768/3072-dim disease+HPO+gene embeddings (matches raras-app) | HGT trained on PrimeKG with disease link-pred + patient contrastive losses |
| **Cohort** | Neo4j vector kNN + Cypher overlap | same retrieval, learned embedding |
| **Subgraph** | Cypher 1-hop sparsification | KG sparsification trained on diagnostic outcomes |
| **Trajectory** | LLM over disease natural history | TRANS-style TGNN over snapshot chains |
| **Risk / survival** | Rule-based severity β†’ exponential survival | NeuralSurv Bayesian survival on KG-walk features |
| **Drug repurposing** | KG walks Disease→Gene→Drug | TxGNN fine-tuned on PrimeKG + SUS auxiliary head |
| **What-if** | Heuristic: mutate snapshot, re-run | CF-GNNExplainer + do-calculus |
| **Active learning** | Info-gain over KG annotation frequencies | Bayesian acquisition over learned dx posterior |
## Citation
Timmers D, Kawassaki A. *Gemeo: Heterogeneous graph foundation model for rare disease digital twins grounded in Brazilian SUS.* Raras, 2026.