Spaces:
Running on Zero
Running on Zero
p2, p3
Browse files- docs/p2_p3/00-OVERVIEW.md +324 -0
- docs/p2_p3/00-OVERVIEW_p3.md +288 -0
- docs/p2_p3/CAPABILITY_CONTRACT_v2.md +899 -0
- docs/p2_p3/CAPABILITY_CONTRACT_v3.md +651 -0
- docs/p2_p3/IMPLEMENTATION_REFERENCE.md +288 -0
- docs/p2_p3/IMPLEMENTATION_REFERENCE_p3.md +493 -0
- docs/p2_p3/M14-federation.md +434 -0
- docs/p2_p3/M15-relay-tier.md +389 -0
- docs/p2_p3/M16-tokens.md +391 -0
- docs/p2_p3/M17-ocr.md +305 -0
- docs/p2_p3/M18-translation.md +225 -0
- docs/p2_p3/M19-stt-tts.md +331 -0
- docs/p2_p3/M20-vision.md +369 -0
- docs/p2_p3/M21-tool-calls.md +317 -0
- docs/p2_p3/M22-mobile-native.md +325 -0
- docs/p2_p3/M23-e2e-encryption.md +474 -0
- docs/p2_p3/M24-rerank.md +227 -0
- docs/p2_p3/M25-group-chat.md +293 -0
- docs/p2_p3/M26-distributed-inference.md +325 -0
- docs/p2_p3/M27-moe-routing.md +378 -0
- docs/p2_p3/M28-fedlearn.md +348 -0
- docs/p2_p3/M29-lora-beacons.md +322 -0
- docs/p2_p3/M30-evidence-ebkh.md +384 -0
- docs/p2_p3/M31-civil-defense.md +410 -0
- docs/p2_p3/M32-protocol-standard.md +323 -0
- docs/p2_p3/X05-dht.md +374 -0
- docs/p2_p3/X06-websocket.md +316 -0
- docs/p2_p3/X07-federated-metrics.md +330 -0
- docs/p2_p3/X08-tensor-transport.md +302 -0
- docs/p2_p3/X09-conformance-suite.md +316 -0
docs/p2_p3/00-OVERVIEW.md
ADDED
|
@@ -0,0 +1,324 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# HearthNet Phase 2 β Spec Set Overview
|
| 2 |
+
|
| 3 |
+
**Phase 2 scope:** post-hackathon, 1β3 months of work. Hardens the MVP into something other communities can adopt.
|
| 4 |
+
|
| 5 |
+
**Stance toward Phase 1:** strictly additive. Phase 1 specs are immutable from Phase 2's view. New modules plug into the bus; nothing in Phase 1 needs rewriting. Where a Phase 1 module is "extended", it is *extended* β same public API, new capabilities or backends added behind the existing facade.
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## 0. What changes vs Phase 1
|
| 10 |
+
|
| 11 |
+
| Concern | Phase 1 (MVP) | Phase 2 |
|
| 12 |
+
|---------|---------------|---------|
|
| 13 |
+
| Discovery | mDNS + UDP (LAN only) | + DHT (cross-LAN), + relay-assisted NAT traversal |
|
| 14 |
+
| Transport | HTTP/1.1 + SSE long-poll pubsub | + WebSocket upgrade for bidirectional + pubsub |
|
| 15 |
+
| Trust | Per-request signing | + Capability tokens (delegation, federation) |
|
| 16 |
+
| Cross-community | Out of scope | Federation: signed peering, scoped capability access |
|
| 17 |
+
| Encryption | TLS-in-transit + signed-at-rest within community | + E2E (X25519 + ChaCha20-Poly1305) for chat & optionally files |
|
| 18 |
+
| Chat | 1:1 only | + Group chat (`chat.thread.*`), + store-and-forward via anchors |
|
| 19 |
+
| LLM | Text only | + Vision (`llm.chat` with image content), + Tool calls (`tool_call_delta`) |
|
| 20 |
+
| RAG | Digital PDFs | + OCR for scanned PDFs and images, + reranking |
|
| 21 |
+
| Services | LLM, embed, RAG, file, market, chat | + OCR, Translation, STT, TTS, Image generation, Rerank |
|
| 22 |
+
| Mobile | Web view served by anchor | + Native client (Flutter/RN) with push via relay tier |
|
| 23 |
+
| Files | Direct fetch on demand | + Resumable PUT, + Background replication, + At-rest encryption |
|
| 24 |
+
| Relay | None | + Hosted relay tier for NAT traversal, federation discovery, push |
|
| 25 |
+
| Observability | Local Prometheus + ring buffer + optional Trackio | + OTLP export, + Federated metrics aggregation |
|
| 26 |
+
|
| 27 |
+
---
|
| 28 |
+
|
| 29 |
+
## 1. Module map (Phase 2 additions)
|
| 30 |
+
|
| 31 |
+
### New numbered modules
|
| 32 |
+
|
| 33 |
+
| ID | Module | Spec file | Concern |
|
| 34 |
+
|-----|------------------------------|----------------------------------------|-------------------------------------------------------|
|
| 35 |
+
| M14 | Federation | `modules/M14-federation.md` | Cross-community trust, federation manifests, scoped access |
|
| 36 |
+
| M15 | Relay Tier | `modules/M15-relay-tier.md` | Hosted HTTPS relay (NAT traversal, federation discovery, mobile push) |
|
| 37 |
+
| M16 | Capability Tokens | `modules/M16-tokens.md` | Short-lived delegation tokens (OAuth-flavoured) |
|
| 38 |
+
| M17 | OCR Service | `modules/M17-ocr.md` | `ocr.image`, `ocr.pdf` β Tesseract / TrOCR / multilingual |
|
| 39 |
+
| M18 | Translation Service | `modules/M18-translation.md` | `trans.text` β NLLB-backed, DEβENβPlattdeutsch |
|
| 40 |
+
| M19 | Speech I/O | `modules/M19-stt-tts.md` | `stt.transcribe` (Whisper), `tts.synthesize` (XTTS/Edge) |
|
| 41 |
+
| M20 | Vision Services | `modules/M20-vision.md` | `img.describe`, `img.generate`, multimodal LLM input |
|
| 42 |
+
| M21 | Tool Calls | `modules/M21-tool-calls.md` | `tool_call_delta` frames, OpenAI/Anthropic-compatible |
|
| 43 |
+
| M22 | Mobile Native | `modules/M22-mobile-native.md` | Flutter/RN client with push |
|
| 44 |
+
| M23 | E2E Encryption | `modules/M23-e2e-encryption.md` | X25519 + ChaCha20-Poly1305 for chat (and optional files) |
|
| 45 |
+
| M24 | Reranking | `modules/M24-rerank.md` | `rerank.text` β BGE-reranker, used by RAG and search |
|
| 46 |
+
| M25 | Group Chat | `modules/M25-group-chat.md` | `chat.thread.*` β multi-party conversations |
|
| 47 |
+
|
| 48 |
+
### New cross-cutting modules
|
| 49 |
+
|
| 50 |
+
| ID | Module | Spec file | Concern |
|
| 51 |
+
|-----|-----------------|----------------------------------------|---------------------------------------------------------|
|
| 52 |
+
| X05 | DHT | `cross-cutting/X05-dht.md` | Kademlia-style cross-LAN peer + content discovery |
|
| 53 |
+
| X06 | WebSocket | `cross-cutting/X06-websocket.md` | Bidirectional upgrade for `/bus/v1/call` and `/pubsub` |
|
| 54 |
+
| X07 | Federated Metrics | `cross-cutting/X07-federated-metrics.md` | Optional OTLP export + per-community aggregation |
|
| 55 |
+
|
| 56 |
+
### Modifications to Phase 1 modules
|
| 57 |
+
|
| 58 |
+
These do not get new spec files β Phase 1 spec is *extended* in place at next major. The Phase 1 IMPLEMENTATION_REFERENCE will gain entries; flagged in [`IMPLEMENTATION_REFERENCE.md`](IMPLEMENTATION_REFERENCE.md) Β§0.
|
| 59 |
+
|
| 60 |
+
| Phase 1 module | Extension |
|
| 61 |
+
|----------------|-----------|
|
| 62 |
+
| M04 LLM | New backends gain multimodal + tools support; descriptors carry `modalities`, `tools_supported` flags |
|
| 63 |
+
| M05 RAG | Auto-reindex on embedding model change; hybrid (keyword+dense) search; `rerank.text` integration |
|
| 64 |
+
| M07 File/Blobs | Resumable PUT (server-side partial-transfer index); background replication; at-rest encryption envelope |
|
| 65 |
+
| M10 Chat | Calls into M23 for encryption; calls into M25 for group threads |
|
| 66 |
+
| M02 Discovery | Calls into X05 DHT when peers not found via mDNS/UDP |
|
| 67 |
+
| X01 Transport | Calls into X06 WebSocket on `Upgrade: websocket` header |
|
| 68 |
+
| M09 Emergency | Phase-2 captive-portal probe |
|
| 69 |
+
|
| 70 |
+
---
|
| 71 |
+
|
| 72 |
+
## 2. Dependency graph (Phase 2 additions on top of Phase 1)
|
| 73 |
+
|
| 74 |
+
```
|
| 75 |
+
ββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 76 |
+
β Phase 1 (unchanged) β
|
| 77 |
+
β X04 X03 X02 X01 M01..M13 β
|
| 78 |
+
ββββββ¬ββββββββββ¬ββββββββ¬βββββββββ¬βββββββββββββββ
|
| 79 |
+
β β β β
|
| 80 |
+
βΌ βΌ βΌ βΌ
|
| 81 |
+
βββββββββββ βββββββ βββββββββββ βββββββββββ
|
| 82 |
+
β X05 β β X06 β β X07 β β M16 β
|
| 83 |
+
β DHT β β WS β β Fed-M β β Tokens β
|
| 84 |
+
ββββββ¬βββββ ββββ¬βββ βββββββββββ ββββββ¬βββββ
|
| 85 |
+
β β β
|
| 86 |
+
βββββββ¬ββββ β
|
| 87 |
+
βΌ βΌ
|
| 88 |
+
ββββββββββ ββββββββββ
|
| 89 |
+
β M14 ββββββββββββββββββ€ M15 β
|
| 90 |
+
βFederat.β β Relay β
|
| 91 |
+
βββββ¬βββββ ββββββββββ
|
| 92 |
+
β
|
| 93 |
+
βΌ
|
| 94 |
+
ββββββββββ
|
| 95 |
+
β M22 β
|
| 96 |
+
β Mobile β
|
| 97 |
+
ββββββββββ
|
| 98 |
+
|
| 99 |
+
ββββ Independent services (each plug into the bus) ββββ
|
| 100 |
+
β M17 OCR M18 Trans M19 STT/TTS M20 Vision β
|
| 101 |
+
β M21 Tools (extends M04) M24 Rerank β
|
| 102 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 103 |
+
|
| 104 |
+
ββββ Chat extensions ββββ
|
| 105 |
+
β M23 E2E β
|
| 106 |
+
β M25 Group β
|
| 107 |
+
βββββββββββββββββββββββββ
|
| 108 |
+
```
|
| 109 |
+
|
| 110 |
+
Hard rules carried over from Phase 1:
|
| 111 |
+
- No service imports another service. Talk via the bus.
|
| 112 |
+
- No layer below the bus imports anything above it.
|
| 113 |
+
|
| 114 |
+
---
|
| 115 |
+
|
| 116 |
+
## 3. File tree additions
|
| 117 |
+
|
| 118 |
+
```
|
| 119 |
+
hearthnet/
|
| 120 |
+
βββ federation/ # M14
|
| 121 |
+
β βββ __init__.py
|
| 122 |
+
β βββ manifest.py
|
| 123 |
+
β βββ peering.py
|
| 124 |
+
β βββ relay_client.py
|
| 125 |
+
β
|
| 126 |
+
βββ relay/ # M15 (separate deployable, lives in same repo)
|
| 127 |
+
β βββ __init__.py
|
| 128 |
+
β βββ server.py
|
| 129 |
+
β βββ nat_traversal.py
|
| 130 |
+
β βββ push.py
|
| 131 |
+
β βββ tier.py
|
| 132 |
+
β
|
| 133 |
+
βββ identity/
|
| 134 |
+
β βββ tokens.py # M16 (now real, not stub)
|
| 135 |
+
β
|
| 136 |
+
βββ dht/ # X05
|
| 137 |
+
β βββ __init__.py
|
| 138 |
+
β βββ kademlia.py
|
| 139 |
+
β βββ routing.py
|
| 140 |
+
β βββ storage.py
|
| 141 |
+
β
|
| 142 |
+
βββ transport/
|
| 143 |
+
β βββ websocket.py # X06
|
| 144 |
+
β
|
| 145 |
+
βββ observability/
|
| 146 |
+
β βββ federated.py # X07
|
| 147 |
+
β
|
| 148 |
+
βββ crypto/ # M23 (new top-level)
|
| 149 |
+
β βββ __init__.py
|
| 150 |
+
β βββ ratchet.py
|
| 151 |
+
β βββ kem.py
|
| 152 |
+
β βββ envelope.py
|
| 153 |
+
β
|
| 154 |
+
βββ services/
|
| 155 |
+
β βββ ocr/ # M17
|
| 156 |
+
β β βββ __init__.py
|
| 157 |
+
β β βββ service.py
|
| 158 |
+
β β βββ backends/
|
| 159 |
+
β β βββ tesseract.py
|
| 160 |
+
β β βββ trocr.py
|
| 161 |
+
β β βββ multilingual.py
|
| 162 |
+
β βββ translation/ # M18
|
| 163 |
+
β β βββ __init__.py
|
| 164 |
+
β β βββ service.py
|
| 165 |
+
β β βββ backends/
|
| 166 |
+
β β βββ nllb.py
|
| 167 |
+
β β βββ plattdeutsch.py
|
| 168 |
+
β βββ speech/ # M19
|
| 169 |
+
β β βββ __init__.py
|
| 170 |
+
β β βββ stt_service.py
|
| 171 |
+
β β βββ tts_service.py
|
| 172 |
+
β β βββ backends/
|
| 173 |
+
β β βββ whisper.py
|
| 174 |
+
β β βββ xtts.py
|
| 175 |
+
β β βββ edge_tts.py
|
| 176 |
+
β βββ image/ # M20
|
| 177 |
+
β β βββ __init__.py
|
| 178 |
+
β β βββ describe_service.py
|
| 179 |
+
β β βββ generate_service.py
|
| 180 |
+
β β βββ backends/
|
| 181 |
+
β β βββ florence2.py
|
| 182 |
+
β β βββ minicpm_v.py
|
| 183 |
+
β β βββ flux.py
|
| 184 |
+
β βββ rerank/ # M24
|
| 185 |
+
β β βββ __init__.py
|
| 186 |
+
β β βββ service.py
|
| 187 |
+
β β βββ backends/
|
| 188 |
+
β β βββ bge_reranker.py
|
| 189 |
+
β βββ llm/
|
| 190 |
+
β β βββ tools.py # M21 (extends M04)
|
| 191 |
+
β βββ chat/
|
| 192 |
+
β β βββ encryption.py # M23 hook
|
| 193 |
+
β β βββ thread_service.py # M25
|
| 194 |
+
β β βββ thread_views.py # M25
|
| 195 |
+
β βββ file/
|
| 196 |
+
β βββ resume.py # P7 extension
|
| 197 |
+
β βββ replication.py # P7 extension
|
| 198 |
+
|
| 199 |
+
mobile-native/ # M22 β separate codebase, Flutter project
|
| 200 |
+
βββ (lives in /mobile-native, not in the Python package)
|
| 201 |
+
```
|
| 202 |
+
|
| 203 |
+
---
|
| 204 |
+
|
| 205 |
+
## 4. Canonical conventions (delta from Phase 1)
|
| 206 |
+
|
| 207 |
+
### 4.1 New type aliases
|
| 208 |
+
|
| 209 |
+
```python
|
| 210 |
+
# additions to hearthnet/types.py
|
| 211 |
+
|
| 212 |
+
TokenID = str # ULID
|
| 213 |
+
ThreadID = str # ULID
|
| 214 |
+
FederationID = str # composite: "<community_a>:<community_b>"
|
| 215 |
+
TensorChunkID = str # blake3:<hex>, used in M25/X07 phase-3 only
|
| 216 |
+
PushDeviceID = str # opaque, assigned by relay tier
|
| 217 |
+
RatchetEpoch = int # per-thread monotonic
|
| 218 |
+
EncryptedPayload = bytes # ciphertext, base64 in JSON
|
| 219 |
+
|
| 220 |
+
# Extended Literal types:
|
| 221 |
+
TrustLevel = Literal["unknown","member","trusted","anchor","federated"]
|
| 222 |
+
Stability = Literal["experimental","beta","stable","deprecated"]
|
| 223 |
+
```
|
| 224 |
+
|
| 225 |
+
### 4.2 New constants
|
| 226 |
+
|
| 227 |
+
```python
|
| 228 |
+
# additions to hearthnet/constants.py
|
| 229 |
+
|
| 230 |
+
TOKEN_DEFAULT_TTL_SECONDS = 3600
|
| 231 |
+
TOKEN_MAX_TTL_SECONDS = 86400
|
| 232 |
+
FEDERATION_MANIFEST_TTL_SECONDS = 86400
|
| 233 |
+
FEDERATION_HEARTBEAT_SECONDS = 300
|
| 234 |
+
DHT_REPLICATION_K = 8 # bucket size
|
| 235 |
+
DHT_ALPHA = 3 # concurrent lookups
|
| 236 |
+
DHT_REFRESH_SECONDS = 3600
|
| 237 |
+
DHT_REPUBLISH_SECONDS = 86400
|
| 238 |
+
WEBSOCKET_PING_SECONDS = 30
|
| 239 |
+
WEBSOCKET_IDLE_CLOSE_SECONDS = 120
|
| 240 |
+
RELAY_REGISTRATION_TTL_SECONDS = 7200
|
| 241 |
+
RELAY_PUSH_RETRY_MAX = 5
|
| 242 |
+
E2E_RATCHET_MAX_OUT_OF_ORDER = 32
|
| 243 |
+
E2E_RATCHET_REKEY_AFTER_MESSAGES = 100
|
| 244 |
+
E2E_PREKEY_BUNDLE_SIZE = 20
|
| 245 |
+
OCR_DEFAULT_DPI = 300
|
| 246 |
+
OCR_MAX_PAGES_PER_REQUEST = 50
|
| 247 |
+
TRANSLATION_MAX_CHARS = 4000
|
| 248 |
+
STT_MAX_AUDIO_SECONDS = 300
|
| 249 |
+
TTS_MAX_TEXT_CHARS = 5000
|
| 250 |
+
RERANK_MAX_DOCS = 100
|
| 251 |
+
FILE_REPLICATION_DESIRED_COPIES = 3
|
| 252 |
+
FILE_RESUME_PARTIAL_TTL_SECONDS = 3600
|
| 253 |
+
```
|
| 254 |
+
|
| 255 |
+
### 4.3 Capability namespace allocations (Phase 2 promotes from reserved)
|
| 256 |
+
|
| 257 |
+
| Prefix | Status in Phase 1 | Status in Phase 2 |
|
| 258 |
+
|--------|-------------------|---------------------|
|
| 259 |
+
| `federation.*` | beta (reserved) | stable |
|
| 260 |
+
| `ocr.*` | reserved | stable |
|
| 261 |
+
| `trans.*` | reserved | stable |
|
| 262 |
+
| `stt.*` `tts.*` | reserved | stable |
|
| 263 |
+
| `img.*` | reserved | stable |
|
| 264 |
+
| `rerank.*` | (new) | stable |
|
| 265 |
+
| `chat.thread.*` | reserved | stable |
|
| 266 |
+
| `chat.forward.*` | reserved | stable |
|
| 267 |
+
| `file.put.resume@1.0` | (new) | stable |
|
| 268 |
+
|
| 269 |
+
---
|
| 270 |
+
|
| 271 |
+
## 5. Build order (Phase 2)
|
| 272 |
+
|
| 273 |
+
| Step | Modules / extensions | What you can demo |
|
| 274 |
+
|------|----------------------------------|-------------------------------------------------|
|
| 275 |
+
| P2-1 | M16 Tokens | Delegate one capability call via a token |
|
| 276 |
+
| P2-2 | X05 DHT (basic) | Two LANs find each other through a public DHT |
|
| 277 |
+
| P2-3 | M14 Federation | Two communities cross-sign, query each other |
|
| 278 |
+
| P2-4 | M15 Relay Tier | NAT'd peers reach each other via your relay |
|
| 279 |
+
| P2-5 | X06 WebSocket | Lower-latency pubsub |
|
| 280 |
+
| P2-6 | M24 Rerank | RAG queries get better answers |
|
| 281 |
+
| P2-7 | M17 OCR + M05 RAG hook | Scanned PDFs become searchable |
|
| 282 |
+
| P2-8 | M18 Translation | DE β EN of a marketplace post |
|
| 283 |
+
| P2-9 | M19 STT/TTS | "Sprich mit HearthNet" voice query |
|
| 284 |
+
| P2-10 | M21 Tool calls + M04 ext | LLM can call `rag.query` as a tool |
|
| 285 |
+
| P2-11 | M20 Vision | "Was siehst du auf diesem Bild?" |
|
| 286 |
+
| P2-12 | M23 E2E + M10 ext | Chat is now end-to-end encrypted |
|
| 287 |
+
| P2-13 | M25 Group chat | Three-way conversation |
|
| 288 |
+
| P2-14 | M07 ext (resume, replication, encrypt) | Bigger files, more resilient |
|
| 289 |
+
| P2-15 | M22 Mobile native | iOS / Android app on a real phone |
|
| 290 |
+
| P2-16 | X07 Federated metrics + observability polish | Real dashboards for operators |
|
| 291 |
+
|
| 292 |
+
Each step is independently demoable. Each gates on no Phase 1 changes β they all attach via the bus.
|
| 293 |
+
|
| 294 |
+
---
|
| 295 |
+
|
| 296 |
+
## 6. Spec versioning
|
| 297 |
+
|
| 298 |
+
- Capability Contract bumps to **v2.0** (additive within phase 2; major bump only on breaking changes).
|
| 299 |
+
- Contract version in node manifests becomes `"2.0"`; peers running Phase 1 see `contract_version=2.0` and reject the manifest with `schema_mismatch` unless they have a compatibility shim.
|
| 300 |
+
- **Compatibility shim:** a Phase 1 node may negotiate down by serving `/manifest?contract_version=1.0`. Optional. Phase 2 SHOULD include the shim for one minor release window.
|
| 301 |
+
|
| 302 |
+
---
|
| 303 |
+
|
| 304 |
+
## 7. What is intentionally NOT in Phase 2
|
| 305 |
+
|
| 306 |
+
Pushed to Phase 3 (see [`../phase-3/00-OVERVIEW.md`](../phase-3/00-OVERVIEW.md)):
|
| 307 |
+
|
| 308 |
+
- Distributed-tensor inference (Petals-style)
|
| 309 |
+
- MoE expert routing
|
| 310 |
+
- Federated learning on LoRA layers
|
| 311 |
+
- LoRA long-distance beacons
|
| 312 |
+
- EBKH evidence layer integration
|
| 313 |
+
- Civil-defence pilot
|
| 314 |
+
- Protocol-standardisation work
|
| 315 |
+
- Conformance test suite for multi-implementation interop
|
| 316 |
+
|
| 317 |
+
---
|
| 318 |
+
|
| 319 |
+
## 8. Out-of-band documents (Phase 2)
|
| 320 |
+
|
| 321 |
+
- **THREAT_MODEL_v2.md** β formal security write-up for federation + E2E + tokens
|
| 322 |
+
- **RELAY_OPERATIONS.md** β for whoever runs `relay.hearthnet.de` (likely Christof on Hetzner)
|
| 323 |
+
- **MOBILE_BUILD.md** β Flutter build, code-signing, store-submission notes
|
| 324 |
+
- **MIGRATION_v1_to_v2.md** β for existing Phase-1 communities upgrading
|
docs/p2_p3/00-OVERVIEW_p3.md
ADDED
|
@@ -0,0 +1,288 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# HearthNet Phase 3 β Spec Set Overview
|
| 2 |
+
|
| 3 |
+
**Phase 3 scope:** research-shaped, 6β12 months. This is where HearthNet stops being a product and starts being a protocol. Each module here is an investment in a long-term capability where the engineering is the easy part β the hard part is establishing trust, governance, and standards.
|
| 4 |
+
|
| 5 |
+
**Stance:** Phase 3 specs are **roadmaps**, not contracts. Where a Phase-1/2 spec answers "what does this *do*?", a Phase-3 spec answers "what would we *build* if we were ready to commit?". Concrete enough to start, loose enough to be wrong about details without invalidating the direction.
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## 0. Reading these specs
|
| 10 |
+
|
| 11 |
+
Phase 3 specs deviate from the Phase 1 / 2 template in three respects:
|
| 12 |
+
|
| 13 |
+
1. **Stability tag is `experimental` for new capabilities** unless explicitly promoted later. Mesh nodes ignore experimental capabilities unless the operator opts in via `policy.research.enable = true`.
|
| 14 |
+
2. **Each module carries an "Open research questions" section** that is longer than the spec itself, by design. Phase 3 modules answer *some* of their open questions before shipping; the rest stay open.
|
| 15 |
+
3. **Acceptance criteria are described, not enumerated**. The point isn't to grade an implementation against a checklist; it's to say "we'll know this is working whenβ¦"
|
| 16 |
+
|
| 17 |
+
If you read a Phase 3 spec and feel uncertain about how something works, that uncertainty is faithful to the state of the work. The spec is doing its job by being honest about that.
|
| 18 |
+
|
| 19 |
+
---
|
| 20 |
+
|
| 21 |
+
## 1. Module map (Phase 3)
|
| 22 |
+
|
| 23 |
+
### New numbered modules
|
| 24 |
+
|
| 25 |
+
| ID | Module | Spec file | Concern |
|
| 26 |
+
|-----|------------------------------|-------------------------------------------------|----------------------------------------------------------------------|
|
| 27 |
+
| M26 | Distributed Inference | `modules/M26-distributed-inference.md` | Layer-sharded LLMs across nodes (Petals-style), small models only |
|
| 28 |
+
| M27 | MoE Expert Routing | `modules/M27-moe-routing.md` | Route queries to the right expert (machine or human) via learned scorer |
|
| 29 |
+
| M28 | Federated Learning | `modules/M28-fedlearn.md` | FedAvg on LoRA layers; per-community fine-tuning without sharing data |
|
| 30 |
+
| M29 | LoRA Long-Distance Beacons | `modules/M29-lora-beacons.md` | 868MHz "community alive" beacons; no AI traffic; emergency-only |
|
| 31 |
+
| M30 | Evidence / EBKH | `modules/M30-evidence-ebkh.md` | Claim graph alongside the event log; provenance + verifiability |
|
| 32 |
+
| M31 | Civil Defence Pilot | `modules/M31-civil-defense.md` | THW / DRK / KatS bridge; compliance profile; audit trail |
|
| 33 |
+
| M32 | Protocol Standardisation | `modules/M32-protocol-standard.md` | Reference implementation, conformance suite, governance for the spec |
|
| 34 |
+
|
| 35 |
+
### New cross-cutting modules
|
| 36 |
+
|
| 37 |
+
| ID | Module | Spec file | Concern |
|
| 38 |
+
|-----|-----------------------|---------------------------------------------------|------------------------------------------------------|
|
| 39 |
+
| X08 | Tensor Transport | `cross-cutting/X08-tensor-transport.md` | High-throughput chunked tensor passing for M26 |
|
| 40 |
+
| X09 | Conformance Suite | `cross-cutting/X09-conformance-suite.md` | Black-box tests defining what "HearthNet-compliant" means |
|
| 41 |
+
|
| 42 |
+
### Modifications to earlier modules
|
| 43 |
+
|
| 44 |
+
| Phase 1/2 module | Phase 3 extension |
|
| 45 |
+
|------------------|-------------------|
|
| 46 |
+
| M03 Bus | Optional MoE routing layer between dispatcher and handler (M27) |
|
| 47 |
+
| M04 LLM | Optional `experimental.distributed_llm.chat@1.0` backend (M26) |
|
| 48 |
+
| X02 Event log | Optional `evidence.*` claim records side-by-side with events (M30) |
|
| 49 |
+
| M14 Federation | Federated learning rounds use federation as the trust substrate (M28) |
|
| 50 |
+
| X03 Observability | Per-call expert-routing trace; per-shard tensor-transport metrics (M27, X08) |
|
| 51 |
+
|
| 52 |
+
---
|
| 53 |
+
|
| 54 |
+
## 2. Dependency graph (Phase 3 additions on top of Phases 1β2)
|
| 55 |
+
|
| 56 |
+
```
|
| 57 |
+
βββββββββββββββββββββββββββββββββββββββββββββββ
|
| 58 |
+
β Phase 1 + Phase 2 (unchanged) β
|
| 59 |
+
ββββββ¬βββββββββββββββββ¬βββββββββββββββ¬βββββββββ
|
| 60 |
+
β β β
|
| 61 |
+
βΌ βΌ βΌ
|
| 62 |
+
ββββββββββββ ββββββββββββ ββββββββββββ
|
| 63 |
+
β X08 β β M27 β β M30 β
|
| 64 |
+
β Tensor β β MoE β β EBKH β
|
| 65 |
+
β Transp. β β Routing β β Evidenceβ
|
| 66 |
+
βββββββ¬βββββ βββββββ¬βββββ ββββββ¬ββββββ
|
| 67 |
+
βΌ β β
|
| 68 |
+
ββββββββββββ β β
|
| 69 |
+
β M26 β β β
|
| 70 |
+
β Distrib.β β β
|
| 71 |
+
β Infer. β β β
|
| 72 |
+
ββββββββββββ β β
|
| 73 |
+
βΌ βΌ
|
| 74 |
+
ββββββββββββ ββββββββββββ
|
| 75 |
+
β M28 β β M31 β
|
| 76 |
+
β FedLearnβ β CivDef. β
|
| 77 |
+
ββββββββββββ ββββββββββββ
|
| 78 |
+
|
| 79 |
+
Standalone (no software deps, governance / hardware):
|
| 80 |
+
ββββββββββββ
|
| 81 |
+
β M29 β (hardware)
|
| 82 |
+
β LoRa β
|
| 83 |
+
β Beacons β
|
| 84 |
+
ββββββββββββ
|
| 85 |
+
ββββββββββββ
|
| 86 |
+
β X09 β (process)
|
| 87 |
+
β Conform.β
|
| 88 |
+
ββββββββββββ
|
| 89 |
+
ββββββββββββ
|
| 90 |
+
β M32 β (governance)
|
| 91 |
+
β Standardβ
|
| 92 |
+
ββββββββββββ
|
| 93 |
+
```
|
| 94 |
+
|
| 95 |
+
Most Phase 3 modules are independent of each other. The exceptions:
|
| 96 |
+
- M26 depends on X08
|
| 97 |
+
- M27 informs M26 (MoE routing picks which expert/shard)
|
| 98 |
+
- M28 reuses M14 federation for cross-community rounds
|
| 99 |
+
- M31 reuses M30 for evidence-grade emergency claims
|
| 100 |
+
|
| 101 |
+
---
|
| 102 |
+
|
| 103 |
+
## 3. File tree additions
|
| 104 |
+
|
| 105 |
+
```
|
| 106 |
+
hearthnet/
|
| 107 |
+
βββ distributed_inference/ # M26
|
| 108 |
+
β βββ __init__.py
|
| 109 |
+
β βββ shard.py
|
| 110 |
+
β βββ pipeline.py
|
| 111 |
+
β βββ routing.py
|
| 112 |
+
β βββ backends/
|
| 113 |
+
β βββ petals_like.py
|
| 114 |
+
β βββ small_model_layered.py
|
| 115 |
+
β
|
| 116 |
+
βββ moe/ # M27
|
| 117 |
+
β βββ __init__.py
|
| 118 |
+
β βββ router.py
|
| 119 |
+
β βββ scorer.py
|
| 120 |
+
β βββ human_in_the_loop.py
|
| 121 |
+
β
|
| 122 |
+
βββ fedlearn/ # M28
|
| 123 |
+
β βββ __init__.py
|
| 124 |
+
β βββ coordinator.py
|
| 125 |
+
β βββ round.py
|
| 126 |
+
β βββ lora_diff.py
|
| 127 |
+
β βββ aggregation.py
|
| 128 |
+
β
|
| 129 |
+
βββ lora_beacons/ # M29 β hardware integration; tiny Python surface
|
| 130 |
+
β βββ __init__.py
|
| 131 |
+
β βββ beacon_bridge.py # serial protocol to a LoRa USB stick
|
| 132 |
+
β βββ policy.py
|
| 133 |
+
β
|
| 134 |
+
βββ evidence/ # M30
|
| 135 |
+
β βββ __init__.py
|
| 136 |
+
β βββ claim.py
|
| 137 |
+
β βββ claim_graph.py
|
| 138 |
+
β βββ provenance.py
|
| 139 |
+
β βββ ebkh_bridge.py # bridge to Christof's EBKH v3+
|
| 140 |
+
β
|
| 141 |
+
βββ civil_defense/ # M31
|
| 142 |
+
β βββ __init__.py
|
| 143 |
+
β βββ profile.py # THW / DRK / KatS member types
|
| 144 |
+
β βββ audit.py
|
| 145 |
+
β βββ nrw_katastrophenschutz.py
|
| 146 |
+
β
|
| 147 |
+
βββ transport/
|
| 148 |
+
β βββ tensor.py # X08
|
| 149 |
+
β
|
| 150 |
+
βββ conformance/ # X09
|
| 151 |
+
βββ __init__.py
|
| 152 |
+
βββ runner.py
|
| 153 |
+
βββ suites/
|
| 154 |
+
β βββ identity.py
|
| 155 |
+
β βββ transport.py
|
| 156 |
+
β βββ bus.py
|
| 157 |
+
β βββ services.py
|
| 158 |
+
β βββ federation.py
|
| 159 |
+
βββ report.py
|
| 160 |
+
|
| 161 |
+
protocol/ # M32 β separate top-level dir at repo root
|
| 162 |
+
βββ README.md
|
| 163 |
+
βββ spec/ # the protocol spec, decoupled from the impl
|
| 164 |
+
β βββ 00-overview.md # mirror of CAPABILITY_CONTRACT but
|
| 165 |
+
β βββ 01-identity.md # implementation-agnostic
|
| 166 |
+
β βββ ...
|
| 167 |
+
βββ governance/
|
| 168 |
+
βββ CHANGELOG.md
|
| 169 |
+
βββ CONTRIBUTING.md
|
| 170 |
+
βββ ROADMAP.md
|
| 171 |
+
```
|
| 172 |
+
|
| 173 |
+
---
|
| 174 |
+
|
| 175 |
+
## 4. Conventions delta from Phase 2
|
| 176 |
+
|
| 177 |
+
### 4.1 New `experimental` namespace
|
| 178 |
+
|
| 179 |
+
A Phase-3 capability MAY be advertised as `experimental.<name>@<ver>`. Mesh nodes default to **not registering** experimental capabilities; the operator must opt in via:
|
| 180 |
+
|
| 181 |
+
```toml
|
| 182 |
+
[policy.research]
|
| 183 |
+
enable = true
|
| 184 |
+
enabled_capabilities = ["experimental.distributed_llm.chat@1.0", "experimental.fedlearn.round.*"]
|
| 185 |
+
```
|
| 186 |
+
|
| 187 |
+
Once a capability is sufficiently proven, it is promoted out of the `experimental.` prefix in a contract bump.
|
| 188 |
+
|
| 189 |
+
### 4.2 New type aliases
|
| 190 |
+
|
| 191 |
+
```python
|
| 192 |
+
# additions to hearthnet/types.py
|
| 193 |
+
|
| 194 |
+
ShardID = str # "<model_id>:<layer_range>"
|
| 195 |
+
ExpertID = str # opaque, refers to a routable subsystem
|
| 196 |
+
ClaimID = str # ULID
|
| 197 |
+
RoundID = str # fedlearn round identifier (ULID)
|
| 198 |
+
LoraBeaconID = str # 8-byte hex, hardware-issued
|
| 199 |
+
EvidenceLevel = Literal["unverified","cited","cross_referenced","attested","disputed"]
|
| 200 |
+
ExpertKind = Literal["model","human","service","external"]
|
| 201 |
+
```
|
| 202 |
+
|
| 203 |
+
### 4.3 New constants
|
| 204 |
+
|
| 205 |
+
```python
|
| 206 |
+
# additions to hearthnet/constants.py β Phase 3
|
| 207 |
+
|
| 208 |
+
# Distributed inference (M26)
|
| 209 |
+
DISTRIBUTED_MAX_SHARDS_PER_REQUEST = 16
|
| 210 |
+
DISTRIBUTED_SHARD_HEALTH_TIMEOUT_S = 30
|
| 211 |
+
DISTRIBUTED_FALLBACK_TO_LOCAL_AFTER_FAILURES = 2
|
| 212 |
+
|
| 213 |
+
# MoE routing (M27)
|
| 214 |
+
MOE_ROUTER_TOP_K = 3
|
| 215 |
+
MOE_ROUTER_TRAIN_MIN_EXAMPLES = 200
|
| 216 |
+
MOE_ROUTER_RETRAIN_EVERY_HOURS = 24
|
| 217 |
+
|
| 218 |
+
# Federated learning (M28)
|
| 219 |
+
FEDLEARN_MAX_ROUND_MINUTES = 120
|
| 220 |
+
FEDLEARN_MIN_PARTICIPANTS = 3
|
| 221 |
+
FEDLEARN_MAX_LORA_RANK = 64
|
| 222 |
+
FEDLEARN_GRAD_CLIP = 1.0
|
| 223 |
+
FEDLEARN_DP_NOISE_SCALE_DEFAULT = 0.0 # off by default; off-by-default differential privacy
|
| 224 |
+
|
| 225 |
+
# Evidence (M30)
|
| 226 |
+
EVIDENCE_CLAIM_TTL_DAYS_DEFAULT = 365
|
| 227 |
+
EVIDENCE_MAX_PROVENANCE_DEPTH = 16
|
| 228 |
+
|
| 229 |
+
# Civil defence (M31)
|
| 230 |
+
CIVDEF_AUDIT_RETENTION_YEARS = 10
|
| 231 |
+
CIVDEF_HEARTBEAT_SECONDS = 60
|
| 232 |
+
|
| 233 |
+
# Tensor transport (X08)
|
| 234 |
+
TENSOR_CHUNK_BYTES = 1_048_576 # 1 MB
|
| 235 |
+
TENSOR_FLOW_CONTROL_WINDOW = 16 # chunks
|
| 236 |
+
TENSOR_COMPRESSION_THRESHOLD_BYTES = 65_536
|
| 237 |
+
|
| 238 |
+
# LoRa beacons (M29)
|
| 239 |
+
LORA_BEACON_PERIOD_SECONDS_DEFAULT = 600 # 10 minutes
|
| 240 |
+
LORA_BEACON_MAX_PAYLOAD_BYTES = 32
|
| 241 |
+
```
|
| 242 |
+
|
| 243 |
+
---
|
| 244 |
+
|
| 245 |
+
## 5. Build order (Phase 3)
|
| 246 |
+
|
| 247 |
+
Phase 3 is not a release; it is a set of long-running tracks. Suggested ordering by independence + value:
|
| 248 |
+
|
| 249 |
+
| Track | Modules | Outcome |
|
| 250 |
+
|-------|----------------------------------|-------------------------------------------------------------------------------|
|
| 251 |
+
| A | X09 Conformance + M32 Standard | Other people can build HearthNet-compliant nodes |
|
| 252 |
+
| B | M30 Evidence / EBKH | Marketplace claims and emergency posts carry provenance |
|
| 253 |
+
| C | M27 MoE Routing (machines only) | Better answers for free; routes RAG queries to best-suited backend |
|
| 254 |
+
| D | M27 + M28 (human routing) | Neighbour gets pinged when their expertise matches |
|
| 255 |
+
| E | M28 FedLearn | Communities co-train a small LoRA without sharing source data |
|
| 256 |
+
| F | X08 + M26 Distributed Inference | Two anchors jointly serve a 7B model; large models become feasible LAN-wide |
|
| 257 |
+
| G | M29 LoRa Beacons | Resilient "I am alive" pings during regional internet outages |
|
| 258 |
+
| H | M31 Civil Defence Pilot | A real Niederrhein THW Ortsverband uses HearthNet for an exercise |
|
| 259 |
+
|
| 260 |
+
Tracks can run in parallel. None of them block the existing Phase-2 system.
|
| 261 |
+
|
| 262 |
+
---
|
| 263 |
+
|
| 264 |
+
## 6. Spec versioning
|
| 265 |
+
|
| 266 |
+
- Capability Contract bumps to **v3.0** but the bump is *additive*. v2 nodes coexist with v3 nodes; experimental capabilities simply aren't seen by v2 nodes.
|
| 267 |
+
- The first concrete deliverable of Track A (M32) is to **decouple** the protocol spec from the implementation. After that, the contract has its own version track separate from the Python implementation's version.
|
| 268 |
+
|
| 269 |
+
---
|
| 270 |
+
|
| 271 |
+
## 7. Out-of-band documents (Phase 3)
|
| 272 |
+
|
| 273 |
+
- **RESEARCH_AGENDA.md** β the deeper "why" for each module; intended audience: PhD students and grant reviewers
|
| 274 |
+
- **GOVERNANCE.md** β how spec changes are proposed, reviewed, and accepted; ties into M32
|
| 275 |
+
- **ETHICS_REVIEW.md** β the framework for evaluating MoE-driven routing-to-humans (M27) and fedlearn-on-personal-data (M28)
|
| 276 |
+
- **CIVDEF_AGREEMENT_TEMPLATE.md** β the MoU template for a civil-defence pilot
|
| 277 |
+
|
| 278 |
+
---
|
| 279 |
+
|
| 280 |
+
## 8. What is NOT in Phase 3
|
| 281 |
+
|
| 282 |
+
Even with all of Phase 3 done, the following remain explicit non-goals:
|
| 283 |
+
|
| 284 |
+
- A central directory of communities. There is no "HearthNet.com" listing all communities. Discovery is via word of mouth + DHT + federation. Pushed indefinitely.
|
| 285 |
+
- An app store for capabilities. Capabilities are code in the source tree, reviewed by maintainers. Not pluggable at runtime by untrusted code.
|
| 286 |
+
- A consensus protocol (Paxos, Raft). Communities do not vote on shared state beyond event-log gossip. Federation does not imply consensus.
|
| 287 |
+
- A cryptocurrency / token economy. Not even for fedlearn incentives. Reputational signals only.
|
| 288 |
+
- AGI. Even the distributed inference module targets at-most-mid-sized models (7B-class). The thesis is "small models close to people are more useful than large models far away", and Phase 3 doesn't change that.
|
docs/p2_p3/CAPABILITY_CONTRACT_v2.md
ADDED
|
@@ -0,0 +1,899 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# HearthNet Capability Contract β Phase 2 additions (v2.0)
|
| 2 |
+
|
| 3 |
+
**Spec version:** v2.0
|
| 4 |
+
**Last touched:** 2026-06-09
|
| 5 |
+
**Builds on:** [`../CAPABILITY_CONTRACT.md`](../CAPABILITY_CONTRACT.md) (v1.0)
|
| 6 |
+
|
| 7 |
+
This document is **additive** to v1.0. Everything in v1.0 still holds unless explicitly overridden here. Bumping a node's `contract_version` to `"2.0"` means: "I implement all of v1 plus the additions below."
|
| 8 |
+
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
## 1. Conventions delta
|
| 12 |
+
|
| 13 |
+
### 1.1 New encoded forms
|
| 14 |
+
|
| 15 |
+
- **Token format:** `hntoken://v1/<base64-url-nopad of canonical-JSON of token body + signature>`. See [M16 Β§3](modules/M16-tokens.md).
|
| 16 |
+
- **Federation peering blob:** `hnfed://v1/<base64>` β analogous to invite blob, signed by both community roots (cross-sig).
|
| 17 |
+
- **Encrypted payload header:** when a chat body is E2E-encrypted, the event's `data.body` becomes a `{"e2e": true, "header": {...}, "ciphertext": "<base64>"}` object. See [M23 Β§4](modules/M23-e2e-encryption.md).
|
| 18 |
+
|
| 19 |
+
### 1.2 Sign-over-method choice
|
| 20 |
+
|
| 21 |
+
Phase 2 capability tokens use the **JWS-flavoured** envelope, not the canonical-JSON envelope. Rationale: tokens are short-lived and frequently passed through HTTP intermediaries; JWS is the lingua franca.
|
| 22 |
+
|
| 23 |
+
```
|
| 24 |
+
hntoken_envelope = base64url(header) + "." + base64url(payload) + "." + base64url(signature)
|
| 25 |
+
```
|
| 26 |
+
|
| 27 |
+
Both forms continue to use Ed25519.
|
| 28 |
+
|
| 29 |
+
### 1.3 New error codes (additive)
|
| 30 |
+
|
| 31 |
+
| Code | Meaning |
|
| 32 |
+
|------|---------|
|
| 33 |
+
| `federation_forbidden` | The caller's community is not federated with ours for this capability |
|
| 34 |
+
| `token_invalid` | Token signature failed |
|
| 35 |
+
| `token_expired` | Token past `exp` |
|
| 36 |
+
| `token_scope_insufficient` | Token does not grant this capability |
|
| 37 |
+
| `relay_unreachable` | Configured relay tier is down |
|
| 38 |
+
| `e2e_session_missing` | Caller did not establish an X3DH session before sending encrypted message |
|
| 39 |
+
| `e2e_decrypt_failed` | Ciphertext could not be decrypted (key mismatch, ratchet drift) |
|
| 40 |
+
| `dht_lookup_failed` | DHT lookup timed out before finding sources |
|
| 41 |
+
| `not_federated` | Federation manifest does not exist between these communities |
|
| 42 |
+
|
| 43 |
+
---
|
| 44 |
+
|
| 45 |
+
## 2. Capability namespace β Phase 2 stable set
|
| 46 |
+
|
| 47 |
+
Promoted from "reserved" in v1.0:
|
| 48 |
+
|
| 49 |
+
| Prefix | Now | Defined |
|
| 50 |
+
|--------|-----|---------|
|
| 51 |
+
| `federation.*` | stable | [M14](modules/M14-federation.md) |
|
| 52 |
+
| `ocr.*` | stable | [M17](modules/M17-ocr.md) |
|
| 53 |
+
| `trans.*` | stable | [M18](modules/M18-translation.md) |
|
| 54 |
+
| `stt.*` `tts.*` | stable | [M19](modules/M19-stt-tts.md) |
|
| 55 |
+
| `img.*` | stable | [M20](modules/M20-vision.md) |
|
| 56 |
+
| `rerank.*` | stable | [M24](modules/M24-rerank.md) |
|
| 57 |
+
| `chat.thread.*` | stable | [M25](modules/M25-group-chat.md) |
|
| 58 |
+
| `chat.forward.*` | stable | [M14](modules/M14-federation.md) (via relay) |
|
| 59 |
+
| `auth.*` | stable (new) | [M16](modules/M16-tokens.md) |
|
| 60 |
+
|
| 61 |
+
---
|
| 62 |
+
|
| 63 |
+
## 3. Complete new capabilities list
|
| 64 |
+
|
| 65 |
+
| Name | Stability | Stream? | Trust required | Section |
|
| 66 |
+
|------|-----------|---------|----------------|---------|
|
| 67 |
+
| `federation.peer.add@1.0` | stable | no | anchor (with co-sig) | Β§4.1 |
|
| 68 |
+
| `federation.peer.remove@1.0` | stable | no | anchor (with co-sig) | Β§4.2 |
|
| 69 |
+
| `federation.peer.list@1.0` | stable | no | member | Β§4.3 |
|
| 70 |
+
| `federation.proxy@1.0` | stable | yes | federated | Β§4.4 |
|
| 71 |
+
| `auth.token.issue@1.0` | stable | no | member | Β§4.5 |
|
| 72 |
+
| `auth.token.revoke@1.0` | stable | no | issuer or trusted | Β§4.6 |
|
| 73 |
+
| `auth.token.introspect@1.0` | stable | no | self | Β§4.7 |
|
| 74 |
+
| `ocr.image@1.0` | stable | no | member | Β§4.8 |
|
| 75 |
+
| `ocr.pdf@1.0` | stable | yes (progress) | trusted | Β§4.9 |
|
| 76 |
+
| `trans.text@1.0` | stable | no | member | Β§4.10 |
|
| 77 |
+
| `stt.transcribe@1.0` | stable | yes (segments) | member | Β§4.11 |
|
| 78 |
+
| `tts.synthesize@1.0` | stable | yes (audio chunks) | member | Β§4.12 |
|
| 79 |
+
| `img.describe@1.0` | stable | no | member | Β§4.13 |
|
| 80 |
+
| `img.generate@1.0` | stable | yes (progress) | trusted | Β§4.14 |
|
| 81 |
+
| `rerank.text@1.0` | stable | no | member | Β§4.15 |
|
| 82 |
+
| `chat.thread.create@1.0` | stable | no | member | Β§4.16 |
|
| 83 |
+
| `chat.thread.send@1.0` | stable | no | thread member | Β§4.17 |
|
| 84 |
+
| `chat.thread.history@1.0` | stable | no | thread member | Β§4.18 |
|
| 85 |
+
| `chat.thread.leave@1.0` | stable | no | thread member | Β§4.19 |
|
| 86 |
+
| `chat.forward.put@1.0` | stable | yes | anchor with forward | Β§4.20 |
|
| 87 |
+
| `chat.forward.fetch@1.0` | stable | yes | self | Β§4.21 |
|
| 88 |
+
| `file.put.resume@1.0` | stable | yes | trusted | Β§4.22 |
|
| 89 |
+
| `llm.chat@2.0` (UPDATE) | stable | yes | member | Β§4.23 |
|
| 90 |
+
| `llm.tools.call@1.0` (NEW, used by `llm.chat` tool flow) | stable | no | member | Β§4.24 |
|
| 91 |
+
|
| 92 |
+
---
|
| 93 |
+
|
| 94 |
+
## 4. Per-capability specifications
|
| 95 |
+
|
| 96 |
+
### 4.1 `federation.peer.add@1.0`
|
| 97 |
+
|
| 98 |
+
Establish a federation link with another community.
|
| 99 |
+
|
| 100 |
+
**Request:**
|
| 101 |
+
```json
|
| 102 |
+
{
|
| 103 |
+
"params": {},
|
| 104 |
+
"input": {
|
| 105 |
+
"client_id": "01HXR...",
|
| 106 |
+
"peer_community_id": "ed25519:<other community root pubkey>",
|
| 107 |
+
"peer_endpoints": [{"transport":"https","host":"...","port":7080}],
|
| 108 |
+
"co_signers": [{"node_id":"...","signature":"..."}, "...", "..."],
|
| 109 |
+
"scope": {
|
| 110 |
+
"capabilities": ["rag.query","market.list"],
|
| 111 |
+
"data_visibility":"public_corpora_only"
|
| 112 |
+
},
|
| 113 |
+
"expires_at": "2027-06-09T00:00:00Z"
|
| 114 |
+
}
|
| 115 |
+
}
|
| 116 |
+
```
|
| 117 |
+
|
| 118 |
+
`co_signers` requires `policy.min_signatures_to_federate` (new policy field; default 3, see [M14 Β§5](modules/M14-federation.md)). The remote community must also have us in their federation manifest before federated calls work.
|
| 119 |
+
|
| 120 |
+
**Response:**
|
| 121 |
+
```json
|
| 122 |
+
{"output": {"event_id": "01HXS...", "federation_id": "ed25519:A:ed25519:B"}, "meta": {"ms": 14}}
|
| 123 |
+
```
|
| 124 |
+
|
| 125 |
+
Emits `federation.peer.added` event.
|
| 126 |
+
|
| 127 |
+
**Errors:** `unauthorized`, `bad_request`, `not_found` (peer endpoints unreachable).
|
| 128 |
+
|
| 129 |
+
### 4.2 `federation.peer.remove@1.0`
|
| 130 |
+
|
| 131 |
+
Terminate a federation link.
|
| 132 |
+
|
| 133 |
+
**Request:**
|
| 134 |
+
```json
|
| 135 |
+
{
|
| 136 |
+
"params": {},
|
| 137 |
+
"input": {
|
| 138 |
+
"client_id": "01HXR...",
|
| 139 |
+
"peer_community_id": "ed25519:...",
|
| 140 |
+
"reason": "policy_violation|unused|mutual",
|
| 141 |
+
"co_signers": [...]
|
| 142 |
+
}
|
| 143 |
+
}
|
| 144 |
+
```
|
| 145 |
+
|
| 146 |
+
Emits `federation.peer.removed`.
|
| 147 |
+
|
| 148 |
+
### 4.3 `federation.peer.list@1.0`
|
| 149 |
+
|
| 150 |
+
List active federations.
|
| 151 |
+
|
| 152 |
+
**Response:**
|
| 153 |
+
```json
|
| 154 |
+
{
|
| 155 |
+
"output": {
|
| 156 |
+
"peers": [
|
| 157 |
+
{
|
| 158 |
+
"community_id": "ed25519:...",
|
| 159 |
+
"name": "Geldern Demo",
|
| 160 |
+
"scope": {"capabilities":["rag.query"]},
|
| 161 |
+
"established_at": "...",
|
| 162 |
+
"expires_at": "...",
|
| 163 |
+
"last_heartbeat": "..."
|
| 164 |
+
}
|
| 165 |
+
]
|
| 166 |
+
},
|
| 167 |
+
"meta": {"ms": 2}
|
| 168 |
+
}
|
| 169 |
+
```
|
| 170 |
+
|
| 171 |
+
### 4.4 `federation.proxy@1.0`
|
| 172 |
+
|
| 173 |
+
A federated peer asks *our* community to forward a capability call to one of *our* members. This is how cross-community RAG query works: peer's anchor calls `federation.proxy` on our anchor, which then internally routes to `rag.query` on whichever local node has the corpus.
|
| 174 |
+
|
| 175 |
+
**Request:**
|
| 176 |
+
```json
|
| 177 |
+
{
|
| 178 |
+
"params": {"target_capability": "rag.query@1.0"},
|
| 179 |
+
"input": {
|
| 180 |
+
"client_id": "01HXR...",
|
| 181 |
+
"token": "hntoken://v1/...",
|
| 182 |
+
"body": { /* the body of the underlying capability */ }
|
| 183 |
+
}
|
| 184 |
+
}
|
| 185 |
+
```
|
| 186 |
+
|
| 187 |
+
**Response:** Whatever the target capability returns. Streams pass through transparently.
|
| 188 |
+
|
| 189 |
+
The proxy verifies the token's scope includes `target_capability`. Returns `federation_forbidden` otherwise.
|
| 190 |
+
|
| 191 |
+
### 4.5 `auth.token.issue@1.0`
|
| 192 |
+
|
| 193 |
+
Issue a capability token.
|
| 194 |
+
|
| 195 |
+
**Request:**
|
| 196 |
+
```json
|
| 197 |
+
{
|
| 198 |
+
"params": {},
|
| 199 |
+
"input": {
|
| 200 |
+
"client_id": "01HXR...",
|
| 201 |
+
"subject": "ed25519:<recipient NodeID>",
|
| 202 |
+
"scope": {
|
| 203 |
+
"capabilities": ["rag.query@1.0", "embed.text@1.0"],
|
| 204 |
+
"corpora": ["niederrhein-emergency"],
|
| 205 |
+
"rate_limit_per_minute": 60
|
| 206 |
+
},
|
| 207 |
+
"ttl_seconds": 3600,
|
| 208 |
+
"audience": "ed25519:<community_id where token is presented, optional>"
|
| 209 |
+
}
|
| 210 |
+
}
|
| 211 |
+
```
|
| 212 |
+
|
| 213 |
+
**Response:**
|
| 214 |
+
```json
|
| 215 |
+
{
|
| 216 |
+
"output": {"token": "hntoken://v1/eyJhbGc...", "token_id": "01HXS..."},
|
| 217 |
+
"meta": {"ms": 4}
|
| 218 |
+
}
|
| 219 |
+
```
|
| 220 |
+
|
| 221 |
+
See [M16](modules/M16-tokens.md) for token body schema.
|
| 222 |
+
|
| 223 |
+
### 4.6 `auth.token.revoke@1.0`
|
| 224 |
+
|
| 225 |
+
Revoke a previously-issued token.
|
| 226 |
+
|
| 227 |
+
**Request:**
|
| 228 |
+
```json
|
| 229 |
+
{"params": {}, "input": {"client_id":"01HXR...","token_id":"01HXR..."}}
|
| 230 |
+
```
|
| 231 |
+
|
| 232 |
+
Emits `auth.token.revoked` event.
|
| 233 |
+
|
| 234 |
+
### 4.7 `auth.token.introspect@1.0`
|
| 235 |
+
|
| 236 |
+
Self-only: check whether a token is still valid.
|
| 237 |
+
|
| 238 |
+
**Request:** `{"params":{},"input":{"token":"hntoken://v1/..."}}`
|
| 239 |
+
|
| 240 |
+
**Response:** `{"output":{"active":bool,"scope":{...},"expires_at":"..."},"meta":{...}}`
|
| 241 |
+
|
| 242 |
+
### 4.8 `ocr.image@1.0`
|
| 243 |
+
|
| 244 |
+
Extract text from a single image.
|
| 245 |
+
|
| 246 |
+
**Request:**
|
| 247 |
+
```json
|
| 248 |
+
{
|
| 249 |
+
"params": {"backend": "tesseract", "languages": ["deu","eng"]},
|
| 250 |
+
"input": {
|
| 251 |
+
"image_cid": "blake3:...",
|
| 252 |
+
"preprocess": {"deskew": true, "denoise": false}
|
| 253 |
+
}
|
| 254 |
+
}
|
| 255 |
+
```
|
| 256 |
+
|
| 257 |
+
**Response:**
|
| 258 |
+
```json
|
| 259 |
+
{
|
| 260 |
+
"output": {
|
| 261 |
+
"text": "Trinkwasser ohne Strom ...",
|
| 262 |
+
"blocks": [
|
| 263 |
+
{"text":"Trinkwasser ohne Strom","bbox":[10,20,300,40],"confidence":0.94}
|
| 264 |
+
],
|
| 265 |
+
"language": "de"
|
| 266 |
+
},
|
| 267 |
+
"meta": {"backend":"tesseract","ms":820}
|
| 268 |
+
}
|
| 269 |
+
```
|
| 270 |
+
|
| 271 |
+
### 4.9 `ocr.pdf@1.0`
|
| 272 |
+
|
| 273 |
+
Extract text from a (scanned) PDF. Streams per-page progress.
|
| 274 |
+
|
| 275 |
+
**Request:**
|
| 276 |
+
```json
|
| 277 |
+
{
|
| 278 |
+
"params": {"backend":"multilingual","languages":["deu","lat"]},
|
| 279 |
+
"input": {
|
| 280 |
+
"doc_cid": "blake3:...",
|
| 281 |
+
"page_range": [1, 50],
|
| 282 |
+
"preprocess": {"deskew": true},
|
| 283 |
+
"store_text": true
|
| 284 |
+
}
|
| 285 |
+
}
|
| 286 |
+
```
|
| 287 |
+
|
| 288 |
+
**Stream frames:**
|
| 289 |
+
```
|
| 290 |
+
event: progress
|
| 291 |
+
data: {"current": 3, "total": 12, "stage": "OCRing page 3"}
|
| 292 |
+
|
| 293 |
+
event: page
|
| 294 |
+
data: {"page": 3, "text": "...", "confidence_mean": 0.91}
|
| 295 |
+
|
| 296 |
+
event: done
|
| 297 |
+
data: {"pages": 12, "stored_cid": "blake3:...", "ms": 18342}
|
| 298 |
+
```
|
| 299 |
+
|
| 300 |
+
If `store_text:true`, the extracted text is stored as a new blob and its CID returned. Useful for piping into `rag.ingest`.
|
| 301 |
+
|
| 302 |
+
### 4.10 `trans.text@1.0`
|
| 303 |
+
|
| 304 |
+
Translate between languages.
|
| 305 |
+
|
| 306 |
+
**Request:**
|
| 307 |
+
```json
|
| 308 |
+
{
|
| 309 |
+
"params": {"backend":"nllb"},
|
| 310 |
+
"input": {
|
| 311 |
+
"text": "Brauche Wasserkanister",
|
| 312 |
+
"from": "de",
|
| 313 |
+
"to": "en",
|
| 314 |
+
"domain": "everyday"
|
| 315 |
+
}
|
| 316 |
+
}
|
| 317 |
+
```
|
| 318 |
+
|
| 319 |
+
**Response:**
|
| 320 |
+
```json
|
| 321 |
+
{
|
| 322 |
+
"output": {"text":"Need water canister", "confidence": 0.97},
|
| 323 |
+
"meta": {"backend":"nllb","model":"nllb-200-distilled-600M","ms":312}
|
| 324 |
+
}
|
| 325 |
+
```
|
| 326 |
+
|
| 327 |
+
Plattdeutsch supported as `nds`. Marketplace UI offers one-click translate on a foreign-language post.
|
| 328 |
+
|
| 329 |
+
### 4.11 `stt.transcribe@1.0`
|
| 330 |
+
|
| 331 |
+
Transcribe an audio blob.
|
| 332 |
+
|
| 333 |
+
**Request:**
|
| 334 |
+
```json
|
| 335 |
+
{
|
| 336 |
+
"params": {"backend":"whisper","model":"large-v3"},
|
| 337 |
+
"input": {
|
| 338 |
+
"audio_cid": "blake3:...",
|
| 339 |
+
"language": "auto",
|
| 340 |
+
"diarize": false,
|
| 341 |
+
"translate_to_en": false
|
| 342 |
+
}
|
| 343 |
+
}
|
| 344 |
+
```
|
| 345 |
+
|
| 346 |
+
**Stream frames:**
|
| 347 |
+
```
|
| 348 |
+
event: segment
|
| 349 |
+
data: {"start": 0.0, "end": 4.2, "text": "Hallo, ich brauche...", "language":"de"}
|
| 350 |
+
|
| 351 |
+
event: segment
|
| 352 |
+
data: {"start": 4.2, "end": 8.1, "text": "Hilfe mit dem Generator."}
|
| 353 |
+
|
| 354 |
+
event: done
|
| 355 |
+
data: {"language":"de","ms":2100,"duration_seconds":18.4}
|
| 356 |
+
```
|
| 357 |
+
|
| 358 |
+
### 4.12 `tts.synthesize@1.0`
|
| 359 |
+
|
| 360 |
+
Synthesize speech from text.
|
| 361 |
+
|
| 362 |
+
**Request:**
|
| 363 |
+
```json
|
| 364 |
+
{
|
| 365 |
+
"params": {"backend":"xtts","voice":"hannes_v1","language":"de"},
|
| 366 |
+
"input": {
|
| 367 |
+
"text": "Das Regenwasser muss zuerst gefiltert werden.",
|
| 368 |
+
"speed": 1.0,
|
| 369 |
+
"format": "ogg_vorbis"
|
| 370 |
+
}
|
| 371 |
+
}
|
| 372 |
+
```
|
| 373 |
+
|
| 374 |
+
**Stream frames:**
|
| 375 |
+
```
|
| 376 |
+
event: chunk
|
| 377 |
+
data: {"i":0,"size_bytes":16384,"data_b64":"..."}
|
| 378 |
+
|
| 379 |
+
event: done
|
| 380 |
+
data: {"total_bytes":91247,"duration_seconds":4.2,"format":"ogg_vorbis","ms":1832}
|
| 381 |
+
```
|
| 382 |
+
|
| 383 |
+
### 4.13 `img.describe@1.0`
|
| 384 |
+
|
| 385 |
+
Describe what's in an image.
|
| 386 |
+
|
| 387 |
+
**Request:**
|
| 388 |
+
```json
|
| 389 |
+
{
|
| 390 |
+
"params": {"backend":"florence2"},
|
| 391 |
+
"input": {
|
| 392 |
+
"image_cid": "blake3:...",
|
| 393 |
+
"task": "detailed_caption",
|
| 394 |
+
"language": "de"
|
| 395 |
+
}
|
| 396 |
+
}
|
| 397 |
+
```
|
| 398 |
+
|
| 399 |
+
`task` β `{"caption","detailed_caption","ocr","objects","tags"}`.
|
| 400 |
+
|
| 401 |
+
**Response:**
|
| 402 |
+
```json
|
| 403 |
+
{
|
| 404 |
+
"output": {
|
| 405 |
+
"caption": "Ein Schaltplan einer einfachen Wasserfilteranlage mit ...",
|
| 406 |
+
"tags": ["schaltplan","wasserfilter","skizze"],
|
| 407 |
+
"objects": [{"label":"pipe","bbox":[10,20,80,90]}]
|
| 408 |
+
},
|
| 409 |
+
"meta": {"backend":"florence2","ms":640}
|
| 410 |
+
}
|
| 411 |
+
```
|
| 412 |
+
|
| 413 |
+
### 4.14 `img.generate@1.0`
|
| 414 |
+
|
| 415 |
+
Generate an image from a text prompt.
|
| 416 |
+
|
| 417 |
+
**Request:**
|
| 418 |
+
```json
|
| 419 |
+
{
|
| 420 |
+
"params": {"backend":"flux","model":"flux.1-dev","lora":"local-style-v1"},
|
| 421 |
+
"input": {
|
| 422 |
+
"prompt": "ein einfacher schaltplan einer wasserfilteranlage, schwarz auf weiss",
|
| 423 |
+
"negative_prompt": "color, photorealistic",
|
| 424 |
+
"width": 1024,
|
| 425 |
+
"height": 1024,
|
| 426 |
+
"steps": 20,
|
| 427 |
+
"seed": 12345
|
| 428 |
+
}
|
| 429 |
+
}
|
| 430 |
+
```
|
| 431 |
+
|
| 432 |
+
**Stream frames:**
|
| 433 |
+
```
|
| 434 |
+
event: progress
|
| 435 |
+
data: {"step":5,"total":20}
|
| 436 |
+
|
| 437 |
+
event: done
|
| 438 |
+
data: {"image_cid":"blake3:...","width":1024,"height":1024,"ms":12800}
|
| 439 |
+
```
|
| 440 |
+
|
| 441 |
+
### 4.15 `rerank.text@1.0`
|
| 442 |
+
|
| 443 |
+
Rerank a list of documents against a query.
|
| 444 |
+
|
| 445 |
+
**Request:**
|
| 446 |
+
```json
|
| 447 |
+
{
|
| 448 |
+
"params": {"model":"BAAI/bge-reranker-v2-m3"},
|
| 449 |
+
"input": {
|
| 450 |
+
"query": "Wie reinige ich Regenwasser ohne Strom?",
|
| 451 |
+
"documents": [
|
| 452 |
+
{"id":"doc1","text":"..."},
|
| 453 |
+
{"id":"doc2","text":"..."}
|
| 454 |
+
],
|
| 455 |
+
"top_k": 10
|
| 456 |
+
}
|
| 457 |
+
}
|
| 458 |
+
```
|
| 459 |
+
|
| 460 |
+
**Response:**
|
| 461 |
+
```json
|
| 462 |
+
{
|
| 463 |
+
"output": {
|
| 464 |
+
"ranked": [
|
| 465 |
+
{"id":"doc2","score":0.91},
|
| 466 |
+
{"id":"doc1","score":0.42}
|
| 467 |
+
]
|
| 468 |
+
},
|
| 469 |
+
"meta": {"model":"BAAI/bge-reranker-v2-m3","ms":42}
|
| 470 |
+
}
|
| 471 |
+
```
|
| 472 |
+
|
| 473 |
+
### 4.16 `chat.thread.create@1.0`
|
| 474 |
+
|
| 475 |
+
Create a multi-party thread.
|
| 476 |
+
|
| 477 |
+
**Request:**
|
| 478 |
+
```json
|
| 479 |
+
{
|
| 480 |
+
"params": {},
|
| 481 |
+
"input": {
|
| 482 |
+
"client_id": "01HXR...",
|
| 483 |
+
"name": "Nachbarschaftshilfe Mai",
|
| 484 |
+
"members": ["ed25519:...","ed25519:...","ed25519:..."],
|
| 485 |
+
"e2e_enabled": true
|
| 486 |
+
}
|
| 487 |
+
}
|
| 488 |
+
```
|
| 489 |
+
|
| 490 |
+
**Response:** `{"output":{"thread_id":"01HXR...","event_id":"01HXR..."},"meta":{...}}`
|
| 491 |
+
|
| 492 |
+
### 4.17 `chat.thread.send@1.0`
|
| 493 |
+
|
| 494 |
+
Send to a thread. Body is E2E-encrypted when `e2e_enabled`.
|
| 495 |
+
|
| 496 |
+
**Request:**
|
| 497 |
+
```json
|
| 498 |
+
{
|
| 499 |
+
"params": {"thread_id":"01HXR..."},
|
| 500 |
+
"input": {
|
| 501 |
+
"client_id": "01HXR...",
|
| 502 |
+
"body": "...", // cleartext or {"e2e":true,...} envelope
|
| 503 |
+
"attachments": [{"cid":"blake3:...","name":"..."}]
|
| 504 |
+
}
|
| 505 |
+
}
|
| 506 |
+
```
|
| 507 |
+
|
| 508 |
+
### 4.18 `chat.thread.history@1.0`
|
| 509 |
+
|
| 510 |
+
Self-only history retrieval for a thread.
|
| 511 |
+
|
| 512 |
+
**Request:** `{"params":{"thread_id":"01HXR..."},"input":{"since_lamport":4000,"limit":200}}`
|
| 513 |
+
|
| 514 |
+
### 4.19 `chat.thread.leave@1.0`
|
| 515 |
+
|
| 516 |
+
Leave a thread.
|
| 517 |
+
|
| 518 |
+
### 4.20 `chat.forward.put@1.0`
|
| 519 |
+
|
| 520 |
+
Store-and-forward: leave a chat message with an anchor for later delivery.
|
| 521 |
+
|
| 522 |
+
**Stream initiator pattern** identical to `file.put`. Anchors that opt into the role register this capability.
|
| 523 |
+
|
| 524 |
+
### 4.21 `chat.forward.fetch@1.0`
|
| 525 |
+
|
| 526 |
+
Self-only: collect queued messages from an anchor.
|
| 527 |
+
|
| 528 |
+
### 4.22 `file.put.resume@1.0`
|
| 529 |
+
|
| 530 |
+
Resume a partial PUT.
|
| 531 |
+
|
| 532 |
+
**Request:**
|
| 533 |
+
```json
|
| 534 |
+
{"params":{},"input":{"manifest_cid":"blake3:...","client_id":"01HXR..."}}
|
| 535 |
+
```
|
| 536 |
+
|
| 537 |
+
**Response (server tells client which chunks are missing):**
|
| 538 |
+
```
|
| 539 |
+
event: ready
|
| 540 |
+
data: {"missing":[3,4,5,8]}
|
| 541 |
+
|
| 542 |
+
(client sends only those chunks)
|
| 543 |
+
|
| 544 |
+
event: done
|
| 545 |
+
data: {"received":4}
|
| 546 |
+
```
|
| 547 |
+
|
| 548 |
+
Server keeps partial transfer state for `FILE_RESUME_PARTIAL_TTL_SECONDS` (1 hour). After that, partial transfers are discarded and client must restart.
|
| 549 |
+
|
| 550 |
+
### 4.23 `llm.chat@2.0` (update)
|
| 551 |
+
|
| 552 |
+
Backward-compatible **minor bump** (still `name="llm.chat"`, callers can still ask for `@>=1.0` and be matched). New optional fields:
|
| 553 |
+
|
| 554 |
+
```json
|
| 555 |
+
{
|
| 556 |
+
"params": {"model":"...","modalities":["text","vision"]},
|
| 557 |
+
"input": {
|
| 558 |
+
"messages": [
|
| 559 |
+
{
|
| 560 |
+
"role": "user",
|
| 561 |
+
"content": [
|
| 562 |
+
{"type": "text", "text": "Was siehst du?"},
|
| 563 |
+
{"type": "image", "image_cid": "blake3:..."}
|
| 564 |
+
]
|
| 565 |
+
}
|
| 566 |
+
],
|
| 567 |
+
"tools": [
|
| 568 |
+
{
|
| 569 |
+
"name": "rag.query",
|
| 570 |
+
"description": "Search the niederrhein-emergency corpus",
|
| 571 |
+
"parameters_schema": { /* JSON Schema for tool args */ }
|
| 572 |
+
}
|
| 573 |
+
],
|
| 574 |
+
"tool_choice": "auto"
|
| 575 |
+
}
|
| 576 |
+
}
|
| 577 |
+
```
|
| 578 |
+
|
| 579 |
+
### 4.24 `llm.tools.call@1.0`
|
| 580 |
+
|
| 581 |
+
When an LLM emits a `tool_call_delta` stream frame followed by `tool_call` end, the **caller** is responsible for executing the tool. To make this composable, the LLM service offers `llm.tools.call` as a convenience that wraps "execute one bus call, return its output as a tool message". Callers MAY use it; the more general flow is to have the orchestrator (UI / agent) handle it.
|
| 582 |
+
|
| 583 |
+
**Request:**
|
| 584 |
+
```json
|
| 585 |
+
{
|
| 586 |
+
"params": {},
|
| 587 |
+
"input": {
|
| 588 |
+
"tool_call_id": "tc_01HXR...",
|
| 589 |
+
"target_capability":"rag.query@1.0",
|
| 590 |
+
"target_body": { /* the tool's args, validated against the tool's parameters_schema */ }
|
| 591 |
+
}
|
| 592 |
+
}
|
| 593 |
+
```
|
| 594 |
+
|
| 595 |
+
**Response:** mirrors target capability's response.
|
| 596 |
+
|
| 597 |
+
---
|
| 598 |
+
|
| 599 |
+
## 5. Wire format changes
|
| 600 |
+
|
| 601 |
+
### 5.1 WebSocket upgrade
|
| 602 |
+
|
| 603 |
+
For `/bus/v1/call`, clients MAY include:
|
| 604 |
+
|
| 605 |
+
```
|
| 606 |
+
Connection: Upgrade
|
| 607 |
+
Upgrade: websocket
|
| 608 |
+
Sec-WebSocket-Protocol: hearthnet-bus.v2
|
| 609 |
+
```
|
| 610 |
+
|
| 611 |
+
Server responds with a 101 if it supports WebSocket (Phase 2 nodes do). Once upgraded, the connection is bidirectional and persistent for the life of the request β useful for tool-call loops and streaming RAG.
|
| 612 |
+
|
| 613 |
+
Frames over WebSocket are the same JSON event-name + data envelope as SSE, just delivered as binary or text WebSocket frames instead of `data:` lines.
|
| 614 |
+
|
| 615 |
+
See [X06](cross-cutting/X06-websocket.md).
|
| 616 |
+
|
| 617 |
+
### 5.2 Token-bearer requests
|
| 618 |
+
|
| 619 |
+
When a caller carries a capability token instead of (or in addition to) a per-request signature:
|
| 620 |
+
|
| 621 |
+
```
|
| 622 |
+
X-HearthNet-Token: hntoken://v1/<base64>
|
| 623 |
+
```
|
| 624 |
+
|
| 625 |
+
The server validates the token (signature, expiry, scope) and uses the token's `subject` as the effective caller for the trust check. The token's `issuer` must be a member of a federated community.
|
| 626 |
+
|
| 627 |
+
If both `X-HearthNet-Signature` and `X-HearthNet-Token` are present, signature is checked first; token is used to widen scope (e.g. "the caller is a federated peer, but for this single call they presented a token granting access").
|
| 628 |
+
|
| 629 |
+
### 5.3 Federation routing
|
| 630 |
+
|
| 631 |
+
When a node receives a call where `X-HearthNet-Community` β our community ID:
|
| 632 |
+
|
| 633 |
+
1. Look up federation manifest for the calling community.
|
| 634 |
+
2. If absent β `not_federated` (404).
|
| 635 |
+
3. If present but scope does not include the requested capability β `federation_forbidden` (403).
|
| 636 |
+
4. Else, dispatch normally; record federation usage in metrics.
|
| 637 |
+
|
| 638 |
+
---
|
| 639 |
+
|
| 640 |
+
## 6. Manifests
|
| 641 |
+
|
| 642 |
+
### 6.1 Federation manifest (new)
|
| 643 |
+
|
| 644 |
+
```json
|
| 645 |
+
{
|
| 646 |
+
"schema_version": 1,
|
| 647 |
+
"federation_id": "<community_a>:<community_b>",
|
| 648 |
+
"community_a": "ed25519:...",
|
| 649 |
+
"community_b": "ed25519:...",
|
| 650 |
+
"established_at": "2026-06-09T10:00:00Z",
|
| 651 |
+
"expires_at": "2027-06-09T10:00:00Z",
|
| 652 |
+
"scope": {
|
| 653 |
+
"a_grants_b": {"capabilities":["rag.query"], "corpora":["public-emergency"]},
|
| 654 |
+
"b_grants_a": {"capabilities":["rag.query"]}
|
| 655 |
+
},
|
| 656 |
+
"bootstrap_endpoints_a": [{"transport":"https","host":"...","port":7080}],
|
| 657 |
+
"bootstrap_endpoints_b": [{"transport":"https","host":"...","port":7080}],
|
| 658 |
+
"signatures": {
|
| 659 |
+
"a": {"signed_by":"ed25519:<anchor of A>","signature":"...","co_signers":[{...},{...}]},
|
| 660 |
+
"b": {"signed_by":"ed25519:<anchor of B>","signature":"...","co_signers":[{...},{...}]}
|
| 661 |
+
}
|
| 662 |
+
}
|
| 663 |
+
```
|
| 664 |
+
|
| 665 |
+
Both sides must sign with their `min_signatures_to_federate` threshold. The federation manifest lives in **both** communities' event logs.
|
| 666 |
+
|
| 667 |
+
### 6.2 Token body (new)
|
| 668 |
+
|
| 669 |
+
JWS-style. Header:
|
| 670 |
+
|
| 671 |
+
```json
|
| 672 |
+
{"alg":"EdDSA","typ":"hntoken","v":1}
|
| 673 |
+
```
|
| 674 |
+
|
| 675 |
+
Payload:
|
| 676 |
+
|
| 677 |
+
```json
|
| 678 |
+
{
|
| 679 |
+
"iss": "ed25519:<issuer NodeID>",
|
| 680 |
+
"sub": "ed25519:<subject NodeID>",
|
| 681 |
+
"aud": "ed25519:<audience community, optional>",
|
| 682 |
+
"iat": 1717939200,
|
| 683 |
+
"exp": 1717942800,
|
| 684 |
+
"jti": "01HXR...",
|
| 685 |
+
"scope": {
|
| 686 |
+
"capabilities": ["rag.query@1.0"],
|
| 687 |
+
"params_constraints": {"corpus":["niederrhein-emergency"]},
|
| 688 |
+
"rate_limit_per_minute": 60
|
| 689 |
+
}
|
| 690 |
+
}
|
| 691 |
+
```
|
| 692 |
+
|
| 693 |
+
Signature: Ed25519 over `base64url(header) + "." + base64url(payload)`.
|
| 694 |
+
|
| 695 |
+
### 6.3 Node manifest delta
|
| 696 |
+
|
| 697 |
+
Phase 2 nodes set `contract_version: "2.0"`. Additional fields in `capabilities[].params`:
|
| 698 |
+
|
| 699 |
+
```json
|
| 700 |
+
{
|
| 701 |
+
"name": "llm.chat",
|
| 702 |
+
"version": "2.0",
|
| 703 |
+
"params": {
|
| 704 |
+
"model": "...",
|
| 705 |
+
"modalities": ["text","vision"],
|
| 706 |
+
"tools_supported": true,
|
| 707 |
+
"max_tools_per_call": 16,
|
| 708 |
+
"requires_internet": false
|
| 709 |
+
}
|
| 710 |
+
}
|
| 711 |
+
```
|
| 712 |
+
|
| 713 |
+
---
|
| 714 |
+
|
| 715 |
+
## 7. Events (additive to v1.0 Β§7.2)
|
| 716 |
+
|
| 717 |
+
### 7.1 New event types
|
| 718 |
+
|
| 719 |
+
```
|
| 720 |
+
federation.peer.added
|
| 721 |
+
federation.peer.removed
|
| 722 |
+
federation.heartbeat
|
| 723 |
+
auth.token.issued
|
| 724 |
+
auth.token.revoked
|
| 725 |
+
chat.thread.created
|
| 726 |
+
chat.thread.member.added
|
| 727 |
+
chat.thread.member.removed
|
| 728 |
+
chat.thread.message.sent
|
| 729 |
+
chat.thread.message.delivered
|
| 730 |
+
chat.thread.archived
|
| 731 |
+
e2e.prekeys.published
|
| 732 |
+
e2e.session.established
|
| 733 |
+
e2e.session.broken
|
| 734 |
+
file.replication.scheduled
|
| 735 |
+
file.replication.completed
|
| 736 |
+
ocr.document.indexed
|
| 737 |
+
```
|
| 738 |
+
|
| 739 |
+
### 7.2 Selected schemas
|
| 740 |
+
|
| 741 |
+
#### `federation.peer.added`
|
| 742 |
+
|
| 743 |
+
```json
|
| 744 |
+
{
|
| 745 |
+
"peer_community_id": "ed25519:...",
|
| 746 |
+
"federation_id": "...",
|
| 747 |
+
"scope": {...},
|
| 748 |
+
"co_signers": [{...},{...},{...}]
|
| 749 |
+
}
|
| 750 |
+
```
|
| 751 |
+
|
| 752 |
+
#### `auth.token.issued`
|
| 753 |
+
|
| 754 |
+
Stored without the signature payload (just metadata for audit):
|
| 755 |
+
|
| 756 |
+
```json
|
| 757 |
+
{
|
| 758 |
+
"token_id": "01HXR...",
|
| 759 |
+
"subject": "ed25519:...",
|
| 760 |
+
"scope": {...},
|
| 761 |
+
"expires_at":"...",
|
| 762 |
+
"audience": "ed25519:..."
|
| 763 |
+
}
|
| 764 |
+
```
|
| 765 |
+
|
| 766 |
+
#### `auth.token.revoked`
|
| 767 |
+
|
| 768 |
+
```json
|
| 769 |
+
{"token_id":"01HXR...","reason":"manual|policy|compromise"}
|
| 770 |
+
```
|
| 771 |
+
|
| 772 |
+
#### `chat.thread.created`
|
| 773 |
+
|
| 774 |
+
```json
|
| 775 |
+
{
|
| 776 |
+
"thread_id": "01HXR...",
|
| 777 |
+
"client_id": "01HXR...",
|
| 778 |
+
"name": "Nachbarschaftshilfe Mai",
|
| 779 |
+
"members": ["ed25519:...","ed25519:..."],
|
| 780 |
+
"e2e_enabled": true,
|
| 781 |
+
"ratchet_root_pubkey": "x25519:..."
|
| 782 |
+
}
|
| 783 |
+
```
|
| 784 |
+
|
| 785 |
+
#### `chat.thread.message.sent`
|
| 786 |
+
|
| 787 |
+
```json
|
| 788 |
+
{
|
| 789 |
+
"thread_id": "01HXR...",
|
| 790 |
+
"client_id": "01HXR...",
|
| 791 |
+
"body": {"e2e":true,"header":{...},"ciphertext":"..."} | "<cleartext>",
|
| 792 |
+
"attachments": [...]
|
| 793 |
+
}
|
| 794 |
+
```
|
| 795 |
+
|
| 796 |
+
#### `e2e.prekeys.published`
|
| 797 |
+
|
| 798 |
+
```json
|
| 799 |
+
{
|
| 800 |
+
"node_id": "ed25519:...",
|
| 801 |
+
"identity_pubkey": "x25519:...",
|
| 802 |
+
"signed_prekey": {"pubkey":"x25519:...","signature":"ed25519:..."},
|
| 803 |
+
"one_time_prekeys": ["x25519:...","x25519:...","..."]
|
| 804 |
+
}
|
| 805 |
+
```
|
| 806 |
+
|
| 807 |
+
#### `file.replication.scheduled`
|
| 808 |
+
|
| 809 |
+
```json
|
| 810 |
+
{
|
| 811 |
+
"cid": "blake3:...",
|
| 812 |
+
"desired_copies": 3,
|
| 813 |
+
"current_copies": 1,
|
| 814 |
+
"candidate_holders": ["ed25519:...","ed25519:..."]
|
| 815 |
+
}
|
| 816 |
+
```
|
| 817 |
+
|
| 818 |
+
#### `ocr.document.indexed`
|
| 819 |
+
|
| 820 |
+
```json
|
| 821 |
+
{
|
| 822 |
+
"doc_cid": "blake3:...",
|
| 823 |
+
"text_cid": "blake3:...",
|
| 824 |
+
"pages": 12,
|
| 825 |
+
"languages": ["de","la"],
|
| 826 |
+
"ocr_backend": "multilingual"
|
| 827 |
+
}
|
| 828 |
+
```
|
| 829 |
+
|
| 830 |
+
### 7.3 Federation events propagate cross-community
|
| 831 |
+
|
| 832 |
+
Events with `event_type β {federation.*, auth.token.issued, auth.token.revoked}` MAY be cross-published into a federated community's event log. The community receiving such an event records the originating community in `data._source_community`. This is the only case where an event's `community_id` does not equal the log it lives in.
|
| 833 |
+
|
| 834 |
+
---
|
| 835 |
+
|
| 836 |
+
## 8. Pub-sub topics (additive)
|
| 837 |
+
|
| 838 |
+
| Topic | Producer | Subscriber |
|
| 839 |
+
|-------|----------|------------|
|
| 840 |
+
| `federation.peer.added` | member adding | all members |
|
| 841 |
+
| `federation.peer.heartbeat.<peer_community>` | federation client loop | UI |
|
| 842 |
+
| `auth.token.issued` | issuer | issuer + subject |
|
| 843 |
+
| `chat.thread.message.<thread_id>` | sender | thread members |
|
| 844 |
+
| `e2e.prekey.request.<our_short_id>` | sender wanting session | recipient |
|
| 845 |
+
| `e2e.session.handshake.<our_short_id>` | initiator | responder |
|
| 846 |
+
| `file.replication.request.<cid_prefix>` | replication scheduler | all anchors |
|
| 847 |
+
| `mobile.push.<device_id>` | sender | push relay tier (M15) |
|
| 848 |
+
|
| 849 |
+
---
|
| 850 |
+
|
| 851 |
+
## 9. Errors β complete delta (additive to v1.0 Β§9)
|
| 852 |
+
|
| 853 |
+
| Code | When | Retry? |
|
| 854 |
+
|------|------|--------|
|
| 855 |
+
| `federation_forbidden` | Caller's community not federated for this capability | no |
|
| 856 |
+
| `not_federated` | No federation manifest with caller's community | no |
|
| 857 |
+
| `token_invalid` | Token signature bad | no |
|
| 858 |
+
| `token_expired` | Token past `exp` | no, request a new token |
|
| 859 |
+
| `token_scope_insufficient` | Token does not include this capability | no |
|
| 860 |
+
| `token_revoked` | Token id in revoked list | no |
|
| 861 |
+
| `relay_unreachable` | Configured relay tier down | yes, exp backoff |
|
| 862 |
+
| `e2e_session_missing` | No active X3DH session | yes, after key exchange |
|
| 863 |
+
| `e2e_decrypt_failed` | Ciphertext can't be decrypted | no, request rekey |
|
| 864 |
+
| `dht_lookup_failed` | DHT did not find sources in time | yes |
|
| 865 |
+
| `ratchet_out_of_order` | Message too far out of order; sender must rewind | maybe |
|
| 866 |
+
|
| 867 |
+
---
|
| 868 |
+
|
| 869 |
+
## 10. Versioning and migration
|
| 870 |
+
|
| 871 |
+
### 10.1 Mixed-version mesh
|
| 872 |
+
|
| 873 |
+
A v1.0 node and a v2.0 node may coexist on the same LAN, but:
|
| 874 |
+
|
| 875 |
+
- A v2.0 node calling a v1.0 node for a Phase 2 capability gets `not_found` (v1 didn't register it).
|
| 876 |
+
- A v1.0 node calling a v2.0 node for a v1 capability works fine (additive contract).
|
| 877 |
+
- v2.0 routes around v1.0 nodes for any capability that requires v2 features.
|
| 878 |
+
|
| 879 |
+
### 10.2 Migration of an existing community
|
| 880 |
+
|
| 881 |
+
When the founder upgrades to v2.0:
|
| 882 |
+
|
| 883 |
+
1. New `policy.min_signatures_to_federate` field added with default 3
|
| 884 |
+
2. New event types unlock; old log still replays cleanly
|
| 885 |
+
3. Existing nodes prompted to upgrade via `community.policy.updated` event
|
| 886 |
+
4. After 30 days, federation capabilities won't dispatch to non-upgraded nodes
|
| 887 |
+
|
| 888 |
+
See `MIGRATION_v1_to_v2.md` (out of band).
|
| 889 |
+
|
| 890 |
+
---
|
| 891 |
+
|
| 892 |
+
## 11. Out of scope still (deferred to Phase 3)
|
| 893 |
+
|
| 894 |
+
- Distributed-tensor inference capabilities (`experimental.distributed_llm.chat`)
|
| 895 |
+
- MoE-style expert routing (lives inside the bus as a learned scorer)
|
| 896 |
+
- Federated learning capabilities (`fedlearn.*`)
|
| 897 |
+
- LoRA long-distance beacons (no capability, hardware-only)
|
| 898 |
+
- Evidence-layer integration (`evidence.*` namespace reserved here, defined in Phase 3)
|
| 899 |
+
- Conformance test suite as a protocol surface
|
docs/p2_p3/CAPABILITY_CONTRACT_v3.md
ADDED
|
@@ -0,0 +1,651 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# HearthNet Capability Contract β Phase 3 additions (v3.0)
|
| 2 |
+
|
| 3 |
+
**Spec version:** v3.0
|
| 4 |
+
**Last touched:** 2026-06-09
|
| 5 |
+
**Builds on:** [`../phase-2/CAPABILITY_CONTRACT_v2.md`](../phase-2/CAPABILITY_CONTRACT_v2.md) (v2.0)
|
| 6 |
+
|
| 7 |
+
This document is **additive** to v2.0. Phase 3 capabilities are mostly under the `experimental.` namespace; nodes default to ignoring them unless `policy.research.enable = true`.
|
| 8 |
+
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
## 1. Conventions delta
|
| 12 |
+
|
| 13 |
+
### 1.1 The `experimental.` namespace
|
| 14 |
+
|
| 15 |
+
A capability of the form `experimental.<name>@<ver>` is treated specially:
|
| 16 |
+
|
| 17 |
+
- Nodes do not register experimental capabilities by default.
|
| 18 |
+
- The bus discovery layer (M02 / X05) excludes experimental capabilities from outbound advertisements unless the local policy opts in.
|
| 19 |
+
- A capability in `experimental.` MAY change its contract between minor versions without a contract bump. Callers MUST tolerate breakage.
|
| 20 |
+
- Once a capability is sufficiently proven and stable, it is **promoted** out of `experimental.` in a future minor contract bump. The prior `experimental.` form is then deprecated; both forms work for one minor-release window.
|
| 21 |
+
|
| 22 |
+
Promotion criteria (a soft checklist; the protocol working group, see M32, decides):
|
| 23 |
+
- β₯ 2 independent implementations.
|
| 24 |
+
- Used in production by β₯ 3 communities for β₯ 90 days.
|
| 25 |
+
- Open security review with no unresolved high-severity findings.
|
| 26 |
+
- Per-call cost (compute, latency, bytes) within a 2Γ factor of the budget in the capability spec.
|
| 27 |
+
|
| 28 |
+
### 1.2 Claim records (new top-level concept)
|
| 29 |
+
|
| 30 |
+
Phase 3 introduces a second persistent surface alongside the event log: the **claim graph**. Where the event log records "X did Y at T", the claim graph records "X asserts P, citing E".
|
| 31 |
+
|
| 32 |
+
```json
|
| 33 |
+
{
|
| 34 |
+
"schema_version": 1,
|
| 35 |
+
"claim_id": "01HXR...",
|
| 36 |
+
"claim_type": "factual|preference|policy|sighting|...",
|
| 37 |
+
"predicate": {"subject":"...","verb":"...","object":"...","modifiers":{...}},
|
| 38 |
+
"evidence": [{"kind":"event_ref","value":"01HXR..."}, {"kind":"document_cid","value":"blake3:..."}],
|
| 39 |
+
"asserted_by": "ed25519:...",
|
| 40 |
+
"asserted_at": "...",
|
| 41 |
+
"evidence_level": "unverified|cited|cross_referenced|attested|disputed",
|
| 42 |
+
"supersedes": ["01HXR..."],
|
| 43 |
+
"signature": "..."
|
| 44 |
+
}
|
| 45 |
+
```
|
| 46 |
+
|
| 47 |
+
Claims live in a separate Merkle-DAG store ([M30 Β§4](modules/M30-evidence-ebkh.md)). They are not events. Events describe what *happened*; claims describe what is *believed*.
|
| 48 |
+
|
| 49 |
+
### 1.3 New error codes (additive)
|
| 50 |
+
|
| 51 |
+
| Code | Meaning |
|
| 52 |
+
|------|---------|
|
| 53 |
+
| `experimental_disabled` | Caller asked for an experimental capability that the node has not opted into |
|
| 54 |
+
| `shard_unavailable` | Distributed inference: one shard host failed mid-stream |
|
| 55 |
+
| `pipeline_stalled` | Distributed inference: no progress within timeout |
|
| 56 |
+
| `fedlearn_round_quorum` | Federated learning: too few participants for this round |
|
| 57 |
+
| `fedlearn_diff_invalid` | Submitted LoRA diff failed schema or bounds check |
|
| 58 |
+
| `evidence_contradiction` | A new claim directly contradicts a previously-attested claim |
|
| 59 |
+
| `civdef_audit_required` | Operation rejected because civil-defence audit policy is active and the call is unsigned by an authorised role |
|
| 60 |
+
|
| 61 |
+
---
|
| 62 |
+
|
| 63 |
+
## 2. Capability namespace allocations
|
| 64 |
+
|
| 65 |
+
Promoted from "reserved" in v2.0 or introduced new:
|
| 66 |
+
|
| 67 |
+
| Prefix | Status | Defined |
|
| 68 |
+
|--------|--------|---------|
|
| 69 |
+
| `experimental.distributed_llm.*` | experimental | [M26](modules/M26-distributed-inference.md) |
|
| 70 |
+
| `experimental.moe.*` | experimental | [M27](modules/M27-moe-routing.md) |
|
| 71 |
+
| `experimental.fedlearn.*` | experimental | [M28](modules/M28-fedlearn.md) |
|
| 72 |
+
| `evidence.*` | stable | [M30](modules/M30-evidence-ebkh.md) |
|
| 73 |
+
| `civdef.*` | stable (when civdef profile active) | [M31](modules/M31-civil-defense.md) |
|
| 74 |
+
| `protocol.*` | stable | [M32](modules/M32-protocol-standard.md) (conformance reporting) |
|
| 75 |
+
|
| 76 |
+
---
|
| 77 |
+
|
| 78 |
+
## 3. Phase 3 capabilities
|
| 79 |
+
|
| 80 |
+
| Name | Stability | Stream? | Trust required | Section |
|
| 81 |
+
|------|-----------|---------|----------------|---------|
|
| 82 |
+
| `experimental.distributed_llm.chat@1.0` | experimental | yes | member + research opt-in | Β§4.1 |
|
| 83 |
+
| `experimental.distributed_llm.shard.advertise@1.0` | experimental | no | trusted + research opt-in | Β§4.2 |
|
| 84 |
+
| `experimental.distributed_llm.shard.serve@1.0` | experimental | yes | trusted + research opt-in | Β§4.3 |
|
| 85 |
+
| `experimental.moe.route@1.0` | experimental | no | member + research opt-in | Β§4.4 |
|
| 86 |
+
| `experimental.moe.expert.register@1.0` | experimental | no | self + research opt-in | Β§4.5 |
|
| 87 |
+
| `experimental.moe.expert.handoff@1.0` | experimental | yes | as configured | Β§4.6 |
|
| 88 |
+
| `experimental.fedlearn.round.start@1.0` | experimental | no | anchor + research opt-in | Β§4.7 |
|
| 89 |
+
| `experimental.fedlearn.round.participate@1.0` | experimental | yes | member + research opt-in | Β§4.8 |
|
| 90 |
+
| `experimental.fedlearn.round.aggregate@1.0` | experimental | no | round coordinator | Β§4.9 |
|
| 91 |
+
| `experimental.fedlearn.lora.publish@1.0` | experimental | no | anchor | Β§4.10 |
|
| 92 |
+
| `evidence.claim.assert@1.0` | stable | no | member | Β§4.11 |
|
| 93 |
+
| `evidence.claim.dispute@1.0` | stable | no | member | Β§4.12 |
|
| 94 |
+
| `evidence.claim.attest@1.0` | stable | no | trusted | Β§4.13 |
|
| 95 |
+
| `evidence.claim.query@1.0` | stable | no | member | Β§4.14 |
|
| 96 |
+
| `evidence.provenance.trace@1.0` | stable | no | member | Β§4.15 |
|
| 97 |
+
| `civdef.alert.publish@1.0` | stable | no | authorised KatS role | Β§4.16 |
|
| 98 |
+
| `civdef.role.register@1.0` | stable | no | anchor (with role-cert) | Β§4.17 |
|
| 99 |
+
| `civdef.audit.export@1.0` | stable | yes | authorised auditor | Β§4.18 |
|
| 100 |
+
| `protocol.conformance.report@1.0` | stable | no | self | Β§4.19 |
|
| 101 |
+
| `protocol.version.list@1.0` | stable | no | unknown | Β§4.20 |
|
| 102 |
+
|
| 103 |
+
---
|
| 104 |
+
|
| 105 |
+
## 4. Per-capability specifications
|
| 106 |
+
|
| 107 |
+
### 4.1 `experimental.distributed_llm.chat@1.0`
|
| 108 |
+
|
| 109 |
+
Like `llm.chat@2.0` but the inference is sharded across multiple shard-server nodes. The caller's node acts as the orchestrator and streams tokens back to the user.
|
| 110 |
+
|
| 111 |
+
**Request:**
|
| 112 |
+
```json
|
| 113 |
+
{
|
| 114 |
+
"params": {
|
| 115 |
+
"model": "Qwen2.5-7B-Instruct",
|
| 116 |
+
"sharding": "auto",
|
| 117 |
+
"fallback_to_local": true
|
| 118 |
+
},
|
| 119 |
+
"input": {
|
| 120 |
+
"messages": [...]
|
| 121 |
+
}
|
| 122 |
+
}
|
| 123 |
+
```
|
| 124 |
+
|
| 125 |
+
**Stream frames:** same as `llm.chat@2.0` (`token_delta`, `done`), plus diagnostic frames:
|
| 126 |
+
|
| 127 |
+
```
|
| 128 |
+
event: shard_status
|
| 129 |
+
data: {"shards":[
|
| 130 |
+
{"shard_id":"Qwen2.5-7B:0-7","host":"ed25519:...","status":"online","latency_ms":4},
|
| 131 |
+
{"shard_id":"Qwen2.5-7B:8-15","host":"ed25519:...","status":"online","latency_ms":7}
|
| 132 |
+
]}
|
| 133 |
+
|
| 134 |
+
event: shard_failover
|
| 135 |
+
data: {"failed_shard":"Qwen2.5-7B:8-15","replacement":"Qwen2.5-7B:8-15@other_host"}
|
| 136 |
+
```
|
| 137 |
+
|
| 138 |
+
**Errors:** `shard_unavailable`, `pipeline_stalled`, `experimental_disabled`.
|
| 139 |
+
|
| 140 |
+
### 4.2 `experimental.distributed_llm.shard.advertise@1.0`
|
| 141 |
+
|
| 142 |
+
A node informs the bus that it is willing to serve a specific shard range.
|
| 143 |
+
|
| 144 |
+
**Request:**
|
| 145 |
+
```json
|
| 146 |
+
{
|
| 147 |
+
"params": {},
|
| 148 |
+
"input": {
|
| 149 |
+
"shard_id": "Qwen2.5-7B:0-7",
|
| 150 |
+
"model_id": "Qwen2.5-7B-Instruct",
|
| 151 |
+
"layer_range": [0, 7],
|
| 152 |
+
"max_concurrent_streams": 2,
|
| 153 |
+
"vram_required_mb": 6800
|
| 154 |
+
}
|
| 155 |
+
}
|
| 156 |
+
```
|
| 157 |
+
|
| 158 |
+
Emits `experimental.shard.advertised` into the event log. Other nodes can then call `experimental.distributed_llm.shard.serve` on us to use the shard.
|
| 159 |
+
|
| 160 |
+
### 4.3 `experimental.distributed_llm.shard.serve@1.0`
|
| 161 |
+
|
| 162 |
+
Tensor-passing inner call. Not normally invoked by user code β used by orchestrators only. See [X08](cross-cutting/X08-tensor-transport.md) for the wire format.
|
| 163 |
+
|
| 164 |
+
### 4.4 `experimental.moe.route@1.0`
|
| 165 |
+
|
| 166 |
+
Decide which expert (model, human, service) to route a request to.
|
| 167 |
+
|
| 168 |
+
**Request:**
|
| 169 |
+
```json
|
| 170 |
+
{
|
| 171 |
+
"params": {},
|
| 172 |
+
"input": {
|
| 173 |
+
"request_summary": "User asks about Sankt Martins parade route in Issum, 2026.",
|
| 174 |
+
"tags": ["local_knowledge","event_planning"],
|
| 175 |
+
"top_k": 3
|
| 176 |
+
}
|
| 177 |
+
}
|
| 178 |
+
```
|
| 179 |
+
|
| 180 |
+
**Response:**
|
| 181 |
+
```json
|
| 182 |
+
{
|
| 183 |
+
"output": {
|
| 184 |
+
"routes": [
|
| 185 |
+
{"expert_id":"human:ed25519:...", "kind":"human", "score":0.91, "name":"Maria K."},
|
| 186 |
+
{"expert_id":"corpus:niederrhein-events", "kind":"service", "score":0.74, "endpoint":"rag.query@1.0"},
|
| 187 |
+
{"expert_id":"model:llama3-70b-instruct", "kind":"model", "score":0.41}
|
| 188 |
+
],
|
| 189 |
+
"rationale": "Sankt Martins is a local cultural event; humans with annotated knowledge of Issum specifically score highest."
|
| 190 |
+
},
|
| 191 |
+
"meta": {"ms": 28}
|
| 192 |
+
}
|
| 193 |
+
```
|
| 194 |
+
|
| 195 |
+
### 4.5 `experimental.moe.expert.register@1.0`
|
| 196 |
+
|
| 197 |
+
A node (or a human via their node) declares itself an expert on some topics.
|
| 198 |
+
|
| 199 |
+
```json
|
| 200 |
+
{
|
| 201 |
+
"params": {},
|
| 202 |
+
"input": {
|
| 203 |
+
"expert_kind": "human",
|
| 204 |
+
"topics": ["sankt_martins","niederrhein_local_history"],
|
| 205 |
+
"availability": {"weekdays_19_21_local": true},
|
| 206 |
+
"consent_to_route": true
|
| 207 |
+
}
|
| 208 |
+
}
|
| 209 |
+
```
|
| 210 |
+
|
| 211 |
+
### 4.6 `experimental.moe.expert.handoff@1.0`
|
| 212 |
+
|
| 213 |
+
When a route to a human expert is chosen, this capability hands the conversation off to the expert's UI (typically: chat thread invite, optional E2E).
|
| 214 |
+
|
| 215 |
+
```json
|
| 216 |
+
{
|
| 217 |
+
"params": {},
|
| 218 |
+
"input": {
|
| 219 |
+
"expert_id": "human:ed25519:...",
|
| 220 |
+
"context_summary": "...",
|
| 221 |
+
"permitted_replies": ["text","attachment"],
|
| 222 |
+
"deadline_minutes": 60
|
| 223 |
+
}
|
| 224 |
+
}
|
| 225 |
+
```
|
| 226 |
+
|
| 227 |
+
### 4.7 `experimental.fedlearn.round.start@1.0`
|
| 228 |
+
|
| 229 |
+
Anchor opens a federated learning round.
|
| 230 |
+
|
| 231 |
+
```json
|
| 232 |
+
{
|
| 233 |
+
"params": {},
|
| 234 |
+
"input": {
|
| 235 |
+
"round_id": "01HXR...",
|
| 236 |
+
"base_model": "Qwen2.5-3B-Instruct",
|
| 237 |
+
"lora_config": {"r":16,"alpha":32,"target_modules":["q_proj","v_proj"]},
|
| 238 |
+
"training_corpus": "niederrhein-emergency",
|
| 239 |
+
"min_participants": 3,
|
| 240 |
+
"max_minutes": 120,
|
| 241 |
+
"objective": "next_token_loss",
|
| 242 |
+
"dp_noise_scale": 0.0
|
| 243 |
+
}
|
| 244 |
+
}
|
| 245 |
+
```
|
| 246 |
+
|
| 247 |
+
### 4.8 `experimental.fedlearn.round.participate@1.0`
|
| 248 |
+
|
| 249 |
+
A node opts into a round and streams its computed LoRA diff back.
|
| 250 |
+
|
| 251 |
+
**Stream frames:**
|
| 252 |
+
```
|
| 253 |
+
event: phase
|
| 254 |
+
data: {"phase":"training","step":0,"total":200}
|
| 255 |
+
|
| 256 |
+
event: phase
|
| 257 |
+
data: {"phase":"training","step":200,"total":200}
|
| 258 |
+
|
| 259 |
+
event: diff
|
| 260 |
+
data: {"lora_diff_cid":"blake3:...","examples_seen":4321,"loss_end":0.84}
|
| 261 |
+
```
|
| 262 |
+
|
| 263 |
+
### 4.9 `experimental.fedlearn.round.aggregate@1.0`
|
| 264 |
+
|
| 265 |
+
Coordinator aggregates submitted diffs (FedAvg, weighted by `examples_seen`).
|
| 266 |
+
|
| 267 |
+
```json
|
| 268 |
+
{
|
| 269 |
+
"params": {},
|
| 270 |
+
"input": {
|
| 271 |
+
"round_id": "01HXR...",
|
| 272 |
+
"diff_cids": ["blake3:...","blake3:...","blake3:..."]
|
| 273 |
+
}
|
| 274 |
+
}
|
| 275 |
+
```
|
| 276 |
+
|
| 277 |
+
**Response:** `{"output":{"aggregated_lora_cid":"blake3:...","participants_used":3,"dropped":[]},"meta":{...}}`
|
| 278 |
+
|
| 279 |
+
### 4.10 `experimental.fedlearn.lora.publish@1.0`
|
| 280 |
+
|
| 281 |
+
After aggregation, the anchor publishes the new LoRA to the community.
|
| 282 |
+
|
| 283 |
+
```json
|
| 284 |
+
{
|
| 285 |
+
"params": {},
|
| 286 |
+
"input": {
|
| 287 |
+
"round_id": "01HXR...",
|
| 288 |
+
"aggregated_lora_cid": "blake3:...",
|
| 289 |
+
"base_model": "Qwen2.5-3B-Instruct",
|
| 290 |
+
"version": "niederrhein-emergency-v3"
|
| 291 |
+
}
|
| 292 |
+
}
|
| 293 |
+
```
|
| 294 |
+
|
| 295 |
+
Emits `experimental.fedlearn.lora.published` event. Nodes that have opted into the corpus' LoRA can pull the new version.
|
| 296 |
+
|
| 297 |
+
### 4.11 `evidence.claim.assert@1.0`
|
| 298 |
+
|
| 299 |
+
Assert a claim into the claim graph.
|
| 300 |
+
|
| 301 |
+
**Request:**
|
| 302 |
+
```json
|
| 303 |
+
{
|
| 304 |
+
"params": {},
|
| 305 |
+
"input": {
|
| 306 |
+
"claim_type": "factual",
|
| 307 |
+
"predicate": {"subject":"<Brunnen 12 Issum>","verb":"yields","object":"<200L/h drinkable water>"},
|
| 308 |
+
"evidence": [
|
| 309 |
+
{"kind":"event_ref","value":"01HXR..."},
|
| 310 |
+
{"kind":"document_cid","value":"blake3:..."}
|
| 311 |
+
],
|
| 312 |
+
"ttl_days": 365
|
| 313 |
+
}
|
| 314 |
+
}
|
| 315 |
+
```
|
| 316 |
+
|
| 317 |
+
**Response:** `{"output":{"claim_id":"01HXR...","evidence_level":"cited"},"meta":{...}}`
|
| 318 |
+
|
| 319 |
+
### 4.12 `evidence.claim.dispute@1.0`
|
| 320 |
+
|
| 321 |
+
```json
|
| 322 |
+
{"params":{},"input":{"claim_id":"01HXR...","reason":"...","counter_evidence":[...]}}
|
| 323 |
+
```
|
| 324 |
+
|
| 325 |
+
### 4.13 `evidence.claim.attest@1.0`
|
| 326 |
+
|
| 327 |
+
Trusted member adds an attestation, raising the claim's evidence level.
|
| 328 |
+
|
| 329 |
+
```json
|
| 330 |
+
{"params":{},"input":{"claim_id":"01HXR...","attestation":"I confirmed personally on 2026-06-08"}}
|
| 331 |
+
```
|
| 332 |
+
|
| 333 |
+
A claim becomes `attested` after `policy.evidence.attestations_required_for_attested` distinct trusted attestations (default 3).
|
| 334 |
+
|
| 335 |
+
### 4.14 `evidence.claim.query@1.0`
|
| 336 |
+
|
| 337 |
+
```json
|
| 338 |
+
{
|
| 339 |
+
"params": {},
|
| 340 |
+
"input": {
|
| 341 |
+
"predicate_pattern": {"subject":"<Brunnen 12 Issum>","verb":"*"},
|
| 342 |
+
"min_evidence_level":"cited",
|
| 343 |
+
"limit": 20
|
| 344 |
+
}
|
| 345 |
+
}
|
| 346 |
+
```
|
| 347 |
+
|
| 348 |
+
### 4.15 `evidence.provenance.trace@1.0`
|
| 349 |
+
|
| 350 |
+
Walk the evidence chain backwards.
|
| 351 |
+
|
| 352 |
+
```json
|
| 353 |
+
{
|
| 354 |
+
"params": {},
|
| 355 |
+
"input": {"claim_id":"01HXR...","max_depth":8}
|
| 356 |
+
}
|
| 357 |
+
```
|
| 358 |
+
|
| 359 |
+
**Response:** a DAG of `{claim_id, predicate, evidence_summary, asserted_by, asserted_at, evidence_level}` nodes.
|
| 360 |
+
|
| 361 |
+
### 4.16 `civdef.alert.publish@1.0`
|
| 362 |
+
|
| 363 |
+
Civil-defence-grade alert. Differs from `emergency.publish@1.0` (Phase 1) in that the caller must hold a `civdef.role` credential and the alert is signed for legal-evidence retention.
|
| 364 |
+
|
| 365 |
+
```json
|
| 366 |
+
{
|
| 367 |
+
"params": {},
|
| 368 |
+
"input": {
|
| 369 |
+
"client_id": "01HXR...",
|
| 370 |
+
"severity": "warning|alert|emergency|extreme",
|
| 371 |
+
"category": "weather|fire|chemical|flood|infrastructure|other",
|
| 372 |
+
"title": "Stromausfall Issum Mitte",
|
| 373 |
+
"body": "...",
|
| 374 |
+
"areas": [{"polygon":"<geojson>"}],
|
| 375 |
+
"issued_by_role": "thw_ortsverband_geldern",
|
| 376 |
+
"audit_evidence": [{"kind":"role_certificate","value":"..."}]
|
| 377 |
+
}
|
| 378 |
+
}
|
| 379 |
+
```
|
| 380 |
+
|
| 381 |
+
### 4.17 `civdef.role.register@1.0`
|
| 382 |
+
|
| 383 |
+
Anchor registers a community member as holding an authorised KatS role.
|
| 384 |
+
|
| 385 |
+
```json
|
| 386 |
+
{
|
| 387 |
+
"params": {},
|
| 388 |
+
"input": {
|
| 389 |
+
"subject": "ed25519:...",
|
| 390 |
+
"role": "thw_helfer|drk_sanitaeter|feuerwehr|katastrophenschutzbeauftragter",
|
| 391 |
+
"role_certificate_cid": "blake3:...",
|
| 392 |
+
"expires_at": "2027-01-01T00:00:00Z"
|
| 393 |
+
}
|
| 394 |
+
}
|
| 395 |
+
```
|
| 396 |
+
|
| 397 |
+
### 4.18 `civdef.audit.export@1.0`
|
| 398 |
+
|
| 399 |
+
Stream the audit trail for a time range β used by KatS auditors.
|
| 400 |
+
|
| 401 |
+
```json
|
| 402 |
+
{
|
| 403 |
+
"params": {},
|
| 404 |
+
"input": {"from":"2026-04-01T00:00:00Z","to":"2026-06-01T00:00:00Z"}
|
| 405 |
+
}
|
| 406 |
+
```
|
| 407 |
+
|
| 408 |
+
**Stream frames:** one frame per audit record; signed batches every 1000 records for tamper evidence.
|
| 409 |
+
|
| 410 |
+
### 4.19 `protocol.conformance.report@1.0`
|
| 411 |
+
|
| 412 |
+
Generate a conformance report against the [X09](cross-cutting/X09-conformance-suite.md) suite.
|
| 413 |
+
|
| 414 |
+
```json
|
| 415 |
+
{
|
| 416 |
+
"params": {},
|
| 417 |
+
"input": {"suite_version":"3.0"}
|
| 418 |
+
}
|
| 419 |
+
```
|
| 420 |
+
|
| 421 |
+
**Response:** `{"output":{"report_cid":"blake3:...","passed":214,"failed":3,"skipped":17},"meta":{...}}`
|
| 422 |
+
|
| 423 |
+
### 4.20 `protocol.version.list@1.0`
|
| 424 |
+
|
| 425 |
+
Returns the contract versions this node supports and the conformance suite versions it has passed.
|
| 426 |
+
|
| 427 |
+
```json
|
| 428 |
+
{
|
| 429 |
+
"output": {
|
| 430 |
+
"contract_versions": ["1.0","2.0","3.0"],
|
| 431 |
+
"conformance_passed": [{"suite":"3.0","report_cid":"blake3:..."}],
|
| 432 |
+
"implementation": {"name":"hearthnet-py","version":"0.7.2","commit":"abc123"}
|
| 433 |
+
}
|
| 434 |
+
}
|
| 435 |
+
```
|
| 436 |
+
|
| 437 |
+
---
|
| 438 |
+
|
| 439 |
+
## 5. Wire format additions
|
| 440 |
+
|
| 441 |
+
### 5.1 Tensor transport (binary)
|
| 442 |
+
|
| 443 |
+
For `experimental.distributed_llm.shard.serve@1.0` only. WebSocket frames carry **binary payloads** with a 16-byte header:
|
| 444 |
+
|
| 445 |
+
```
|
| 446 |
+
+---------------+---------------+---------------+
|
| 447 |
+
| 4B chunk_id | 4B chunk_seq | 4B total_seq |
|
| 448 |
+
| 2B flags | 2B reserved |
|
| 449 |
+
+---------------+---------------+---------------+
|
| 450 |
+
| tensor chunk (β€ 1 MB) |
|
| 451 |
+
+-----------------------------------------------+
|
| 452 |
+
```
|
| 453 |
+
|
| 454 |
+
`flags`:
|
| 455 |
+
- `0x0001` LAST (this is the final chunk of the message)
|
| 456 |
+
- `0x0002` COMPRESSED (zstd-compressed; only when payload β₯ `TENSOR_COMPRESSION_THRESHOLD_BYTES`)
|
| 457 |
+
- `0x0004` FP16 (else FP32)
|
| 458 |
+
|
| 459 |
+
See [X08](cross-cutting/X08-tensor-transport.md) for the protocol.
|
| 460 |
+
|
| 461 |
+
### 5.2 Claim records
|
| 462 |
+
|
| 463 |
+
Claims are stored in their own Merkle-DAG; they reference events but are not events. A claim record header:
|
| 464 |
+
|
| 465 |
+
```
|
| 466 |
+
X-HearthNet-Claim: 01HXR...
|
| 467 |
+
X-HearthNet-Claim-Asserted-By: ed25519:...
|
| 468 |
+
```
|
| 469 |
+
|
| 470 |
+
Claim records flow through the same transport but with `Content-Type: application/vnd.hearthnet.claim+json`.
|
| 471 |
+
|
| 472 |
+
### 5.3 Civil-defence audit signatures
|
| 473 |
+
|
| 474 |
+
`civdef.*` capability calls always require both a per-call signature (per X01) **and** a `civdef.role` credential reference in headers:
|
| 475 |
+
|
| 476 |
+
```
|
| 477 |
+
X-HearthNet-CivDef-Role: thw_helfer
|
| 478 |
+
X-HearthNet-CivDef-Role-Cert: blake3:...
|
| 479 |
+
```
|
| 480 |
+
|
| 481 |
+
Without these, the call returns `civdef_audit_required`.
|
| 482 |
+
|
| 483 |
+
---
|
| 484 |
+
|
| 485 |
+
## 6. Manifests
|
| 486 |
+
|
| 487 |
+
### 6.1 Node manifest delta
|
| 488 |
+
|
| 489 |
+
```json
|
| 490 |
+
{
|
| 491 |
+
"contract_version": "3.0",
|
| 492 |
+
"experimental_capabilities_enabled": false,
|
| 493 |
+
"civdef_profile": {
|
| 494 |
+
"active": false,
|
| 495 |
+
"authority": "",
|
| 496 |
+
"audit_endpoint": ""
|
| 497 |
+
},
|
| 498 |
+
"research_opt_in": {
|
| 499 |
+
"fedlearn": false,
|
| 500 |
+
"distributed_inference": false,
|
| 501 |
+
"moe_human_routing": false
|
| 502 |
+
}
|
| 503 |
+
}
|
| 504 |
+
```
|
| 505 |
+
|
| 506 |
+
A node with `experimental_capabilities_enabled=false` does not advertise any `experimental.*` capabilities.
|
| 507 |
+
|
| 508 |
+
### 6.2 Community policy delta
|
| 509 |
+
|
| 510 |
+
```yaml
|
| 511 |
+
research:
|
| 512 |
+
enable: false
|
| 513 |
+
enabled_capabilities: []
|
| 514 |
+
|
| 515 |
+
fedlearn:
|
| 516 |
+
participate: false
|
| 517 |
+
share_compute_with_federated: false
|
| 518 |
+
dp_noise_scale_min: 0.0
|
| 519 |
+
|
| 520 |
+
civdef:
|
| 521 |
+
active: false
|
| 522 |
+
authority: "" # e.g. "Kreis Kleve BevΓΆlkerungsschutz"
|
| 523 |
+
audit_export_to: ""
|
| 524 |
+
|
| 525 |
+
evidence:
|
| 526 |
+
attestations_required_for_attested: 3
|
| 527 |
+
default_claim_ttl_days: 365
|
| 528 |
+
retain_disputed_claims_days: 1825
|
| 529 |
+
```
|
| 530 |
+
|
| 531 |
+
---
|
| 532 |
+
|
| 533 |
+
## 7. Events (additive to v2.0 Β§7.1)
|
| 534 |
+
|
| 535 |
+
```
|
| 536 |
+
experimental.shard.advertised
|
| 537 |
+
experimental.shard.retired
|
| 538 |
+
experimental.fedlearn.round.opened
|
| 539 |
+
experimental.fedlearn.round.closed
|
| 540 |
+
experimental.fedlearn.lora.published
|
| 541 |
+
experimental.moe.expert.registered
|
| 542 |
+
experimental.moe.expert.unregistered
|
| 543 |
+
evidence.claim.asserted
|
| 544 |
+
evidence.claim.disputed
|
| 545 |
+
evidence.claim.attested
|
| 546 |
+
evidence.claim.superseded
|
| 547 |
+
civdef.alert.published
|
| 548 |
+
civdef.role.registered
|
| 549 |
+
civdef.role.revoked
|
| 550 |
+
civdef.audit.exported
|
| 551 |
+
protocol.conformance.reported
|
| 552 |
+
```
|
| 553 |
+
|
| 554 |
+
### Selected schemas
|
| 555 |
+
|
| 556 |
+
#### `evidence.claim.asserted`
|
| 557 |
+
|
| 558 |
+
```json
|
| 559 |
+
{
|
| 560 |
+
"claim_id": "01HXR...",
|
| 561 |
+
"claim_type": "factual",
|
| 562 |
+
"predicate": {...},
|
| 563 |
+
"asserted_by": "ed25519:...",
|
| 564 |
+
"evidence_level": "cited",
|
| 565 |
+
"claim_payload_cid": "blake3:..."
|
| 566 |
+
}
|
| 567 |
+
```
|
| 568 |
+
|
| 569 |
+
The event references the full claim record by CID; the claim itself lives in the claim store (M30 Β§4).
|
| 570 |
+
|
| 571 |
+
#### `civdef.alert.published`
|
| 572 |
+
|
| 573 |
+
```json
|
| 574 |
+
{
|
| 575 |
+
"client_id": "01HXR...",
|
| 576 |
+
"severity": "alert",
|
| 577 |
+
"category": "infrastructure",
|
| 578 |
+
"title": "Stromausfall Issum Mitte",
|
| 579 |
+
"issued_by": "ed25519:...",
|
| 580 |
+
"issued_by_role":"thw_ortsverband_geldern",
|
| 581 |
+
"areas": [{...}]
|
| 582 |
+
}
|
| 583 |
+
```
|
| 584 |
+
|
| 585 |
+
#### `experimental.fedlearn.lora.published`
|
| 586 |
+
|
| 587 |
+
```json
|
| 588 |
+
{
|
| 589 |
+
"round_id": "01HXR...",
|
| 590 |
+
"base_model": "Qwen2.5-3B-Instruct",
|
| 591 |
+
"aggregated_lora_cid": "blake3:...",
|
| 592 |
+
"version": "niederrhein-emergency-v3",
|
| 593 |
+
"participants": 5,
|
| 594 |
+
"objective": "next_token_loss",
|
| 595 |
+
"dp_noise_scale": 0.0
|
| 596 |
+
}
|
| 597 |
+
```
|
| 598 |
+
|
| 599 |
+
---
|
| 600 |
+
|
| 601 |
+
## 8. Pub-sub topics (additive)
|
| 602 |
+
|
| 603 |
+
| Topic | Producer | Subscriber |
|
| 604 |
+
|-------|----------|------------|
|
| 605 |
+
| `experimental.shard.advertised` | shard host | orchestrators |
|
| 606 |
+
| `experimental.fedlearn.round.opened.<round_id>` | coordinator | members |
|
| 607 |
+
| `experimental.moe.expert.registered` | expert | router |
|
| 608 |
+
| `evidence.claim.<claim_id>.changed` | asserter / disputer | watchers |
|
| 609 |
+
| `civdef.alert.<area_hash>` | civdef caller | members in area |
|
| 610 |
+
|
| 611 |
+
---
|
| 612 |
+
|
| 613 |
+
## 9. Errors β complete Phase 3 set
|
| 614 |
+
|
| 615 |
+
(additive to v2.0 Β§9)
|
| 616 |
+
|
| 617 |
+
| Code | When |
|
| 618 |
+
|------|------|
|
| 619 |
+
| `experimental_disabled` | Caller asked for an experimental capability the node has not opted into |
|
| 620 |
+
| `shard_unavailable` | A required shard host failed mid-pipeline |
|
| 621 |
+
| `pipeline_stalled` | No progress within `DISTRIBUTED_SHARD_HEALTH_TIMEOUT_S` |
|
| 622 |
+
| `fedlearn_round_quorum` | Round closed for lack of participants |
|
| 623 |
+
| `fedlearn_diff_invalid` | Submitted diff failed schema or norm bounds |
|
| 624 |
+
| `evidence_contradiction` | A new claim directly contradicts an attested claim; explicit override needed |
|
| 625 |
+
| `civdef_audit_required` | Operation rejected because civdef profile is active and call is not properly signed |
|
| 626 |
+
| `civdef_role_invalid` | Civdef role credential is missing, expired, or revoked |
|
| 627 |
+
| `conformance_failed` | A `protocol.*` operation depends on a passed conformance suite that this node has not passed |
|
| 628 |
+
|
| 629 |
+
---
|
| 630 |
+
|
| 631 |
+
## 10. Compatibility
|
| 632 |
+
|
| 633 |
+
- v3 contract nodes are backward-compatible with v2 nodes: v2 nodes simply do not see `experimental.*` capabilities and do not understand claim records (the relevant event types are skipped on replay).
|
| 634 |
+
- A v3 community must include at least one anchor that has passed the `protocol.conformance.report@1.0` suite at level 3.0, otherwise capabilities under `civdef.*` and the claim graph features remain inert (the spec calls this **degraded v3**; everything else still works).
|
| 635 |
+
- Promotion of an `experimental.X` to `X` is a normal v3 minor contract bump; the experimental form remains registered for one minor cycle.
|
| 636 |
+
|
| 637 |
+
---
|
| 638 |
+
|
| 639 |
+
## 11. Glossary additions
|
| 640 |
+
|
| 641 |
+
| Term | Meaning |
|
| 642 |
+
|------|---------|
|
| 643 |
+
| Shard | One contiguous range of transformer layers, served by one node |
|
| 644 |
+
| Pipeline | The chain of shards that produces a single LLM forward-pass |
|
| 645 |
+
| Expert | A routable subsystem (model, service, or human) that can answer a class of requests |
|
| 646 |
+
| Round | A federated-learning training session bounded in time |
|
| 647 |
+
| Diff | A LoRA delta tensor produced by one round participant |
|
| 648 |
+
| Claim | A signed assertion of a predicate, with evidence, in the claim graph |
|
| 649 |
+
| Attestation | A signed endorsement by a trusted member that a claim is correct |
|
| 650 |
+
| KatS | Katastrophenschutz β German civil protection |
|
| 651 |
+
| Conformance | The property of an implementation passing the X09 suite at a stated level |
|
docs/p2_p3/IMPLEMENTATION_REFERENCE.md
ADDED
|
@@ -0,0 +1,288 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# HearthNet Phase 3 β Spec Set Overview
|
| 2 |
+
|
| 3 |
+
**Phase 3 scope:** research-shaped, 6β12 months. This is where HearthNet stops being a product and starts being a protocol. Each module here is an investment in a long-term capability where the engineering is the easy part β the hard part is establishing trust, governance, and standards.
|
| 4 |
+
|
| 5 |
+
**Stance:** Phase 3 specs are **roadmaps**, not contracts. Where a Phase-1/2 spec answers "what does this *do*?", a Phase-3 spec answers "what would we *build* if we were ready to commit?". Concrete enough to start, loose enough to be wrong about details without invalidating the direction.
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## 0. Reading these specs
|
| 10 |
+
|
| 11 |
+
Phase 3 specs deviate from the Phase 1 / 2 template in three respects:
|
| 12 |
+
|
| 13 |
+
1. **Stability tag is `experimental` for new capabilities** unless explicitly promoted later. Mesh nodes ignore experimental capabilities unless the operator opts in via `policy.research.enable = true`.
|
| 14 |
+
2. **Each module carries an "Open research questions" section** that is longer than the spec itself, by design. Phase 3 modules answer *some* of their open questions before shipping; the rest stay open.
|
| 15 |
+
3. **Acceptance criteria are described, not enumerated**. The point isn't to grade an implementation against a checklist; it's to say "we'll know this is working whenβ¦"
|
| 16 |
+
|
| 17 |
+
If you read a Phase 3 spec and feel uncertain about how something works, that uncertainty is faithful to the state of the work. The spec is doing its job by being honest about that.
|
| 18 |
+
|
| 19 |
+
---
|
| 20 |
+
|
| 21 |
+
## 1. Module map (Phase 3)
|
| 22 |
+
|
| 23 |
+
### New numbered modules
|
| 24 |
+
|
| 25 |
+
| ID | Module | Spec file | Concern |
|
| 26 |
+
|-----|------------------------------|-------------------------------------------------|----------------------------------------------------------------------|
|
| 27 |
+
| M26 | Distributed Inference | `modules/M26-distributed-inference.md` | Layer-sharded LLMs across nodes (Petals-style), small models only |
|
| 28 |
+
| M27 | MoE Expert Routing | `modules/M27-moe-routing.md` | Route queries to the right expert (machine or human) via learned scorer |
|
| 29 |
+
| M28 | Federated Learning | `modules/M28-fedlearn.md` | FedAvg on LoRA layers; per-community fine-tuning without sharing data |
|
| 30 |
+
| M29 | LoRA Long-Distance Beacons | `modules/M29-lora-beacons.md` | 868MHz "community alive" beacons; no AI traffic; emergency-only |
|
| 31 |
+
| M30 | Evidence / EBKH | `modules/M30-evidence-ebkh.md` | Claim graph alongside the event log; provenance + verifiability |
|
| 32 |
+
| M31 | Civil Defence Pilot | `modules/M31-civil-defense.md` | THW / DRK / KatS bridge; compliance profile; audit trail |
|
| 33 |
+
| M32 | Protocol Standardisation | `modules/M32-protocol-standard.md` | Reference implementation, conformance suite, governance for the spec |
|
| 34 |
+
|
| 35 |
+
### New cross-cutting modules
|
| 36 |
+
|
| 37 |
+
| ID | Module | Spec file | Concern |
|
| 38 |
+
|-----|-----------------------|---------------------------------------------------|------------------------------------------------------|
|
| 39 |
+
| X08 | Tensor Transport | `cross-cutting/X08-tensor-transport.md` | High-throughput chunked tensor passing for M26 |
|
| 40 |
+
| X09 | Conformance Suite | `cross-cutting/X09-conformance-suite.md` | Black-box tests defining what "HearthNet-compliant" means |
|
| 41 |
+
|
| 42 |
+
### Modifications to earlier modules
|
| 43 |
+
|
| 44 |
+
| Phase 1/2 module | Phase 3 extension |
|
| 45 |
+
|------------------|-------------------|
|
| 46 |
+
| M03 Bus | Optional MoE routing layer between dispatcher and handler (M27) |
|
| 47 |
+
| M04 LLM | Optional `experimental.distributed_llm.chat@1.0` backend (M26) |
|
| 48 |
+
| X02 Event log | Optional `evidence.*` claim records side-by-side with events (M30) |
|
| 49 |
+
| M14 Federation | Federated learning rounds use federation as the trust substrate (M28) |
|
| 50 |
+
| X03 Observability | Per-call expert-routing trace; per-shard tensor-transport metrics (M27, X08) |
|
| 51 |
+
|
| 52 |
+
---
|
| 53 |
+
|
| 54 |
+
## 2. Dependency graph (Phase 3 additions on top of Phases 1β2)
|
| 55 |
+
|
| 56 |
+
```
|
| 57 |
+
βββββββββββββββββββββββββββββββββββββββββββββββ
|
| 58 |
+
β Phase 1 + Phase 2 (unchanged) β
|
| 59 |
+
ββββββ¬βββββββββββββββββ¬βββββββββββββββ¬βββββββββ
|
| 60 |
+
β β β
|
| 61 |
+
βΌ βΌ βΌ
|
| 62 |
+
ββββββββββββ ββββββββββββ ββββββββββββ
|
| 63 |
+
β X08 β β M27 β β M30 β
|
| 64 |
+
β Tensor β β MoE β β EBKH β
|
| 65 |
+
β Transp. β β Routing β β Evidenceβ
|
| 66 |
+
βββββββ¬βββββ βββββββ¬βββββ ββββββ¬ββββββ
|
| 67 |
+
βΌ β β
|
| 68 |
+
ββββββββββββ β β
|
| 69 |
+
β M26 β β β
|
| 70 |
+
β Distrib.β β β
|
| 71 |
+
β Infer. β β β
|
| 72 |
+
ββββββββββββ β β
|
| 73 |
+
βΌ βΌ
|
| 74 |
+
ββββββββββββ ββββββββββββ
|
| 75 |
+
β M28 β β M31 β
|
| 76 |
+
β FedLearnβ β CivDef. β
|
| 77 |
+
ββββββββββββ ββββββββββββ
|
| 78 |
+
|
| 79 |
+
Standalone (no software deps, governance / hardware):
|
| 80 |
+
ββββββββββββ
|
| 81 |
+
β M29 β (hardware)
|
| 82 |
+
β LoRa β
|
| 83 |
+
β Beacons β
|
| 84 |
+
ββββββββββββ
|
| 85 |
+
ββββββββββββ
|
| 86 |
+
β X09 β (process)
|
| 87 |
+
β Conform.β
|
| 88 |
+
ββββββββββββ
|
| 89 |
+
ββββββββββββ
|
| 90 |
+
β M32 β (governance)
|
| 91 |
+
β Standardβ
|
| 92 |
+
ββββββββββββ
|
| 93 |
+
```
|
| 94 |
+
|
| 95 |
+
Most Phase 3 modules are independent of each other. The exceptions:
|
| 96 |
+
- M26 depends on X08
|
| 97 |
+
- M27 informs M26 (MoE routing picks which expert/shard)
|
| 98 |
+
- M28 reuses M14 federation for cross-community rounds
|
| 99 |
+
- M31 reuses M30 for evidence-grade emergency claims
|
| 100 |
+
|
| 101 |
+
---
|
| 102 |
+
|
| 103 |
+
## 3. File tree additions
|
| 104 |
+
|
| 105 |
+
```
|
| 106 |
+
hearthnet/
|
| 107 |
+
βββ distributed_inference/ # M26
|
| 108 |
+
β βββ __init__.py
|
| 109 |
+
β βββ shard.py
|
| 110 |
+
β βββ pipeline.py
|
| 111 |
+
β βββ routing.py
|
| 112 |
+
β βββ backends/
|
| 113 |
+
β βββ petals_like.py
|
| 114 |
+
β βββ small_model_layered.py
|
| 115 |
+
β
|
| 116 |
+
βββ moe/ # M27
|
| 117 |
+
β βββ __init__.py
|
| 118 |
+
β βββ router.py
|
| 119 |
+
β βββ scorer.py
|
| 120 |
+
β βββ human_in_the_loop.py
|
| 121 |
+
β
|
| 122 |
+
βββ fedlearn/ # M28
|
| 123 |
+
β βββ __init__.py
|
| 124 |
+
β βββ coordinator.py
|
| 125 |
+
β βββ round.py
|
| 126 |
+
β βββ lora_diff.py
|
| 127 |
+
β βββ aggregation.py
|
| 128 |
+
β
|
| 129 |
+
βββ lora_beacons/ # M29 β hardware integration; tiny Python surface
|
| 130 |
+
β βββ __init__.py
|
| 131 |
+
β βββ beacon_bridge.py # serial protocol to a LoRa USB stick
|
| 132 |
+
β βββ policy.py
|
| 133 |
+
β
|
| 134 |
+
βββ evidence/ # M30
|
| 135 |
+
β βββ __init__.py
|
| 136 |
+
β βββ claim.py
|
| 137 |
+
β βββ claim_graph.py
|
| 138 |
+
β βββ provenance.py
|
| 139 |
+
β βββ ebkh_bridge.py # bridge to Christof's EBKH v3+
|
| 140 |
+
β
|
| 141 |
+
βββ civil_defense/ # M31
|
| 142 |
+
β βββ __init__.py
|
| 143 |
+
β βββ profile.py # THW / DRK / KatS member types
|
| 144 |
+
β βββ audit.py
|
| 145 |
+
β βββ nrw_katastrophenschutz.py
|
| 146 |
+
β
|
| 147 |
+
βββ transport/
|
| 148 |
+
β βββ tensor.py # X08
|
| 149 |
+
β
|
| 150 |
+
βββ conformance/ # X09
|
| 151 |
+
βββ __init__.py
|
| 152 |
+
βββ runner.py
|
| 153 |
+
βββ suites/
|
| 154 |
+
β βββ identity.py
|
| 155 |
+
β βββ transport.py
|
| 156 |
+
β βββ bus.py
|
| 157 |
+
β βββ services.py
|
| 158 |
+
β βββ federation.py
|
| 159 |
+
βββ report.py
|
| 160 |
+
|
| 161 |
+
protocol/ # M32 β separate top-level dir at repo root
|
| 162 |
+
βββ README.md
|
| 163 |
+
βββ spec/ # the protocol spec, decoupled from the impl
|
| 164 |
+
β βββ 00-overview.md # mirror of CAPABILITY_CONTRACT but
|
| 165 |
+
β βββ 01-identity.md # implementation-agnostic
|
| 166 |
+
β βββ ...
|
| 167 |
+
βββ governance/
|
| 168 |
+
βββ CHANGELOG.md
|
| 169 |
+
βββ CONTRIBUTING.md
|
| 170 |
+
βββ ROADMAP.md
|
| 171 |
+
```
|
| 172 |
+
|
| 173 |
+
---
|
| 174 |
+
|
| 175 |
+
## 4. Conventions delta from Phase 2
|
| 176 |
+
|
| 177 |
+
### 4.1 New `experimental` namespace
|
| 178 |
+
|
| 179 |
+
A Phase-3 capability MAY be advertised as `experimental.<name>@<ver>`. Mesh nodes default to **not registering** experimental capabilities; the operator must opt in via:
|
| 180 |
+
|
| 181 |
+
```toml
|
| 182 |
+
[policy.research]
|
| 183 |
+
enable = true
|
| 184 |
+
enabled_capabilities = ["experimental.distributed_llm.chat@1.0", "experimental.fedlearn.round.*"]
|
| 185 |
+
```
|
| 186 |
+
|
| 187 |
+
Once a capability is sufficiently proven, it is promoted out of the `experimental.` prefix in a contract bump.
|
| 188 |
+
|
| 189 |
+
### 4.2 New type aliases
|
| 190 |
+
|
| 191 |
+
```python
|
| 192 |
+
# additions to hearthnet/types.py
|
| 193 |
+
|
| 194 |
+
ShardID = str # "<model_id>:<layer_range>"
|
| 195 |
+
ExpertID = str # opaque, refers to a routable subsystem
|
| 196 |
+
ClaimID = str # ULID
|
| 197 |
+
RoundID = str # fedlearn round identifier (ULID)
|
| 198 |
+
LoraBeaconID = str # 8-byte hex, hardware-issued
|
| 199 |
+
EvidenceLevel = Literal["unverified","cited","cross_referenced","attested","disputed"]
|
| 200 |
+
ExpertKind = Literal["model","human","service","external"]
|
| 201 |
+
```
|
| 202 |
+
|
| 203 |
+
### 4.3 New constants
|
| 204 |
+
|
| 205 |
+
```python
|
| 206 |
+
# additions to hearthnet/constants.py β Phase 3
|
| 207 |
+
|
| 208 |
+
# Distributed inference (M26)
|
| 209 |
+
DISTRIBUTED_MAX_SHARDS_PER_REQUEST = 16
|
| 210 |
+
DISTRIBUTED_SHARD_HEALTH_TIMEOUT_S = 30
|
| 211 |
+
DISTRIBUTED_FALLBACK_TO_LOCAL_AFTER_FAILURES = 2
|
| 212 |
+
|
| 213 |
+
# MoE routing (M27)
|
| 214 |
+
MOE_ROUTER_TOP_K = 3
|
| 215 |
+
MOE_ROUTER_TRAIN_MIN_EXAMPLES = 200
|
| 216 |
+
MOE_ROUTER_RETRAIN_EVERY_HOURS = 24
|
| 217 |
+
|
| 218 |
+
# Federated learning (M28)
|
| 219 |
+
FEDLEARN_MAX_ROUND_MINUTES = 120
|
| 220 |
+
FEDLEARN_MIN_PARTICIPANTS = 3
|
| 221 |
+
FEDLEARN_MAX_LORA_RANK = 64
|
| 222 |
+
FEDLEARN_GRAD_CLIP = 1.0
|
| 223 |
+
FEDLEARN_DP_NOISE_SCALE_DEFAULT = 0.0 # off by default; off-by-default differential privacy
|
| 224 |
+
|
| 225 |
+
# Evidence (M30)
|
| 226 |
+
EVIDENCE_CLAIM_TTL_DAYS_DEFAULT = 365
|
| 227 |
+
EVIDENCE_MAX_PROVENANCE_DEPTH = 16
|
| 228 |
+
|
| 229 |
+
# Civil defence (M31)
|
| 230 |
+
CIVDEF_AUDIT_RETENTION_YEARS = 10
|
| 231 |
+
CIVDEF_HEARTBEAT_SECONDS = 60
|
| 232 |
+
|
| 233 |
+
# Tensor transport (X08)
|
| 234 |
+
TENSOR_CHUNK_BYTES = 1_048_576 # 1 MB
|
| 235 |
+
TENSOR_FLOW_CONTROL_WINDOW = 16 # chunks
|
| 236 |
+
TENSOR_COMPRESSION_THRESHOLD_BYTES = 65_536
|
| 237 |
+
|
| 238 |
+
# LoRa beacons (M29)
|
| 239 |
+
LORA_BEACON_PERIOD_SECONDS_DEFAULT = 600 # 10 minutes
|
| 240 |
+
LORA_BEACON_MAX_PAYLOAD_BYTES = 32
|
| 241 |
+
```
|
| 242 |
+
|
| 243 |
+
---
|
| 244 |
+
|
| 245 |
+
## 5. Build order (Phase 3)
|
| 246 |
+
|
| 247 |
+
Phase 3 is not a release; it is a set of long-running tracks. Suggested ordering by independence + value:
|
| 248 |
+
|
| 249 |
+
| Track | Modules | Outcome |
|
| 250 |
+
|-------|----------------------------------|-------------------------------------------------------------------------------|
|
| 251 |
+
| A | X09 Conformance + M32 Standard | Other people can build HearthNet-compliant nodes |
|
| 252 |
+
| B | M30 Evidence / EBKH | Marketplace claims and emergency posts carry provenance |
|
| 253 |
+
| C | M27 MoE Routing (machines only) | Better answers for free; routes RAG queries to best-suited backend |
|
| 254 |
+
| D | M27 + M28 (human routing) | Neighbour gets pinged when their expertise matches |
|
| 255 |
+
| E | M28 FedLearn | Communities co-train a small LoRA without sharing source data |
|
| 256 |
+
| F | X08 + M26 Distributed Inference | Two anchors jointly serve a 7B model; large models become feasible LAN-wide |
|
| 257 |
+
| G | M29 LoRa Beacons | Resilient "I am alive" pings during regional internet outages |
|
| 258 |
+
| H | M31 Civil Defence Pilot | A real Niederrhein THW Ortsverband uses HearthNet for an exercise |
|
| 259 |
+
|
| 260 |
+
Tracks can run in parallel. None of them block the existing Phase-2 system.
|
| 261 |
+
|
| 262 |
+
---
|
| 263 |
+
|
| 264 |
+
## 6. Spec versioning
|
| 265 |
+
|
| 266 |
+
- Capability Contract bumps to **v3.0** but the bump is *additive*. v2 nodes coexist with v3 nodes; experimental capabilities simply aren't seen by v2 nodes.
|
| 267 |
+
- The first concrete deliverable of Track A (M32) is to **decouple** the protocol spec from the implementation. After that, the contract has its own version track separate from the Python implementation's version.
|
| 268 |
+
|
| 269 |
+
---
|
| 270 |
+
|
| 271 |
+
## 7. Out-of-band documents (Phase 3)
|
| 272 |
+
|
| 273 |
+
- **RESEARCH_AGENDA.md** β the deeper "why" for each module; intended audience: PhD students and grant reviewers
|
| 274 |
+
- **GOVERNANCE.md** β how spec changes are proposed, reviewed, and accepted; ties into M32
|
| 275 |
+
- **ETHICS_REVIEW.md** β the framework for evaluating MoE-driven routing-to-humans (M27) and fedlearn-on-personal-data (M28)
|
| 276 |
+
- **CIVDEF_AGREEMENT_TEMPLATE.md** β the MoU template for a civil-defence pilot
|
| 277 |
+
|
| 278 |
+
---
|
| 279 |
+
|
| 280 |
+
## 8. What is NOT in Phase 3
|
| 281 |
+
|
| 282 |
+
Even with all of Phase 3 done, the following remain explicit non-goals:
|
| 283 |
+
|
| 284 |
+
- A central directory of communities. There is no "HearthNet.com" listing all communities. Discovery is via word of mouth + DHT + federation. Pushed indefinitely.
|
| 285 |
+
- An app store for capabilities. Capabilities are code in the source tree, reviewed by maintainers. Not pluggable at runtime by untrusted code.
|
| 286 |
+
- A consensus protocol (Paxos, Raft). Communities do not vote on shared state beyond event-log gossip. Federation does not imply consensus.
|
| 287 |
+
- A cryptocurrency / token economy. Not even for fedlearn incentives. Reputational signals only.
|
| 288 |
+
- AGI. Even the distributed inference module targets at-most-mid-sized models (7B-class). The thesis is "small models close to people are more useful than large models far away", and Phase 3 doesn't change that.
|
docs/p2_p3/IMPLEMENTATION_REFERENCE_p3.md
ADDED
|
@@ -0,0 +1,493 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# HearthNet Phase 3 β Implementation Reference
|
| 2 |
+
|
| 3 |
+
**Spec set:** v3.0 β *experimental*
|
| 4 |
+
**Status:** Research-shaped. Names and contracts may shift before promotion to stable.
|
| 5 |
+
|
| 6 |
+
This document is the symbol index and quick-reference for everything introduced in Phase 3. It mirrors the structure of Phase 1's and Phase 2's `IMPLEMENTATION_REFERENCE.md`, so a maintainer can find the relevant spec section, file, class, capability, event, or constant by name.
|
| 7 |
+
|
| 8 |
+
Read this with the Phase 1 + Phase 2 references already in hand. Phase 3 is purely additive; it does not redefine anything from earlier phases.
|
| 9 |
+
|
| 10 |
+
---
|
| 11 |
+
|
| 12 |
+
## 1. Module index
|
| 13 |
+
|
| 14 |
+
| ID | Name | Status | Spec file |
|
| 15 |
+
|----|------|--------|-----------|
|
| 16 |
+
| M26 | Distributed Inference (layer sharding) | experimental | [phase-3/modules/M26-distributed-inference.md](modules/M26-distributed-inference.md) |
|
| 17 |
+
| M27 | MoE Expert Routing | experimental | [phase-3/modules/M27-moe-routing.md](modules/M27-moe-routing.md) |
|
| 18 |
+
| M28 | Federated Learning (LoRA aggregation) | experimental | [phase-3/modules/M28-fedlearn.md](modules/M28-fedlearn.md) |
|
| 19 |
+
| M29 | LoRa Hardware Beacons | experimental | [phase-3/modules/M29-lora-beacons.md](modules/M29-lora-beacons.md) |
|
| 20 |
+
| M30 | Evidence Graph & EBKH Integration | experimental | [phase-3/modules/M30-evidence-ebkh.md](modules/M30-evidence-ebkh.md) |
|
| 21 |
+
| M31 | Civil Defense (NRW BevΓΆlkerungsschutz) | experimental | [phase-3/modules/M31-civil-defense.md](modules/M31-civil-defense.md) |
|
| 22 |
+
| M32 | Protocol Standardisation & Conformance | provisional | [phase-3/modules/M32-protocol-standard.md](modules/M32-protocol-standard.md) |
|
| 23 |
+
| X08 | Tensor Transport | experimental | [phase-3/cross-cutting/X08-tensor-transport.md](cross-cutting/X08-tensor-transport.md) |
|
| 24 |
+
| X09 | Conformance Suite | provisional | [phase-3/cross-cutting/X09-conformance-suite.md](cross-cutting/X09-conformance-suite.md) |
|
| 25 |
+
|
| 26 |
+
All `experimental` modules are gated by per-module feature flags in `hearthnet/config.py` and default to `False`.
|
| 27 |
+
|
| 28 |
+
---
|
| 29 |
+
|
| 30 |
+
## 2. Capability index
|
| 31 |
+
|
| 32 |
+
All Phase 3 capabilities (except `protocol.*`) live in the `experimental.*` namespace. The bus refuses to register them unless the corresponding feature flag is on.
|
| 33 |
+
|
| 34 |
+
### 2.1 Distributed Inference (M26)
|
| 35 |
+
|
| 36 |
+
| Capability | Module | Notes |
|
| 37 |
+
|------------|--------|-------|
|
| 38 |
+
| `experimental.distributed_llm.shard.list` | M26 | Local advertisement of shards we host |
|
| 39 |
+
| `experimental.distributed_llm.shard.connect` | M26 | Negotiate an X08 tensor session |
|
| 40 |
+
| `experimental.distributed_llm.shard.forward` | M26 | Run forward through a hosted shard |
|
| 41 |
+
| `experimental.distributed_llm.pipeline.plan` | M26 | Construct a layer pipeline for a model |
|
| 42 |
+
| `experimental.distributed_llm.pipeline.run` | M26 | Execute a planned pipeline |
|
| 43 |
+
| `experimental.distributed_llm.pipeline.status` | M26 | Pipeline state and stats |
|
| 44 |
+
|
| 45 |
+
### 2.2 MoE Routing (M27)
|
| 46 |
+
|
| 47 |
+
| Capability | Module | Notes |
|
| 48 |
+
|------------|--------|-------|
|
| 49 |
+
| `experimental.moe.expert.register` | M27 | Register self as an expert for some topics |
|
| 50 |
+
| `experimental.moe.expert.list` | M27 | List registered experts |
|
| 51 |
+
| `experimental.moe.expert.unregister` | M27 | Withdraw expert registration |
|
| 52 |
+
| `experimental.moe.route.query` | M27 | Get top-K experts for a query |
|
| 53 |
+
| `experimental.moe.route.handoff` | M27 | Initiate human-in-the-loop handoff |
|
| 54 |
+
| `experimental.moe.feedback.record` | M27 | Record outcome for scorer training |
|
| 55 |
+
|
| 56 |
+
### 2.3 Federated Learning (M28)
|
| 57 |
+
|
| 58 |
+
| Capability | Module | Notes |
|
| 59 |
+
|------------|--------|-------|
|
| 60 |
+
| `experimental.fedlearn.round.announce` | M28 | Coordinator announces a round |
|
| 61 |
+
| `experimental.fedlearn.round.list` | M28 | List open rounds |
|
| 62 |
+
| `experimental.fedlearn.round.join` | M28 | Participant joins a round |
|
| 63 |
+
| `experimental.fedlearn.round.submit` | M28 | Submit gradient delta |
|
| 64 |
+
| `experimental.fedlearn.round.status` | M28 | Get round state |
|
| 65 |
+
| `experimental.fedlearn.round.finalize` | M28 | Coordinator finalises and aggregates |
|
| 66 |
+
| `experimental.fedlearn.adapter.fetch` | M28 | Fetch an aggregated adapter by SHA |
|
| 67 |
+
| `experimental.fedlearn.adapter.apply` | M28 | Apply adapter to session or node |
|
| 68 |
+
|
| 69 |
+
### 2.4 LoRa Beacons (M29)
|
| 70 |
+
|
| 71 |
+
| Capability | Module | Notes |
|
| 72 |
+
|------------|--------|-------|
|
| 73 |
+
| `experimental.lora.status` | M29 | Hardware and link status |
|
| 74 |
+
| `experimental.lora.beacon.send` | M29 | Send a normal beacon |
|
| 75 |
+
| `experimental.lora.panic.send` | M29 | Send a panic burst |
|
| 76 |
+
| `experimental.lora.peer.list` | M29 | Known LoRa peers |
|
| 77 |
+
| `experimental.lora.peer.verify` | M29 | TOFU-confirm a peer's NodeID binding |
|
| 78 |
+
| `experimental.lora.recent_beacons` | M29 | Recent RX'd beacons |
|
| 79 |
+
| `experimental.lora.duty_cycle` | M29 | Current duty-cycle budget status |
|
| 80 |
+
|
| 81 |
+
### 2.5 Evidence Graph (M30)
|
| 82 |
+
|
| 83 |
+
| Capability | Module | Notes |
|
| 84 |
+
|------------|--------|-------|
|
| 85 |
+
| `experimental.evidence.claim.assert` | M30 | Assert a new claim |
|
| 86 |
+
| `experimental.evidence.claim.dispute` | M30 | Dispute an existing claim |
|
| 87 |
+
| `experimental.evidence.claim.attest` | M30 | Attest to an existing claim |
|
| 88 |
+
| `experimental.evidence.claim.get` | M30 | Fetch a single claim by ID |
|
| 89 |
+
| `experimental.evidence.claim.query` | M30 | Query claims by triple |
|
| 90 |
+
| `experimental.evidence.provenance.trace` | M30 | Walk the derivation graph |
|
| 91 |
+
| `experimental.evidence.subject.summary` | M30 | Multi-claim summary for a subject |
|
| 92 |
+
| `experimental.evidence.ebkh.sync` | M30 | Sync with external EBKH endpoint |
|
| 93 |
+
|
| 94 |
+
### 2.6 Civil Defense (M31)
|
| 95 |
+
|
| 96 |
+
| Capability | Module | Notes |
|
| 97 |
+
|------------|--------|-------|
|
| 98 |
+
| `experimental.civdef.alert.publish` | M31 | Role-cert-gated alert publication |
|
| 99 |
+
| `experimental.civdef.alert.cancel` | M31 | Cancel an active alert |
|
| 100 |
+
| `experimental.civdef.alert.list` | M31 | List active alerts (filtered) |
|
| 101 |
+
| `experimental.civdef.alert.get` | M31 | Fetch an alert envelope |
|
| 102 |
+
| `experimental.civdef.alert.subscribe` | M31 | Subscribe to alerts matching a filter |
|
| 103 |
+
| `experimental.civdef.alert.ack` | M31 | Acknowledge an alert |
|
| 104 |
+
| `experimental.civdef.alert.acks` | M31 | List acks for an alert |
|
| 105 |
+
| `experimental.civdef.role.register` | M31 | Register a role certificate |
|
| 106 |
+
| `experimental.civdef.role.list` | M31 | List registered certificates |
|
| 107 |
+
| `experimental.civdef.role.revoke` | M31 | Revoke a certificate |
|
| 108 |
+
| `experimental.civdef.audit.export` | M31 | Export tamper-evident audit chain |
|
| 109 |
+
|
| 110 |
+
### 2.7 Protocol (M32) β stable
|
| 111 |
+
|
| 112 |
+
| Capability | Module | Notes |
|
| 113 |
+
|------------|--------|-------|
|
| 114 |
+
| `protocol.version.list` | M32 | Versions supported |
|
| 115 |
+
| `protocol.self.describe` | M32 | Implementation descriptor |
|
| 116 |
+
| `protocol.conformance.report` | M32 | Run / fetch conformance report |
|
| 117 |
+
| `protocol.registry.list` | M32 | Known implementations |
|
| 118 |
+
| `protocol.registry.announce` | M32 | Announce self to registry |
|
| 119 |
+
|
| 120 |
+
---
|
| 121 |
+
|
| 122 |
+
## 3. Event types
|
| 123 |
+
|
| 124 |
+
All Phase 3 event types follow the convention `<area>.<entity>.<verb>` and are recorded in the X02 event log.
|
| 125 |
+
|
| 126 |
+
### 3.1 Distributed inference
|
| 127 |
+
|
| 128 |
+
```
|
| 129 |
+
distributed_llm.shard.advertised
|
| 130 |
+
distributed_llm.shard.withdrawn
|
| 131 |
+
distributed_llm.pipeline.planned
|
| 132 |
+
distributed_llm.pipeline.started
|
| 133 |
+
distributed_llm.pipeline.shard_failed
|
| 134 |
+
distributed_llm.pipeline.failover
|
| 135 |
+
distributed_llm.pipeline.completed
|
| 136 |
+
distributed_llm.pipeline.cancelled
|
| 137 |
+
```
|
| 138 |
+
|
| 139 |
+
### 3.2 MoE
|
| 140 |
+
|
| 141 |
+
```
|
| 142 |
+
moe.expert.registered
|
| 143 |
+
moe.expert.unregistered
|
| 144 |
+
moe.route.computed
|
| 145 |
+
moe.handoff.initiated
|
| 146 |
+
moe.handoff.accepted
|
| 147 |
+
moe.handoff.declined
|
| 148 |
+
moe.handoff.timed_out
|
| 149 |
+
moe.feedback.recorded
|
| 150 |
+
```
|
| 151 |
+
|
| 152 |
+
### 3.3 Federated learning
|
| 153 |
+
|
| 154 |
+
```
|
| 155 |
+
fedlearn.round.announced
|
| 156 |
+
fedlearn.round.joined
|
| 157 |
+
fedlearn.round.consent.granted
|
| 158 |
+
fedlearn.round.submitted
|
| 159 |
+
fedlearn.round.aggregated
|
| 160 |
+
fedlearn.round.completed
|
| 161 |
+
fedlearn.round.aborted
|
| 162 |
+
fedlearn.round.takeover
|
| 163 |
+
fedlearn.adapter.published
|
| 164 |
+
fedlearn.adapter.applied
|
| 165 |
+
```
|
| 166 |
+
|
| 167 |
+
### 3.4 LoRa
|
| 168 |
+
|
| 169 |
+
```
|
| 170 |
+
lora.beacon.sent
|
| 171 |
+
lora.beacon.received
|
| 172 |
+
lora.panic.sent
|
| 173 |
+
lora.panic.received
|
| 174 |
+
lora.peer.unknown
|
| 175 |
+
lora.peer.verified
|
| 176 |
+
lora.peer.conflict
|
| 177 |
+
lora.duty_cycle.exhausted
|
| 178 |
+
lora.duty_cycle.overridden
|
| 179 |
+
lora.rx.dropped
|
| 180 |
+
```
|
| 181 |
+
|
| 182 |
+
### 3.5 Evidence
|
| 183 |
+
|
| 184 |
+
```
|
| 185 |
+
evidence.claim.asserted
|
| 186 |
+
evidence.claim.attested
|
| 187 |
+
evidence.dispute.opened
|
| 188 |
+
evidence.dispute.retracted
|
| 189 |
+
evidence.ebkh.synced
|
| 190 |
+
evidence.ebkh.sync_partial
|
| 191 |
+
```
|
| 192 |
+
|
| 193 |
+
### 3.6 Civil defense
|
| 194 |
+
|
| 195 |
+
```
|
| 196 |
+
civdef.alert.published
|
| 197 |
+
civdef.alert.forwarded
|
| 198 |
+
civdef.alert.acked
|
| 199 |
+
civdef.alert.cancelled
|
| 200 |
+
civdef.alert.dropped.revoked
|
| 201 |
+
civdef.alert.foreign_role
|
| 202 |
+
civdef.role.registered
|
| 203 |
+
civdef.role.revoked
|
| 204 |
+
civdef.audit.checkpointed
|
| 205 |
+
civdef.audit.broken
|
| 206 |
+
```
|
| 207 |
+
|
| 208 |
+
### 3.7 Protocol
|
| 209 |
+
|
| 210 |
+
```
|
| 211 |
+
protocol.descriptor.announced
|
| 212 |
+
protocol.registry.updated
|
| 213 |
+
protocol.conformance.ran
|
| 214 |
+
```
|
| 215 |
+
|
| 216 |
+
---
|
| 217 |
+
|
| 218 |
+
## 4. Type aliases (added to `hearthnet/types.py`)
|
| 219 |
+
|
| 220 |
+
```python
|
| 221 |
+
ShardID = NewType("ShardID", str) # "model:layer_range[:tier]"
|
| 222 |
+
ExpertID = NewType("ExpertID", str) # "human:..." | "model:..." | "service:..." | "external:..."
|
| 223 |
+
ExpertKind = Literal["human","model","service","external"]
|
| 224 |
+
ClaimID = NewType("ClaimID", str) # base32 of SHA-256 canonical claim
|
| 225 |
+
SourceID = NewType("SourceID", str)
|
| 226 |
+
EvidenceLevel = Literal["unverified","cited","cross_referenced","attested","disputed"]
|
| 227 |
+
RoundID = NewType("RoundID", str) # ULID
|
| 228 |
+
LoraBeaconID = NewType("LoraBeaconID", str)
|
| 229 |
+
LoraDeviceID = NewType("LoraDeviceID", str)
|
| 230 |
+
AlertID = NewType("AlertID", str) # ULID
|
| 231 |
+
AlertSeverity = Literal["info","advisory","warning","emergency","extreme"]
|
| 232 |
+
AckStatus = Literal["received","acting","need_help","standing_down","mistaken"]
|
| 233 |
+
|
| 234 |
+
@dataclass(frozen=True)
|
| 235 |
+
class ProtocolVersion: major: int; minor: int; patch: int; suffix: str = ""
|
| 236 |
+
|
| 237 |
+
# Reused from earlier phases but referenced here for completeness:
|
| 238 |
+
NodeID # M01
|
| 239 |
+
EventID # X02
|
| 240 |
+
AuthToken # M16
|
| 241 |
+
Bbox # M07 spatial extensions
|
| 242 |
+
Tensor # local LLM tensor type, dtype-tagged
|
| 243 |
+
```
|
| 244 |
+
|
| 245 |
+
---
|
| 246 |
+
|
| 247 |
+
## 5. Centralised constants (`hearthnet/constants.py`, Phase 3 additions)
|
| 248 |
+
|
| 249 |
+
```python
|
| 250 |
+
# --- Distributed inference (M26) ---
|
| 251 |
+
DISTRIBUTED_LLM_MAX_SHARDS_PER_PIPELINE = 16
|
| 252 |
+
DISTRIBUTED_LLM_SHARD_HEARTBEAT_SECONDS = 5
|
| 253 |
+
DISTRIBUTED_LLM_FAILOVER_TIMEOUT_SECONDS = 10
|
| 254 |
+
DISTRIBUTED_LLM_MAX_PIPELINE_LATENCY_TOKENS_PER_S = 2.0 # advisory floor
|
| 255 |
+
DISTRIBUTED_LLM_DEFAULT_DTYPE = "fp16"
|
| 256 |
+
|
| 257 |
+
# --- MoE routing (M27) ---
|
| 258 |
+
MOE_TOP_K_DEFAULT = 3
|
| 259 |
+
MOE_LEARNED_SCORER_MIN_FEEDBACK_SAMPLES = 200
|
| 260 |
+
MOE_HUMAN_HANDOFF_DEFAULT_TIMEOUT_HOURS = 24
|
| 261 |
+
MOE_HUMAN_HANDOFF_COOLDOWN_HOURS = 2
|
| 262 |
+
MOE_HUMAN_RATE_LIMIT_PER_DAY = 5
|
| 263 |
+
|
| 264 |
+
# --- Federated learning (M28) ---
|
| 265 |
+
FEDLEARN_MAX_LORA_RANK = 64
|
| 266 |
+
FEDLEARN_MAX_LORA_TARGET_MODULES = 8
|
| 267 |
+
FEDLEARN_MAX_TRAIN_STEPS = 1000
|
| 268 |
+
FEDLEARN_MAX_PARTICIPANTS = 32
|
| 269 |
+
FEDLEARN_MIN_PARTICIPANTS = 3
|
| 270 |
+
FEDLEARN_DP_NOISE_SCALE_DEFAULT = 0.0 # off
|
| 271 |
+
FEDLEARN_CLIP_NORM_DEFAULT = 1.0
|
| 272 |
+
FEDLEARN_SUBMISSION_MAX_BYTES = 64 * 1024 * 1024
|
| 273 |
+
|
| 274 |
+
# --- LoRa beacons (M29) ---
|
| 275 |
+
LORA_BEACON_PERIOD_SECONDS_DEFAULT = 600 # 10 min
|
| 276 |
+
LORA_BEACON_MAX_PAYLOAD_BYTES = 32
|
| 277 |
+
LORA_RX_QUEUE_MAX = 256
|
| 278 |
+
LORA_PEER_RX_MAX_PER_MINUTE = 20
|
| 279 |
+
LORA_PANIC_BURST_COUNT = 3
|
| 280 |
+
LORA_PANIC_BURST_GAP_MS = 800
|
| 281 |
+
|
| 282 |
+
# --- Evidence (M30) ---
|
| 283 |
+
EVIDENCE_CLAIM_TTL_DAYS_DEFAULT = 365
|
| 284 |
+
EVIDENCE_DISPUTE_MIN_TRUST = 0.3
|
| 285 |
+
EVIDENCE_MAX_PROVENANCE_DEPTH = 8
|
| 286 |
+
|
| 287 |
+
# --- Civil defense (M31) ---
|
| 288 |
+
CIVDEF_AUDIT_RETENTION_YEARS = 10 # operator must validate against current law
|
| 289 |
+
CIVDEF_ACK_MAX_PER_MINUTE_PER_NODE = 5
|
| 290 |
+
CIVDEF_ALERT_TITLE_MAX_CHARS = 80
|
| 291 |
+
CIVDEF_ALERT_BODY_MAX_CHARS = 1000
|
| 292 |
+
|
| 293 |
+
# --- Tensor transport (X08) ---
|
| 294 |
+
TENSOR_CHUNK_BYTES = 1 * 1024 * 1024 # 1 MiB
|
| 295 |
+
TENSOR_FLOW_CONTROL_WINDOW = 16
|
| 296 |
+
TENSOR_COMPRESSION_THRESHOLD_BYTES = 64 * 1024
|
| 297 |
+
TENSOR_KEEPALIVE_SECONDS = 30
|
| 298 |
+
TENSOR_MAX_SESSION_LIFETIME_SECONDS = 3600
|
| 299 |
+
|
| 300 |
+
# --- Conformance suite (X09) ---
|
| 301 |
+
CONFORMANCE_DEFAULT_SEED = 0xC0FFEE
|
| 302 |
+
CONFORMANCE_DEFAULT_OUTPUT_DIR = "./conformance-report"
|
| 303 |
+
```
|
| 304 |
+
|
| 305 |
+
---
|
| 306 |
+
|
| 307 |
+
## 6. Error codes (Phase 3 additions)
|
| 308 |
+
|
| 309 |
+
| Code | Module | When |
|
| 310 |
+
|------|--------|------|
|
| 311 |
+
| `experimental_disabled` | shared | Capability called with the feature flag off |
|
| 312 |
+
| `shard_unavailable` | M26 | No replica for the required layer range |
|
| 313 |
+
| `shard_unreachable` | M26 | All replicas connectivity-failed |
|
| 314 |
+
| `pipeline_failed` | M26 | Aggregate failure of an in-flight pipeline |
|
| 315 |
+
| `pipeline_cancelled` | M26 | Pipeline cancelled by caller |
|
| 316 |
+
| `tensor_too_large` | X08 | Tensor exceeds rx_buffer_bytes_max |
|
| 317 |
+
| `unknown_frame_type` | X08 | Frame type outside the defined set |
|
| 318 |
+
| `expert_unknown` | M27 | Referenced expert is not registered |
|
| 319 |
+
| `expert_unavailable` | M27 | Expert known but currently outside availability window |
|
| 320 |
+
| `human_handoff_declined` | M27 | Human expert explicitly declined |
|
| 321 |
+
| `human_handoff_timed_out` | M27 | Handoff exceeded timeout without ack |
|
| 322 |
+
| `consent_required` | M28 | join() without explicit operator consent |
|
| 323 |
+
| `base_model_mismatch` | M28 | Local base model SHA differs from manifest |
|
| 324 |
+
| `insufficient_resources` | M28 | Estimated VRAM/disk exceeds budget |
|
| 325 |
+
| `delta_invalid` | M28 | Submitted state-dict fails structural validation |
|
| 326 |
+
| `fedlearn_aggregation_failed` | M28 | Aggregation produced NaN/Inf |
|
| 327 |
+
| `fedlearn_min_participants_unmet` | M28 | Round closed below quorum |
|
| 328 |
+
| `fedlearn_aggregator_unreachable` | M28 | Finalize attempted while coordinator offline |
|
| 329 |
+
| `adapter_not_found` | M28 | Fetch for unknown adapter SHA |
|
| 330 |
+
| `lora_hardware_unavailable` | M29 | No stick present |
|
| 331 |
+
| `lora_hardware_unsupported` | M29 | Adapter init failed |
|
| 332 |
+
| `lora_duty_cycle_exhausted` | M29 | Non-panic send with empty budget |
|
| 333 |
+
| `lora_peer_unknown` | M29 | Verify against unseen sender_hash |
|
| 334 |
+
| `lora_peer_conflict` | M29 | Verify would create conflicting binding |
|
| 335 |
+
| `lora_frame_malformed` | M29 | RX frame structurally invalid |
|
| 336 |
+
| `claim_not_found` | M30 | Reference to unknown ClaimID |
|
| 337 |
+
| `claim_signature_invalid` | M30 | Signature doesn't verify |
|
| 338 |
+
| `evidence_cycle_detected` | M30 | Derivation chain forms a cycle |
|
| 339 |
+
| `evidence_contradiction` | M30 | (advisory) conflicting claims on same triple |
|
| 340 |
+
| `ebkh_unavailable` | M30 | EBKH endpoint unreachable |
|
| 341 |
+
| `civdef_cert_not_owned` | M31 | Cert holder β caller identity |
|
| 342 |
+
| `civdef_cert_invalid` | M31 | Cert expired, revoked, or signature broken |
|
| 343 |
+
| `civdef_cert_unrecognised` | M31 | Issuer chain doesn't reach a trust root |
|
| 344 |
+
| `civdef_cert_out_of_scope` | M31 | Cert lacks the requested role/region |
|
| 345 |
+
| `civdef_alert_not_found` | M31 | Operation on unknown AlertID |
|
| 346 |
+
| `civdef_alert_target_invalid` | M31 | Malformed target or outside scope |
|
| 347 |
+
| `civdef_audit_chain_broken` | M31 | Audit chain hash/signature mismatch |
|
| 348 |
+
| `civdef_role_revoked` | M31 | Op with revoked cert |
|
| 349 |
+
| `civdef_region_unsupported` | M31 | No region adapter loaded |
|
| 350 |
+
| `civdef_ack_rate_limited` | M31 | Ack rate exceeded |
|
| 351 |
+
| `protocol_version_unknown` | M32 | Reference to unknown protocol version |
|
| 352 |
+
| `protocol_suite_not_installed` | M32 | Conformance report requested without X09 |
|
| 353 |
+
| `protocol_descriptor_invalid` | M32 | Malformed descriptor announcement |
|
| 354 |
+
| `protocol_unsupported_capability` | M32 | Federation negotiates no compatible major |
|
| 355 |
+
|
| 356 |
+
---
|
| 357 |
+
|
| 358 |
+
## 7. File map (top-level)
|
| 359 |
+
|
| 360 |
+
```
|
| 361 |
+
hearthnet/
|
| 362 |
+
βββ distributed_inference/ # M26
|
| 363 |
+
β βββ shard.py
|
| 364 |
+
β βββ orchestrator.py
|
| 365 |
+
β βββ router.py
|
| 366 |
+
β βββ plan.py
|
| 367 |
+
β βββ failover.py
|
| 368 |
+
βββ moe/ # M27
|
| 369 |
+
β βββ router.py
|
| 370 |
+
β βββ scorer.py
|
| 371 |
+
β βββ expert_registry.py
|
| 372 |
+
β βββ human_in_the_loop.py
|
| 373 |
+
β βββ feedback.py
|
| 374 |
+
βββ fedlearn/ # M28
|
| 375 |
+
β βββ coordinator.py
|
| 376 |
+
β βββ participant.py
|
| 377 |
+
β βββ trainer.py
|
| 378 |
+
β βββ aggregator.py
|
| 379 |
+
β βββ delta.py
|
| 380 |
+
β βββ privacy.py
|
| 381 |
+
β βββ manifest.py
|
| 382 |
+
βββ lora/ # M29
|
| 383 |
+
β βββ service.py
|
| 384 |
+
β βββ serial_bridge.py
|
| 385 |
+
β βββ frame.py
|
| 386 |
+
β βββ duty_cycle.py
|
| 387 |
+
β βββ peer_map.py
|
| 388 |
+
β βββ adapters/{meshtastic,rfm95w,sx126x}.py
|
| 389 |
+
βββ evidence/ # M30
|
| 390 |
+
β βββ service.py
|
| 391 |
+
β βββ claim.py
|
| 392 |
+
β βββ store.py
|
| 393 |
+
β βββ query.py
|
| 394 |
+
β βββ extractor.py
|
| 395 |
+
β βββ ebkh_adapter.py
|
| 396 |
+
β βββ trust.py
|
| 397 |
+
βββ civdef/ # M31
|
| 398 |
+
β βββ service.py
|
| 399 |
+
β βββ alert.py
|
| 400 |
+
β βββ role.py
|
| 401 |
+
β βββ audit.py
|
| 402 |
+
β βββ target.py
|
| 403 |
+
β βββ ack.py
|
| 404 |
+
β βββ regions/nrw.py
|
| 405 |
+
βββ protocol/ # M32 runtime
|
| 406 |
+
β βββ service.py
|
| 407 |
+
β βββ registry.py
|
| 408 |
+
β βββ report.py
|
| 409 |
+
βββ transport/
|
| 410 |
+
βββ tensor/ # X08
|
| 411 |
+
βββ session.py
|
| 412 |
+
βββ frame.py
|
| 413 |
+
βββ flow.py
|
| 414 |
+
βββ compress.py
|
| 415 |
+
|
| 416 |
+
protocol/ # M32 spec artefacts (repo root)
|
| 417 |
+
βββ VERSION
|
| 418 |
+
βββ README.md
|
| 419 |
+
βββ CHANGELOG.md
|
| 420 |
+
βββ governance.md
|
| 421 |
+
βββ versioning.md
|
| 422 |
+
βββ reference-implementations.md
|
| 423 |
+
βββ core/...
|
| 424 |
+
βββ experimental/...
|
| 425 |
+
|
| 426 |
+
conformance/ # X09 suite (repo root)
|
| 427 |
+
βββ VERSION
|
| 428 |
+
βββ runner.py
|
| 429 |
+
βββ report.py
|
| 430 |
+
βββ harness/...
|
| 431 |
+
βββ suites/...
|
| 432 |
+
βββ vectors/...
|
| 433 |
+
```
|
| 434 |
+
|
| 435 |
+
---
|
| 436 |
+
|
| 437 |
+
## 8. Configuration index
|
| 438 |
+
|
| 439 |
+
Each module defines its own `*Config` dataclass; all are surfaced through the global `HearthnetConfig` and read from `~/.config/hearthnet/config.toml`. Phase 3 additions:
|
| 440 |
+
|
| 441 |
+
```python
|
| 442 |
+
@dataclass(frozen=True)
|
| 443 |
+
class HearthnetConfig:
|
| 444 |
+
# ... (Phase 1, Phase 2 fields) ...
|
| 445 |
+
distributed_llm: DistributedLlmConfig
|
| 446 |
+
moe: MoeConfig
|
| 447 |
+
fedlearn: FedLearnConfig
|
| 448 |
+
lora: LoraConfig
|
| 449 |
+
evidence: EvidenceConfig
|
| 450 |
+
civdef: CivDefConfig
|
| 451 |
+
tensor_transport: TensorTransportConfig
|
| 452 |
+
protocol: ProtocolConfig
|
| 453 |
+
```
|
| 454 |
+
|
| 455 |
+
Every Phase 3 config has `enabled: bool = False` except `protocol` (default `True`). The bus dispatcher refuses to register Phase 3 capabilities when their module's enabled flag is False.
|
| 456 |
+
|
| 457 |
+
---
|
| 458 |
+
|
| 459 |
+
## 9. Build order (recap from `00-OVERVIEW.md`)
|
| 460 |
+
|
| 461 |
+
Phase 3 has eight independent tracks A-H that can be parallelised:
|
| 462 |
+
|
| 463 |
+
```
|
| 464 |
+
Track A: X09 conformance suite scaffolding β M32 protocol service
|
| 465 |
+
Track B: M30 evidence + EBKH adapter
|
| 466 |
+
Track C: M27 MoE machine experts (router, registry, scorer)
|
| 467 |
+
Track D: M27 human-in-the-loop coordinator (depends on Track C base)
|
| 468 |
+
Track E: M28 federated LoRA aggregation
|
| 469 |
+
Track F: X08 tensor transport β M26 distributed inference
|
| 470 |
+
Track G: M29 LoRa beacons (hardware-gated)
|
| 471 |
+
Track H: M31 civil defense (depends on M30 evidence)
|
| 472 |
+
```
|
| 473 |
+
|
| 474 |
+
Tracks A and F unlock the most downstream work (M32 needs X09; M26 needs X08). Tracks G and H are most easily deferred if Phase 3 needs to ship a minimal cut.
|
| 475 |
+
|
| 476 |
+
---
|
| 477 |
+
|
| 478 |
+
## 10. Open-question summary
|
| 479 |
+
|
| 480 |
+
Each module spec has its own Β§10 with detailed open questions. The recurring themes across Phase 3:
|
| 481 |
+
|
| 482 |
+
1. **Real-world identity binding** (M28 sybil defence, M31 institutional keys, M30 EBKH trust roots, M27 human verification) β the cryptographic story is solid; the social/institutional story is the work.
|
| 483 |
+
2. **Adversarial robustness** (M26 byzantine shards, M28 poisoning, M30 disputed claims, M31 forged certs) β all have stub defences and known harder problems.
|
| 484 |
+
3. **Second implementation** (M32) β until a non-reference impl exists, conformance is performative. This is the single most important next step.
|
| 485 |
+
4. **Cross-Land / cross-border generalisation** (M31 regional adapter, M30 EBKH OSINT scope, M29 regulatory regions) β designed for NRW first; structures admit other regions but they're unbuilt.
|
| 486 |
+
5. **Resource tiers** (M26 phone-class participants, M28 hardware fairness) β heterogeneous hardware aggregation is largely unsolved.
|
| 487 |
+
6. **Privacy / DP calibration** (M28 noise scale, M30 sensitive claims, M29 sender hash) β defaults are conservative; tuning is operator-by-operator.
|
| 488 |
+
|
| 489 |
+
Each module also lists module-specific items. Read them.
|
| 490 |
+
|
| 491 |
+
---
|
| 492 |
+
|
| 493 |
+
*Last updated: spec set v3.0. Phase 3 specs were authored with the intent that any of M26βM31 could be cut from a shipping release without affecting Phase 1 + Phase 2 functionality. M32 + X09 are the long-term durability investment and should ship even when other Phase 3 modules don't.*
|
docs/p2_p3/M14-federation.md
ADDED
|
@@ -0,0 +1,434 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# M14 β Federation
|
| 2 |
+
|
| 3 |
+
**Spec version:** v1.0 (Phase 2)
|
| 4 |
+
**Depends on:** M01 (identity), M16 (tokens), X02 (events), X05 (DHT, for cross-LAN bootstrap), X01 (transport), M03 (bus), M15 (relay, for NAT traversal)
|
| 5 |
+
**Depended on by:** M22 (mobile uses federation for cross-community discovery), all services indirectly (federated calls route through M14)
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## 1. Responsibility
|
| 10 |
+
|
| 11 |
+
Let two **communities** establish a trust link and grant each other scoped access to specific capabilities. Federation is the bridge from "your house" to "the neighbourhood".
|
| 12 |
+
|
| 13 |
+
- Establish a federation between two communities (mutual cross-signature)
|
| 14 |
+
- Enforce scope on cross-community capability calls
|
| 15 |
+
- Route federated calls through a designated **Bridge** node profile
|
| 16 |
+
- Heartbeat between federated communities so liveness is visible
|
| 17 |
+
- Issue federation tokens (via M16) for individual calls
|
| 18 |
+
|
| 19 |
+
Out of scope:
|
| 20 |
+
- Many-community mesh federation β Phase 3 (this module handles bilateral only)
|
| 21 |
+
- Anonymous federation β never
|
| 22 |
+
|
| 23 |
+
---
|
| 24 |
+
|
| 25 |
+
## 2. File layout
|
| 26 |
+
|
| 27 |
+
```
|
| 28 |
+
hearthnet/federation/
|
| 29 |
+
βββ __init__.py
|
| 30 |
+
βββ manifest.py # FederationManifest builder + verifier
|
| 31 |
+
βββ peering.py # the cross-sig handshake
|
| 32 |
+
βββ relay_client.py # outbound calls into peer community
|
| 33 |
+
βββ service.py # FederationService β registers federation.* capabilities
|
| 34 |
+
```
|
| 35 |
+
|
| 36 |
+
---
|
| 37 |
+
|
| 38 |
+
## 3. Federation manifest
|
| 39 |
+
|
| 40 |
+
Format defined in [CAP2 Β§6.1](../CAPABILITY_CONTRACT_v2.md). Key properties:
|
| 41 |
+
|
| 42 |
+
- Lives in **both** communities' event logs
|
| 43 |
+
- Signed by anchors of both, with per-community `min_signatures_to_federate` co-signatures
|
| 44 |
+
- Carries the **scope** each side grants the other
|
| 45 |
+
- Has an expiry (`expires_at`) β typically 1 year, with auto-renewal via heartbeats
|
| 46 |
+
|
| 47 |
+
A node may belong to **one** community (Phase 1) and that community may be federated with **N** other communities. Federation is a community-level relationship, not per-node.
|
| 48 |
+
|
| 49 |
+
---
|
| 50 |
+
|
| 51 |
+
## 4. Public API
|
| 52 |
+
|
| 53 |
+
### 4.1 `manifest.py`
|
| 54 |
+
|
| 55 |
+
```python
|
| 56 |
+
# hearthnet/federation/manifest.py
|
| 57 |
+
from dataclasses import dataclass
|
| 58 |
+
|
| 59 |
+
@dataclass(frozen=True)
|
| 60 |
+
class FederationScope:
|
| 61 |
+
"""What one community grants the other."""
|
| 62 |
+
capabilities: list[str] # ["rag.query@1.0"]
|
| 63 |
+
params_constraints: dict[str, list[str]] # {"corpus":["public-emergency"]}
|
| 64 |
+
rate_limit_per_minute: int # cross-community budget
|
| 65 |
+
data_visibility: str # "public_corpora_only"|"members_only"|"open"
|
| 66 |
+
|
| 67 |
+
@dataclass(frozen=True)
|
| 68 |
+
class FederationManifest:
|
| 69 |
+
schema_version: int
|
| 70 |
+
federation_id: str
|
| 71 |
+
community_a: str
|
| 72 |
+
community_b: str
|
| 73 |
+
established_at: str
|
| 74 |
+
expires_at: str
|
| 75 |
+
scope_a_to_b: FederationScope # what A grants B
|
| 76 |
+
scope_b_to_a: FederationScope # what B grants A
|
| 77 |
+
bootstrap_endpoints_a: list[Endpoint]
|
| 78 |
+
bootstrap_endpoints_b: list[Endpoint]
|
| 79 |
+
signature_a: dict # {signed_by, signature, co_signers}
|
| 80 |
+
signature_b: dict
|
| 81 |
+
|
| 82 |
+
def grants_to(self, calling_community_id: str) -> FederationScope | None:
|
| 83 |
+
"""Returns the scope grant *to* the calling community (if federated, else None)."""
|
| 84 |
+
|
| 85 |
+
def is_expired(self, now: datetime | None = None) -> bool: ...
|
| 86 |
+
|
| 87 |
+
def build_federation_proposal(
|
| 88 |
+
our_community_manifest: CommunityManifest,
|
| 89 |
+
peer_community_manifest_url: str,
|
| 90 |
+
proposed_scope_to_grant: FederationScope,
|
| 91 |
+
proposed_scope_to_receive: FederationScope,
|
| 92 |
+
bootstrap_endpoints: list[Endpoint],
|
| 93 |
+
) -> 'FederationProposal':
|
| 94 |
+
"""Step 1: prepare a proposal. Not yet a manifest β just a draft for the other side."""
|
| 95 |
+
|
| 96 |
+
@dataclass(frozen=True)
|
| 97 |
+
class FederationProposal:
|
| 98 |
+
community_a: str
|
| 99 |
+
community_b: str
|
| 100 |
+
scope_a_to_b: FederationScope
|
| 101 |
+
scope_b_to_a: FederationScope
|
| 102 |
+
bootstrap_endpoints_a: list[Endpoint]
|
| 103 |
+
bootstrap_endpoints_b: list[Endpoint]
|
| 104 |
+
proposer_signature: str
|
| 105 |
+
|
| 106 |
+
def co_sign_federation(
|
| 107 |
+
proposal: FederationProposal,
|
| 108 |
+
signing_kp: KeyPair,
|
| 109 |
+
role: str, # "a" or "b"
|
| 110 |
+
) -> dict:
|
| 111 |
+
"""Returns {signed_by, signature, co_signers[]} payload."""
|
| 112 |
+
|
| 113 |
+
def finalize_federation_manifest(
|
| 114 |
+
proposal: FederationProposal,
|
| 115 |
+
sig_a: dict,
|
| 116 |
+
sig_b: dict,
|
| 117 |
+
) -> FederationManifest:
|
| 118 |
+
"""Assemble fully-signed manifest after both sides have signed."""
|
| 119 |
+
|
| 120 |
+
def parse_federation_manifest(blob: bytes | dict) -> FederationManifest: ...
|
| 121 |
+
def verify_federation_manifest(
|
| 122 |
+
m: FederationManifest,
|
| 123 |
+
community_a_manifest: CommunityManifest,
|
| 124 |
+
community_b_manifest: CommunityManifest,
|
| 125 |
+
) -> None:
|
| 126 |
+
"""Verify both sides signed, anchors are valid in their communities,
|
| 127 |
+
co-signer counts meet policy, expiry is in the future."""
|
| 128 |
+
```
|
| 129 |
+
|
| 130 |
+
### 4.2 `peering.py`
|
| 131 |
+
|
| 132 |
+
```python
|
| 133 |
+
# hearthnet/federation/peering.py
|
| 134 |
+
class FederationHandshake:
|
| 135 |
+
"""Conducts the multi-step cross-signing handshake.
|
| 136 |
+
Stateful; one instance per active proposal."""
|
| 137 |
+
|
| 138 |
+
def __init__(
|
| 139 |
+
self,
|
| 140 |
+
our_community_manifest: CommunityManifest,
|
| 141 |
+
our_kp: KeyPair,
|
| 142 |
+
transport_client: HttpClient,
|
| 143 |
+
event_log: EventLog,
|
| 144 |
+
):
|
| 145 |
+
...
|
| 146 |
+
|
| 147 |
+
async def initiate(
|
| 148 |
+
self,
|
| 149 |
+
peer_endpoints: list[Endpoint],
|
| 150 |
+
scope_to_grant: FederationScope,
|
| 151 |
+
scope_to_receive: FederationScope,
|
| 152 |
+
) -> FederationProposal:
|
| 153 |
+
"""1. Fetch peer's community manifest.
|
| 154 |
+
2. Build proposal.
|
| 155 |
+
3. Sign as community A's anchor.
|
| 156 |
+
4. POST to peer.
|
| 157 |
+
5. Receive peer's signed proposal back.
|
| 158 |
+
6. Verify both signatures and gather more local co-signers if policy requires.
|
| 159 |
+
Returns the fully-signed proposal ready to finalize."""
|
| 160 |
+
|
| 161 |
+
async def accept(self, proposal: FederationProposal) -> FederationManifest:
|
| 162 |
+
"""The other side accepting an incoming proposal.
|
| 163 |
+
Returns the finalized manifest (publishable to event log)."""
|
| 164 |
+
|
| 165 |
+
async def publish(self, manifest: FederationManifest) -> None:
|
| 166 |
+
"""Append federation.peer.added event to local log.
|
| 167 |
+
Push the manifest to peer so they can do the same."""
|
| 168 |
+
```
|
| 169 |
+
|
| 170 |
+
### 4.3 `relay_client.py`
|
| 171 |
+
|
| 172 |
+
```python
|
| 173 |
+
# hearthnet/federation/relay_client.py
|
| 174 |
+
class FederationCaller:
|
| 175 |
+
"""Outbound side: makes calls into federated communities.
|
| 176 |
+
Used by services when their request triggers a federated lookup
|
| 177 |
+
(e.g. rag.query across federated corpora)."""
|
| 178 |
+
|
| 179 |
+
def __init__(
|
| 180 |
+
self,
|
| 181 |
+
bus: CapabilityBus,
|
| 182 |
+
our_kp: KeyPair,
|
| 183 |
+
our_community_id: str,
|
| 184 |
+
federation_manifests_provider: Callable[[], list[FederationManifest]],
|
| 185 |
+
):
|
| 186 |
+
...
|
| 187 |
+
|
| 188 |
+
async def call_in_peer(
|
| 189 |
+
self,
|
| 190 |
+
peer_community_id: str,
|
| 191 |
+
capability: str,
|
| 192 |
+
version: Version,
|
| 193 |
+
body: dict,
|
| 194 |
+
*,
|
| 195 |
+
timeout_seconds: float | None = None,
|
| 196 |
+
) -> dict:
|
| 197 |
+
"""1. Look up federation manifest for peer_community_id.
|
| 198 |
+
2. Verify scope includes (capability, params).
|
| 199 |
+
3. Issue an auth.token via local M16 with capability scope.
|
| 200 |
+
4. Pick a peer Bridge endpoint (from manifest.bootstrap_endpoints_b).
|
| 201 |
+
5. POST /bus/v1/call to peer's federation.proxy@1.0 with token + body.
|
| 202 |
+
Returns the result. Raises FederationError on scope/auth issues."""
|
| 203 |
+
|
| 204 |
+
async def stream_in_peer(...) -> AsyncIterator[Frame]:
|
| 205 |
+
"""Streaming variant."""
|
| 206 |
+
```
|
| 207 |
+
|
| 208 |
+
### 4.4 `service.py`
|
| 209 |
+
|
| 210 |
+
```python
|
| 211 |
+
# hearthnet/federation/service.py
|
| 212 |
+
class FederationService:
|
| 213 |
+
name = "federation"
|
| 214 |
+
version = "1.0"
|
| 215 |
+
|
| 216 |
+
def __init__(
|
| 217 |
+
self,
|
| 218 |
+
bus: CapabilityBus,
|
| 219 |
+
event_log: EventLog,
|
| 220 |
+
replay_engine: ReplayEngine,
|
| 221 |
+
author_kp: KeyPair,
|
| 222 |
+
community_manifest_provider: Callable[[], CommunityManifest],
|
| 223 |
+
revocation_cache: RevocationCache,
|
| 224 |
+
):
|
| 225 |
+
...
|
| 226 |
+
|
| 227 |
+
def capabilities(self) -> list[tuple[CapabilityDescriptor, Callable, ParamsPredicate]]:
|
| 228 |
+
"""Registers: federation.peer.add, federation.peer.remove,
|
| 229 |
+
federation.peer.list, federation.proxy (all @1.0)."""
|
| 230 |
+
|
| 231 |
+
async def start(self) -> None: ...
|
| 232 |
+
async def stop(self) -> None: ...
|
| 233 |
+
def health(self) -> dict: ...
|
| 234 |
+
|
| 235 |
+
# --- handlers ---
|
| 236 |
+
|
| 237 |
+
async def handle_peer_add(self, req: RouteRequest) -> dict:
|
| 238 |
+
"""CAP2 Β§4.1. Run handshake; check co-signatures; emit event."""
|
| 239 |
+
|
| 240 |
+
async def handle_peer_remove(self, req: RouteRequest) -> dict:
|
| 241 |
+
"""CAP2 Β§4.2."""
|
| 242 |
+
|
| 243 |
+
async def handle_peer_list(self, req: RouteRequest) -> dict:
|
| 244 |
+
"""CAP2 Β§4.3."""
|
| 245 |
+
|
| 246 |
+
async def handle_proxy(self, req: RouteRequest) -> dict | AsyncIterator[dict]:
|
| 247 |
+
"""CAP2 Β§4.4. Forward a federated call to the local bus.
|
| 248 |
+
1. Verify token attached to request.
|
| 249 |
+
2. Look up federation manifest, check scope.
|
| 250 |
+
3. Call bus.call(target_capability, version, body) internally.
|
| 251 |
+
4. Return result (or stream frames)."""
|
| 252 |
+
|
| 253 |
+
# --- maintenance ---
|
| 254 |
+
|
| 255 |
+
async def heartbeat_loop(self) -> None:
|
| 256 |
+
"""Per FEDERATION_HEARTBEAT_SECONDS, ping each federated peer's
|
| 257 |
+
federation.peer.list. Update last_heartbeat. Emit
|
| 258 |
+
federation.heartbeat event."""
|
| 259 |
+
```
|
| 260 |
+
|
| 261 |
+
---
|
| 262 |
+
|
| 263 |
+
## 5. Behaviour
|
| 264 |
+
|
| 265 |
+
### 5.1 Two-phase handshake
|
| 266 |
+
|
| 267 |
+
```
|
| 268 |
+
Community A's anchor decides to federate with B.
|
| 269 |
+
β
|
| 270 |
+
A: peering.initiate(B_endpoints, scope_a_to_b, scope_b_to_a)
|
| 271 |
+
β fetch B's community manifest
|
| 272 |
+
β build proposal; sign as A
|
| 273 |
+
β POST /federation/proposal to B's bridge
|
| 274 |
+
β
|
| 275 |
+
B receives proposal, presents to a trusted member.
|
| 276 |
+
human decision (or auto-policy)
|
| 277 |
+
β
|
| 278 |
+
B: peering.accept(proposal)
|
| 279 |
+
β sign as B
|
| 280 |
+
β return signed proposal
|
| 281 |
+
β
|
| 282 |
+
A: gather more local co-signers if our policy.min_signatures_to_federate > 1
|
| 283 |
+
β
|
| 284 |
+
A: finalize_federation_manifest(proposal, sig_a, sig_b)
|
| 285 |
+
β
|
| 286 |
+
A: publish federation.peer.added event locally
|
| 287 |
+
β
|
| 288 |
+
A: POST manifest to B so they publish too
|
| 289 |
+
β
|
| 290 |
+
both communities heartbeat each other periodically
|
| 291 |
+
```
|
| 292 |
+
|
| 293 |
+
### 5.2 Bridge node profile
|
| 294 |
+
|
| 295 |
+
A community designates one or more nodes with `profile: "bridge"`. Bridge nodes:
|
| 296 |
+
|
| 297 |
+
- Always-on (best-effort)
|
| 298 |
+
- Have a publicly-reachable endpoint or a relay-tier (M15) registration
|
| 299 |
+
- Run `FederationService` and act as the proxy for inbound federated calls
|
| 300 |
+
- Hold the bandwidth budget for cross-community traffic
|
| 301 |
+
|
| 302 |
+
Non-bridge nodes can still **call into** federated communities (via M14 `FederationCaller`); they just don't *serve* cross-community calls.
|
| 303 |
+
|
| 304 |
+
### 5.3 Scope enforcement (inbound)
|
| 305 |
+
|
| 306 |
+
When `federation.proxy` is invoked:
|
| 307 |
+
|
| 308 |
+
1. Caller signature verified (Phase 1 Β§1.3)
|
| 309 |
+
2. Caller's community is parsed from token's `iss`
|
| 310 |
+
3. Federation manifest lookup; absent β `not_federated`
|
| 311 |
+
4. Scope check: `(capability, version)` β scope and params allowed β else `federation_forbidden`
|
| 312 |
+
5. Token's signature verified against issuer's community anchors
|
| 313 |
+
6. Token's `aud` must match our community
|
| 314 |
+
7. Token's `scope` β federation manifest's scope (caller's community can't grant themselves more than they were granted)
|
| 315 |
+
8. Dispatch internally via bus
|
| 316 |
+
9. Record metrics: `hearthnet_federation_calls_total{peer_community, capability, result}`
|
| 317 |
+
|
| 318 |
+
### 5.4 Heartbeats and expiry
|
| 319 |
+
|
| 320 |
+
- Every `FEDERATION_HEARTBEAT_SECONDS` (300), each bridge calls `federation.peer.list` on each federated peer
|
| 321 |
+
- If a heartbeat fails 3 times in a row, the peer is marked `degraded` in the local view
|
| 322 |
+
- Federation manifests have `expires_at`. 30 days before expiry, a renewal handshake is auto-initiated. If renewal fails by expiry, the federation lapses; calls return `not_federated`.
|
| 323 |
+
|
| 324 |
+
### 5.5 Revocation
|
| 325 |
+
|
| 326 |
+
To break a federation:
|
| 327 |
+
|
| 328 |
+
- Either side may call `federation.peer.remove`
|
| 329 |
+
- Co-signature requirements: same as creation (`policy.min_signatures_to_federate`)
|
| 330 |
+
- Event `federation.peer.removed` is published locally; peer is notified and publishes their own
|
| 331 |
+
- All outstanding tokens issued under this federation are implicitly revoked (M16 verifies federation still exists)
|
| 332 |
+
|
| 333 |
+
### 5.6 Identity import (Phase 2.5 hook)
|
| 334 |
+
|
| 335 |
+
A federated user with NodeID in community A wishing to access community B's services *as themselves* (not via their community A anchor's token) can use `federation.identity.attest` (reserved capability, Phase 2.5). Out of scope for first cut.
|
| 336 |
+
|
| 337 |
+
### 5.7 Trust transitivity
|
| 338 |
+
|
| 339 |
+
**Not transitive.** AβB and BβC do not imply AβC. Each pair establishes its own manifest. This is intentional β explicit consent.
|
| 340 |
+
|
| 341 |
+
### 5.8 Conflict: federation with revoked member's community
|
| 342 |
+
|
| 343 |
+
If community A has federated with B, and later A's anchor (the signer) is revoked from A:
|
| 344 |
+
|
| 345 |
+
- The federation manifest's signature *was* valid at sign time
|
| 346 |
+
- Going forward, A's community may renew with a new anchor signature
|
| 347 |
+
- B verifies federation against A's current anchor set on every call β if no current anchor co-signs, the federation is invalid
|
| 348 |
+
|
| 349 |
+
---
|
| 350 |
+
|
| 351 |
+
## 6. Discovery integration
|
| 352 |
+
|
| 353 |
+
A community wishing to find a federated peer they haven't talked to in a while:
|
| 354 |
+
|
| 355 |
+
1. Look up `bootstrap_endpoints` in their stored federation manifest
|
| 356 |
+
2. Try each; if all fail, fall back to [X05 DHT](../cross-cutting/X05-dht.md): `find_value(blake3(peer_community_id))`
|
| 357 |
+
3. If DHT also returns nothing, try [M15 relay](M15-relay-tier.md): `relay.lookup_community(peer_community_id)`
|
| 358 |
+
4. Only after all three fail β mark federation as `unreachable`; UI shows offline indicator
|
| 359 |
+
|
| 360 |
+
---
|
| 361 |
+
|
| 362 |
+
## 7. Errors
|
| 363 |
+
|
| 364 |
+
`FederationError`:
|
| 365 |
+
|
| 366 |
+
- `not_federated` β no manifest for this peer
|
| 367 |
+
- `federation_expired` β manifest past expires_at
|
| 368 |
+
- `scope_violation` β request outside granted scope
|
| 369 |
+
- `bridge_unreachable` β couldn't reach any of peer's bridges
|
| 370 |
+
- `co_signer_insufficient` β proposal lacks required signature count
|
| 371 |
+
- `peer_community_invalid` β peer's manifest failed verification
|
| 372 |
+
|
| 373 |
+
Wire mapping per [CAP2 Β§9](../CAPABILITY_CONTRACT_v2.md).
|
| 374 |
+
|
| 375 |
+
---
|
| 376 |
+
|
| 377 |
+
## 8. Configuration
|
| 378 |
+
|
| 379 |
+
```python
|
| 380 |
+
config.federation.enabled = False # opt-in
|
| 381 |
+
config.federation.bridge_node = False # we serve cross-community calls
|
| 382 |
+
config.federation.relay_url = None # M15 hosted relay for NAT
|
| 383 |
+
config.federation.auto_renew_days_before = 30
|
| 384 |
+
config.federation.max_peer_communities = 16
|
| 385 |
+
config.federation.heartbeat_seconds = FEDERATION_HEARTBEAT_SECONDS
|
| 386 |
+
config.federation.scope_default_rate_limit_per_minute = 60
|
| 387 |
+
```
|
| 388 |
+
|
| 389 |
+
Constants: `FEDERATION_MANIFEST_TTL_SECONDS`, `FEDERATION_HEARTBEAT_SECONDS`.
|
| 390 |
+
|
| 391 |
+
---
|
| 392 |
+
|
| 393 |
+
## 9. Tests
|
| 394 |
+
|
| 395 |
+
### Unit
|
| 396 |
+
- `test_federation_proposal_builds_correctly`
|
| 397 |
+
- `test_co_sign_signature_verifies`
|
| 398 |
+
- `test_finalize_requires_min_signers`
|
| 399 |
+
- `test_grants_to_returns_scope_for_correct_direction`
|
| 400 |
+
- `test_expired_federation_rejects_calls`
|
| 401 |
+
|
| 402 |
+
### Integration
|
| 403 |
+
- `test_two_community_federation_round_trip` β A and B in different processes federate, then A queries B's RAG via proxy
|
| 404 |
+
- `test_scope_violation_returns_403`
|
| 405 |
+
- `test_heartbeat_marks_degraded_after_3_failures`
|
| 406 |
+
- `test_revocation_breaks_existing_tokens`
|
| 407 |
+
- `test_renewal_30_days_before_expiry`
|
| 408 |
+
|
| 409 |
+
### Chaos
|
| 410 |
+
- `test_partition_during_federation_handshake_resumable`
|
| 411 |
+
|
| 412 |
+
---
|
| 413 |
+
|
| 414 |
+
## 10. Cross-references
|
| 415 |
+
|
| 416 |
+
| What | Where |
|
| 417 |
+
|------|-------|
|
| 418 |
+
| Federation manifest schema | [CAP2 Β§6.1](../CAPABILITY_CONTRACT_v2.md) |
|
| 419 |
+
| `federation.*` capabilities | [CAP2 Β§4.1β4.4](../CAPABILITY_CONTRACT_v2.md) |
|
| 420 |
+
| Token issuance for cross-community | [M16 Β§5.1](M16-tokens.md) |
|
| 421 |
+
| DHT bootstrap | [X05 Β§4.3](../cross-cutting/X05-dht.md) |
|
| 422 |
+
| Relay tier NAT traversal | [M15](M15-relay-tier.md) |
|
| 423 |
+
| Bridge node profile | [Phase 1 PRD Β§5.4 + this module Β§5.2](../../HEARTHNET_PRD_v2.md) |
|
| 424 |
+
| Phase 3 transitive federation | TBD |
|
| 425 |
+
|
| 426 |
+
---
|
| 427 |
+
|
| 428 |
+
## 11. Open questions
|
| 429 |
+
|
| 430 |
+
1. **Multi-party federation (mesh of N>2 communities)** β currently bilateral only. Phase 3 candidate.
|
| 431 |
+
2. **Federated marketplace** β should `market.list` cross federations? Reserved scope param; default off.
|
| 432 |
+
3. **Federated identity** β single-sign-on across federated communities. Phase 2.5; design depends on token-on-token.
|
| 433 |
+
4. **Federation revocation event propagation** β if AβB and AβC, and B unilaterally revokes A, should C see this? MVP: no, each pair is independent.
|
| 434 |
+
5. **Audit log for federation activity** β should there be a separate "federation_audit" log so cross-community activity is easy to surface to operators? Yes, Phase 2.5.
|
docs/p2_p3/M15-relay-tier.md
ADDED
|
@@ -0,0 +1,389 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# M15 β Relay Tier
|
| 2 |
+
|
| 3 |
+
**Spec version:** v1.0 (Phase 2)
|
| 4 |
+
**Depends on:** M01 (identity), M16 (tokens, for relay registration auth), X01 (transport), X05 (DHT), X04 (config)
|
| 5 |
+
**Depended on by:** M14 (federation, when bridges are NAT'd), M22 (mobile push delivery), X05 (DHT bootstrap)
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## 1. Responsibility
|
| 10 |
+
|
| 11 |
+
A **relay node** is a public-internet-reachable service that helps HearthNet nodes that cannot directly reach each other (NAT, mobile networks, dynamic IPs). It provides:
|
| 12 |
+
|
| 13 |
+
- **NAT traversal**: registered nodes receive forwarded traffic from peers
|
| 14 |
+
- **Federation discovery**: a public lookup of "which IP currently runs community X's bridge"
|
| 15 |
+
- **Mobile push**: delivers chat/marketplace notifications to mobile devices via APNs/FCM
|
| 16 |
+
- **DHT bootstrap**: serves as a stable initial DHT endpoint
|
| 17 |
+
|
| 18 |
+
A relay is **infrastructure** β typically one or a few well-known servers per region. Christof's planned `relay.hearthnet.de` (Hetzner) is the reference deployment.
|
| 19 |
+
|
| 20 |
+
Relays do **not**:
|
| 21 |
+
- Store any community state long-term
|
| 22 |
+
- See cleartext of E2E-encrypted chat
|
| 23 |
+
- Make any trust decisions β they are credential-free transport for already-authenticated traffic
|
| 24 |
+
- Replace community anchors
|
| 25 |
+
|
| 26 |
+
---
|
| 27 |
+
|
| 28 |
+
## 2. File layout
|
| 29 |
+
|
| 30 |
+
```
|
| 31 |
+
hearthnet/relay/ # client-side helpers, ships with normal Hearthnet
|
| 32 |
+
βββ __init__.py
|
| 33 |
+
βββ client.py # RelayClient β registration, forwarding fetch
|
| 34 |
+
βββ push_subscriber.py # iOS APNs / Android FCM token registration
|
| 35 |
+
|
| 36 |
+
relay-server/ # separate deployable, lives in /relay-server in the repo
|
| 37 |
+
βββ pyproject.toml
|
| 38 |
+
βββ README.md
|
| 39 |
+
βββ relay_server/
|
| 40 |
+
β βββ __init__.py
|
| 41 |
+
β βββ app.py # FastAPI app
|
| 42 |
+
β βββ registration.py # /relay/v1/register, /heartbeat
|
| 43 |
+
β βββ forward.py # /relay/v1/forward
|
| 44 |
+
β βββ lookup.py # /relay/v1/community/<id>
|
| 45 |
+
β βββ push.py # outbound APNs/FCM
|
| 46 |
+
β βββ billing.py # optional, for paid tier
|
| 47 |
+
β βββ storage.py # SQLite for registrations + push device tokens
|
| 48 |
+
β βββ observability.py # logging, metrics
|
| 49 |
+
βββ deploy/
|
| 50 |
+
β βββ docker-compose.yml
|
| 51 |
+
β βββ caddy.Caddyfile # TLS termination
|
| 52 |
+
β βββ systemd/relay.service
|
| 53 |
+
βββ tests/
|
| 54 |
+
```
|
| 55 |
+
|
| 56 |
+
The relay server is a thin FastAPI app deployable to any VPS. ~1500 LOC.
|
| 57 |
+
|
| 58 |
+
---
|
| 59 |
+
|
| 60 |
+
## 3. Wire surface (relay server endpoints)
|
| 61 |
+
|
| 62 |
+
| Endpoint | Method | Purpose |
|
| 63 |
+
|----------|--------|---------|
|
| 64 |
+
| `/relay/v1/register` | POST | A node registers itself for forwarding |
|
| 65 |
+
| `/relay/v1/heartbeat` | POST | Keep registration alive |
|
| 66 |
+
| `/relay/v1/deregister` | POST | Cleanly remove a registration |
|
| 67 |
+
| `/relay/v1/forward/<node_short>` | POST | A peer wanting to reach a registered node sends the encapsulated request here |
|
| 68 |
+
| `/relay/v1/community/<community_id>` | GET | Look up current bridge endpoints for a community |
|
| 69 |
+
| `/relay/v1/push/register` | POST | Register an APNs/FCM device token for push delivery |
|
| 70 |
+
| `/relay/v1/push/<device_id>` | POST | Send a push notification (typically from another node) |
|
| 71 |
+
| `/relay/v1/dht/bootstrap` | GET | Return a list of known-good DHT contacts |
|
| 72 |
+
| `/relay/v1/health` | GET | Operator endpoint |
|
| 73 |
+
|
| 74 |
+
Auth: every endpoint except `/health` requires a HearthNet signature OR a relay-issued token (registered nodes get a session token for performance).
|
| 75 |
+
|
| 76 |
+
---
|
| 77 |
+
|
| 78 |
+
## 4. Public API (client-side)
|
| 79 |
+
|
| 80 |
+
### 4.1 `client.py`
|
| 81 |
+
|
| 82 |
+
```python
|
| 83 |
+
# hearthnet/relay/client.py
|
| 84 |
+
@dataclass(frozen=True)
|
| 85 |
+
class RelayRegistration:
|
| 86 |
+
relay_url: str
|
| 87 |
+
node_id_full: str
|
| 88 |
+
expires_at: int # unix seconds
|
| 89 |
+
session_token: str # short-lived bearer for subsequent calls
|
| 90 |
+
|
| 91 |
+
class RelayClient:
|
| 92 |
+
"""Used by federation bridges and mobile clients to be reachable through a relay."""
|
| 93 |
+
|
| 94 |
+
def __init__(
|
| 95 |
+
self,
|
| 96 |
+
relay_url: str,
|
| 97 |
+
kp: KeyPair,
|
| 98 |
+
community_id: str,
|
| 99 |
+
):
|
| 100 |
+
...
|
| 101 |
+
|
| 102 |
+
async def register(
|
| 103 |
+
self,
|
| 104 |
+
*,
|
| 105 |
+
capabilities_offered: list[str] | None = None,
|
| 106 |
+
external_endpoint_hint: Endpoint | None = None,
|
| 107 |
+
) -> RelayRegistration:
|
| 108 |
+
"""Register us with this relay. Relay will forward inbound /relay/v1/forward/<our_short>
|
| 109 |
+
calls to our actual endpoint via reverse-WebSocket (we hold a persistent WS to the relay).
|
| 110 |
+
Returns registration; client should hold an open WS until deregister."""
|
| 111 |
+
|
| 112 |
+
async def heartbeat(self) -> None:
|
| 113 |
+
"""Refresh registration. Should be called every RELAY_REGISTRATION_TTL_SECONDS / 2."""
|
| 114 |
+
|
| 115 |
+
async def deregister(self) -> None: ...
|
| 116 |
+
|
| 117 |
+
async def maintain(self) -> None:
|
| 118 |
+
"""Long-running task: keeps registration alive, reconnects on failure."""
|
| 119 |
+
|
| 120 |
+
# --- lookups (no registration required) ---
|
| 121 |
+
|
| 122 |
+
async def lookup_community(self, community_id: str) -> list[Endpoint]:
|
| 123 |
+
"""Find current bridge endpoints for a community."""
|
| 124 |
+
|
| 125 |
+
async def dht_bootstrap_endpoints(self) -> list[Endpoint]: ...
|
| 126 |
+
|
| 127 |
+
async def send_push(
|
| 128 |
+
self,
|
| 129 |
+
device_id: str,
|
| 130 |
+
payload: dict,
|
| 131 |
+
*,
|
| 132 |
+
push_token: str, # token from M16 with relay.push scope
|
| 133 |
+
) -> None:
|
| 134 |
+
"""Send a push notification via this relay."""
|
| 135 |
+
```
|
| 136 |
+
|
| 137 |
+
### 4.2 `push_subscriber.py`
|
| 138 |
+
|
| 139 |
+
```python
|
| 140 |
+
# hearthnet/relay/push_subscriber.py
|
| 141 |
+
class PushSubscriber:
|
| 142 |
+
"""Mobile-side: registers an APNs / FCM device token with the relay
|
| 143 |
+
so the relay can deliver push notifications for chat / marketplace events."""
|
| 144 |
+
|
| 145 |
+
def __init__(
|
| 146 |
+
self,
|
| 147 |
+
relay_url: str,
|
| 148 |
+
kp: KeyPair,
|
| 149 |
+
community_id: str,
|
| 150 |
+
platform: str, # "ios" | "android" | "web"
|
| 151 |
+
):
|
| 152 |
+
...
|
| 153 |
+
|
| 154 |
+
async def register(self, device_token: str) -> str:
|
| 155 |
+
"""Returns our PushDeviceID. Stored locally for later push send authorization."""
|
| 156 |
+
|
| 157 |
+
async def unregister(self) -> None: ...
|
| 158 |
+
```
|
| 159 |
+
|
| 160 |
+
### 4.3 Reverse-WebSocket pattern for forwarding
|
| 161 |
+
|
| 162 |
+
NAT'd nodes can't accept inbound connections. So:
|
| 163 |
+
|
| 164 |
+
1. Node POSTs `/relay/v1/register`
|
| 165 |
+
2. Server returns 101 Switching Protocols, upgrading to WebSocket
|
| 166 |
+
3. Node holds the WS open; relay sends forwarded calls down it
|
| 167 |
+
4. Node processes and responds back through the same WS
|
| 168 |
+
|
| 169 |
+
The relay server's `forward.py` proxies between the inbound HTTP caller and the registered node's WS. The inbound caller sees a normal HTTP/SSE response.
|
| 170 |
+
|
| 171 |
+
---
|
| 172 |
+
|
| 173 |
+
## 5. Server-side internals (sketch)
|
| 174 |
+
|
| 175 |
+
### 5.1 Registration table
|
| 176 |
+
|
| 177 |
+
```sql
|
| 178 |
+
CREATE TABLE registrations (
|
| 179 |
+
node_id_full TEXT PRIMARY KEY,
|
| 180 |
+
community_id TEXT NOT NULL,
|
| 181 |
+
external_ip TEXT,
|
| 182 |
+
ws_session_id TEXT, -- in-memory WS connection id
|
| 183 |
+
capabilities_offered TEXT, -- JSON
|
| 184 |
+
registered_at INTEGER,
|
| 185 |
+
expires_at INTEGER,
|
| 186 |
+
last_heartbeat INTEGER
|
| 187 |
+
);
|
| 188 |
+
CREATE INDEX idx_reg_community ON registrations(community_id);
|
| 189 |
+
```
|
| 190 |
+
|
| 191 |
+
### 5.2 Push table
|
| 192 |
+
|
| 193 |
+
```sql
|
| 194 |
+
CREATE TABLE push_devices (
|
| 195 |
+
device_id TEXT PRIMARY KEY, -- ULID, assigned by relay
|
| 196 |
+
node_id_full TEXT NOT NULL,
|
| 197 |
+
community_id TEXT NOT NULL,
|
| 198 |
+
platform TEXT NOT NULL,
|
| 199 |
+
device_token TEXT NOT NULL, -- APNs / FCM token (kept secret on relay)
|
| 200 |
+
registered_at INTEGER,
|
| 201 |
+
last_active INTEGER
|
| 202 |
+
);
|
| 203 |
+
CREATE INDEX idx_push_node ON push_devices(node_id_full);
|
| 204 |
+
```
|
| 205 |
+
|
| 206 |
+
### 5.3 Forwarding flow
|
| 207 |
+
|
| 208 |
+
```
|
| 209 |
+
peer P wants to call NAT'd node N (only knows N's NodeID and relay URL)
|
| 210 |
+
β
|
| 211 |
+
P β POST https://relay.hearthnet.de/relay/v1/forward/<N_short>
|
| 212 |
+
headers: standard X-HearthNet-* (signed by P)
|
| 213 |
+
body: original capability call
|
| 214 |
+
β
|
| 215 |
+
relay looks up N's WS session
|
| 216 |
+
if absent β 503 relay_unreachable
|
| 217 |
+
if present β wraps request as a WS message and sends to N's WS
|
| 218 |
+
β
|
| 219 |
+
N processes β sends response frames back through WS
|
| 220 |
+
β
|
| 221 |
+
relay streams those frames as HTTP/SSE response to P
|
| 222 |
+
β
|
| 223 |
+
relay never inspects body (E2E content); relay does check signatures
|
| 224 |
+
are valid (against peer manifests it caches) to prevent abuse
|
| 225 |
+
```
|
| 226 |
+
|
| 227 |
+
### 5.4 Push flow
|
| 228 |
+
|
| 229 |
+
```
|
| 230 |
+
sender wants to push to mobile user U
|
| 231 |
+
β
|
| 232 |
+
sender β bus.call("auth.token.issue", scope={"capabilities":["relay.push"],"audience":"<device_id>"})
|
| 233 |
+
β
|
| 234 |
+
sender β POST relay/v1/push/<device_id> with token + payload
|
| 235 |
+
β
|
| 236 |
+
relay verifies token; resolves device_id β APNs/FCM token
|
| 237 |
+
β
|
| 238 |
+
relay sends via APNs/FCM
|
| 239 |
+
β
|
| 240 |
+
Apple/Google delivers to the device
|
| 241 |
+
β
|
| 242 |
+
mobile app opens, calls bus to fetch new chat / event
|
| 243 |
+
```
|
| 244 |
+
|
| 245 |
+
The payload itself is opaque to the relay (`{"event_type":"chat.message.sent","sender_short":"7H4G-..."}`). The mobile app fetches the actual content via the bus when it opens.
|
| 246 |
+
|
| 247 |
+
### 5.5 Federation lookup flow
|
| 248 |
+
|
| 249 |
+
A community publishes its bridge endpoints to the relay via `/relay/v1/community/<id>` (POST, signed by an anchor). The relay caches `{community_id β [endpoints, last_updated]}` for 24 hours. GETs are free, signed (anti-spam) but lightweight.
|
| 250 |
+
|
| 251 |
+
---
|
| 252 |
+
|
| 253 |
+
## 6. Behaviour
|
| 254 |
+
|
| 255 |
+
### 6.1 Trust model
|
| 256 |
+
|
| 257 |
+
The relay is **untrusted-but-honest**. It:
|
| 258 |
+
|
| 259 |
+
- Sees who is talking to whom (NodeID-level)
|
| 260 |
+
- Sees signature envelopes (but not E2E ciphertext)
|
| 261 |
+
- Can deny service (DoS), refuse to forward, or rate-limit
|
| 262 |
+
- Can NOT impersonate anyone (no private keys)
|
| 263 |
+
- Can NOT decrypt E2E content (no DH secrets)
|
| 264 |
+
- Can NOT modify forwarded bytes without breaking signatures
|
| 265 |
+
|
| 266 |
+
Operators of relays are accountable through public reputation: the relay's URL is in plain sight in community configs. A misbehaving relay gets blackballed by communities.
|
| 267 |
+
|
| 268 |
+
### 6.2 Rate limiting
|
| 269 |
+
|
| 270 |
+
| Endpoint | Limit |
|
| 271 |
+
|----------|-------|
|
| 272 |
+
| `/register` | 10 per hour per node |
|
| 273 |
+
| `/forward` | 10 RPS per (peer, target_node) |
|
| 274 |
+
| `/community/<id>` GET | 100 RPS total |
|
| 275 |
+
| `/push/<device>` | 60 per hour per (sender, device) |
|
| 276 |
+
|
| 277 |
+
Exceeded β 429 + `retry_after_ms`.
|
| 278 |
+
|
| 279 |
+
### 6.3 Tier policy (Christof's hosted instance)
|
| 280 |
+
|
| 281 |
+
| Tier | Communities | Push notifications | Cost |
|
| 282 |
+
|------|-------------|--------------------|------|
|
| 283 |
+
| Free | β€ 5 nodes per community | 100/day | β¬0 |
|
| 284 |
+
| Hearth | β€ 50 nodes | 5000/day | β¬5/month |
|
| 285 |
+
| Anchor | unlimited | 50000/day | οΏ½οΏ½25/month |
|
| 286 |
+
| Self-hosted | unlimited | unlimited | infrastructure |
|
| 287 |
+
|
| 288 |
+
The relay is open-source; any community can run their own. Hosted tier is a convenience layer.
|
| 289 |
+
|
| 290 |
+
### 6.4 Privacy guarantees
|
| 291 |
+
|
| 292 |
+
- Per-call signature verification (the relay checks them, but signatures contain only public NodeID β not user identity in a deeper sense)
|
| 293 |
+
- Sender hides destination by sending to `forward/<short>`; the relay sees both
|
| 294 |
+
- For traffic-pattern privacy (who talks to whom), no protection β outside scope
|
| 295 |
+
- Logs retain registration + forwarding metadata for 30 days for abuse handling, then purged
|
| 296 |
+
|
| 297 |
+
### 6.5 Failure modes
|
| 298 |
+
|
| 299 |
+
- Relay down β mobile push delivery delayed; federated lookups fall back to DHT or stored endpoints; direct LAN calls unaffected
|
| 300 |
+
- Relay overloaded β 429s; clients exponential-backoff
|
| 301 |
+
- Relay key rotation β relay publishes new pubkey signed by previous key; clients update via standard manifest refresh
|
| 302 |
+
|
| 303 |
+
---
|
| 304 |
+
|
| 305 |
+
## 7. Configuration (client side)
|
| 306 |
+
|
| 307 |
+
```python
|
| 308 |
+
config.relay.enabled = False
|
| 309 |
+
config.relay.urls = ["https://relay.hearthnet.de"]
|
| 310 |
+
config.relay.tier = "free" # informational
|
| 311 |
+
config.relay.register_as_bridge = False # if True, holds persistent WS to relay
|
| 312 |
+
config.relay.push_enabled = False
|
| 313 |
+
config.relay.push_platform = "web"
|
| 314 |
+
```
|
| 315 |
+
|
| 316 |
+
Constants: `RELAY_REGISTRATION_TTL_SECONDS=7200`, `RELAY_PUSH_RETRY_MAX=5`.
|
| 317 |
+
|
| 318 |
+
### Relay server config (`relay-server/relay_server/config.py`)
|
| 319 |
+
|
| 320 |
+
```python
|
| 321 |
+
config.bind = "0.0.0.0:443"
|
| 322 |
+
config.tls_cert_file = "/etc/relay/cert.pem"
|
| 323 |
+
config.tls_key_file = "/etc/relay/key.pem"
|
| 324 |
+
config.database = "/var/lib/relay/relay.db"
|
| 325 |
+
config.apns_cert = "/etc/relay/apns.pem"
|
| 326 |
+
config.fcm_key_file = "/etc/relay/fcm.json"
|
| 327 |
+
config.tier = "free|hearth|anchor"
|
| 328 |
+
config.stripe_secret = None # for paid tiers
|
| 329 |
+
config.admin_token = "<random>" # for operator endpoints
|
| 330 |
+
```
|
| 331 |
+
|
| 332 |
+
---
|
| 333 |
+
|
| 334 |
+
## 8. Errors
|
| 335 |
+
|
| 336 |
+
`RelayError` (client domain):
|
| 337 |
+
|
| 338 |
+
- `relay_unreachable` β TCP fails or 5xx
|
| 339 |
+
- `registration_expired` β call requires re-register
|
| 340 |
+
- `forward_target_offline` β target node not currently registered with this relay
|
| 341 |
+
- `push_token_invalid` β APNs/FCM rejected the device token
|
| 342 |
+
- `tier_limit_exceeded` β quota for this tier reached
|
| 343 |
+
|
| 344 |
+
Wire mapping: `relay_unreachable` is its own code in [CAP2 Β§9](../CAPABILITY_CONTRACT_v2.md).
|
| 345 |
+
|
| 346 |
+
---
|
| 347 |
+
|
| 348 |
+
## 9. Tests
|
| 349 |
+
|
| 350 |
+
### Client-side unit
|
| 351 |
+
- `test_register_includes_signature`
|
| 352 |
+
- `test_heartbeat_refreshes_expires_at`
|
| 353 |
+
- `test_lookup_returns_endpoints`
|
| 354 |
+
|
| 355 |
+
### Server-side unit
|
| 356 |
+
- `test_forward_requires_target_registered`
|
| 357 |
+
- `test_signature_required_on_register`
|
| 358 |
+
- `test_rate_limit_per_peer_target`
|
| 359 |
+
- `test_push_dispatch_apns_mock`
|
| 360 |
+
|
| 361 |
+
### Integration
|
| 362 |
+
- `test_two_nat_peers_communicate_through_relay`
|
| 363 |
+
- `test_federation_bridge_via_relay`
|
| 364 |
+
- `test_push_delivered_to_real_test_device` (manual, with APNs sandbox)
|
| 365 |
+
|
| 366 |
+
### Operational
|
| 367 |
+
- Smoke tests on the deployed `relay.hearthnet.de` instance run hourly
|
| 368 |
+
|
| 369 |
+
---
|
| 370 |
+
|
| 371 |
+
## 10. Cross-references
|
| 372 |
+
|
| 373 |
+
| What | Where |
|
| 374 |
+
|------|-------|
|
| 375 |
+
| Token use for push auth | [M16 Β§5.5](M16-tokens.md) |
|
| 376 |
+
| Federation routes through relay | [M14 Β§6](M14-federation.md) |
|
| 377 |
+
| DHT bootstrap endpoint | [X05 Β§4.4](../cross-cutting/X05-dht.md) |
|
| 378 |
+
| Mobile push subscriber | [M22 Β§6](M22-mobile-native.md) |
|
| 379 |
+
| Wire `relay_unreachable` | [CAP2 Β§9](../CAPABILITY_CONTRACT_v2.md) |
|
| 380 |
+
|
| 381 |
+
---
|
| 382 |
+
|
| 383 |
+
## 11. Open questions
|
| 384 |
+
|
| 385 |
+
1. **TURN-style relay vs message relay** β current spec is message-level (peer sends entire capability call). Could also do session-level TCP relay (more efficient for streams). Phase 2.5 candidate.
|
| 386 |
+
2. **STUN integration** β clients could try direct connection via STUN before falling back to relay. Phase 3.
|
| 387 |
+
3. **Multi-relay redundancy** β a node could register with two relays for HA. MVP picks one; multi is Phase 2.5.
|
| 388 |
+
4. **Payment integration** β Stripe webhooks β tier upgrade. Implementation detail, not specced here.
|
| 389 |
+
5. **Self-hosting documentation quality** β for the "appliance" go-to-market path, the relay needs a one-command install. Defer to `RELAY_OPERATIONS.md` doc.
|
docs/p2_p3/M16-tokens.md
ADDED
|
@@ -0,0 +1,391 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# M16 β Capability Tokens
|
| 2 |
+
|
| 3 |
+
**Spec version:** v1.0 (Phase 2)
|
| 4 |
+
**Depends on:** M01 (identity), X02 (events, for `auth.token.*`), X04 (config), X03 (observability)
|
| 5 |
+
**Depended on by:** M14 (federation), M15 (relay), M22 (mobile), M23 (optionally, for session credentials)
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## 1. Responsibility
|
| 10 |
+
|
| 11 |
+
Issue, verify, and revoke short-lived **capability tokens** for delegation. A token says: "the holder of this token may invoke capability X (with these constraints) on behalf of issuer Y, until time Z."
|
| 12 |
+
|
| 13 |
+
Tokens are the mechanism Phase 2 uses for:
|
| 14 |
+
- Federation calls (a federated peer presents a token issued by an anchor of the peer community)
|
| 15 |
+
- Mobile clients (the mobile app presents a token issued during onboarding)
|
| 16 |
+
- Limited-scope sharing (e.g. "let this neighbour query our emergency corpus for the next hour")
|
| 17 |
+
|
| 18 |
+
Per-request Ed25519 signatures (Phase 1 Β§1.3) remain the default authentication; tokens are an *additional* mechanism.
|
| 19 |
+
|
| 20 |
+
---
|
| 21 |
+
|
| 22 |
+
## 2. File layout
|
| 23 |
+
|
| 24 |
+
```
|
| 25 |
+
hearthnet/identity/
|
| 26 |
+
βββ tokens.py # CapabilityToken, encode/decode, verify, revocation cache
|
| 27 |
+
|
| 28 |
+
hearthnet/services/auth/
|
| 29 |
+
βββ __init__.py
|
| 30 |
+
βββ service.py # AuthService β registers auth.token.* capabilities
|
| 31 |
+
```
|
| 32 |
+
|
| 33 |
+
The token primitives live under `identity/` (low-level crypto). The capability handlers live as a normal service so they go through the bus.
|
| 34 |
+
|
| 35 |
+
---
|
| 36 |
+
|
| 37 |
+
## 3. Token envelope
|
| 38 |
+
|
| 39 |
+
Compact JWS-style. Compatible with off-the-shelf JWS decoders that accept `EdDSA`. Length budget: β€ 800 bytes (fits a QR at error correction M).
|
| 40 |
+
|
| 41 |
+
### 3.1 Header
|
| 42 |
+
|
| 43 |
+
```json
|
| 44 |
+
{"alg": "EdDSA", "typ": "hntoken", "v": 1}
|
| 45 |
+
```
|
| 46 |
+
|
| 47 |
+
### 3.2 Payload
|
| 48 |
+
|
| 49 |
+
```json
|
| 50 |
+
{
|
| 51 |
+
"iss": "ed25519:<issuer NodeID full form>",
|
| 52 |
+
"sub": "ed25519:<subject NodeID full form>",
|
| 53 |
+
"aud": "ed25519:<audience community_id, optional>",
|
| 54 |
+
"iat": 1717939200,
|
| 55 |
+
"exp": 1717942800,
|
| 56 |
+
"nbf": 1717939200,
|
| 57 |
+
"jti": "01HXR...",
|
| 58 |
+
"scope": {
|
| 59 |
+
"capabilities": ["rag.query@1.0", "embed.text@1.0"],
|
| 60 |
+
"params_constraints": {
|
| 61 |
+
"corpus": ["niederrhein-emergency"],
|
| 62 |
+
"model": ["bge-small-en-v1.5"]
|
| 63 |
+
},
|
| 64 |
+
"rate_limit_per_minute": 60,
|
| 65 |
+
"max_calls_total": null
|
| 66 |
+
},
|
| 67 |
+
"issued_via": "federation|onboarding|manual|relay"
|
| 68 |
+
}
|
| 69 |
+
```
|
| 70 |
+
|
| 71 |
+
`sub` MAY be `"*"` for a bearer-style token (anyone with the token may use it). Used sparingly β only for federation proxies where the actual subject is unknown at issuance time.
|
| 72 |
+
|
| 73 |
+
### 3.3 Signature
|
| 74 |
+
|
| 75 |
+
`Ed25519(base64url(header) + "." + base64url(payload))`. Final form:
|
| 76 |
+
|
| 77 |
+
```
|
| 78 |
+
hntoken://v1/<base64url(header)>.<base64url(payload)>.<base64url(signature)>
|
| 79 |
+
```
|
| 80 |
+
|
| 81 |
+
Total length: ~600β800 bytes typical.
|
| 82 |
+
|
| 83 |
+
---
|
| 84 |
+
|
| 85 |
+
## 4. Public API
|
| 86 |
+
|
| 87 |
+
### 4.1 `hearthnet/identity/tokens.py`
|
| 88 |
+
|
| 89 |
+
```python
|
| 90 |
+
# hearthnet/identity/tokens.py
|
| 91 |
+
from dataclasses import dataclass
|
| 92 |
+
|
| 93 |
+
@dataclass(frozen=True)
|
| 94 |
+
class TokenScope:
|
| 95 |
+
capabilities: list[str] # e.g. ["rag.query@1.0"]
|
| 96 |
+
params_constraints: dict[str, list[str]] # e.g. {"corpus": ["..."]}
|
| 97 |
+
rate_limit_per_minute: int
|
| 98 |
+
max_calls_total: int | None
|
| 99 |
+
|
| 100 |
+
@dataclass(frozen=True)
|
| 101 |
+
class CapabilityToken:
|
| 102 |
+
"""The fully decoded token, ready for verification."""
|
| 103 |
+
issuer: str
|
| 104 |
+
subject: str # "*" for bearer
|
| 105 |
+
audience: str | None
|
| 106 |
+
issued_at: int # unix seconds
|
| 107 |
+
expires_at: int
|
| 108 |
+
not_before: int
|
| 109 |
+
jti: str # ULID
|
| 110 |
+
scope: TokenScope
|
| 111 |
+
issued_via: str # "federation"|"onboarding"|...
|
| 112 |
+
signature: bytes # raw 64 bytes
|
| 113 |
+
|
| 114 |
+
@property
|
| 115 |
+
def is_bearer(self) -> bool: ...
|
| 116 |
+
|
| 117 |
+
def is_active(self, now: int | None = None) -> bool: ...
|
| 118 |
+
|
| 119 |
+
def covers(self, capability_name: str, version: tuple[int, int],
|
| 120 |
+
params: dict | None = None) -> bool:
|
| 121 |
+
"""True iff scope includes the capability and (if params_constraints set) every requested param value is in the allow-list."""
|
| 122 |
+
|
| 123 |
+
def issue_token(
|
| 124 |
+
issuer_kp: KeyPair,
|
| 125 |
+
subject: str,
|
| 126 |
+
scope: TokenScope,
|
| 127 |
+
*,
|
| 128 |
+
ttl_seconds: int = TOKEN_DEFAULT_TTL_SECONDS,
|
| 129 |
+
audience: str | None = None,
|
| 130 |
+
issued_via: str = "manual",
|
| 131 |
+
not_before_offset: int = 0,
|
| 132 |
+
) -> tuple[CapabilityToken, str]:
|
| 133 |
+
"""Build, sign, encode. Returns (token, encoded_str)."""
|
| 134 |
+
|
| 135 |
+
def encode_token(tok: CapabilityToken, header_signature: bytes) -> str:
|
| 136 |
+
"""Render to 'hntoken://v1/...'."""
|
| 137 |
+
|
| 138 |
+
def decode_token(text: str) -> CapabilityToken:
|
| 139 |
+
"""Parse + structural validation only. Does NOT verify the signature.
|
| 140 |
+
Raises TokenError on malformed input."""
|
| 141 |
+
|
| 142 |
+
def verify_token(
|
| 143 |
+
tok: CapabilityToken,
|
| 144 |
+
*,
|
| 145 |
+
expected_audience: str | None = None,
|
| 146 |
+
revocation_cache: 'RevocationCache | None' = None,
|
| 147 |
+
now: int | None = None,
|
| 148 |
+
community_manifest: CommunityManifest,
|
| 149 |
+
) -> None:
|
| 150 |
+
"""Verify signature against issuer's pubkey, expiry, nbf, audience,
|
| 151 |
+
revocation, and that the issuer is currently a community member
|
| 152 |
+
(not revoked at the issuer's community level).
|
| 153 |
+
Raises TokenError with specific code."""
|
| 154 |
+
|
| 155 |
+
class RevocationCache:
|
| 156 |
+
"""In-memory + persisted (SQLite) cache of revoked JTIs.
|
| 157 |
+
Authoritative source is the event log."""
|
| 158 |
+
|
| 159 |
+
def __init__(self, db_path: Path):
|
| 160 |
+
...
|
| 161 |
+
|
| 162 |
+
def add(self, jti: str, revoked_at: int) -> None: ...
|
| 163 |
+
def is_revoked(self, jti: str) -> bool: ...
|
| 164 |
+
def hydrate_from_log(self, event_log: EventLog) -> int:
|
| 165 |
+
"""Read all auth.token.revoked events; bring cache up to date.
|
| 166 |
+
Returns rows added."""
|
| 167 |
+
|
| 168 |
+
class TokenError(Exception):
|
| 169 |
+
"""code in {
|
| 170 |
+
'token_invalid','token_expired','token_not_yet_valid',
|
| 171 |
+
'token_signature_bad','token_audience_mismatch',
|
| 172 |
+
'token_revoked','token_scope_insufficient',
|
| 173 |
+
'token_issuer_revoked','token_malformed'}"""
|
| 174 |
+
code: str
|
| 175 |
+
```
|
| 176 |
+
|
| 177 |
+
### 4.2 `hearthnet/services/auth/service.py`
|
| 178 |
+
|
| 179 |
+
```python
|
| 180 |
+
# hearthnet/services/auth/service.py
|
| 181 |
+
class AuthService:
|
| 182 |
+
"""Registers auth.token.issue / revoke / introspect capabilities."""
|
| 183 |
+
|
| 184 |
+
name = "auth"
|
| 185 |
+
version = "1.0"
|
| 186 |
+
|
| 187 |
+
def __init__(
|
| 188 |
+
self,
|
| 189 |
+
author_kp: KeyPair,
|
| 190 |
+
event_log: EventLog,
|
| 191 |
+
community_manifest_provider: Callable[[], CommunityManifest],
|
| 192 |
+
revocation_cache: RevocationCache,
|
| 193 |
+
):
|
| 194 |
+
...
|
| 195 |
+
|
| 196 |
+
def capabilities(self) -> list[tuple[CapabilityDescriptor, Callable, ParamsPredicate]]:
|
| 197 |
+
"""Registers: auth.token.issue@1.0, auth.token.revoke@1.0, auth.token.introspect@1.0."""
|
| 198 |
+
|
| 199 |
+
async def start(self) -> None:
|
| 200 |
+
"""Hydrate the revocation cache from event log."""
|
| 201 |
+
|
| 202 |
+
async def stop(self) -> None: ...
|
| 203 |
+
def health(self) -> dict: ...
|
| 204 |
+
|
| 205 |
+
# --- handlers ---
|
| 206 |
+
|
| 207 |
+
async def handle_issue(self, req: RouteRequest) -> dict:
|
| 208 |
+
"""CAP2 Β§4.5. Build a CapabilityToken, sign with author_kp, emit auth.token.issued event."""
|
| 209 |
+
|
| 210 |
+
async def handle_revoke(self, req: RouteRequest) -> dict:
|
| 211 |
+
"""CAP2 Β§4.6. Verify caller is issuer (or 'trusted'). Append auth.token.revoked event."""
|
| 212 |
+
|
| 213 |
+
async def handle_introspect(self, req: RouteRequest) -> dict:
|
| 214 |
+
"""CAP2 Β§4.7. Self-only. Returns active status and scope."""
|
| 215 |
+
```
|
| 216 |
+
|
| 217 |
+
---
|
| 218 |
+
|
| 219 |
+
## 5. Behaviour
|
| 220 |
+
|
| 221 |
+
### 5.1 Token-bearer call lifecycle
|
| 222 |
+
|
| 223 |
+
```
|
| 224 |
+
caller hits any capability endpoint with:
|
| 225 |
+
X-HearthNet-Token: hntoken://v1/...
|
| 226 |
+
(and optionally X-HearthNet-Signature)
|
| 227 |
+
β
|
| 228 |
+
X01 transport extracts and decodes
|
| 229 |
+
β
|
| 230 |
+
verify_token(...) β signature, expiry, audience, revocation
|
| 231 |
+
β
|
| 232 |
+
on success:
|
| 233 |
+
caller_effective_identity = token.subject (or token.issuer if subject == "*")
|
| 234 |
+
scope_check (does token cover this capability?)
|
| 235 |
+
β
|
| 236 |
+
bus.handle_call() with the effective caller
|
| 237 |
+
β
|
| 238 |
+
record token usage in metrics: hearthnet_token_calls_total{issuer, scope_match}
|
| 239 |
+
```
|
| 240 |
+
|
| 241 |
+
### 5.2 Co-existence with per-request signing
|
| 242 |
+
|
| 243 |
+
A request MAY carry both `X-HearthNet-Signature` and `X-HearthNet-Token`:
|
| 244 |
+
|
| 245 |
+
- Signature: proves *who* is making this exact call right now
|
| 246 |
+
- Token: proves they're *allowed* to (via delegation)
|
| 247 |
+
|
| 248 |
+
The token's `sub` MUST equal the signature's `From` NodeID, unless `sub == "*"`. Mismatch β `invalid_signature`.
|
| 249 |
+
|
| 250 |
+
This combination is the normal mode for federation: a federated peer's anchor signs with their key (signature) AND carries a token issued by their community's anchor delegating "rag.query is OK".
|
| 251 |
+
|
| 252 |
+
### 5.3 Issuance authority
|
| 253 |
+
|
| 254 |
+
A node may issue a token iff:
|
| 255 |
+
|
| 256 |
+
- The capabilities in scope are ones the issuer's community offers (or grants via federation)
|
| 257 |
+
- TTL β€ `policy.capability_token_ttl_seconds` (community-wide policy bound)
|
| 258 |
+
- The issuer is a `member` (level β₯ member) of the community
|
| 259 |
+
|
| 260 |
+
The handler enforces these before signing.
|
| 261 |
+
|
| 262 |
+
### 5.4 Revocation
|
| 263 |
+
|
| 264 |
+
A token is revoked by appending `auth.token.revoked` to the event log:
|
| 265 |
+
|
| 266 |
+
- Issuer may revoke their own tokens
|
| 267 |
+
- A `trusted` member may revoke any token (operator override)
|
| 268 |
+
- The community root can revoke any token
|
| 269 |
+
|
| 270 |
+
Once the revoke event is in the log, all gossip-receiving nodes update their `RevocationCache`. Until that propagates, a revoked token may still be honoured briefly β design accepts up to 60 seconds of lag.
|
| 271 |
+
|
| 272 |
+
### 5.5 Bearer tokens (`sub == "*"`)
|
| 273 |
+
|
| 274 |
+
Used sparingly:
|
| 275 |
+
|
| 276 |
+
- Federation proxy tokens: peer community gets one bearer token to make federated calls; rotation every 24h
|
| 277 |
+
- Mobile push tokens (M22): one bearer token tied to a `PushDeviceID`, longer TTL
|
| 278 |
+
|
| 279 |
+
Bearer tokens trade convenience for less revocability granularity. The `jti` is still unique so a specific bearer can be killed.
|
| 280 |
+
|
| 281 |
+
### 5.6 Replay protection
|
| 282 |
+
|
| 283 |
+
Tokens are not single-use. Replay is mitigated by:
|
| 284 |
+
- Short TTL (default 1h)
|
| 285 |
+
- Audience binding (`aud` field): server rejects if `aud` β ours
|
| 286 |
+
- Rate-limit budget (`scope.rate_limit_per_minute`)
|
| 287 |
+
- Revocation if abuse detected
|
| 288 |
+
|
| 289 |
+
For one-shot tokens (e.g. password-reset-style flows), set `max_calls_total: 1` and the server tracks usage via a per-jti counter.
|
| 290 |
+
|
| 291 |
+
### 5.7 Token-on-token (delegation chains)
|
| 292 |
+
|
| 293 |
+
Phase 2: **forbidden**. A token holder cannot issue new tokens. This avoids a delegation tree we cannot audit.
|
| 294 |
+
|
| 295 |
+
Phase 3 may add bounded delegation with a `delegates: int` counter.
|
| 296 |
+
|
| 297 |
+
---
|
| 298 |
+
|
| 299 |
+
## 6. Storage
|
| 300 |
+
|
| 301 |
+
### 6.1 Revocation cache table
|
| 302 |
+
|
| 303 |
+
```sql
|
| 304 |
+
CREATE TABLE IF NOT EXISTS token_revocations (
|
| 305 |
+
jti TEXT PRIMARY KEY,
|
| 306 |
+
revoked_at INTEGER NOT NULL,
|
| 307 |
+
reason TEXT,
|
| 308 |
+
via_event_id TEXT
|
| 309 |
+
);
|
| 310 |
+
CREATE INDEX IF NOT EXISTS idx_revocations_time ON token_revocations(revoked_at);
|
| 311 |
+
```
|
| 312 |
+
|
| 313 |
+
### 6.2 Rate-limit counters
|
| 314 |
+
|
| 315 |
+
Per-(jti, minute) sliding window in memory. Persisted only when capacity-exceeded events fire (for audit).
|
| 316 |
+
|
| 317 |
+
---
|
| 318 |
+
|
| 319 |
+
## 7. Errors
|
| 320 |
+
|
| 321 |
+
`TokenError` β wire mapping:
|
| 322 |
+
|
| 323 |
+
| TokenError code | Wire code | HTTP |
|
| 324 |
+
|-----------------|-----------|------|
|
| 325 |
+
| `token_malformed` | `bad_request` | 400 |
|
| 326 |
+
| `token_invalid` | `token_invalid` | 401 |
|
| 327 |
+
| `token_signature_bad` | `token_invalid` | 401 |
|
| 328 |
+
| `token_expired` | `token_expired` | 410 |
|
| 329 |
+
| `token_not_yet_valid` | `token_expired` | 410 |
|
| 330 |
+
| `token_audience_mismatch` | `unauthorized` | 401 |
|
| 331 |
+
| `token_revoked` | `token_revoked` | 401 |
|
| 332 |
+
| `token_scope_insufficient` | `token_scope_insufficient` | 403 |
|
| 333 |
+
| `token_issuer_revoked` | `revoked` | 403 |
|
| 334 |
+
|
| 335 |
+
---
|
| 336 |
+
|
| 337 |
+
## 8. Configuration
|
| 338 |
+
|
| 339 |
+
From [X04](../../cross-cutting/X04-config.md) (extension):
|
| 340 |
+
|
| 341 |
+
```python
|
| 342 |
+
config.auth.enabled = True
|
| 343 |
+
config.auth.token_default_ttl_seconds = TOKEN_DEFAULT_TTL_SECONDS
|
| 344 |
+
config.auth.token_max_ttl_seconds = TOKEN_MAX_TTL_SECONDS
|
| 345 |
+
config.auth.allow_bearer_tokens = True
|
| 346 |
+
config.auth.federated_only_bearer = True # bearer tokens only issued for federation context
|
| 347 |
+
```
|
| 348 |
+
|
| 349 |
+
---
|
| 350 |
+
|
| 351 |
+
## 9. Tests
|
| 352 |
+
|
| 353 |
+
### Unit
|
| 354 |
+
- `test_token_encode_decode_roundtrip`
|
| 355 |
+
- `test_token_under_800_bytes`
|
| 356 |
+
- `test_token_signature_verified`
|
| 357 |
+
- `test_token_expired_rejected`
|
| 358 |
+
- `test_token_audience_mismatch_rejected`
|
| 359 |
+
- `test_token_scope_covers_exact_match`
|
| 360 |
+
- `test_token_scope_params_constraint_filtered`
|
| 361 |
+
- `test_revocation_event_updates_cache`
|
| 362 |
+
- `test_bearer_token_with_star_subject`
|
| 363 |
+
|
| 364 |
+
### Integration
|
| 365 |
+
- `test_federated_call_with_token_succeeds`
|
| 366 |
+
- `test_revoked_token_rejected_within_60_seconds`
|
| 367 |
+
- `test_rate_limit_per_token_enforced`
|
| 368 |
+
- `test_mobile_client_token_authenticates`
|
| 369 |
+
|
| 370 |
+
---
|
| 371 |
+
|
| 372 |
+
## 10. Cross-references
|
| 373 |
+
|
| 374 |
+
| What | Where |
|
| 375 |
+
|------|-------|
|
| 376 |
+
| Token wire format | [CAP2 Β§6.2](../CAPABILITY_CONTRACT_v2.md) |
|
| 377 |
+
| Token-bearer requests | [CAP2 Β§5.2](../CAPABILITY_CONTRACT_v2.md) |
|
| 378 |
+
| `auth.token.*` capabilities | [CAP2 Β§4.5β4.7](../CAPABILITY_CONTRACT_v2.md) |
|
| 379 |
+
| Used by federation | [M14 Β§5](M14-federation.md) |
|
| 380 |
+
| Used by relay tier | [M15 Β§4](M15-relay-tier.md) |
|
| 381 |
+
| Used by mobile client | [M22 Β§4](M22-mobile-native.md) |
|
| 382 |
+
| Phase 1 identity primitives | [M01](../../modules/M01-identity.md) |
|
| 383 |
+
|
| 384 |
+
---
|
| 385 |
+
|
| 386 |
+
## 11. Open questions
|
| 387 |
+
|
| 388 |
+
1. **Audience as community vs node** β Phase 2 uses community as audience. Should single-node audience be supported (one-call-to-one-node tokens)? Probably yes; adds `aud_kind: "community"|"node"`. Defer.
|
| 389 |
+
2. **JWE for confidential scope** β current scope is in cleartext. Some scope values are sensitive (corpus names). Wrap payload in JWE? Defer; out of scope MVP for tokens.
|
| 390 |
+
3. **Hardware-bound tokens** β Phase 3 idea: token bound to a TPM-attested device.
|
| 391 |
+
4. **Token-on-token (delegation)** β explicitly Phase 3.
|
docs/p2_p3/M17-ocr.md
ADDED
|
@@ -0,0 +1,305 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# M17 β OCR Service
|
| 2 |
+
|
| 3 |
+
**Spec version:** v1.0 (Phase 2)
|
| 4 |
+
**Depends on:** M03 (bus), M07 (blobs, for reading image/PDF inputs and storing extracted text), M11 (embedding, when integrating with M05 RAG), X04 (config), X03 (observability)
|
| 5 |
+
**Depended on by:** M05 RAG (ingest of scanned PDFs), M20 vision (img.describe can fall back to OCR for text-heavy images)
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## 1. Responsibility
|
| 10 |
+
|
| 11 |
+
Provide `ocr.image@1.0` and `ocr.pdf@1.0`. Wrap several OCR backends so the bus can route between them by document type and language. Specifically engineered to handle:
|
| 12 |
+
|
| 13 |
+
- Modern German printed text (Tesseract)
|
| 14 |
+
- Handwriting (TrOCR / Microsoft Florence-OCR)
|
| 15 |
+
- Historical scripts β SΓΌtterlin, Kurrent, Latin, Arabic, Cyrillic (Christof's multilingual harness)
|
| 16 |
+
- Mixed-language documents (auto-detection)
|
| 17 |
+
|
| 18 |
+
---
|
| 19 |
+
|
| 20 |
+
## 2. File layout
|
| 21 |
+
|
| 22 |
+
```
|
| 23 |
+
hearthnet/services/ocr/
|
| 24 |
+
βββ __init__.py
|
| 25 |
+
βββ service.py # OcrService
|
| 26 |
+
βββ backends/
|
| 27 |
+
βββ __init__.py
|
| 28 |
+
βββ base.py # OcrBackend Protocol
|
| 29 |
+
βββ tesseract.py # Tesseract via pytesseract
|
| 30 |
+
βββ trocr.py # Microsoft TrOCR via transformers
|
| 31 |
+
βββ multilingual.py # Christof's self-improving harness (CHURRO, olmOCR-2)
|
| 32 |
+
βββ florence_ocr.py # Florence-2 OCR mode (overlap with M20)
|
| 33 |
+
```
|
| 34 |
+
|
| 35 |
+
---
|
| 36 |
+
|
| 37 |
+
## 3. Public API
|
| 38 |
+
|
| 39 |
+
### 3.1 `backends/base.py`
|
| 40 |
+
|
| 41 |
+
```python
|
| 42 |
+
# hearthnet/services/ocr/backends/base.py
|
| 43 |
+
from dataclasses import dataclass
|
| 44 |
+
|
| 45 |
+
@dataclass(frozen=True)
|
| 46 |
+
class OcrBlock:
|
| 47 |
+
text: str
|
| 48 |
+
bbox: tuple[int, int, int, int] # (x, y, w, h) in pixel coords
|
| 49 |
+
confidence: float # 0..1
|
| 50 |
+
language: str | None
|
| 51 |
+
|
| 52 |
+
@dataclass(frozen=True)
|
| 53 |
+
class OcrPageResult:
|
| 54 |
+
page: int # 1-indexed
|
| 55 |
+
text: str # concatenated, reading order
|
| 56 |
+
blocks: list[OcrBlock]
|
| 57 |
+
languages: list[str] # detected, ordered by prevalence
|
| 58 |
+
confidence_mean: float
|
| 59 |
+
ms: int
|
| 60 |
+
|
| 61 |
+
class OcrBackend(Protocol):
|
| 62 |
+
name: str # "tesseract" | "trocr" | "multilingual" | "florence_ocr"
|
| 63 |
+
languages_supported: list[str] # ISO 639-2 codes: "deu","eng","lat","ara","rus", ...
|
| 64 |
+
supports_handwriting: bool
|
| 65 |
+
max_image_pixels: int
|
| 66 |
+
|
| 67 |
+
async def warm(self) -> None: ...
|
| 68 |
+
async def close(self) -> None: ...
|
| 69 |
+
|
| 70 |
+
async def ocr_image(
|
| 71 |
+
self,
|
| 72 |
+
image_bytes: bytes,
|
| 73 |
+
*,
|
| 74 |
+
languages: list[str] | None, # None β auto-detect
|
| 75 |
+
preprocess: dict | None = None, # {deskew, denoise, dpi}
|
| 76 |
+
) -> OcrPageResult: ...
|
| 77 |
+
|
| 78 |
+
async def ocr_pdf_page(
|
| 79 |
+
self,
|
| 80 |
+
pdf_bytes: bytes,
|
| 81 |
+
*,
|
| 82 |
+
page: int,
|
| 83 |
+
languages: list[str] | None,
|
| 84 |
+
preprocess: dict | None = None,
|
| 85 |
+
) -> OcrPageResult: ...
|
| 86 |
+
|
| 87 |
+
def health(self) -> dict: ...
|
| 88 |
+
```
|
| 89 |
+
|
| 90 |
+
### 3.2 Concrete backends
|
| 91 |
+
|
| 92 |
+
| File | Class | Notes |
|
| 93 |
+
|------|-------|-------|
|
| 94 |
+
| `backends/tesseract.py` | `TesseractBackend(min_confidence: float = 0.5)` | Languages: any installed traineddata. Subprocess via pytesseract. |
|
| 95 |
+
| `backends/trocr.py` | `TrocrBackend(model: str = "microsoft/trocr-large-handwritten", device: str = "auto")` | Handwriting; CUDA preferred. |
|
| 96 |
+
| `backends/multilingual.py` | `MultilingualHarnessBackend(model: str = "self-improving-ocr-v1", device: str = "auto", harness_dir: Path)` | Christof's harness (CHURRO, olmOCR-2, retrieval-augmented correction, Kurrent/SΓΌtterlin/Latin/Arabic/Cyrillic). Configured via `harness_dir`. |
|
| 97 |
+
| `backends/florence_ocr.py` | `FlorenceOcrBackend(model: str = "microsoft/Florence-2-large")` | Reuses M20 vision backend in OCR mode. |
|
| 98 |
+
|
| 99 |
+
`MultilingualHarnessBackend` is the headline integration for Christof's existing work. It exposes the same `OcrBackend` interface and lets the harness's internal page-level VLMs do the heavy lifting.
|
| 100 |
+
|
| 101 |
+
### 3.3 `service.py`
|
| 102 |
+
|
| 103 |
+
```python
|
| 104 |
+
# hearthnet/services/ocr/service.py
|
| 105 |
+
class OcrService:
|
| 106 |
+
name = "ocr"
|
| 107 |
+
version = "1.0"
|
| 108 |
+
|
| 109 |
+
def __init__(self, config: OcrConfig, blob_store: BlobStore, event_log: EventLog):
|
| 110 |
+
self._backends: dict[str, OcrBackend] = self._build_backends(config)
|
| 111 |
+
|
| 112 |
+
def capabilities(self) -> list[tuple[CapabilityDescriptor, Callable, ParamsPredicate]]:
|
| 113 |
+
"""One ocr.image entry per backend; one ocr.pdf entry per backend.
|
| 114 |
+
params include backend name and supported languages."""
|
| 115 |
+
|
| 116 |
+
async def start(self) -> None: ...
|
| 117 |
+
async def stop(self) -> None: ...
|
| 118 |
+
def health(self) -> dict: ...
|
| 119 |
+
|
| 120 |
+
# --- handlers ---
|
| 121 |
+
|
| 122 |
+
async def handle_image(self, req: RouteRequest) -> dict:
|
| 123 |
+
"""CAP2 Β§4.8.
|
| 124 |
+
1. Resolve image_cid via blob_store
|
| 125 |
+
2. Pick backend from params.backend
|
| 126 |
+
3. Run; build response"""
|
| 127 |
+
|
| 128 |
+
async def handle_pdf(self, req: RouteRequest) -> AsyncIterator[dict]:
|
| 129 |
+
"""CAP2 Β§4.9.
|
| 130 |
+
1. Resolve doc_cid
|
| 131 |
+
2. For each page in page_range:
|
| 132 |
+
emit 'progress' frame
|
| 133 |
+
emit 'page' frame
|
| 134 |
+
3. If store_text:true, write concatenated text as new blob; emit done with text_cid"""
|
| 135 |
+
```
|
| 136 |
+
|
| 137 |
+
### 3.4 `params_compatible` predicate
|
| 138 |
+
|
| 139 |
+
```python
|
| 140 |
+
def params_compatible(offered: dict, requested: dict) -> bool:
|
| 141 |
+
# backend must match if specified
|
| 142 |
+
if "backend" in requested and requested["backend"] != offered.get("backend"):
|
| 143 |
+
return False
|
| 144 |
+
# all requested languages must be supported by this backend
|
| 145 |
+
requested_langs = set(requested.get("languages", []))
|
| 146 |
+
offered_langs = set(offered.get("languages_supported", []))
|
| 147 |
+
return requested_langs.issubset(offered_langs)
|
| 148 |
+
```
|
| 149 |
+
|
| 150 |
+
---
|
| 151 |
+
|
| 152 |
+
## 4. Behaviour
|
| 153 |
+
|
| 154 |
+
### 4.1 Auto-language detection
|
| 155 |
+
|
| 156 |
+
If `languages` is omitted or set to `["auto"]`:
|
| 157 |
+
|
| 158 |
+
1. Sample 3 random pages
|
| 159 |
+
2. Run lightweight script detection (Tesseract `osd`)
|
| 160 |
+
3. Choose top 2 scripts
|
| 161 |
+
4. Re-run with that language set
|
| 162 |
+
|
| 163 |
+
Backends that don't support `osd` fall back to a fixed default (configured per backend).
|
| 164 |
+
|
| 165 |
+
### 4.2 Preprocessing pipeline
|
| 166 |
+
|
| 167 |
+
`preprocess` dict supports:
|
| 168 |
+
- `deskew: bool` β straighten image
|
| 169 |
+
- `denoise: bool` β bilateral filter
|
| 170 |
+
- `binarize: bool` β Otsu threshold
|
| 171 |
+
- `dpi: int` β target resolution; upscale if lower
|
| 172 |
+
- `contrast_normalise: bool`
|
| 173 |
+
|
| 174 |
+
Default: `{"deskew": true, "denoise": false}`. Heavy preprocessing slows ingest meaningfully; only enable per document.
|
| 175 |
+
|
| 176 |
+
### 4.3 Quality estimation
|
| 177 |
+
|
| 178 |
+
Each page result reports `confidence_mean`. Below 0.6, the service emits a `low_quality` warning frame and recommends:
|
| 179 |
+
- Trying a different backend (e.g. switch from Tesseract to multilingual harness for historic text)
|
| 180 |
+
- Raising DPI
|
| 181 |
+
- Re-scanning
|
| 182 |
+
|
| 183 |
+
### 4.4 Integration with RAG
|
| 184 |
+
|
| 185 |
+
[M05 Β§10 open question 4](../../modules/M05-rag.md) is now answered:
|
| 186 |
+
|
| 187 |
+
```
|
| 188 |
+
RagService.handle_ingest receives a scanned PDF (mime_type=image/scanned-pdf or detected)
|
| 189 |
+
β bus.call("ocr.pdf", (1,0), {input:{doc_cid:..., store_text:true}})
|
| 190 |
+
β receive text_cid
|
| 191 |
+
β ingest the text_cid blob (which is now extracted plaintext) as normal
|
| 192 |
+
β emit rag.document.ingested event (with metadata noting ocr_backend used)
|
| 193 |
+
```
|
| 194 |
+
|
| 195 |
+
The OCR text is stored as a separate blob, content-addressed. Re-ingestion is idempotent.
|
| 196 |
+
|
| 197 |
+
### 4.5 Page-range and parallelism
|
| 198 |
+
|
| 199 |
+
`page_range: [1, 50]` lets callers process partial documents. Pages are OCR'd serially within one call. For very large PDFs, callers should split into ranges and call concurrently β the bus enforces per-capability concurrency.
|
| 200 |
+
|
| 201 |
+
`OCR_MAX_PAGES_PER_REQUEST = 50` is the hard ceiling per call.
|
| 202 |
+
|
| 203 |
+
### 4.6 PDF text-layer detection
|
| 204 |
+
|
| 205 |
+
Before OCR'ing, the service checks if the PDF has an extractable text layer (via `pypdf`). If yes and confidence is decent (heuristic), it returns the text-layer content directly β much cheaper than OCR. Caller can force OCR with `force_ocr: true`.
|
| 206 |
+
|
| 207 |
+
### 4.7 Christof's multilingual harness integration
|
| 208 |
+
|
| 209 |
+
The `MultilingualHarnessBackend` wraps Christof's existing self-improving OCR pipeline:
|
| 210 |
+
|
| 211 |
+
- Internal models: CHURRO (page-level VLM), olmOCR-2 (page-level VLM)
|
| 212 |
+
- Retrieval-augmented correction over a script-specific corpus
|
| 213 |
+
- Kurrent + SΓΌtterlin support for German historical documents
|
| 214 |
+
- Latin / Arabic / Cyrillic script recognition
|
| 215 |
+
|
| 216 |
+
Configuration:
|
| 217 |
+
|
| 218 |
+
```python
|
| 219 |
+
config.ocr.multilingual_harness_dir = Path("/srv/ocr-harness")
|
| 220 |
+
config.ocr.multilingual_max_pages_concurrent = 2
|
| 221 |
+
```
|
| 222 |
+
|
| 223 |
+
The harness is GPU-intensive. On CPU-only nodes, it deregisters itself at startup.
|
| 224 |
+
|
| 225 |
+
---
|
| 226 |
+
|
| 227 |
+
## 5. Storage and lifecycle
|
| 228 |
+
|
| 229 |
+
- Input image/PDF: fetched from blob store via CID
|
| 230 |
+
- Output text: optionally stored as a new blob (`store_text: true`)
|
| 231 |
+
- Side-effect: `ocr.document.indexed` event in the community log (carries text_cid for downstream replication)
|
| 232 |
+
|
| 233 |
+
OCR backends do NOT cache results inside themselves. Reuse comes from caching at the RAG/blob layer (same `doc_cid` β already-extracted-text blob).
|
| 234 |
+
|
| 235 |
+
---
|
| 236 |
+
|
| 237 |
+
## 6. Errors
|
| 238 |
+
|
| 239 |
+
| Condition | Wire code |
|
| 240 |
+
|-----------|-----------|
|
| 241 |
+
| Unknown backend | `not_found` |
|
| 242 |
+
| Languages not supported by any backend | `bad_request` |
|
| 243 |
+
| Image too large (> max_image_pixels) | `bad_request` |
|
| 244 |
+
| Page-range exceeds document | `bad_request` |
|
| 245 |
+
| > OCR_MAX_PAGES_PER_REQUEST | `bad_request` |
|
| 246 |
+
| Backend crash | `internal_error` |
|
| 247 |
+
| GPU OOM on multilingual | `capacity_exceeded` (with retry_after) |
|
| 248 |
+
|
| 249 |
+
---
|
| 250 |
+
|
| 251 |
+
## 7. Configuration
|
| 252 |
+
|
| 253 |
+
```python
|
| 254 |
+
config.ocr.enabled = True
|
| 255 |
+
config.ocr.backends = [
|
| 256 |
+
OcrBackendConfig(name="tesseract", languages=["deu","eng","fra","lat"]),
|
| 257 |
+
OcrBackendConfig(name="trocr", model="microsoft/trocr-large-handwritten"),
|
| 258 |
+
OcrBackendConfig(name="multilingual", harness_dir=Path("/srv/ocr-harness")),
|
| 259 |
+
]
|
| 260 |
+
config.ocr.default_dpi = OCR_DEFAULT_DPI # 300
|
| 261 |
+
config.ocr.max_pages_per_request = OCR_MAX_PAGES_PER_REQUEST
|
| 262 |
+
config.ocr.text_layer_first = True
|
| 263 |
+
```
|
| 264 |
+
|
| 265 |
+
Constants: `OCR_DEFAULT_DPI`, `OCR_MAX_PAGES_PER_REQUEST`.
|
| 266 |
+
|
| 267 |
+
---
|
| 268 |
+
|
| 269 |
+
## 8. Tests
|
| 270 |
+
|
| 271 |
+
### Unit
|
| 272 |
+
- `test_descriptor_schema_validates_meta_schema`
|
| 273 |
+
- `test_params_compatible_language_subset`
|
| 274 |
+
- `test_text_layer_short_circuits_when_present`
|
| 275 |
+
- `test_force_ocr_bypasses_text_layer`
|
| 276 |
+
- `test_low_quality_emits_warning_frame`
|
| 277 |
+
|
| 278 |
+
### Integration
|
| 279 |
+
- `test_tesseract_german_print` (with a known sample)
|
| 280 |
+
- `test_trocr_handwriting_sample`
|
| 281 |
+
- `test_multilingual_kurrent_sample` (if harness installed)
|
| 282 |
+
- `test_rag_ingest_scanned_pdf_end_to_end`
|
| 283 |
+
- `test_ocr_pdf_progress_frames`
|
| 284 |
+
|
| 285 |
+
---
|
| 286 |
+
|
| 287 |
+
## 9. Cross-references
|
| 288 |
+
|
| 289 |
+
| What | Where |
|
| 290 |
+
|------|-------|
|
| 291 |
+
| `ocr.*` wire | [CAP2 Β§4.8β4.9](../CAPABILITY_CONTRACT_v2.md) |
|
| 292 |
+
| Blob store dependency | [M07 Β§3](../../modules/M07-file-blobs.md) |
|
| 293 |
+
| RAG integration | [M05 Β§10 q4](../../modules/M05-rag.md) β now resolved |
|
| 294 |
+
| Vision overlap (Florence-2 OCR mode) | [M20 Β§4.3](M20-vision.md) |
|
| 295 |
+
| `ocr.document.indexed` event | [CAP2 Β§7.1](../CAPABILITY_CONTRACT_v2.md) |
|
| 296 |
+
| Christof's harness | external project; this module is the integration surface |
|
| 297 |
+
|
| 298 |
+
---
|
| 299 |
+
|
| 300 |
+
## 10. Open questions
|
| 301 |
+
|
| 302 |
+
1. **Multilingual harness auto-update.** The harness self-improves; should the model versions be event-logged so we can replay deterministically? Yes β record the harness version hash in each `ocr.document.indexed` event.
|
| 303 |
+
2. **Manuscript-quality preprocessing.** Some historic documents need bespoke preprocessing (e.g. ink-bleed removal). Phase 2.5 might add a `preprocess_profile` enum.
|
| 304 |
+
3. **Reading order from layout.** Currently we trust the backend's reading order. For multi-column documents, an explicit layout model (LayoutLMv3) might help. Phase 3.
|
| 305 |
+
4. **Streaming OCR for very large images.** Currently atomic. Could tile and stream. Defer.
|
docs/p2_p3/M18-translation.md
ADDED
|
@@ -0,0 +1,225 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# M18 β Translation Service
|
| 2 |
+
|
| 3 |
+
**Spec version:** v1.0 (Phase 2)
|
| 4 |
+
**Depends on:** M03 (bus), X04 (config), X03 (observability), `transformers`, `torch`
|
| 5 |
+
**Depended on by:** UI marketplace + chat (one-click translate), M19 STT (with `translate_to_en=true`)
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## 1. Responsibility
|
| 10 |
+
|
| 11 |
+
Provide `trans.text@1.0`. Translate between languages, with strong emphasis on:
|
| 12 |
+
- German β English (default)
|
| 13 |
+
- German β Plattdeutsch (Niederrhein-specific, Christof's domain)
|
| 14 |
+
- Major European languages
|
| 15 |
+
- Optionally Arabic, Turkish, Russian, Ukrainian β useful in refugee-context emergencies
|
| 16 |
+
|
| 17 |
+
---
|
| 18 |
+
|
| 19 |
+
## 2. File layout
|
| 20 |
+
|
| 21 |
+
```
|
| 22 |
+
hearthnet/services/translation/
|
| 23 |
+
βββ __init__.py
|
| 24 |
+
βββ service.py
|
| 25 |
+
βββ backends/
|
| 26 |
+
βββ __init__.py
|
| 27 |
+
βββ base.py
|
| 28 |
+
βββ nllb.py # facebook/nllb-200-distilled-600M
|
| 29 |
+
βββ plattdeutsch.py # specialised fine-tune, optional
|
| 30 |
+
```
|
| 31 |
+
|
| 32 |
+
---
|
| 33 |
+
|
| 34 |
+
## 3. Public API
|
| 35 |
+
|
| 36 |
+
### 3.1 `backends/base.py`
|
| 37 |
+
|
| 38 |
+
```python
|
| 39 |
+
@dataclass(frozen=True)
|
| 40 |
+
class TranslationResult:
|
| 41 |
+
text: str
|
| 42 |
+
from_lang: str # ISO 639-1
|
| 43 |
+
to_lang: str
|
| 44 |
+
confidence: float # 0..1 if backend supports; else 1.0 placeholder
|
| 45 |
+
ms: int
|
| 46 |
+
|
| 47 |
+
class TranslationBackend(Protocol):
|
| 48 |
+
name: str
|
| 49 |
+
languages_pairs: list[tuple[str, str]] # supported (from, to) pairs
|
| 50 |
+
max_chars: int
|
| 51 |
+
|
| 52 |
+
async def warm(self) -> None: ...
|
| 53 |
+
async def close(self) -> None: ...
|
| 54 |
+
|
| 55 |
+
async def translate(
|
| 56 |
+
self,
|
| 57 |
+
text: str,
|
| 58 |
+
*,
|
| 59 |
+
from_lang: str, # "auto" supported
|
| 60 |
+
to_lang: str,
|
| 61 |
+
domain: str | None,
|
| 62 |
+
) -> TranslationResult: ...
|
| 63 |
+
|
| 64 |
+
def detect_language(self, text: str) -> str | None: ...
|
| 65 |
+
|
| 66 |
+
def health(self) -> dict: ...
|
| 67 |
+
```
|
| 68 |
+
|
| 69 |
+
### 3.2 Concrete backends
|
| 70 |
+
|
| 71 |
+
```python
|
| 72 |
+
class NllbBackend(TranslationBackend):
|
| 73 |
+
"""facebook/nllb-200-distilled-600M (or larger variants).
|
| 74 |
+
200+ language pairs out of the box."""
|
| 75 |
+
|
| 76 |
+
def __init__(
|
| 77 |
+
self,
|
| 78 |
+
model: str = "facebook/nllb-200-distilled-600M",
|
| 79 |
+
device: str = "auto",
|
| 80 |
+
max_chars: int = TRANSLATION_MAX_CHARS,
|
| 81 |
+
):
|
| 82 |
+
...
|
| 83 |
+
|
| 84 |
+
class PlattdeutschBackend(TranslationBackend):
|
| 85 |
+
"""Optional specialised fine-tune.
|
| 86 |
+
If a Plattdeutsch fine-tune is present in models_dir, registers deβnds pair.
|
| 87 |
+
Otherwise no-op (the backend reports zero language pairs and is filtered out)."""
|
| 88 |
+
|
| 89 |
+
def __init__(
|
| 90 |
+
self,
|
| 91 |
+
models_dir: Path,
|
| 92 |
+
device: str = "auto",
|
| 93 |
+
):
|
| 94 |
+
...
|
| 95 |
+
```
|
| 96 |
+
|
| 97 |
+
### 3.3 `service.py`
|
| 98 |
+
|
| 99 |
+
```python
|
| 100 |
+
class TranslationService:
|
| 101 |
+
name = "translation"
|
| 102 |
+
version = "1.0"
|
| 103 |
+
|
| 104 |
+
def __init__(self, config: TranslationConfig):
|
| 105 |
+
self._backends: list[TranslationBackend] = self._build_backends(config)
|
| 106 |
+
|
| 107 |
+
def capabilities(self) -> list[tuple[CapabilityDescriptor, Callable, ParamsPredicate]]:
|
| 108 |
+
"""One trans.text entry per backend. params declare languages_pairs."""
|
| 109 |
+
|
| 110 |
+
async def start(self) -> None: ...
|
| 111 |
+
async def stop(self) -> None: ...
|
| 112 |
+
def health(self) -> dict: ...
|
| 113 |
+
|
| 114 |
+
async def handle_translate(self, req: RouteRequest) -> dict:
|
| 115 |
+
"""CAP2 Β§4.10."""
|
| 116 |
+
```
|
| 117 |
+
|
| 118 |
+
### 3.4 `params_compatible` predicate
|
| 119 |
+
|
| 120 |
+
```python
|
| 121 |
+
def params_compatible(offered: dict, requested: dict) -> bool:
|
| 122 |
+
if "backend" in requested and requested["backend"] != offered.get("backend"):
|
| 123 |
+
return False
|
| 124 |
+
pair = (requested.get("from"), requested.get("to"))
|
| 125 |
+
if pair[0] == "auto":
|
| 126 |
+
# auto-detect; backend must support at least one source β target pair
|
| 127 |
+
return any(t == pair[1] for (_, t) in offered.get("languages_pairs", []))
|
| 128 |
+
return pair in offered.get("languages_pairs", [])
|
| 129 |
+
```
|
| 130 |
+
|
| 131 |
+
---
|
| 132 |
+
|
| 133 |
+
## 4. Behaviour
|
| 134 |
+
|
| 135 |
+
### 4.1 Auto-detection
|
| 136 |
+
|
| 137 |
+
`from: "auto"`:
|
| 138 |
+
1. Call `detect_language(text)` (NLLB has internal language detection)
|
| 139 |
+
2. Substitute detected lang
|
| 140 |
+
3. Translate
|
| 141 |
+
|
| 142 |
+
### 4.2 Domain hints
|
| 143 |
+
|
| 144 |
+
`domain: "everyday" | "medical" | "legal" | "emergency"` is advisory. NLLB ignores it; specialised fine-tunes may use it.
|
| 145 |
+
|
| 146 |
+
### 4.3 Niederrhein focus
|
| 147 |
+
|
| 148 |
+
`PlattdeutschBackend` is Christof's local interest. When installed:
|
| 149 |
+
|
| 150 |
+
- Registers pairs `("de", "nds")` and `("nds", "de")`
|
| 151 |
+
- Optionally `("en", "nds")` if fine-tune extends
|
| 152 |
+
- Used by the marketplace UI's "auf Platt" button in [M08 settings](../../modules/M08-ui.md) ext
|
| 153 |
+
|
| 154 |
+
### 4.4 Length limits
|
| 155 |
+
|
| 156 |
+
- Single request: β€ `TRANSLATION_MAX_CHARS` (4000)
|
| 157 |
+
- For longer texts, callers chunk by paragraph and recombine
|
| 158 |
+
|
| 159 |
+
### 4.5 Batching
|
| 160 |
+
|
| 161 |
+
Internal: requests within 100 ms batched up to 8 strings per forward pass. Improves GPU utilisation. Demultiplexed back. Transparent to callers.
|
| 162 |
+
|
| 163 |
+
### 4.6 Caching
|
| 164 |
+
|
| 165 |
+
In-memory LRU cache `(text_hash, from, to) β result`, max 10k entries. Big wins for marketplace UI which re-translates same posts on every refresh.
|
| 166 |
+
|
| 167 |
+
---
|
| 168 |
+
|
| 169 |
+
## 5. Errors
|
| 170 |
+
|
| 171 |
+
| Condition | Wire code |
|
| 172 |
+
|-----------|-----------|
|
| 173 |
+
| Pair not supported by any backend | `not_found` |
|
| 174 |
+
| Text too long | `bad_request` |
|
| 175 |
+
| Detection failed | `bad_request` |
|
| 176 |
+
| Backend OOM | `capacity_exceeded` |
|
| 177 |
+
|
| 178 |
+
---
|
| 179 |
+
|
| 180 |
+
## 6. Configuration
|
| 181 |
+
|
| 182 |
+
```python
|
| 183 |
+
config.translation.enabled = True
|
| 184 |
+
config.translation.backends = [
|
| 185 |
+
TranslationBackendConfig(name="nllb", model="facebook/nllb-200-distilled-600M", device="auto"),
|
| 186 |
+
TranslationBackendConfig(name="plattdeutsch", models_dir=Path("~/.hearthnet/models/plattdeutsch")),
|
| 187 |
+
]
|
| 188 |
+
```
|
| 189 |
+
|
| 190 |
+
Constants: `TRANSLATION_MAX_CHARS`.
|
| 191 |
+
|
| 192 |
+
---
|
| 193 |
+
|
| 194 |
+
## 7. Tests
|
| 195 |
+
|
| 196 |
+
### Unit
|
| 197 |
+
- `test_descriptor_schema_validates`
|
| 198 |
+
- `test_params_compatible_pair_must_match`
|
| 199 |
+
- `test_auto_detect_substitutes_source_lang`
|
| 200 |
+
- `test_text_too_long_rejected`
|
| 201 |
+
- `test_cache_hit_returns_immediately`
|
| 202 |
+
|
| 203 |
+
### Integration
|
| 204 |
+
- `test_german_to_english_quality` (BLEU above floor)
|
| 205 |
+
- `test_plattdeutsch_pair_registered_when_finetune_present`
|
| 206 |
+
- `test_marketplace_one_click_translate_end_to_end`
|
| 207 |
+
|
| 208 |
+
---
|
| 209 |
+
|
| 210 |
+
## 8. Cross-references
|
| 211 |
+
|
| 212 |
+
| What | Where |
|
| 213 |
+
|------|-------|
|
| 214 |
+
| `trans.text@1.0` wire | [CAP2 Β§4.10](../CAPABILITY_CONTRACT_v2.md) |
|
| 215 |
+
| STT translate-to-EN feature | [M19 Β§4.3](M19-stt-tts.md) |
|
| 216 |
+
| Marketplace one-click | [M08 ext](../../modules/M08-ui.md) |
|
| 217 |
+
| Niederrhein context | Christof's domain |
|
| 218 |
+
|
| 219 |
+
---
|
| 220 |
+
|
| 221 |
+
## 9. Open questions
|
| 222 |
+
|
| 223 |
+
1. **Fine-tune in-the-loop.** A community could fine-tune the Plattdeutsch model on its own corpus over time. Reserved.
|
| 224 |
+
2. **Document-level translation.** Currently per-string. Document-coherence translation (better than chunked) is Phase 3.
|
| 225 |
+
3. **Glossary support.** Domain glossaries (technical terms, names) preserved across translation. Phase 2.5.
|
docs/p2_p3/M19-stt-tts.md
ADDED
|
@@ -0,0 +1,331 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# M19 β Speech I/O (STT + TTS)
|
| 2 |
+
|
| 3 |
+
**Spec version:** v1.0 (Phase 2)
|
| 4 |
+
**Depends on:** M03 (bus), M07 (blobs, for audio I/O), X04 (config), X03 (observability), `openai-whisper`, `TTS` (Coqui XTTS-v2), `edge-tts` libs
|
| 5 |
+
**Depended on by:** M08 UI (voice query button), M22 mobile (voice notes), M18 (STT can chain into translation)
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## 1. Responsibility
|
| 10 |
+
|
| 11 |
+
Two capabilities:
|
| 12 |
+
|
| 13 |
+
- `stt.transcribe@1.0` β audio β text, with optional translate-to-English
|
| 14 |
+
- `tts.synthesize@1.0` β text β audio
|
| 15 |
+
|
| 16 |
+
Two services in the same module because they share the speech domain and often pair (voice query β STT β LLM β TTS).
|
| 17 |
+
|
| 18 |
+
---
|
| 19 |
+
|
| 20 |
+
## 2. File layout
|
| 21 |
+
|
| 22 |
+
```
|
| 23 |
+
hearthnet/services/speech/
|
| 24 |
+
βββ __init__.py
|
| 25 |
+
βββ stt_service.py
|
| 26 |
+
βββ tts_service.py
|
| 27 |
+
βββ backends/
|
| 28 |
+
βββ __init__.py
|
| 29 |
+
βββ base.py # SttBackend, TtsBackend protocols
|
| 30 |
+
βββ whisper.py # OpenAI Whisper local
|
| 31 |
+
βββ whisper_remote.py # HF inference API alternative
|
| 32 |
+
βββ xtts.py # Coqui XTTS-v2 (cloned voices)
|
| 33 |
+
βββ edge_tts.py # Microsoft Edge-TTS (Christof has existing pipeline)
|
| 34 |
+
```
|
| 35 |
+
|
| 36 |
+
---
|
| 37 |
+
|
| 38 |
+
## 3. STT β public API
|
| 39 |
+
|
| 40 |
+
### 3.1 `backends/base.py` (STT)
|
| 41 |
+
|
| 42 |
+
```python
|
| 43 |
+
@dataclass(frozen=True)
|
| 44 |
+
class SttSegment:
|
| 45 |
+
start_seconds: float
|
| 46 |
+
end_seconds: float
|
| 47 |
+
text: str
|
| 48 |
+
language: str
|
| 49 |
+
speaker: str | None # only if diarization enabled
|
| 50 |
+
confidence: float | None
|
| 51 |
+
|
| 52 |
+
@dataclass(frozen=True)
|
| 53 |
+
class SttResult:
|
| 54 |
+
segments: list[SttSegment]
|
| 55 |
+
language: str
|
| 56 |
+
duration_seconds: float
|
| 57 |
+
ms: int
|
| 58 |
+
|
| 59 |
+
class SttBackend(Protocol):
|
| 60 |
+
name: str
|
| 61 |
+
models: list[str] # "tiny" | "base" | "small" | "medium" | "large-v3"
|
| 62 |
+
languages_supported: list[str] # ISO 639-1
|
| 63 |
+
supports_diarization: bool
|
| 64 |
+
|
| 65 |
+
async def warm(self, model: str) -> None: ...
|
| 66 |
+
async def close(self) -> None: ...
|
| 67 |
+
|
| 68 |
+
async def transcribe(
|
| 69 |
+
self,
|
| 70 |
+
audio_bytes: bytes,
|
| 71 |
+
*,
|
| 72 |
+
model: str,
|
| 73 |
+
language: str | None, # "auto" handled by caller
|
| 74 |
+
diarize: bool,
|
| 75 |
+
translate_to_en: bool,
|
| 76 |
+
) -> AsyncIterator[SttSegment]:
|
| 77 |
+
"""Yields segments as they are produced. Backend may produce in big chunks
|
| 78 |
+
or near-realtime depending on model + hardware."""
|
| 79 |
+
|
| 80 |
+
def health(self) -> dict: ...
|
| 81 |
+
```
|
| 82 |
+
|
| 83 |
+
### 3.2 `stt_service.py`
|
| 84 |
+
|
| 85 |
+
```python
|
| 86 |
+
class SttService:
|
| 87 |
+
name = "stt"
|
| 88 |
+
version = "1.0"
|
| 89 |
+
|
| 90 |
+
def __init__(self, config: SpeechConfig, blob_store: BlobStore):
|
| 91 |
+
...
|
| 92 |
+
|
| 93 |
+
def capabilities(self) -> list[tuple[CapabilityDescriptor, Callable, ParamsPredicate]]:
|
| 94 |
+
"""One stt.transcribe per (backend, model) combo."""
|
| 95 |
+
|
| 96 |
+
async def start(self) -> None: ...
|
| 97 |
+
async def stop(self) -> None: ...
|
| 98 |
+
def health(self) -> dict: ...
|
| 99 |
+
|
| 100 |
+
async def handle_transcribe(self, req: RouteRequest) -> AsyncIterator[dict]:
|
| 101 |
+
"""CAP2 Β§4.11.
|
| 102 |
+
1. Fetch audio blob by CID
|
| 103 |
+
2. Verify duration β€ STT_MAX_AUDIO_SECONDS
|
| 104 |
+
3. Stream segments
|
| 105 |
+
4. Emit done with total stats"""
|
| 106 |
+
```
|
| 107 |
+
|
| 108 |
+
### 3.3 Concrete STT backends
|
| 109 |
+
|
| 110 |
+
```python
|
| 111 |
+
class WhisperBackend(SttBackend):
|
| 112 |
+
"""Local Whisper via openai-whisper or faster-whisper."""
|
| 113 |
+
|
| 114 |
+
def __init__(self, models_dir: Path, default_model: str = "large-v3", device: str = "auto"):
|
| 115 |
+
...
|
| 116 |
+
|
| 117 |
+
class WhisperRemoteBackend(SttBackend):
|
| 118 |
+
"""HF Inference API. requires_internet=True. Used as fallback when local Whisper not available."""
|
| 119 |
+
|
| 120 |
+
def __init__(self, model: str = "openai/whisper-large-v3", token_env: str = "HF_TOKEN"):
|
| 121 |
+
...
|
| 122 |
+
```
|
| 123 |
+
|
| 124 |
+
---
|
| 125 |
+
|
| 126 |
+
## 4. TTS β public API
|
| 127 |
+
|
| 128 |
+
### 4.1 `backends/base.py` (TTS)
|
| 129 |
+
|
| 130 |
+
```python
|
| 131 |
+
@dataclass(frozen=True)
|
| 132 |
+
class TtsResult:
|
| 133 |
+
audio_format: str # "ogg_vorbis" | "mp3" | "wav"
|
| 134 |
+
sample_rate: int # Hz
|
| 135 |
+
duration_seconds: float
|
| 136 |
+
total_bytes: int
|
| 137 |
+
ms: int
|
| 138 |
+
|
| 139 |
+
class TtsBackend(Protocol):
|
| 140 |
+
name: str
|
| 141 |
+
voices: list[str]
|
| 142 |
+
languages_supported: list[str]
|
| 143 |
+
formats_supported: list[str]
|
| 144 |
+
cloned_voices_supported: bool
|
| 145 |
+
|
| 146 |
+
async def warm(self, voice: str) -> None: ...
|
| 147 |
+
async def close(self) -> None: ...
|
| 148 |
+
|
| 149 |
+
async def synthesize(
|
| 150 |
+
self,
|
| 151 |
+
text: str,
|
| 152 |
+
*,
|
| 153 |
+
voice: str,
|
| 154 |
+
language: str,
|
| 155 |
+
speed: float, # 0.5..2.0; 1.0 default
|
| 156 |
+
output_format: str, # "ogg_vorbis"|"mp3"|"wav"
|
| 157 |
+
chunk_size_bytes: int = 16384,
|
| 158 |
+
) -> AsyncIterator[bytes]:
|
| 159 |
+
"""Yields raw audio chunks."""
|
| 160 |
+
|
| 161 |
+
def health(self) -> dict: ...
|
| 162 |
+
```
|
| 163 |
+
|
| 164 |
+
### 4.2 `tts_service.py`
|
| 165 |
+
|
| 166 |
+
```python
|
| 167 |
+
class TtsService:
|
| 168 |
+
name = "tts"
|
| 169 |
+
version = "1.0"
|
| 170 |
+
|
| 171 |
+
def __init__(self, config: SpeechConfig):
|
| 172 |
+
...
|
| 173 |
+
|
| 174 |
+
def capabilities(self) -> list[tuple[CapabilityDescriptor, Callable, ParamsPredicate]]:
|
| 175 |
+
"""One tts.synthesize per (backend, voice) pair (or backend-only if many voices)."""
|
| 176 |
+
|
| 177 |
+
async def start(self) -> None: ...
|
| 178 |
+
async def stop(self) -> None: ...
|
| 179 |
+
def health(self) -> dict: ...
|
| 180 |
+
|
| 181 |
+
async def handle_synthesize(self, req: RouteRequest) -> AsyncIterator[dict]:
|
| 182 |
+
"""CAP2 Β§4.12.
|
| 183 |
+
1. Validate text length β€ TTS_MAX_TEXT_CHARS
|
| 184 |
+
2. Pick backend and voice
|
| 185 |
+
3. Stream chunks (base64 in 'chunk' frame)
|
| 186 |
+
4. Emit done with metadata"""
|
| 187 |
+
```
|
| 188 |
+
|
| 189 |
+
### 4.3 Concrete TTS backends
|
| 190 |
+
|
| 191 |
+
```python
|
| 192 |
+
class XttsBackend(TtsBackend):
|
| 193 |
+
"""Coqui XTTS-v2 (Christof has the pipeline from his podcast generator).
|
| 194 |
+
Supports voice cloning via reference audio."""
|
| 195 |
+
|
| 196 |
+
def __init__(
|
| 197 |
+
self,
|
| 198 |
+
model: str = "tts_models/multilingual/multi-dataset/xtts_v2",
|
| 199 |
+
voices_dir: Path = Path("~/.hearthnet/voices"),
|
| 200 |
+
device: str = "auto",
|
| 201 |
+
):
|
| 202 |
+
...
|
| 203 |
+
|
| 204 |
+
class EdgeTtsBackend(TtsBackend):
|
| 205 |
+
"""Microsoft Edge-TTS β requires internet, many voices, very natural.
|
| 206 |
+
Used as default when xtts is too slow on a node."""
|
| 207 |
+
|
| 208 |
+
def __init__(self, default_voice: str = "de-DE-KatjaNeural"):
|
| 209 |
+
...
|
| 210 |
+
```
|
| 211 |
+
|
| 212 |
+
---
|
| 213 |
+
|
| 214 |
+
## 5. Behaviour
|
| 215 |
+
|
| 216 |
+
### 5.1 STT streaming
|
| 217 |
+
|
| 218 |
+
For long audio:
|
| 219 |
+
- Local Whisper produces segments incrementally (~real time on a 4090, slower on CPU)
|
| 220 |
+
- Service emits one SSE `segment` frame per finalised segment
|
| 221 |
+
- Final `done` frame includes total duration and full language detection
|
| 222 |
+
|
| 223 |
+
### 5.2 STT max length
|
| 224 |
+
|
| 225 |
+
`STT_MAX_AUDIO_SECONDS = 300`. Longer audio: caller chunks into 5-minute segments and concatenates results. Caller's responsibility to manage cross-chunk speaker continuity.
|
| 226 |
+
|
| 227 |
+
### 5.3 Voice cloning (XTTS)
|
| 228 |
+
|
| 229 |
+
`XttsBackend` supports voice cloning when given a reference audio file:
|
| 230 |
+
|
| 231 |
+
```python
|
| 232 |
+
config.tts.cloned_voices = [
|
| 233 |
+
ClonedVoiceConfig(name="hannes_v1", reference_path=Path("~/.hearthnet/voices/hannes-3s.wav"))
|
| 234 |
+
]
|
| 235 |
+
```
|
| 236 |
+
|
| 237 |
+
Each cloned voice is registered as a separate `voice` entry in the descriptor params. Cloning happens once at startup; serves quickly thereafter.
|
| 238 |
+
|
| 239 |
+
**Privacy note:** Voice cloning is powerful and risky. Communities SHOULD policy-restrict who can register cloned voices (suggested: `trust_required="anchor"` for voice cloning). MVP allows any member; document the risk.
|
| 240 |
+
|
| 241 |
+
### 5.4 Audio format negotiation
|
| 242 |
+
|
| 243 |
+
- Input STT: any common format Whisper accepts (mp3, ogg, wav, m4a). Service normalises via `ffmpeg`.
|
| 244 |
+
- Output TTS: `ogg_vorbis` default (smallest), `mp3` widely-compatible, `wav` lossless.
|
| 245 |
+
|
| 246 |
+
### 5.5 Edge-TTS internet dependency
|
| 247 |
+
|
| 248 |
+
`EdgeTtsBackend` requires internet. Deregistered automatically by [M09](../../modules/M09-emergency.md) when offline. XTTS local backend continues to work.
|
| 249 |
+
|
| 250 |
+
### 5.6 STT β TTS chain (voice assistant pattern)
|
| 251 |
+
|
| 252 |
+
The voice query button in M08 UI ext:
|
| 253 |
+
```
|
| 254 |
+
mic β audio blob via M07 β stt.transcribe β text
|
| 255 |
+
text β llm.chat β response text
|
| 256 |
+
response text β tts.synthesize β audio chunks β speaker
|
| 257 |
+
```
|
| 258 |
+
|
| 259 |
+
This is composed at the UI layer, not internally in the speech services.
|
| 260 |
+
|
| 261 |
+
### 5.7 Christof's existing pipeline reuse
|
| 262 |
+
|
| 263 |
+
Christof has an established XTTS-v2 + Edge-TTS podcast generator pipeline. The `XttsBackend` and `EdgeTtsBackend` are designed to be drop-ins for that pipeline, sharing the same models directory.
|
| 264 |
+
|
| 265 |
+
---
|
| 266 |
+
|
| 267 |
+
## 6. Errors
|
| 268 |
+
|
| 269 |
+
| Condition | Wire code |
|
| 270 |
+
|-----------|-----------|
|
| 271 |
+
| Audio > STT_MAX_AUDIO_SECONDS | `bad_request` |
|
| 272 |
+
| Text > TTS_MAX_TEXT_CHARS | `bad_request` |
|
| 273 |
+
| Unknown voice | `not_found` |
|
| 274 |
+
| Audio decode failed (corrupt blob) | `bad_request` |
|
| 275 |
+
| Backend GPU OOM | `capacity_exceeded` |
|
| 276 |
+
|
| 277 |
+
---
|
| 278 |
+
|
| 279 |
+
## 7. Configuration
|
| 280 |
+
|
| 281 |
+
```python
|
| 282 |
+
config.speech.enabled = True
|
| 283 |
+
config.speech.stt_backends = [
|
| 284 |
+
SttBackendConfig(name="whisper", default_model="large-v3", device="auto"),
|
| 285 |
+
]
|
| 286 |
+
config.speech.tts_backends = [
|
| 287 |
+
TtsBackendConfig(name="xtts", voices_dir=Path("~/.hearthnet/voices")),
|
| 288 |
+
TtsBackendConfig(name="edge_tts", default_voice="de-DE-KatjaNeural"),
|
| 289 |
+
]
|
| 290 |
+
config.speech.cloned_voices = [] # list[ClonedVoiceConfig]
|
| 291 |
+
```
|
| 292 |
+
|
| 293 |
+
Constants: `STT_MAX_AUDIO_SECONDS`, `TTS_MAX_TEXT_CHARS`.
|
| 294 |
+
|
| 295 |
+
---
|
| 296 |
+
|
| 297 |
+
## 8. Tests
|
| 298 |
+
|
| 299 |
+
### Unit
|
| 300 |
+
- `test_stt_descriptor_per_model`
|
| 301 |
+
- `test_tts_descriptor_per_voice`
|
| 302 |
+
- `test_stt_max_duration_rejected`
|
| 303 |
+
- `test_tts_max_length_rejected`
|
| 304 |
+
|
| 305 |
+
### Integration
|
| 306 |
+
- `test_whisper_transcribes_de_audio` (test asset)
|
| 307 |
+
- `test_xtts_synthesises_then_decodes_to_correct_duration`
|
| 308 |
+
- `test_voice_chain_stt_llm_tts` β end-to-end
|
| 309 |
+
- `test_edge_tts_deregistered_when_offline`
|
| 310 |
+
|
| 311 |
+
---
|
| 312 |
+
|
| 313 |
+
## 9. Cross-references
|
| 314 |
+
|
| 315 |
+
| What | Where |
|
| 316 |
+
|------|-------|
|
| 317 |
+
| `stt.transcribe@1.0` wire | [CAP2 Β§4.11](../CAPABILITY_CONTRACT_v2.md) |
|
| 318 |
+
| `tts.synthesize@1.0` wire | [CAP2 Β§4.12](../CAPABILITY_CONTRACT_v2.md) |
|
| 319 |
+
| Voice query UI | M08 ext |
|
| 320 |
+
| Mobile voice notes | [M22 Β§4](M22-mobile-native.md) |
|
| 321 |
+
| Translation chain | [M18](M18-translation.md) |
|
| 322 |
+
| Emergency dereg for internet-bound backends | [M09 Β§5.2](../../modules/M09-emergency.md) |
|
| 323 |
+
|
| 324 |
+
---
|
| 325 |
+
|
| 326 |
+
## 10. Open questions
|
| 327 |
+
|
| 328 |
+
1. **Streaming STT (mic input β live caption)** β Phase 2.5. Requires WebSocket and a different backend init pattern.
|
| 329 |
+
2. **Real-time TTS (sub-100ms first audio)** β XTTS is 500ms+; piper-tts is fast but limited voices. Phase 3.
|
| 330 |
+
3. **Speaker enrollment** β explicit "this is who I am" speech sample so diarization can label by name. Phase 2.5.
|
| 331 |
+
4. **Audio at-rest privacy** β should voice notes be E2E? [M23](M23-e2e-encryption.md) supports it; default ON for chat attachments.
|
docs/p2_p3/M20-vision.md
ADDED
|
@@ -0,0 +1,369 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# M20 β Vision Services
|
| 2 |
+
|
| 3 |
+
**Spec version:** v1.0 (Phase 2)
|
| 4 |
+
**Depends on:** M03 (bus), M07 (blobs), M04 (LLM, extended), X04 (config), X03 (observability)
|
| 5 |
+
**Depended on by:** M08 UI (image describe in ask tab; generate in tools), M21 tools (vision used by tool-augmented LLM), M17 OCR (Florence-2 OCR mode shared)
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## 1. Responsibility
|
| 10 |
+
|
| 11 |
+
Two capability families:
|
| 12 |
+
|
| 13 |
+
- `img.describe@1.0` β given an image CID, produce a caption, tags, object list, or OCR
|
| 14 |
+
- `img.generate@1.0` β given a prompt, generate an image
|
| 15 |
+
|
| 16 |
+
Plus: extend the LLM `llm.chat@2.0` request schema with **multimodal content** (text + image_cid in messages). The multimodal path goes through M04 backends that declare `modalities: ["text", "vision"]`. M20 is responsible for providing the vision backends those LLMs depend on.
|
| 17 |
+
|
| 18 |
+
Christof's existing pipelines: Florence-2 (describe), FLUX.1-dev with LoRAs (generate), MiniCPM-V (multimodal). All wired in.
|
| 19 |
+
|
| 20 |
+
---
|
| 21 |
+
|
| 22 |
+
## 2. File layout
|
| 23 |
+
|
| 24 |
+
```
|
| 25 |
+
hearthnet/services/image/
|
| 26 |
+
βββ __init__.py
|
| 27 |
+
βββ describe_service.py
|
| 28 |
+
βββ generate_service.py
|
| 29 |
+
βββ backends/
|
| 30 |
+
βββ __init__.py
|
| 31 |
+
βββ base.py # ImageDescribeBackend, ImageGenerateBackend
|
| 32 |
+
βββ florence2.py # Microsoft Florence-2 large
|
| 33 |
+
βββ minicpm_v.py # OpenBMB MiniCPM-V (also usable for chat-with-vision)
|
| 34 |
+
βββ flux.py # black-forest-labs FLUX.1-dev with LoRA support
|
| 35 |
+
βββ stable_diffusion.py # Optional SD-XL fallback
|
| 36 |
+
```
|
| 37 |
+
|
| 38 |
+
---
|
| 39 |
+
|
| 40 |
+
## 3. Public API β describe
|
| 41 |
+
|
| 42 |
+
### 3.1 `backends/base.py`
|
| 43 |
+
|
| 44 |
+
```python
|
| 45 |
+
@dataclass(frozen=True)
|
| 46 |
+
class ImageDescription:
|
| 47 |
+
caption: str
|
| 48 |
+
detailed_caption: str | None
|
| 49 |
+
tags: list[str]
|
| 50 |
+
objects: list[dict] # [{label, bbox, confidence}]
|
| 51 |
+
ocr_text: str | None
|
| 52 |
+
language: str
|
| 53 |
+
ms: int
|
| 54 |
+
|
| 55 |
+
class ImageDescribeBackend(Protocol):
|
| 56 |
+
name: str
|
| 57 |
+
tasks_supported: list[str] # subset of {"caption","detailed_caption","ocr","objects","tags"}
|
| 58 |
+
languages: list[str]
|
| 59 |
+
max_pixels: int
|
| 60 |
+
|
| 61 |
+
async def warm(self) -> None: ...
|
| 62 |
+
async def close(self) -> None: ...
|
| 63 |
+
|
| 64 |
+
async def describe(
|
| 65 |
+
self,
|
| 66 |
+
image_bytes: bytes,
|
| 67 |
+
*,
|
| 68 |
+
task: str,
|
| 69 |
+
language: str = "en",
|
| 70 |
+
) -> ImageDescription: ...
|
| 71 |
+
|
| 72 |
+
def health(self) -> dict: ...
|
| 73 |
+
```
|
| 74 |
+
|
| 75 |
+
### 3.2 `describe_service.py`
|
| 76 |
+
|
| 77 |
+
```python
|
| 78 |
+
class ImageDescribeService:
|
| 79 |
+
name = "image.describe"
|
| 80 |
+
version = "1.0"
|
| 81 |
+
|
| 82 |
+
def __init__(self, config: VisionConfig, blob_store: BlobStore):
|
| 83 |
+
...
|
| 84 |
+
|
| 85 |
+
def capabilities(self) -> list[tuple[CapabilityDescriptor, Callable, ParamsPredicate]]:
|
| 86 |
+
"""One img.describe per backend. Params include backend name and tasks_supported."""
|
| 87 |
+
|
| 88 |
+
async def start(self) -> None: ...
|
| 89 |
+
async def stop(self) -> None: ...
|
| 90 |
+
def health(self) -> dict: ...
|
| 91 |
+
|
| 92 |
+
async def handle_describe(self, req: RouteRequest) -> dict:
|
| 93 |
+
"""CAP2 Β§4.13."""
|
| 94 |
+
```
|
| 95 |
+
|
| 96 |
+
### 3.3 Concrete describe backends
|
| 97 |
+
|
| 98 |
+
```python
|
| 99 |
+
class Florence2Backend(ImageDescribeBackend):
|
| 100 |
+
def __init__(self, model: str = "microsoft/Florence-2-large", device: str = "auto"):
|
| 101 |
+
...
|
| 102 |
+
|
| 103 |
+
class MinicpmVBackend(ImageDescribeBackend):
|
| 104 |
+
"""Used both standalone (img.describe) and as an LLM vision backend (M04 extension)."""
|
| 105 |
+
def __init__(self, model: str = "openbmb/MiniCPM-V-2_6", device: str = "auto"):
|
| 106 |
+
...
|
| 107 |
+
```
|
| 108 |
+
|
| 109 |
+
---
|
| 110 |
+
|
| 111 |
+
## 4. Public API β generate
|
| 112 |
+
|
| 113 |
+
### 4.1 `backends/base.py`
|
| 114 |
+
|
| 115 |
+
```python
|
| 116 |
+
@dataclass(frozen=True)
|
| 117 |
+
class GenerationResult:
|
| 118 |
+
image_bytes: bytes
|
| 119 |
+
width: int
|
| 120 |
+
height: int
|
| 121 |
+
format: str # "png" | "webp" | "jpg"
|
| 122 |
+
seed: int
|
| 123 |
+
ms: int
|
| 124 |
+
|
| 125 |
+
class ImageGenerateBackend(Protocol):
|
| 126 |
+
name: str
|
| 127 |
+
models: list[str]
|
| 128 |
+
loras_available: list[str]
|
| 129 |
+
max_resolution: tuple[int, int]
|
| 130 |
+
min_resolution: tuple[int, int]
|
| 131 |
+
supports_negative_prompt: bool
|
| 132 |
+
|
| 133 |
+
async def warm(self, model: str) -> None: ...
|
| 134 |
+
async def close(self) -> None: ...
|
| 135 |
+
|
| 136 |
+
async def generate(
|
| 137 |
+
self,
|
| 138 |
+
prompt: str,
|
| 139 |
+
*,
|
| 140 |
+
model: str,
|
| 141 |
+
lora: str | None,
|
| 142 |
+
negative_prompt: str | None,
|
| 143 |
+
width: int,
|
| 144 |
+
height: int,
|
| 145 |
+
steps: int,
|
| 146 |
+
seed: int | None,
|
| 147 |
+
progress_cb: Callable[[int, int], None] | None = None,
|
| 148 |
+
) -> GenerationResult: ...
|
| 149 |
+
|
| 150 |
+
def health(self) -> dict: ...
|
| 151 |
+
```
|
| 152 |
+
|
| 153 |
+
### 4.2 `generate_service.py`
|
| 154 |
+
|
| 155 |
+
```python
|
| 156 |
+
class ImageGenerateService:
|
| 157 |
+
name = "image.generate"
|
| 158 |
+
version = "1.0"
|
| 159 |
+
|
| 160 |
+
def __init__(self, config: VisionConfig, blob_store: BlobStore):
|
| 161 |
+
...
|
| 162 |
+
|
| 163 |
+
def capabilities(self) -> list[tuple[CapabilityDescriptor, Callable, ParamsPredicate]]:
|
| 164 |
+
"""One img.generate per (backend, model) combo. params declare loras_available."""
|
| 165 |
+
|
| 166 |
+
async def start(self) -> None: ...
|
| 167 |
+
async def stop(self) -> None: ...
|
| 168 |
+
def health(self) -> dict: ...
|
| 169 |
+
|
| 170 |
+
async def handle_generate(self, req: RouteRequest) -> AsyncIterator[dict]:
|
| 171 |
+
"""CAP2 Β§4.14.
|
| 172 |
+
1. Generate (streaming progress frames)
|
| 173 |
+
2. Store resulting image as blob
|
| 174 |
+
3. Emit done with image_cid"""
|
| 175 |
+
```
|
| 176 |
+
|
| 177 |
+
### 4.3 Concrete generate backends
|
| 178 |
+
|
| 179 |
+
```python
|
| 180 |
+
class FluxBackend(ImageGenerateBackend):
|
| 181 |
+
"""FLUX.1-dev with LoRA support; Christof's existing pipeline."""
|
| 182 |
+
def __init__(
|
| 183 |
+
self,
|
| 184 |
+
model: str = "black-forest-labs/FLUX.1-dev",
|
| 185 |
+
device: str = "auto",
|
| 186 |
+
loras_dir: Path = Path("~/.hearthnet/loras"),
|
| 187 |
+
):
|
| 188 |
+
...
|
| 189 |
+
|
| 190 |
+
class StableDiffusionBackend(ImageGenerateBackend):
|
| 191 |
+
"""SD-XL fallback for nodes with smaller GPUs."""
|
| 192 |
+
def __init__(self, model: str = "stabilityai/stable-diffusion-xl-base-1.0", device: str = "auto"):
|
| 193 |
+
...
|
| 194 |
+
```
|
| 195 |
+
|
| 196 |
+
---
|
| 197 |
+
|
| 198 |
+
## 5. Multimodal LLM extension (M04 hook)
|
| 199 |
+
|
| 200 |
+
### 5.1 Message content array
|
| 201 |
+
|
| 202 |
+
In `llm.chat@2.0` (CAP2 Β§4.23), each `messages[].content` may be a list:
|
| 203 |
+
|
| 204 |
+
```json
|
| 205 |
+
[
|
| 206 |
+
{"type": "text", "text": "Was ist auf diesem Bild?"},
|
| 207 |
+
{"type": "image", "image_cid": "blake3:..."}
|
| 208 |
+
]
|
| 209 |
+
```
|
| 210 |
+
|
| 211 |
+
Backends that declare `modalities: ["text", "vision"]` in their descriptor must handle the array form. Backends that don't either:
|
| 212 |
+
- Are skipped by the router (params_compatible returns False when message contains image and `modalities β {"vision"}`)
|
| 213 |
+
- Or fall back: extract text content only, ignore images (worse UX; not recommended)
|
| 214 |
+
|
| 215 |
+
### 5.2 Vision-capable backends in M04
|
| 216 |
+
|
| 217 |
+
These M04 backends gain a `modalities: ["text","vision"]` declaration in Phase 2:
|
| 218 |
+
|
| 219 |
+
| Backend | Vision support |
|
| 220 |
+
|---------|----------------|
|
| 221 |
+
| `MinicpmVBackend` (M04 entry β same model as M20's describe) | Yes; native multimodal |
|
| 222 |
+
| `AnthropicApiBackend` | Yes; Claude vision via API |
|
| 223 |
+
| `OpenAiApiBackend` | Yes; GPT-4V |
|
| 224 |
+
| `Llava` (new, optional) | Yes; LLaVA via llama.cpp |
|
| 225 |
+
| `LlamaCppBackend` | Yes if model is multimodal (LLaVA-format) |
|
| 226 |
+
| `OllamaBackend` | Yes for vision models |
|
| 227 |
+
| Others | No |
|
| 228 |
+
|
| 229 |
+
The `M04.LlmService._build_backends` constructs these with their vision flag.
|
| 230 |
+
|
| 231 |
+
### 5.3 Image preprocessing
|
| 232 |
+
|
| 233 |
+
For LLM context, images are:
|
| 234 |
+
- Loaded from blob store via CID
|
| 235 |
+
- Resized to backend's preferred resolution (e.g. 1024Γ1024 for MiniCPM-V)
|
| 236 |
+
- Encoded base64 or sent as bytes per backend's protocol
|
| 237 |
+
|
| 238 |
+
This is opaque to the caller β the multimodal `messages` array is the contract.
|
| 239 |
+
|
| 240 |
+
---
|
| 241 |
+
|
| 242 |
+
## 6. Behaviour
|
| 243 |
+
|
| 244 |
+
### 6.1 Image describe lifecycle
|
| 245 |
+
|
| 246 |
+
```
|
| 247 |
+
caller β bus.call("img.describe", (1,0), {input:{image_cid:..., task:"detailed_caption"}})
|
| 248 |
+
β ImageDescribeService.handle_describe
|
| 249 |
+
β blob_store.read_blob_bytes(image_cid)
|
| 250 |
+
β backend.describe(bytes, task=...)
|
| 251 |
+
β return ImageDescription serialised
|
| 252 |
+
```
|
| 253 |
+
|
| 254 |
+
### 6.2 Image generate lifecycle
|
| 255 |
+
|
| 256 |
+
```
|
| 257 |
+
caller β bus.stream("img.generate", (1,0), {input:{prompt:"...", steps:20}})
|
| 258 |
+
β ImageGenerateService.handle_generate
|
| 259 |
+
β backend.generate(...) with progress_cb
|
| 260 |
+
β for each step: emit 'progress' frame
|
| 261 |
+
β on completion: blob_store.write_blob(image)
|
| 262 |
+
β emit 'done' frame with image_cid
|
| 263 |
+
```
|
| 264 |
+
|
| 265 |
+
### 6.3 Safety filters
|
| 266 |
+
|
| 267 |
+
- `img.generate` prompts pass through a configurable safety filter list (regex blocklist + optional LLM-based classifier)
|
| 268 |
+
- Generation of identifiable persons is blocked by default (configurable: `config.vision.allow_identifiable_persons`)
|
| 269 |
+
- NSFW filter on output (Stable Diffusion has built-in; FLUX needs separate model)
|
| 270 |
+
- Failed safety β `bad_request` with `reason: "safety_filter"`
|
| 271 |
+
|
| 272 |
+
### 6.4 LoRA management
|
| 273 |
+
|
| 274 |
+
`FluxBackend.loras_available` lists LoRAs found in `loras_dir`. Caller can request a specific LoRA in `params.lora`. Loading a LoRA takes a few seconds on first use; cached thereafter.
|
| 275 |
+
|
| 276 |
+
Christof's existing LoRAs (local-style, sketches, etc.) drop into the `loras_dir`. The backend auto-discovers them.
|
| 277 |
+
|
| 278 |
+
### 6.5 GPU pressure
|
| 279 |
+
|
| 280 |
+
Vision models are heavy. Recommended:
|
| 281 |
+
|
| 282 |
+
- One Florence-2 instance per node (always-loaded)
|
| 283 |
+
- FLUX/SD only loaded on-demand (warm on first request, kept hot for 5 minutes)
|
| 284 |
+
- `max_concurrent = 1` for FLUX; `2` for describe backends
|
| 285 |
+
|
| 286 |
+
These limits are declared in the capability descriptor so the bus throttles correctly.
|
| 287 |
+
|
| 288 |
+
### 6.6 Multimodal LLM call routing
|
| 289 |
+
|
| 290 |
+
When a user sends a multimodal message:
|
| 291 |
+
|
| 292 |
+
```
|
| 293 |
+
UI β bus.stream("llm.chat", (2,0), {input:{messages:[{role:"user", content:[{type:"text",text:"..."},{type:"image",image_cid:"..."}]}]}})
|
| 294 |
+
β Router.route filters candidates to those with modalities β {"vision"}
|
| 295 |
+
β picks best (e.g. MinicpmV local, fall back to Anthropic API)
|
| 296 |
+
β backend handles base64 / image-token-injection internally
|
| 297 |
+
```
|
| 298 |
+
|
| 299 |
+
If no vision-capable backend is online, the call returns `not_found` with a helpful `alt_capabilities` hint pointing to describe-then-text-only fallback (UI can offer this).
|
| 300 |
+
|
| 301 |
+
---
|
| 302 |
+
|
| 303 |
+
## 7. Errors
|
| 304 |
+
|
| 305 |
+
| Condition | Wire code |
|
| 306 |
+
|-----------|-----------|
|
| 307 |
+
| Unknown task | `bad_request` |
|
| 308 |
+
| Image too large | `bad_request` |
|
| 309 |
+
| Prompt safety violation | `bad_request` (reason=safety_filter) |
|
| 310 |
+
| LoRA not found | `not_found` |
|
| 311 |
+
| GPU OOM | `capacity_exceeded` |
|
| 312 |
+
| Backend missing for requested task | `not_implemented` |
|
| 313 |
+
|
| 314 |
+
---
|
| 315 |
+
|
| 316 |
+
## 8. Configuration
|
| 317 |
+
|
| 318 |
+
```python
|
| 319 |
+
config.vision.enabled = True
|
| 320 |
+
config.vision.describe_backends = [
|
| 321 |
+
DescribeBackendConfig(name="florence2", model="microsoft/Florence-2-large", device="auto"),
|
| 322 |
+
DescribeBackendConfig(name="minicpm_v", model="openbmb/MiniCPM-V-2_6", device="auto"),
|
| 323 |
+
]
|
| 324 |
+
config.vision.generate_backends = [
|
| 325 |
+
GenerateBackendConfig(name="flux", model="black-forest-labs/FLUX.1-dev",
|
| 326 |
+
loras_dir=Path("~/.hearthnet/loras"), device="auto"),
|
| 327 |
+
]
|
| 328 |
+
config.vision.allow_identifiable_persons = False
|
| 329 |
+
config.vision.safety_blocklist_file = None # optional regex file
|
| 330 |
+
```
|
| 331 |
+
|
| 332 |
+
---
|
| 333 |
+
|
| 334 |
+
## 9. Tests
|
| 335 |
+
|
| 336 |
+
### Unit
|
| 337 |
+
- `test_describe_descriptor_per_backend`
|
| 338 |
+
- `test_safety_filter_blocks_known_pattern`
|
| 339 |
+
- `test_lora_discovery`
|
| 340 |
+
- `test_oom_returns_capacity_exceeded`
|
| 341 |
+
|
| 342 |
+
### Integration
|
| 343 |
+
- `test_florence2_caption_sample` (test image)
|
| 344 |
+
- `test_flux_generate_with_lora_progress_frames`
|
| 345 |
+
- `test_multimodal_llm_routes_to_vision_backend`
|
| 346 |
+
- `test_describe_then_text_fallback_when_no_vision_llm`
|
| 347 |
+
|
| 348 |
+
---
|
| 349 |
+
|
| 350 |
+
## 10. Cross-references
|
| 351 |
+
|
| 352 |
+
| What | Where |
|
| 353 |
+
|------|-------|
|
| 354 |
+
| `img.*` wire | [CAP2 Β§4.13β4.14](../CAPABILITY_CONTRACT_v2.md) |
|
| 355 |
+
| Multimodal `llm.chat@2.0` | [CAP2 Β§4.23](../CAPABILITY_CONTRACT_v2.md) |
|
| 356 |
+
| LLM service extension | M04 (extended in Phase 2 β see [00-OVERVIEW Β§1](../00-OVERVIEW.md)) |
|
| 357 |
+
| OCR overlap | [M17 Β§3.2 florence_ocr](M17-ocr.md) |
|
| 358 |
+
| Christof's pipelines | external, this is the integration |
|
| 359 |
+
|
| 360 |
+
---
|
| 361 |
+
|
| 362 |
+
## 11. Open questions
|
| 363 |
+
|
| 364 |
+
1. **Video** β Phase 3 considers `video.describe` and `video.generate` (LTX-Video). Not in Phase 2.
|
| 365 |
+
2. **Image editing (inpainting)** β Phase 2.5: `img.edit@1.0` capability. Reserved.
|
| 366 |
+
3. **Control nets (depth, edge, pose)** β Phase 2.5.
|
| 367 |
+
4. **3D generation** β Phase 3 with TripoSR or similar.
|
| 368 |
+
5. **Safety filter quality** β regex blocklist is weak. An LLM-as-judge classifier is better but adds latency. Configurable; default off.
|
| 369 |
+
6. **LoRA stacking** β caller specifies multiple `loras: ["...","..."]`. Implementable but adds attack surface (prompt-LoRA combos). Defer.
|
docs/p2_p3/M21-tool-calls.md
ADDED
|
@@ -0,0 +1,317 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# M21 β Tool Calls (LLM Tool-Use)
|
| 2 |
+
|
| 3 |
+
**Spec version:** v1.0 (Phase 2)
|
| 4 |
+
**Depends on:** M03 (bus), M04 (LLM, extended), X06 (WebSocket, for in-stream tool loops), X04 (config), X03 (observability)
|
| 5 |
+
**Depended on by:** M08 UI (ask tab gains tool-augmented mode), M22 mobile (same), any future agent applications
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## 1. Responsibility
|
| 10 |
+
|
| 11 |
+
Let the LLM call other HearthNet capabilities mid-generation. Specifically:
|
| 12 |
+
|
| 13 |
+
- Declare `tools` in the `llm.chat@2.0` request
|
| 14 |
+
- Receive `tool_call_delta` and `tool_call` stream frames from the LLM
|
| 15 |
+
- Execute each tool call against the bus
|
| 16 |
+
- Feed results back into the LLM via `tool_result` (WebSocket) or via a follow-up `llm.chat` call (SSE)
|
| 17 |
+
- Provide `llm.tools.call@1.0` as a convenience that wraps the bus dispatch
|
| 18 |
+
|
| 19 |
+
This module is **a protocol + helper**, not a service in the usual sense. The actual LLM work lives in M04; M21 documents how the tool flow is structured and provides utilities both sides need.
|
| 20 |
+
|
| 21 |
+
---
|
| 22 |
+
|
| 23 |
+
## 2. File layout
|
| 24 |
+
|
| 25 |
+
```
|
| 26 |
+
hearthnet/services/llm/
|
| 27 |
+
βββ tools.py # ToolDefinition, ToolCall, ToolResult, ToolExecutor
|
| 28 |
+
|
| 29 |
+
hearthnet/services/auth/ (already in M16)
|
| 30 |
+
# the auth service also registers llm.tools.call@1.0 as a wrapper capability
|
| 31 |
+
```
|
| 32 |
+
|
| 33 |
+
`tools.py` is small (~250 LOC). It lives inside the LLM service package because tool usage is intrinsic to chat completion.
|
| 34 |
+
|
| 35 |
+
---
|
| 36 |
+
|
| 37 |
+
## 3. Public API
|
| 38 |
+
|
| 39 |
+
### 3.1 Data types
|
| 40 |
+
|
| 41 |
+
```python
|
| 42 |
+
# hearthnet/services/llm/tools.py
|
| 43 |
+
from dataclasses import dataclass
|
| 44 |
+
from typing import AsyncIterator, Callable
|
| 45 |
+
|
| 46 |
+
@dataclass(frozen=True)
|
| 47 |
+
class ToolDefinition:
|
| 48 |
+
"""A tool the caller offers to the LLM.
|
| 49 |
+
Translated by the LLM backend into its native tool format."""
|
| 50 |
+
name: str # short identifier visible to the LLM
|
| 51 |
+
description: str # human-readable, drives LLM selection
|
| 52 |
+
parameters_schema: dict # JSON Schema for arguments
|
| 53 |
+
bound_capability: str | None # if set, ToolExecutor dispatches via bus
|
| 54 |
+
bound_version: tuple[int, int] | None
|
| 55 |
+
side_effects: bool # "is this a write?" β affects retry semantics
|
| 56 |
+
|
| 57 |
+
@dataclass(frozen=True)
|
| 58 |
+
class ToolCall:
|
| 59 |
+
"""A request from the LLM to execute a tool."""
|
| 60 |
+
id: str # opaque, generated by the LLM
|
| 61 |
+
name: str
|
| 62 |
+
arguments: dict # validated against parameters_schema
|
| 63 |
+
|
| 64 |
+
@dataclass(frozen=True)
|
| 65 |
+
class ToolResult:
|
| 66 |
+
"""The result of executing a ToolCall, fed back to the LLM."""
|
| 67 |
+
tool_call_id: str
|
| 68 |
+
name: str
|
| 69 |
+
content: str | dict # serialisable; if dict, becomes JSON
|
| 70 |
+
is_error: bool
|
| 71 |
+
```
|
| 72 |
+
|
| 73 |
+
### 3.2 `ToolExecutor`
|
| 74 |
+
|
| 75 |
+
```python
|
| 76 |
+
class ToolExecutor:
|
| 77 |
+
"""Wraps the orchestration loop: forward LLM tool calls to the bus,
|
| 78 |
+
collect results, re-inject into the LLM."""
|
| 79 |
+
|
| 80 |
+
def __init__(
|
| 81 |
+
self,
|
| 82 |
+
bus: CapabilityBus,
|
| 83 |
+
tools: list[ToolDefinition],
|
| 84 |
+
*,
|
| 85 |
+
max_iterations: int = 6,
|
| 86 |
+
per_tool_timeout_seconds: int = 30,
|
| 87 |
+
):
|
| 88 |
+
...
|
| 89 |
+
|
| 90 |
+
@property
|
| 91 |
+
def native_definitions(self) -> list[dict]:
|
| 92 |
+
"""Returns tools in the request schema's format (CAP2 Β§4.23 input.tools)."""
|
| 93 |
+
|
| 94 |
+
async def dispatch(self, call: ToolCall) -> ToolResult:
|
| 95 |
+
"""Validate call.arguments against the tool's parameters_schema.
|
| 96 |
+
If bound_capability: bus.call(bound_capability, bound_version, {input: call.arguments}).
|
| 97 |
+
Returns ToolResult. Catches and surfaces errors as is_error=True."""
|
| 98 |
+
|
| 99 |
+
async def run_chat_with_tools(
|
| 100 |
+
self,
|
| 101 |
+
chat_request_body: dict,
|
| 102 |
+
*,
|
| 103 |
+
stream_to: Callable[[dict], Awaitable[None]] | None = None,
|
| 104 |
+
) -> dict:
|
| 105 |
+
"""Orchestrator helper. Loops:
|
| 106 |
+
1. call bus.stream("llm.chat", (2,0), body)
|
| 107 |
+
2. accumulate text + tool_call frames
|
| 108 |
+
3. on tool_call_complete: dispatch, append tool_result message
|
| 109 |
+
4. re-call llm.chat with extended messages
|
| 110 |
+
5. stop when no more tool calls OR max_iterations reached
|
| 111 |
+
Returns the final assistant message."""
|
| 112 |
+
```
|
| 113 |
+
|
| 114 |
+
### 3.3 Wire-level frames (recap of CAP2 Β§4.23 and Β§5.1 with the tool flow)
|
| 115 |
+
|
| 116 |
+
LLM emits:
|
| 117 |
+
|
| 118 |
+
```
|
| 119 |
+
event: token
|
| 120 |
+
data: {"text":"I'll search "}
|
| 121 |
+
|
| 122 |
+
event: tool_call_delta
|
| 123 |
+
data: {"id":"tc_1","name":"rag.query","arguments_delta":"{\"query\":\""}
|
| 124 |
+
|
| 125 |
+
event: tool_call_delta
|
| 126 |
+
data: {"id":"tc_1","arguments_delta":"Regenwasser\""}
|
| 127 |
+
|
| 128 |
+
event: tool_call
|
| 129 |
+
data: {"id":"tc_1","name":"rag.query","arguments":{"query":"Regenwasser","corpus":"niederrhein-emergency"}}
|
| 130 |
+
```
|
| 131 |
+
|
| 132 |
+
Caller dispatches and replies (over WebSocket OR by re-calling `llm.chat` with the tool result added to messages):
|
| 133 |
+
|
| 134 |
+
WebSocket:
|
| 135 |
+
```
|
| 136 |
+
client β {"type":"tool_result","tool_call_id":"tc_1","body":{"chunks":[...]}}
|
| 137 |
+
```
|
| 138 |
+
|
| 139 |
+
SSE fallback:
|
| 140 |
+
```
|
| 141 |
+
caller re-calls llm.chat with messages = original_messages + [
|
| 142 |
+
{"role":"assistant","content":"...","tool_calls":[{"id":"tc_1","name":"rag.query","arguments":{...}}]},
|
| 143 |
+
{"role":"tool","tool_call_id":"tc_1","content":"<JSON of tool result>"}
|
| 144 |
+
]
|
| 145 |
+
```
|
| 146 |
+
|
| 147 |
+
Both paths converge: LLM continues and eventually emits `done`.
|
| 148 |
+
|
| 149 |
+
---
|
| 150 |
+
|
| 151 |
+
## 4. Behaviour
|
| 152 |
+
|
| 153 |
+
### 4.1 Tool selection heuristics
|
| 154 |
+
|
| 155 |
+
The LLM picks tools based on:
|
| 156 |
+
- Tool descriptions (descriptive English/German helps)
|
| 157 |
+
- `tool_choice` parameter:
|
| 158 |
+
- `"auto"` (default): LLM decides
|
| 159 |
+
- `"none"`: forbid tool use even if tools are declared
|
| 160 |
+
- `"required"`: must call at least one tool
|
| 161 |
+
- `{"name":"rag.query"}`: must call specifically this tool
|
| 162 |
+
|
| 163 |
+
Backends translate these to their native API.
|
| 164 |
+
|
| 165 |
+
### 4.2 Built-in tools
|
| 166 |
+
|
| 167 |
+
When `ToolExecutor` is instantiated by the UI, it can auto-include a set of standard tools bound to common bus capabilities:
|
| 168 |
+
|
| 169 |
+
| Tool name | Bound to | Use case |
|
| 170 |
+
|-----------|----------|----------|
|
| 171 |
+
| `search_corpus` | `rag.query@1.0` | Search a corpus |
|
| 172 |
+
| `list_corpora` | `rag.list_corpora@1.0` | What's available |
|
| 173 |
+
| `translate` | `trans.text@1.0` | Translate snippets |
|
| 174 |
+
| `find_neighbour` | (custom β list peers in current community) | "Wer ist da?" |
|
| 175 |
+
| `list_marketplace` | `market.list@1.0` | Active posts |
|
| 176 |
+
| `describe_image` | `img.describe@1.0` | Inspect uploaded images |
|
| 177 |
+
| `transcribe_audio` | `stt.transcribe@1.0` | Voice-input chained |
|
| 178 |
+
|
| 179 |
+
These are *suggested defaults*. Real applications pick what fits.
|
| 180 |
+
|
| 181 |
+
### 4.3 Validation
|
| 182 |
+
|
| 183 |
+
`ToolExecutor.dispatch` validates `call.arguments` against `parameters_schema` before calling the bus. Invalid args β `ToolResult(is_error=True, content="invalid_arguments: ...")`. The LLM sees the error and typically self-corrects.
|
| 184 |
+
|
| 185 |
+
### 4.4 Iteration limits
|
| 186 |
+
|
| 187 |
+
`max_iterations` (default 6) prevents runaway tool loops. After the limit, `ToolExecutor.run_chat_with_tools` injects a final `tool` message saying "iteration limit reached; finalise your answer" and forces `tool_choice="none"` on the next call.
|
| 188 |
+
|
| 189 |
+
### 4.5 Side-effect tools
|
| 190 |
+
|
| 191 |
+
Tools where `side_effects: True` (like `market.post`, `chat.send`) require explicit confirmation. By default, `ToolExecutor` raises `ToolError("requires_confirmation")` on side-effect calls, expecting the orchestrator (UI) to present a confirmation dialog.
|
| 192 |
+
|
| 193 |
+
UI flow:
|
| 194 |
+
```
|
| 195 |
+
LLM emits tool_call to market.post
|
| 196 |
+
ToolExecutor sees side_effects=True
|
| 197 |
+
emits a 'confirmation_required' frame upstream
|
| 198 |
+
UI shows "Allow LLM to post this?"
|
| 199 |
+
user clicks yes β orchestrator calls ToolExecutor.dispatch_confirmed(call)
|
| 200 |
+
```
|
| 201 |
+
|
| 202 |
+
### 4.6 Parallel tool calls
|
| 203 |
+
|
| 204 |
+
LLMs (Claude, GPT-4) can emit multiple `tool_call` frames in one turn. `ToolExecutor` dispatches them in parallel (bounded by `max_concurrent=4`). Results are submitted together in the next LLM turn.
|
| 205 |
+
|
| 206 |
+
### 4.7 Tool call composition (tools that call tools)
|
| 207 |
+
|
| 208 |
+
A `bound_capability` may itself be `llm.tools.call@1.0`. This allows defining higher-level tools as compositions of bus capabilities + LLM reasoning. Recursion limit = `max_iterations`.
|
| 209 |
+
|
| 210 |
+
### 4.8 Trust and tokens
|
| 211 |
+
|
| 212 |
+
Tool dispatch goes through the bus and inherits the caller's trust level. The LLM cannot escalate by emitting a tool call β the tool inherits the caller's permissions. For cross-community tool calls, the caller must hold an appropriate token (M16).
|
| 213 |
+
|
| 214 |
+
### 4.9 LLM backend translation
|
| 215 |
+
|
| 216 |
+
Backends translate `ToolDefinition` to their native protocol:
|
| 217 |
+
|
| 218 |
+
| Backend | Native format |
|
| 219 |
+
|---------|--------------|
|
| 220 |
+
| `AnthropicApiBackend` | Anthropic Messages tools |
|
| 221 |
+
| `OpenAiApiBackend` | OpenAI function calling |
|
| 222 |
+
| `OllamaBackend` (some models) | Ollama tool calls |
|
| 223 |
+
| `LlamaCppBackend` (with grammar) | JSON-Schema grammar constraint |
|
| 224 |
+
| `MinicpmVBackend` | MiniCPM tool format |
|
| 225 |
+
| `NemotronBackend` | OpenAI-compatible |
|
| 226 |
+
| `OpenBmbBackend` | OpenAI-compatible |
|
| 227 |
+
| Others | Tools ignored; backend emits a notice on `tool_choice="required"` |
|
| 228 |
+
|
| 229 |
+
This translation lives inside each backend's `chat()` method.
|
| 230 |
+
|
| 231 |
+
---
|
| 232 |
+
|
| 233 |
+
## 5. `llm.tools.call@1.0` capability
|
| 234 |
+
|
| 235 |
+
Convenience wrapper. Used when a caller wants to invoke a bus capability as if it were a tool result, without going through the LLM:
|
| 236 |
+
|
| 237 |
+
Already specified in [CAP2 Β§4.24](../CAPABILITY_CONTRACT_v2.md).
|
| 238 |
+
|
| 239 |
+
The handler lives in `M04.LlmService.handle_tools_call`:
|
| 240 |
+
|
| 241 |
+
```python
|
| 242 |
+
async def handle_tools_call(self, req: RouteRequest) -> dict:
|
| 243 |
+
"""1. Validate target_body against target_capability's request schema (via bus.schema)
|
| 244 |
+
2. bus.call(target_capability, target_version, target_body)
|
| 245 |
+
3. Return result"""
|
| 246 |
+
```
|
| 247 |
+
|
| 248 |
+
Mostly used by orchestrators that want a single audit-trail capability for "tool execution".
|
| 249 |
+
|
| 250 |
+
---
|
| 251 |
+
|
| 252 |
+
## 6. Configuration
|
| 253 |
+
|
| 254 |
+
```python
|
| 255 |
+
config.llm.tools_enabled = True
|
| 256 |
+
config.llm.tools_max_iterations = 6
|
| 257 |
+
config.llm.tools_per_tool_timeout_seconds = 30
|
| 258 |
+
config.llm.tools_max_parallel = 4
|
| 259 |
+
config.llm.tools_default_set = ["search_corpus","list_corpora","translate","list_marketplace"]
|
| 260 |
+
config.llm.tools_require_confirmation_for_side_effects = True
|
| 261 |
+
```
|
| 262 |
+
|
| 263 |
+
---
|
| 264 |
+
|
| 265 |
+
## 7. Errors
|
| 266 |
+
|
| 267 |
+
| Condition | Wire code |
|
| 268 |
+
|-----------|-----------|
|
| 269 |
+
| Tool not in declared set | `bad_request` |
|
| 270 |
+
| Tool arguments fail schema | `bad_request` |
|
| 271 |
+
| Tool execution timed out | `timeout` |
|
| 272 |
+
| Tool returned `internal_error` | propagated as `internal_error` |
|
| 273 |
+
| Iteration limit reached | (graceful β final answer forced) |
|
| 274 |
+
| Caller's token doesn't cover bound capability | `token_scope_insufficient` |
|
| 275 |
+
|
| 276 |
+
---
|
| 277 |
+
|
| 278 |
+
## 8. Tests
|
| 279 |
+
|
| 280 |
+
### Unit
|
| 281 |
+
- `test_tool_definition_to_native_format` β per backend
|
| 282 |
+
- `test_dispatch_validates_arguments`
|
| 283 |
+
- `test_side_effect_tool_requires_confirmation`
|
| 284 |
+
- `test_iteration_limit_forces_finalisation`
|
| 285 |
+
- `test_parallel_tool_calls_collected`
|
| 286 |
+
|
| 287 |
+
### Integration
|
| 288 |
+
- `test_search_corpus_tool_used_for_grounded_answer` β LLM is asked a question, calls rag.query, answers
|
| 289 |
+
- `test_translate_chain` β user types in DE, LLM uses trans.text tool internally
|
| 290 |
+
- `test_market_post_requires_confirmation`
|
| 291 |
+
- `test_recursive_tool_call_limited_by_max_iterations`
|
| 292 |
+
|
| 293 |
+
### Manual
|
| 294 |
+
- Confirm Anthropic Claude, OpenAI GPT-4, Ollama Mistral, MiniCPM-V all produce well-formed tool_call frames on the same test prompt.
|
| 295 |
+
|
| 296 |
+
---
|
| 297 |
+
|
| 298 |
+
## 9. Cross-references
|
| 299 |
+
|
| 300 |
+
| What | Where |
|
| 301 |
+
|------|-------|
|
| 302 |
+
| `llm.chat@2.0` tools field | [CAP2 Β§4.23](../CAPABILITY_CONTRACT_v2.md) |
|
| 303 |
+
| Tool-call stream frames | [CAP2 Β§5.1, X06 Β§6.6](../cross-cutting/X06-websocket.md) |
|
| 304 |
+
| `llm.tools.call@1.0` | [CAP2 Β§4.24](../CAPABILITY_CONTRACT_v2.md) |
|
| 305 |
+
| M04 backend extensions | M04 (extended in Phase 2) |
|
| 306 |
+
| Token scope for cross-community tool dispatch | [M16 Β§5.2](M16-tokens.md) |
|
| 307 |
+
| Confirmation UI hook | M08 ext |
|
| 308 |
+
|
| 309 |
+
---
|
| 310 |
+
|
| 311 |
+
## 10. Open questions
|
| 312 |
+
|
| 313 |
+
1. **Tool result streaming.** Currently a tool result is atomic. For long-running tool calls (e.g. `img.generate`), the LLM has to wait. Phase 2.5 may stream tool progress back.
|
| 314 |
+
2. **Tool memoisation.** Repeated `search_corpus(q)` in one chat could be cached. Defer.
|
| 315 |
+
3. **Tool authority lineage.** When a tool is called by an LLM running on Node A on behalf of User on Node B, which token does the tool inherit? Currently the user's. This may be insufficient for federation. Phase 2.5.
|
| 316 |
+
4. **Tool calls that issue tokens.** Could a tool be "issue me a token for capability X"? Probably yes; specify carefully to avoid privilege escalation. Defer.
|
| 317 |
+
5. **Tool selection telemetry.** Which tools does the LLM actually pick? Useful for tuning descriptions. Log to trace ring buffer; surface in observability dashboard.
|
docs/p2_p3/M22-mobile-native.md
ADDED
|
@@ -0,0 +1,325 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# M22 β Mobile Native Client
|
| 2 |
+
|
| 3 |
+
**Spec version:** v1.0 (Phase 2)
|
| 4 |
+
**Depends on:** M01 (identity), M15 (relay tier for push and NAT traversal), M16 (tokens for app auth), M23 (E2E for chat), X06 (WebSocket for live updates), the entire Phase 1 bus protocol as a wire client
|
| 5 |
+
**Depended on by:** end users on iOS/Android β first non-web HearthNet surface
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## 1. Responsibility
|
| 10 |
+
|
| 11 |
+
Native mobile application that:
|
| 12 |
+
|
| 13 |
+
- Onboards into a community (scan invite QR or paste invite blob)
|
| 14 |
+
- Stores keys in the device's secure enclave (iOS Keychain, Android Keystore)
|
| 15 |
+
- Receives push notifications via the relay tier (M15)
|
| 16 |
+
- Calls the community's anchor(s) via the bus protocol over HTTP/SSE or WebSocket
|
| 17 |
+
- Provides UI for chat (1:1 + group), marketplace, ask (LLM), community feed
|
| 18 |
+
- Operates fully when the user's anchor is reachable; degrades gracefully when not
|
| 19 |
+
|
| 20 |
+
This module specifies **the contract between the mobile client and the rest of HearthNet**. The actual Flutter codebase lives in a separate repo (`/mobile-native`) and is not Python.
|
| 21 |
+
|
| 22 |
+
---
|
| 23 |
+
|
| 24 |
+
## 2. File layout
|
| 25 |
+
|
| 26 |
+
```
|
| 27 |
+
mobile-native/ # separate Flutter project
|
| 28 |
+
βββ pubspec.yaml
|
| 29 |
+
βββ README.md
|
| 30 |
+
βββ lib/
|
| 31 |
+
β βββ main.dart
|
| 32 |
+
β βββ onboarding/ # invite scan, key gen
|
| 33 |
+
β βββ identity/ # key storage, signing
|
| 34 |
+
β β βββ secure_storage.dart
|
| 35 |
+
β β βββ signing.dart
|
| 36 |
+
β βββ bus/ # protocol client
|
| 37 |
+
β β βββ http_client.dart
|
| 38 |
+
β β βββ ws_client.dart
|
| 39 |
+
β β βββ sse_client.dart
|
| 40 |
+
β βββ crypto/ # E2E using cryptography_flutter / libsodium bindings
|
| 41 |
+
β β βββ x25519.dart
|
| 42 |
+
β β βββ ratchet.dart
|
| 43 |
+
β β βββ envelope.dart
|
| 44 |
+
β βββ push/ # APNs / FCM hookup
|
| 45 |
+
β β βββ subscriber.dart
|
| 46 |
+
β βββ ui/ # screens
|
| 47 |
+
β β βββ chat.dart
|
| 48 |
+
β β βββ marketplace.dart
|
| 49 |
+
β β βββ ask.dart
|
| 50 |
+
β β βββ community.dart
|
| 51 |
+
β βββ settings/
|
| 52 |
+
βββ ios/
|
| 53 |
+
βββ android/
|
| 54 |
+
βββ tests/
|
| 55 |
+
|
| 56 |
+
hearthnet/mobile/ # Python-side helper (in the main package)
|
| 57 |
+
βββ __init__.py
|
| 58 |
+
βββ invite.py # mobile-targeted invite QR generation
|
| 59 |
+
βββ push_authority.py # bus-side service for mobile push token registry
|
| 60 |
+
```
|
| 61 |
+
|
| 62 |
+
The Python `hearthnet/mobile/` package contains the anchor-side helpers used by the existing community. The Flutter code is its own world; this spec governs the wire contract it must implement.
|
| 63 |
+
|
| 64 |
+
---
|
| 65 |
+
|
| 66 |
+
## 3. Onboarding flow
|
| 67 |
+
|
| 68 |
+
```
|
| 69 |
+
User installs HearthNet app
|
| 70 |
+
β
|
| 71 |
+
App: "Scan invite QR or paste invite link"
|
| 72 |
+
β
|
| 73 |
+
On scan: parse hnvite:// blob (Phase 1 Β§13)
|
| 74 |
+
β
|
| 75 |
+
App generates Ed25519 keypair via libsodium binding
|
| 76 |
+
Persists private key in iOS Keychain (kSecAttrAccessibleAfterFirstUnlock)
|
| 77 |
+
or Android Keystore (KeyProperties.AUTH_REQUIRED if biometric set)
|
| 78 |
+
β
|
| 79 |
+
App calls bus.call("onboard.complete", input={
|
| 80 |
+
"invite_token": "...",
|
| 81 |
+
"node_id_full": "<our new ed25519 pub>",
|
| 82 |
+
"device_class": "mobile",
|
| 83 |
+
"display_name": "<user-entered>",
|
| 84 |
+
"platform": "ios|android",
|
| 85 |
+
})
|
| 86 |
+
β
|
| 87 |
+
Anchor processes invite (M13), emits node.joined event
|
| 88 |
+
β
|
| 89 |
+
App fetches community manifest (signed); pins it
|
| 90 |
+
β
|
| 91 |
+
App registers for push:
|
| 92 |
+
- obtain APNs/FCM device token from OS
|
| 93 |
+
- bus.call("relay.push.register", input={device_token, platform})
|
| 94 |
+
β
|
| 95 |
+
App publishes E2E prekey bundle (e2e.prekeys.published event)
|
| 96 |
+
β
|
| 97 |
+
Ready to use
|
| 98 |
+
```
|
| 99 |
+
|
| 100 |
+
---
|
| 101 |
+
|
| 102 |
+
## 4. Bus protocol on mobile
|
| 103 |
+
|
| 104 |
+
### 4.1 Same wire as desktop
|
| 105 |
+
|
| 106 |
+
The mobile client speaks the same HTTP/SSE/WebSocket protocol as Phase 1 anchors. It is a **client** that calls into anchors/services in its community; it does not host capabilities itself (mostly β see Β§4.4).
|
| 107 |
+
|
| 108 |
+
### 4.2 Endpoint selection
|
| 109 |
+
|
| 110 |
+
The mobile client maintains an ordered list of endpoints:
|
| 111 |
+
|
| 112 |
+
1. **Cached anchor endpoints** from last successful manifest fetch
|
| 113 |
+
2. **Local network discovery** (mDNS via [Bonjour iOS / NSD Android]) β only when on Wi-Fi
|
| 114 |
+
3. **Relay tier** (M15) β for NAT traversal when on cellular
|
| 115 |
+
4. **DHT lookup** (X05) β last resort
|
| 116 |
+
|
| 117 |
+
Per call, the client tries endpoints in order; first success wins. Persistent failures bubble up as `relay_unreachable` or `network_unreachable`.
|
| 118 |
+
|
| 119 |
+
### 4.3 Reconnection
|
| 120 |
+
|
| 121 |
+
WebSocket connections are persistent for tool-call loops and live pubsub. On disconnect, the client reconnects with exponential backoff (1s, 2s, 4s, ..., capped at 60s). Reconnect re-subscribes to all active topics.
|
| 122 |
+
|
| 123 |
+
### 4.4 Mobile-as-callable
|
| 124 |
+
|
| 125 |
+
In Phase 2, the mobile client does NOT register capabilities back into the community. It's purely a caller. Phase 3 may allow simple things (`market.list` from local cache while offline).
|
| 126 |
+
|
| 127 |
+
### 4.5 Token-bearer mode
|
| 128 |
+
|
| 129 |
+
For background-fetch (when the app is suspended), the OS may run a brief task. Background tasks use a **bearer token** from M16 (issued at onboarding, refreshed on each app foreground). The token has scope:
|
| 130 |
+
|
| 131 |
+
```json
|
| 132 |
+
{
|
| 133 |
+
"capabilities": ["chat.fetch","marketplace.list"],
|
| 134 |
+
"rate_limit_per_minute": 30
|
| 135 |
+
}
|
| 136 |
+
```
|
| 137 |
+
|
| 138 |
+
This avoids needing the user's biometric to unlock the private key for background polling.
|
| 139 |
+
|
| 140 |
+
---
|
| 141 |
+
|
| 142 |
+
## 5. Push notifications
|
| 143 |
+
|
| 144 |
+
### 5.1 What triggers a push
|
| 145 |
+
|
| 146 |
+
When the following events occur in the community, anchors send a push to relevant subscribed mobile devices:
|
| 147 |
+
|
| 148 |
+
- `chat.message.sent` where recipient is the mobile user
|
| 149 |
+
- `chat.thread.message.sent` where mobile user is a thread member
|
| 150 |
+
- `marketplace.post.created` where post matches user's subscribed categories
|
| 151 |
+
- `community.alert` (broadcast emergency alert)
|
| 152 |
+
- `node.joined` (subscribed users only)
|
| 153 |
+
|
| 154 |
+
### 5.2 Push payload shape
|
| 155 |
+
|
| 156 |
+
Per [M15 Β§5.4](M15-relay-tier.md), the payload is minimal:
|
| 157 |
+
|
| 158 |
+
```json
|
| 159 |
+
{
|
| 160 |
+
"event_type": "chat.message.sent",
|
| 161 |
+
"sender_short": "7H4G-...",
|
| 162 |
+
"preview": "Hallo Jana, ich bring..." // optional cleartext preview if not E2E
|
| 163 |
+
}
|
| 164 |
+
```
|
| 165 |
+
|
| 166 |
+
For E2E messages, the preview is absent and the app must fetch + decrypt on open.
|
| 167 |
+
|
| 168 |
+
### 5.3 iOS vs Android specifics
|
| 169 |
+
|
| 170 |
+
**iOS:** APNs payload with `aps.alert.title` and `aps.alert.body`. Background mode enabled to fetch on receive (`content-available: 1`).
|
| 171 |
+
|
| 172 |
+
**Android:** FCM `data` message (not `notification` β we control display). Handled by the app's `FirebaseMessagingService`.
|
| 173 |
+
|
| 174 |
+
### 5.4 Quiet hours
|
| 175 |
+
|
| 176 |
+
User-configurable. Push silenced 22:00β07:00 by default; emergency alerts override.
|
| 177 |
+
|
| 178 |
+
### 5.5 Mute and per-thread settings
|
| 179 |
+
|
| 180 |
+
Per-thread mute, per-category marketplace silence. Stored in `mobile.preferences` event (self-only, encrypted-at-rest on device).
|
| 181 |
+
|
| 182 |
+
---
|
| 183 |
+
|
| 184 |
+
## 6. Secure key storage
|
| 185 |
+
|
| 186 |
+
### 6.1 iOS β Keychain Services
|
| 187 |
+
|
| 188 |
+
```dart
|
| 189 |
+
// lib/identity/secure_storage.dart (sketch)
|
| 190 |
+
const _accessibility = 'kSecAttrAccessibleAfterFirstUnlock';
|
| 191 |
+
|
| 192 |
+
Future<void> storePrivateKey(String label, Uint8List bytes) async {
|
| 193 |
+
await KeychainAccess.setData(
|
| 194 |
+
label: label,
|
| 195 |
+
data: bytes,
|
| 196 |
+
accessibility: _accessibility,
|
| 197 |
+
accessControl: AccessControl.userPresence, // biometric on key use
|
| 198 |
+
);
|
| 199 |
+
}
|
| 200 |
+
```
|
| 201 |
+
|
| 202 |
+
### 6.2 Android β Keystore
|
| 203 |
+
|
| 204 |
+
Hardware-backed if available (StrongBox on modern Pixels), else TEE.
|
| 205 |
+
|
| 206 |
+
```dart
|
| 207 |
+
final cipher = await CryptographyFlutter.aesGcm(
|
| 208 |
+
keyId: 'hearthnet_identity_v1',
|
| 209 |
+
requireAuth: AuthMethod.biometric,
|
| 210 |
+
);
|
| 211 |
+
```
|
| 212 |
+
|
| 213 |
+
### 6.3 Backup
|
| 214 |
+
|
| 215 |
+
Private keys are NEVER backed up via iCloud / Google Backup. App-level backup uses an encrypted export blob (user-chosen passphrase) that the user is expected to save out-of-band (e.g. password manager, written down).
|
| 216 |
+
|
| 217 |
+
`config.mobile.cloud_backup_allowed = false` enforced.
|
| 218 |
+
|
| 219 |
+
### 6.4 Lost device
|
| 220 |
+
|
| 221 |
+
If the user loses the device, they:
|
| 222 |
+
|
| 223 |
+
1. Wait out the device's session β eventually messages stop delivering
|
| 224 |
+
2. From another device, call `node.revoke` on this NodeID (anchor co-signs)
|
| 225 |
+
3. The revoked NodeID is then blacklisted; the lost phone, even if recovered, can't authenticate
|
| 226 |
+
|
| 227 |
+
This is identical to the Phase 1 node revocation flow.
|
| 228 |
+
|
| 229 |
+
---
|
| 230 |
+
|
| 231 |
+
## 7. UI surface (mobile)
|
| 232 |
+
|
| 233 |
+
Mirrors the web UI ([M08](../../modules/M08-ui.md)) but native:
|
| 234 |
+
|
| 235 |
+
| Tab | Content |
|
| 236 |
+
|-----|---------|
|
| 237 |
+
| **Chat** | 1:1 conversations + group threads. Live updates via WS. Voice notes optional (record β STT β send transcript + audio attachment). |
|
| 238 |
+
| **Market** | List + post; image attachments via camera/gallery. |
|
| 239 |
+
| **Ask** | LLM chat. Tool-augmented mode available. Voice input button. |
|
| 240 |
+
| **Community** | Member list, recent events, federation peers. |
|
| 241 |
+
| **Settings** | Push prefs, language, backup, advanced. |
|
| 242 |
+
|
| 243 |
+
All UI is plain Material/Cupertino β no exotic frameworks. Matches Christof's preference for boring, durable tech.
|
| 244 |
+
|
| 245 |
+
---
|
| 246 |
+
|
| 247 |
+
## 8. Configuration
|
| 248 |
+
|
| 249 |
+
Mobile-side (lives in app):
|
| 250 |
+
|
| 251 |
+
```dart
|
| 252 |
+
const config = {
|
| 253 |
+
'community_id': '<from invite>',
|
| 254 |
+
'anchor_endpoints': [/* from manifest */],
|
| 255 |
+
'relay_url': 'https://relay.hearthnet.de',
|
| 256 |
+
'push_enabled': true,
|
| 257 |
+
'quiet_hours': {'start': '22:00', 'end': '07:00'},
|
| 258 |
+
'background_fetch_minutes': 15,
|
| 259 |
+
};
|
| 260 |
+
```
|
| 261 |
+
|
| 262 |
+
Anchor-side (Python, for the push_authority service):
|
| 263 |
+
|
| 264 |
+
```python
|
| 265 |
+
config.mobile.push_enabled = True
|
| 266 |
+
config.mobile.push_categories_marketplace = ["essentials","emergency"]
|
| 267 |
+
config.mobile.push_quiet_hours_default = ("22:00","07:00")
|
| 268 |
+
```
|
| 269 |
+
|
| 270 |
+
---
|
| 271 |
+
|
| 272 |
+
## 9. Errors
|
| 273 |
+
|
| 274 |
+
| Condition | UI presentation |
|
| 275 |
+
|-----------|-----------------|
|
| 276 |
+
| No network | Offline banner; queued sends |
|
| 277 |
+
| Anchor unreachable | "Your community anchor is offline. Retrying..." |
|
| 278 |
+
| Relay unreachable | Falls back to direct; warns if all fail |
|
| 279 |
+
| Token expired | Silent refresh; only surface if refresh fails |
|
| 280 |
+
| Push delivery failed | No UI; logged for diagnostics |
|
| 281 |
+
| Manifest signature mismatch | Hard block; re-onboard required |
|
| 282 |
+
|
| 283 |
+
---
|
| 284 |
+
|
| 285 |
+
## 10. Tests
|
| 286 |
+
|
| 287 |
+
### Flutter side
|
| 288 |
+
- Widget tests for each tab
|
| 289 |
+
- Integration test for onboarding flow with a mock anchor
|
| 290 |
+
- E2E test using a real anchor in CI (Linux runner running hearthnet)
|
| 291 |
+
|
| 292 |
+
### Anchor-side Python
|
| 293 |
+
- `test_push_subscription_recorded`
|
| 294 |
+
- `test_push_dispatch_on_chat_message_sent`
|
| 295 |
+
- `test_quiet_hours_silences_non_emergency`
|
| 296 |
+
- `test_revocation_revokes_mobile_session`
|
| 297 |
+
|
| 298 |
+
### Manual
|
| 299 |
+
- iOS + Android smoke tests on physical devices
|
| 300 |
+
- Background-fetch verified across 30-minute suspensions
|
| 301 |
+
|
| 302 |
+
---
|
| 303 |
+
|
| 304 |
+
## 11. Cross-references
|
| 305 |
+
|
| 306 |
+
| What | Where |
|
| 307 |
+
|------|-------|
|
| 308 |
+
| Bus protocol | [Phase 1 CAP Β§5](../../CAPABILITY_CONTRACT.md) |
|
| 309 |
+
| Push relay tier | [M15 Β§5.4](M15-relay-tier.md) |
|
| 310 |
+
| Token-bearer auth | [M16 Β§5.5](M16-tokens.md) |
|
| 311 |
+
| E2E chat | [M23](M23-e2e-encryption.md) |
|
| 312 |
+
| WebSocket | [X06](../cross-cutting/X06-websocket.md) |
|
| 313 |
+
| Invite blobs | [Phase 1 CAP Β§13](../../CAPABILITY_CONTRACT.md) |
|
| 314 |
+
| Web UI (mirror) | [M08](../../modules/M08-ui.md) |
|
| 315 |
+
|
| 316 |
+
---
|
| 317 |
+
|
| 318 |
+
## 12. Open questions
|
| 319 |
+
|
| 320 |
+
1. **Flutter vs React Native vs native (Swift + Kotlin).** Choosing Flutter for shared codebase and Christof's stated preference for boring/durable stacks. Reconsider if Flutter's keychain support is shaky.
|
| 321 |
+
2. **End-to-end encryption library in Flutter.** Need a libsodium binding that matches the Python side bit-exactly. `flutter_sodium` is well-maintained; verify on both platforms.
|
| 322 |
+
3. **Background fetch reliability.** iOS throttles aggressively. We accept "best effort"; push is the real delivery mechanism.
|
| 323 |
+
4. **Offline mode depth.** Mobile-only LLM (small Phi-3 / Gemma 2B) is Phase 3.
|
| 324 |
+
5. **Web push for PWA.** Could the same flow target a PWA (no native app)? Yes, with FCM web push; documented but not built in Phase 2.
|
| 325 |
+
6. **Family-share licence.** Christof might want to ship the iOS app to family members under his account; App Store policy permits this within Family Sharing.
|
docs/p2_p3/M23-e2e-encryption.md
ADDED
|
@@ -0,0 +1,474 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# M23 β End-to-End Encryption
|
| 2 |
+
|
| 3 |
+
**Spec version:** v1.0 (Phase 2)
|
| 4 |
+
**Depends on:** M01 (identity, key derivation), M07 (blobs, for prekey bundles), X02 (events, for prekey/session events), X04 (config), `pynacl`
|
| 5 |
+
**Depended on by:** M10 chat (extended), M25 group chat, optionally M07 file encryption
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## 1. Responsibility
|
| 10 |
+
|
| 11 |
+
Provide end-to-end encryption between community members for:
|
| 12 |
+
|
| 13 |
+
- **1:1 chat** (via M10 extension): every chat message encrypted with a per-sender ratchet
|
| 14 |
+
- **Group chat** (M25): per-thread sender keys
|
| 15 |
+
- **File envelopes** (M07 extension): chunks optionally wrapped in a per-recipient envelope
|
| 16 |
+
|
| 17 |
+
The cryptographic design borrows from Signal but stays simpler:
|
| 18 |
+
|
| 19 |
+
- **X3DH** for initial key agreement (one identity key + one signed prekey + one one-time prekey)
|
| 20 |
+
- **Double Ratchet** for per-session forward-secrecy
|
| 21 |
+
- **Sender keys** (Signal-style) for group threads
|
| 22 |
+
- **Per-blob envelope** for file encryption
|
| 23 |
+
|
| 24 |
+
This module owns the crypto primitives and session state. M10 and M25 own the message protocol that calls into M23.
|
| 25 |
+
|
| 26 |
+
---
|
| 27 |
+
|
| 28 |
+
## 2. File layout
|
| 29 |
+
|
| 30 |
+
```
|
| 31 |
+
hearthnet/crypto/
|
| 32 |
+
βββ __init__.py
|
| 33 |
+
βββ kem.py # X25519 key agreement (X3DH)
|
| 34 |
+
βββ ratchet.py # Double Ratchet, per-session
|
| 35 |
+
βββ sender_keys.py # Group sender keys (M25 helper)
|
| 36 |
+
βββ envelope.py # File envelope encryption (chunks)
|
| 37 |
+
βββ prekeys.py # Prekey bundle storage and publication
|
| 38 |
+
```
|
| 39 |
+
|
| 40 |
+
The `hearthnet/crypto/` directory is a NEW top-level package in Phase 2.
|
| 41 |
+
|
| 42 |
+
---
|
| 43 |
+
|
| 44 |
+
## 3. Public API
|
| 45 |
+
|
| 46 |
+
### 3.1 `kem.py` β X3DH
|
| 47 |
+
|
| 48 |
+
```python
|
| 49 |
+
# hearthnet/crypto/kem.py
|
| 50 |
+
from dataclasses import dataclass
|
| 51 |
+
|
| 52 |
+
@dataclass(frozen=True)
|
| 53 |
+
class X25519KeyPair:
|
| 54 |
+
private: bytes # 32 bytes
|
| 55 |
+
public: bytes # 32 bytes "x25519:<base64>"
|
| 56 |
+
|
| 57 |
+
def x25519_generate() -> X25519KeyPair: ...
|
| 58 |
+
def x25519_dh(our_priv: bytes, their_pub: bytes) -> bytes:
|
| 59 |
+
"""Computes the shared secret. 32 bytes."""
|
| 60 |
+
|
| 61 |
+
@dataclass(frozen=True)
|
| 62 |
+
class PrekeyBundle:
|
| 63 |
+
"""What a recipient publishes so senders can establish a session without them being online."""
|
| 64 |
+
node_id_full: str
|
| 65 |
+
identity_pub: bytes # long-lived; derived from Ed25519 identity via SignedConversion
|
| 66 |
+
signed_prekey_pub: bytes
|
| 67 |
+
signed_prekey_sig: bytes # Ed25519 signature over signed_prekey_pub by identity Ed25519 key
|
| 68 |
+
one_time_prekeys_pub: list[bytes] # depleted on use
|
| 69 |
+
published_at: int # unix seconds
|
| 70 |
+
|
| 71 |
+
def derive_identity_x25519_from_ed25519(ed_kp: KeyPair) -> X25519KeyPair:
|
| 72 |
+
"""Use the standard nacl conversion. Single x25519 identity key per device."""
|
| 73 |
+
|
| 74 |
+
def build_prekey_bundle(
|
| 75 |
+
ed_kp: KeyPair,
|
| 76 |
+
*,
|
| 77 |
+
num_one_time: int = E2E_PREKEY_BUNDLE_SIZE,
|
| 78 |
+
) -> tuple[PrekeyBundle, X25519KeyPair, list[X25519KeyPair]]:
|
| 79 |
+
"""Returns (bundle, signed_prekey_full, one_time_prekeys_full).
|
| 80 |
+
Caller persists the private halves; publishes only the bundle."""
|
| 81 |
+
|
| 82 |
+
def x3dh_initiator(
|
| 83 |
+
our_identity_x: X25519KeyPair,
|
| 84 |
+
our_ephemeral: X25519KeyPair,
|
| 85 |
+
their_bundle: PrekeyBundle,
|
| 86 |
+
) -> tuple[bytes, dict]:
|
| 87 |
+
"""Returns (shared_secret, session_init_message).
|
| 88 |
+
session_init_message includes our identity_pub + ephemeral_pub + used_otp_index."""
|
| 89 |
+
|
| 90 |
+
def x3dh_responder(
|
| 91 |
+
our_identity_x: X25519KeyPair,
|
| 92 |
+
our_signed_prekey: X25519KeyPair,
|
| 93 |
+
our_one_time_prekey: X25519KeyPair | None,
|
| 94 |
+
their_identity_pub: bytes,
|
| 95 |
+
their_ephemeral_pub: bytes,
|
| 96 |
+
) -> bytes:
|
| 97 |
+
"""Returns shared_secret."""
|
| 98 |
+
```
|
| 99 |
+
|
| 100 |
+
### 3.2 `ratchet.py` β Double Ratchet
|
| 101 |
+
|
| 102 |
+
```python
|
| 103 |
+
# hearthnet/crypto/ratchet.py
|
| 104 |
+
@dataclass
|
| 105 |
+
class RatchetState:
|
| 106 |
+
"""One session, one direction. There are two per session (send + receive)."""
|
| 107 |
+
root_key: bytes
|
| 108 |
+
chain_key: bytes
|
| 109 |
+
counter: int
|
| 110 |
+
epoch: int
|
| 111 |
+
skipped_messages: dict[tuple[int, int], bytes] # (epoch, counter) β message_key
|
| 112 |
+
dh_keypair: X25519KeyPair | None
|
| 113 |
+
remote_dh_pub: bytes | None
|
| 114 |
+
|
| 115 |
+
@dataclass
|
| 116 |
+
class RatchetSession:
|
| 117 |
+
"""A bidirectional encrypted session between two NodeIDs."""
|
| 118 |
+
peer_node_id_full: str
|
| 119 |
+
send_state: RatchetState
|
| 120 |
+
recv_state: RatchetState
|
| 121 |
+
|
| 122 |
+
def is_established(self) -> bool: ...
|
| 123 |
+
|
| 124 |
+
def init_session_initiator(shared_secret: bytes, peer_dh_pub: bytes) -> RatchetSession: ...
|
| 125 |
+
def init_session_responder(shared_secret: bytes, our_dh_kp: X25519KeyPair) -> RatchetSession: ...
|
| 126 |
+
|
| 127 |
+
@dataclass(frozen=True)
|
| 128 |
+
class RatchetMessageHeader:
|
| 129 |
+
dh_pub: bytes # current sender's DH pub
|
| 130 |
+
epoch: int
|
| 131 |
+
counter: int
|
| 132 |
+
|
| 133 |
+
def encrypt_message(
|
| 134 |
+
session: RatchetSession,
|
| 135 |
+
plaintext: bytes,
|
| 136 |
+
*,
|
| 137 |
+
aad: bytes = b"",
|
| 138 |
+
) -> tuple[RatchetMessageHeader, bytes]:
|
| 139 |
+
"""Returns (header, ciphertext). Mutates session.send_state."""
|
| 140 |
+
|
| 141 |
+
def decrypt_message(
|
| 142 |
+
session: RatchetSession,
|
| 143 |
+
header: RatchetMessageHeader,
|
| 144 |
+
ciphertext: bytes,
|
| 145 |
+
*,
|
| 146 |
+
aad: bytes = b"",
|
| 147 |
+
) -> bytes:
|
| 148 |
+
"""Verifies + decrypts. Mutates session.recv_state.
|
| 149 |
+
Tolerates up to E2E_RATCHET_MAX_OUT_OF_ORDER out-of-order messages via skipped_messages."""
|
| 150 |
+
|
| 151 |
+
class RatchetError(Exception):
|
| 152 |
+
"""code in {
|
| 153 |
+
'session_not_established','decrypt_failed','out_of_order_too_far',
|
| 154 |
+
'message_too_old','aad_mismatch'}"""
|
| 155 |
+
code: str
|
| 156 |
+
|
| 157 |
+
# Persistence: sessions are serialised via dataclasses_json into a SQLite table.
|
| 158 |
+
```
|
| 159 |
+
|
| 160 |
+
### 3.3 `sender_keys.py` β Group ratchet
|
| 161 |
+
|
| 162 |
+
```python
|
| 163 |
+
# hearthnet/crypto/sender_keys.py
|
| 164 |
+
@dataclass
|
| 165 |
+
class SenderKeyState:
|
| 166 |
+
"""Per (group, sender) sender-key chain. Each sender broadcasts a chain key
|
| 167 |
+
to all group members at thread create / join."""
|
| 168 |
+
thread_id: str
|
| 169 |
+
sender_node_id: str
|
| 170 |
+
chain_key: bytes
|
| 171 |
+
counter: int
|
| 172 |
+
signature_keypair: tuple[bytes, bytes] | None # ed25519, for message signing
|
| 173 |
+
|
| 174 |
+
@dataclass
|
| 175 |
+
class GroupSession:
|
| 176 |
+
"""One per thread; holds all members' SenderKeyStates."""
|
| 177 |
+
thread_id: str
|
| 178 |
+
sender_keys: dict[str, SenderKeyState] # sender_node_id β state
|
| 179 |
+
|
| 180 |
+
def init_sender_key(thread_id: str, sender_node_id: str) -> SenderKeyState: ...
|
| 181 |
+
def encrypt_for_group(session: GroupSession, sender_node_id: str, plaintext: bytes) -> tuple[dict, bytes]: ...
|
| 182 |
+
def decrypt_for_group(session: GroupSession, header: dict, ciphertext: bytes) -> bytes: ...
|
| 183 |
+
|
| 184 |
+
def serialise_sender_key_distribution(state: SenderKeyState) -> bytes:
|
| 185 |
+
"""Serialise a sender key for sending to other group members.
|
| 186 |
+
MUST be sent inside a pairwise Double Ratchet session, not in cleartext."""
|
| 187 |
+
|
| 188 |
+
def consume_sender_key_distribution(bytes_blob: bytes, session: GroupSession) -> None: ...
|
| 189 |
+
```
|
| 190 |
+
|
| 191 |
+
### 3.4 `envelope.py` β File envelope
|
| 192 |
+
|
| 193 |
+
```python
|
| 194 |
+
# hearthnet/crypto/envelope.py
|
| 195 |
+
@dataclass(frozen=True)
|
| 196 |
+
class FileEnvelopeHeader:
|
| 197 |
+
recipient_node_ids: list[str]
|
| 198 |
+
wrapped_keys: dict[str, bytes] # node_id_short β wrapped symmetric key
|
| 199 |
+
nonce: bytes # 12 bytes
|
| 200 |
+
chunk_size_bytes: int
|
| 201 |
+
|
| 202 |
+
def encrypt_blob_for(
|
| 203 |
+
recipients: list[str], # NodeIDs full
|
| 204 |
+
plaintext: bytes,
|
| 205 |
+
sender_kp: KeyPair,
|
| 206 |
+
sessions_provider: Callable[[str], RatchetSession | None],
|
| 207 |
+
) -> tuple[FileEnvelopeHeader, bytes]:
|
| 208 |
+
"""1. Generate random 32-byte symmetric key
|
| 209 |
+
2. For each recipient: encrypt key with their ratchet session (one-shot use of next message key)
|
| 210 |
+
3. ChaCha20-Poly1305-encrypt plaintext with the symmetric key + nonce
|
| 211 |
+
4. Return (header, ciphertext)"""
|
| 212 |
+
|
| 213 |
+
def decrypt_blob_for_self(
|
| 214 |
+
header: FileEnvelopeHeader,
|
| 215 |
+
ciphertext: bytes,
|
| 216 |
+
our_node_id_full: str,
|
| 217 |
+
sessions_provider: Callable[[str], RatchetSession | None],
|
| 218 |
+
) -> bytes:
|
| 219 |
+
"""1. Find our wrapped key in header.wrapped_keys
|
| 220 |
+
2. Decrypt symmetric key via sender's ratchet session
|
| 221 |
+
3. ChaCha20-Poly1305-decrypt"""
|
| 222 |
+
```
|
| 223 |
+
|
| 224 |
+
### 3.5 `prekeys.py` β Publication and consumption
|
| 225 |
+
|
| 226 |
+
```python
|
| 227 |
+
# hearthnet/crypto/prekeys.py
|
| 228 |
+
class PrekeyStore:
|
| 229 |
+
"""Persists this node's private prekey material and the bundles we've consumed for others."""
|
| 230 |
+
|
| 231 |
+
def __init__(self, db_path: Path):
|
| 232 |
+
...
|
| 233 |
+
|
| 234 |
+
def publish_self(
|
| 235 |
+
self,
|
| 236 |
+
ed_kp: KeyPair,
|
| 237 |
+
event_log: EventLog,
|
| 238 |
+
*,
|
| 239 |
+
num_one_time: int = E2E_PREKEY_BUNDLE_SIZE,
|
| 240 |
+
) -> None:
|
| 241 |
+
"""1. Build PrekeyBundle (kem.build_prekey_bundle)
|
| 242 |
+
2. Persist private halves locally
|
| 243 |
+
3. Emit e2e.prekeys.published event with public halves"""
|
| 244 |
+
|
| 245 |
+
def get_peer_bundle(self, peer_node_id_full: str) -> PrekeyBundle | None:
|
| 246 |
+
"""Look in local cache; if absent, fetch from peer via bus.
|
| 247 |
+
Returns one bundle including one consumable one-time prekey if available."""
|
| 248 |
+
|
| 249 |
+
def consume_one_time_prekey(self, our_otp_index: int) -> X25519KeyPair | None:
|
| 250 |
+
"""Server-side: when someone uses one of our one-time prekeys, return + remove it."""
|
| 251 |
+
|
| 252 |
+
def refill_one_time_prekeys_if_low(self, ed_kp: KeyPair, event_log: EventLog) -> int:
|
| 253 |
+
"""If fewer than E2E_PREKEY_BUNDLE_SIZE / 4 remain, publish a new bundle.
|
| 254 |
+
Returns count added."""
|
| 255 |
+
```
|
| 256 |
+
|
| 257 |
+
---
|
| 258 |
+
|
| 259 |
+
## 4. Behaviour
|
| 260 |
+
|
| 261 |
+
### 4.1 Session establishment lifecycle (1:1)
|
| 262 |
+
|
| 263 |
+
```
|
| 264 |
+
Alice wants to send Bob an encrypted message; no session exists.
|
| 265 |
+
β
|
| 266 |
+
PrekeyStore.get_peer_bundle("ed25519:bob") β fetch from Bob's most recent e2e.prekeys.published event
|
| 267 |
+
β
|
| 268 |
+
x3dh_initiator(alice_identity_x, alice_ephemeral, bob_bundle) β (shared_secret, init_msg)
|
| 269 |
+
β
|
| 270 |
+
init_session_initiator(shared_secret, bob_signed_prekey_pub) β RatchetSession
|
| 271 |
+
β
|
| 272 |
+
encrypt_message(session, plaintext) β (header, ciphertext)
|
| 273 |
+
β
|
| 274 |
+
Alice sends: chat.message.sent event with data.body = {
|
| 275 |
+
"e2e": true,
|
| 276 |
+
"header": { x3dh_init: init_msg, ratchet_header: header },
|
| 277 |
+
"ciphertext": "<base64>"
|
| 278 |
+
}
|
| 279 |
+
β
|
| 280 |
+
Bob receives event.
|
| 281 |
+
x3dh_responder(...) β shared_secret
|
| 282 |
+
init_session_responder(...) β RatchetSession
|
| 283 |
+
decrypt_message(...) β plaintext
|
| 284 |
+
Emit e2e.session.established event so Alice can clean up retries
|
| 285 |
+
β
|
| 286 |
+
Subsequent messages: just header + ciphertext (no x3dh_init).
|
| 287 |
+
```
|
| 288 |
+
|
| 289 |
+
### 4.2 Group session establishment
|
| 290 |
+
|
| 291 |
+
```
|
| 292 |
+
Thread creator emits chat.thread.created with members and an ed25519:thread_signing_root.
|
| 293 |
+
β
|
| 294 |
+
Each member generates a SenderKeyState for themselves in this thread.
|
| 295 |
+
β
|
| 296 |
+
Each member, in a pairwise loop, sends their sender key distribution to each other member
|
| 297 |
+
inside their 1:1 ratchet sessions (so non-thread-members never see it).
|
| 298 |
+
β
|
| 299 |
+
Once everyone has everyone's sender keys, encrypt/decrypt happens with sender keys
|
| 300 |
+
(chain ratchet only; no DH ratchet on the group session itself).
|
| 301 |
+
```
|
| 302 |
+
|
| 303 |
+
When a member is added later, the inviter must re-distribute all existing senders' current chain states to the new member (rewinds to the message they should start being able to read β usually the current state, not history).
|
| 304 |
+
|
| 305 |
+
When a member is removed, existing sender keys are still known to them. **All members must rotate their sender keys** to achieve forward secrecy after removal. UI prompts this.
|
| 306 |
+
|
| 307 |
+
### 4.3 Out-of-order messages
|
| 308 |
+
|
| 309 |
+
Up to `E2E_RATCHET_MAX_OUT_OF_ORDER` (32) skipped message keys are cached per session. Beyond that, `out_of_order_too_far` is raised; the message is dropped and the sender notified (out-of-band) that they should rekey.
|
| 310 |
+
|
| 311 |
+
### 4.4 Rekeying
|
| 312 |
+
|
| 313 |
+
After `E2E_RATCHET_REKEY_AFTER_MESSAGES` messages on the same DH ratchet, the next message includes a new DH ephemeral. Standard Double Ratchet behaviour. Transparent to users.
|
| 314 |
+
|
| 315 |
+
### 4.5 Session loss recovery
|
| 316 |
+
|
| 317 |
+
If a node's session state is lost (disk corruption, fresh install with same keys), the peer doesn't know β messages will fail to decrypt. Recovery flow:
|
| 318 |
+
|
| 319 |
+
1. Decrypting node returns `e2e_decrypt_failed` via pubsub
|
| 320 |
+
2. Sending node sees this and re-initiates X3DH
|
| 321 |
+
3. New session replaces old; resends recent messages
|
| 322 |
+
|
| 323 |
+
UI shows "session was reset" so users know context might have been lost.
|
| 324 |
+
|
| 325 |
+
### 4.6 Identity X25519 derivation
|
| 326 |
+
|
| 327 |
+
We derive a per-device X25519 identity key from the Ed25519 identity key, using libsodium's `crypto_sign_ed25519_pk_to_curve25519`. This way:
|
| 328 |
+
|
| 329 |
+
- Only one identity key to maintain
|
| 330 |
+
- Anyone with the public Ed25519 (in the community manifest) can derive the X25519 pub
|
| 331 |
+
- Signed prekey signatures use the Ed25519 key (already established as device identity)
|
| 332 |
+
|
| 333 |
+
### 4.7 Prekey publication
|
| 334 |
+
|
| 335 |
+
Each node publishes a fresh `e2e.prekeys.published` event on startup if their last one is > 24h old. The event contains:
|
| 336 |
+
|
| 337 |
+
- `identity_pubkey` (X25519 form)
|
| 338 |
+
- `signed_prekey` (with Ed25519 signature)
|
| 339 |
+
- `one_time_prekeys[]` (up to `E2E_PREKEY_BUNDLE_SIZE` = 20)
|
| 340 |
+
|
| 341 |
+
Consumers find a peer's bundle by reading their latest `e2e.prekeys.published` event from the log.
|
| 342 |
+
|
| 343 |
+
### 4.8 What is NOT E2E
|
| 344 |
+
|
| 345 |
+
Even with M23 active:
|
| 346 |
+
- Event envelope (sender, recipient, lamport, event_type, wall_clock) is cleartext within the community
|
| 347 |
+
- Signatures over events remain valid for community-level audit
|
| 348 |
+
- Message *metadata* leaks to community members (who talked to whom and when), just not content
|
| 349 |
+
|
| 350 |
+
This is intentional: communities are trust roots; complete anonymity within a community is not a goal.
|
| 351 |
+
|
| 352 |
+
### 4.9 File envelope
|
| 353 |
+
|
| 354 |
+
For file blobs, `encrypt_blob_for(recipients, plaintext, ...)` produces a single ciphertext, with a small per-recipient header. Senders pick recipients explicitly (e.g. group thread members for an attachment). Bystanders cannot decrypt even if they fetch the blob via M07 `file.read`.
|
| 355 |
+
|
| 356 |
+
The blob's CID is the hash of the **ciphertext**, so the same plaintext sent to different recipient sets has different CIDs. Costs more storage; needed for security.
|
| 357 |
+
|
| 358 |
+
---
|
| 359 |
+
|
| 360 |
+
## 5. Persistence
|
| 361 |
+
|
| 362 |
+
### 5.1 Sessions table
|
| 363 |
+
|
| 364 |
+
```sql
|
| 365 |
+
CREATE TABLE ratchet_sessions (
|
| 366 |
+
peer_node_id_full TEXT PRIMARY KEY,
|
| 367 |
+
session_blob BLOB NOT NULL, -- serialised RatchetSession
|
| 368 |
+
established_at INTEGER NOT NULL,
|
| 369 |
+
last_used INTEGER NOT NULL
|
| 370 |
+
);
|
| 371 |
+
```
|
| 372 |
+
|
| 373 |
+
### 5.2 Group sessions table
|
| 374 |
+
|
| 375 |
+
```sql
|
| 376 |
+
CREATE TABLE group_sessions (
|
| 377 |
+
thread_id TEXT PRIMARY KEY,
|
| 378 |
+
session_blob BLOB NOT NULL,
|
| 379 |
+
updated_at INTEGER NOT NULL
|
| 380 |
+
);
|
| 381 |
+
```
|
| 382 |
+
|
| 383 |
+
### 5.3 Prekey private halves
|
| 384 |
+
|
| 385 |
+
```sql
|
| 386 |
+
CREATE TABLE prekey_private (
|
| 387 |
+
kind TEXT NOT NULL, -- 'identity'|'signed_prekey'|'one_time'
|
| 388 |
+
index_or_id TEXT NOT NULL, -- '0' for identity; 'spk_v1' for signed; otp index for OTPs
|
| 389 |
+
private_key BLOB NOT NULL,
|
| 390 |
+
consumed_at INTEGER, -- only set for one-time, when used
|
| 391 |
+
PRIMARY KEY (kind, index_or_id)
|
| 392 |
+
);
|
| 393 |
+
```
|
| 394 |
+
|
| 395 |
+
Files locked at 0600. Backed up nightly via `hearthnet export` (encrypted with user passphrase).
|
| 396 |
+
|
| 397 |
+
---
|
| 398 |
+
|
| 399 |
+
## 6. Errors
|
| 400 |
+
|
| 401 |
+
`RatchetError` codes (M23-internal):
|
| 402 |
+
- `session_not_established`
|
| 403 |
+
- `decrypt_failed`
|
| 404 |
+
- `out_of_order_too_far`
|
| 405 |
+
- `message_too_old`
|
| 406 |
+
- `aad_mismatch`
|
| 407 |
+
|
| 408 |
+
Wire mapping per [CAP2 Β§9](../CAPABILITY_CONTRACT_v2.md):
|
| 409 |
+
- `e2e_session_missing` β `session_not_established`
|
| 410 |
+
- `e2e_decrypt_failed` β `decrypt_failed`, `aad_mismatch`
|
| 411 |
+
- `ratchet_out_of_order` β `out_of_order_too_far`
|
| 412 |
+
|
| 413 |
+
---
|
| 414 |
+
|
| 415 |
+
## 7. Configuration
|
| 416 |
+
|
| 417 |
+
```python
|
| 418 |
+
config.e2e.enabled = True
|
| 419 |
+
config.e2e.chat_default_enabled = True # new 1:1 chats default to E2E
|
| 420 |
+
config.e2e.group_default_enabled = True
|
| 421 |
+
config.e2e.file_default_enabled = False # opt-in per blob
|
| 422 |
+
config.e2e.prekey_refill_count = E2E_PREKEY_BUNDLE_SIZE
|
| 423 |
+
config.e2e.rekey_after_messages = E2E_RATCHET_REKEY_AFTER_MESSAGES
|
| 424 |
+
config.e2e.max_out_of_order = E2E_RATCHET_MAX_OUT_OF_ORDER
|
| 425 |
+
```
|
| 426 |
+
|
| 427 |
+
---
|
| 428 |
+
|
| 429 |
+
## 8. Tests
|
| 430 |
+
|
| 431 |
+
### Unit
|
| 432 |
+
- `test_x25519_dh_symmetric`
|
| 433 |
+
- `test_x3dh_initiator_responder_agree`
|
| 434 |
+
- `test_ratchet_encrypt_decrypt_roundtrip`
|
| 435 |
+
- `test_ratchet_out_of_order_within_window`
|
| 436 |
+
- `test_ratchet_out_of_order_too_far_rejected`
|
| 437 |
+
- `test_rekey_after_n_messages`
|
| 438 |
+
- `test_group_sender_key_distribution_pairwise_only`
|
| 439 |
+
- `test_blob_envelope_recipient_only_can_decrypt`
|
| 440 |
+
|
| 441 |
+
### Integration
|
| 442 |
+
- `test_two_node_first_message_x3dh_session_persists`
|
| 443 |
+
- `test_session_recovery_after_disk_wipe`
|
| 444 |
+
- `test_group_add_member_can_decrypt_subsequent`
|
| 445 |
+
- `test_group_remove_member_cannot_decrypt_after_rotation`
|
| 446 |
+
- `test_file_envelope_2_recipients`
|
| 447 |
+
|
| 448 |
+
### Adversarial
|
| 449 |
+
- `test_replay_old_ratchet_message_rejected`
|
| 450 |
+
- `test_modified_ciphertext_decrypt_fails`
|
| 451 |
+
- `test_one_time_prekey_consumed_once`
|
| 452 |
+
|
| 453 |
+
---
|
| 454 |
+
|
| 455 |
+
## 9. Cross-references
|
| 456 |
+
|
| 457 |
+
| What | Where |
|
| 458 |
+
|------|-------|
|
| 459 |
+
| `e2e.*` events | [CAP2 Β§7](../CAPABILITY_CONTRACT_v2.md) |
|
| 460 |
+
| Encrypted chat body envelope | [CAP2 Β§1.1, Β§7.2 chat.message.sent](../CAPABILITY_CONTRACT_v2.md) |
|
| 461 |
+
| Chat service hook | [M10 ext](../../modules/M10-chat.md) β Phase 2 extension |
|
| 462 |
+
| Group chat | [M25](M25-group-chat.md) |
|
| 463 |
+
| File envelope use | [M07 ext](../../modules/M07-file-blobs.md) |
|
| 464 |
+
| Identity key conversion | [M01](../../modules/M01-identity.md) |
|
| 465 |
+
|
| 466 |
+
---
|
| 467 |
+
|
| 468 |
+
## 10. Open questions
|
| 469 |
+
|
| 470 |
+
1. **Post-quantum readiness.** X25519 + ChaCha20-Poly1305 is not PQ-safe. Hybrid (X25519 + ML-KEM-768) is Phase 3.
|
| 471 |
+
2. **Verification of session identity.** Signal does safety numbers; HearthNet can do the same. UI ergonomics deferred.
|
| 472 |
+
3. **Multi-device per identity.** If a user has anchor + mobile + laptop, do they share keys or have separate ones? Currently separate (each device is a separate NodeID; group threads include all of them). Could unify with a "linked devices" Phase 3 feature.
|
| 473 |
+
4. **Forward secrecy on group membership change.** Current spec asks members to rotate sender keys on removal. UX of forcing this needs design.
|
| 474 |
+
5. **Cryptographic auditing.** This module should be reviewed by a real cryptographer before going to civil-defence pilots. Listed in `THREAT_MODEL_v2.md`.
|
docs/p2_p3/M24-rerank.md
ADDED
|
@@ -0,0 +1,227 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# M24 β Reranking Service
|
| 2 |
+
|
| 3 |
+
**Spec version:** v2.0
|
| 4 |
+
**Depends on:** [M03 Capability Bus](../../modules/M03-capability-bus.md), [M01 Identity](../../modules/M01-identity.md), [X03 Observability](../../cross-cutting/X03-observability.md)
|
| 5 |
+
**Depended on by:** [M05 RAG](../../modules/M05-rag.md) (extension), [M06 Marketplace](../../modules/M06-marketplace.md) (extension)
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## 1. Responsibility
|
| 10 |
+
|
| 11 |
+
Re-score a candidate list of documents against a query using a cross-encoder, producing a higher-precision ordering than dense retrieval alone can deliver.
|
| 12 |
+
|
| 13 |
+
The capability is intentionally narrow: take query + N short docs, return ranked list. The service does **not** retrieve documents, does **not** fetch from blobs, does **not** know about corpora. Callers (typically `rag.query` and `market.search`) do retrieval first, then ask the reranker to refine the top 100.
|
| 14 |
+
|
| 15 |
+
This is the smallest service in Phase 2 β one model, one method, no streaming β and the most underrated. Adding it to the RAG pipeline lifts answer quality more than any other Phase 2 module.
|
| 16 |
+
|
| 17 |
+
---
|
| 18 |
+
|
| 19 |
+
## 2. File layout
|
| 20 |
+
|
| 21 |
+
```
|
| 22 |
+
hearthnet/services/rerank/
|
| 23 |
+
βββ __init__.py
|
| 24 |
+
βββ service.py # RerankService β capability registration
|
| 25 |
+
βββ selection.py # Picks a backend; loads on demand
|
| 26 |
+
βββ backends/
|
| 27 |
+
βββ base.py # RerankBackend ABC
|
| 28 |
+
βββ bge_reranker.py # BGE-reranker-v2-m3 (default)
|
| 29 |
+
βββ cross_encoder.py # Sentence-transformers fallback
|
| 30 |
+
```
|
| 31 |
+
|
| 32 |
+
---
|
| 33 |
+
|
| 34 |
+
## 3. Public API
|
| 35 |
+
|
| 36 |
+
### 3.1 `RerankBackend` (ABC)
|
| 37 |
+
|
| 38 |
+
```python
|
| 39 |
+
class RerankBackend(Protocol):
|
| 40 |
+
name: str # e.g. "BAAI/bge-reranker-v2-m3"
|
| 41 |
+
max_doc_chars: int # truncate longer docs
|
| 42 |
+
|
| 43 |
+
async def score(self, query: str, documents: list[str]) -> list[float]:
|
| 44 |
+
"""Return one score per document, same length and order."""
|
| 45 |
+
...
|
| 46 |
+
|
| 47 |
+
async def health(self) -> dict[str, Any]:
|
| 48 |
+
"""Return {"ok": bool, "loaded": bool, "model_id": ..., ...}."""
|
| 49 |
+
...
|
| 50 |
+
```
|
| 51 |
+
|
| 52 |
+
### 3.2 `BgeRerankerBackend`
|
| 53 |
+
|
| 54 |
+
```python
|
| 55 |
+
class BgeRerankerBackend:
|
| 56 |
+
def __init__(self, model_id: str = "BAAI/bge-reranker-v2-m3", device: str = "auto", max_batch: int = 32):
|
| 57 |
+
...
|
| 58 |
+
|
| 59 |
+
async def score(self, query: str, documents: list[str]) -> list[float]:
|
| 60 |
+
# Tokenise (query, doc) pairs in batches of max_batch
|
| 61 |
+
# Forward pass; pooled logit becomes the score
|
| 62 |
+
# Higher = more relevant
|
| 63 |
+
...
|
| 64 |
+
```
|
| 65 |
+
|
| 66 |
+
### 3.3 `RerankService`
|
| 67 |
+
|
| 68 |
+
```python
|
| 69 |
+
class RerankService:
|
| 70 |
+
"""Bus-facing facade. Picks backend, enforces RERANK_MAX_DOCS, emits metrics."""
|
| 71 |
+
|
| 72 |
+
def __init__(self, bus: CapabilityBus, settings: RerankSettings, observability: Observability):
|
| 73 |
+
...
|
| 74 |
+
|
| 75 |
+
async def start(self) -> None:
|
| 76 |
+
# Register `rerank.text@1.0` on the bus
|
| 77 |
+
...
|
| 78 |
+
|
| 79 |
+
async def rerank_text(self, body: RerankRequest) -> RerankResponse:
|
| 80 |
+
# 1. Validate len(documents) <= RERANK_MAX_DOCS (else bad_request)
|
| 81 |
+
# 2. Pick backend per `body.params.model` or default
|
| 82 |
+
# 3. Truncate docs to backend.max_doc_chars
|
| 83 |
+
# 4. Call backend.score
|
| 84 |
+
# 5. Sort descending, take top_k (or all)
|
| 85 |
+
# 6. Emit `rerank.latency_ms` metric
|
| 86 |
+
...
|
| 87 |
+
```
|
| 88 |
+
|
| 89 |
+
### 3.4 Request / response dataclasses
|
| 90 |
+
|
| 91 |
+
```python
|
| 92 |
+
@dataclass
|
| 93 |
+
class RerankDoc:
|
| 94 |
+
id: str
|
| 95 |
+
text: str
|
| 96 |
+
|
| 97 |
+
@dataclass
|
| 98 |
+
class RerankRequest:
|
| 99 |
+
query: str
|
| 100 |
+
documents: list[RerankDoc]
|
| 101 |
+
top_k: int = 10
|
| 102 |
+
params: dict[str, Any] = field(default_factory=dict) # {"model": "..."}
|
| 103 |
+
|
| 104 |
+
@dataclass
|
| 105 |
+
class RerankedDoc:
|
| 106 |
+
id: str
|
| 107 |
+
score: float
|
| 108 |
+
|
| 109 |
+
@dataclass
|
| 110 |
+
class RerankResponse:
|
| 111 |
+
ranked: list[RerankedDoc]
|
| 112 |
+
meta: dict[str, Any]
|
| 113 |
+
```
|
| 114 |
+
|
| 115 |
+
---
|
| 116 |
+
|
| 117 |
+
## 4. Behaviour
|
| 118 |
+
|
| 119 |
+
### 4.1 Backend selection
|
| 120 |
+
|
| 121 |
+
`params.model` is matched against installed backends (key = HuggingFace model id). Default is `BAAI/bge-reranker-v2-m3` because it handles β₯100 languages including German and Latin (relevant for the OCR'd historical doc corpus).
|
| 122 |
+
|
| 123 |
+
If `params.model` is supplied but unknown β return `bad_request` with the list of installed backends.
|
| 124 |
+
|
| 125 |
+
### 4.2 Cold start
|
| 126 |
+
|
| 127 |
+
Backend is loaded lazily on first call. First call latency budget: β€ 60s on the RTX 5090 (model ~2 GB on disk). Subsequent calls: β€ 200ms for 50 docs at ~512 chars each.
|
| 128 |
+
|
| 129 |
+
The service publishes `model_loaded` and `model_loading` health states; `rerank.text` calls during loading wait up to `RERANK_LOAD_TIMEOUT_SECONDS` (default 60) then return `unavailable`.
|
| 130 |
+
|
| 131 |
+
### 4.3 Score semantics
|
| 132 |
+
|
| 133 |
+
Scores are **raw logits**, not normalised probabilities. They are comparable within a single call but not across calls or backends. Callers MUST NOT compare a 0.91 score from BGE to a 0.91 from cross-encoder/ms-marco β different scales.
|
| 134 |
+
|
| 135 |
+
### 4.4 Truncation
|
| 136 |
+
|
| 137 |
+
Documents longer than `backend.max_doc_chars` (default 2048) are truncated. The service logs `rerank.docs_truncated` counter. Truncation is from the right; callers who care about specific spans should pre-summarise or chunk before passing in.
|
| 138 |
+
|
| 139 |
+
### 4.5 No streaming
|
| 140 |
+
|
| 141 |
+
`rerank.text@1.0` is non-streaming. Even at 100 docs the latency is well under 1s on GPU. If a Phase-3 use case demands streaming (e.g. 1000-doc reranks for academic search), introduce `rerank.text@2.0` with `progress` frames; do not retrofit v1.
|
| 142 |
+
|
| 143 |
+
### 4.6 Integration with RAG (M05 extension)
|
| 144 |
+
|
| 145 |
+
`rag.query` in Phase 2 grows an internal pipeline:
|
| 146 |
+
|
| 147 |
+
```
|
| 148 |
+
1. Hybrid retrieval (dense + BM25) β top 100 candidates
|
| 149 |
+
2. Optional call to rerank.text@1.0 β top 10
|
| 150 |
+
3. Pass top 10 to llm.chat as context
|
| 151 |
+
```
|
| 152 |
+
|
| 153 |
+
The hop to `rerank.text` is done via the bus, not via direct import. This keeps the policy ("which model?", "is reranking available?") in the service and out of the RAG core.
|
| 154 |
+
|
| 155 |
+
If `rerank.text@1.0` is unavailable in the local mesh, RAG falls back to dense scores alone and logs `rag.rerank_skipped` counter (not an error).
|
| 156 |
+
|
| 157 |
+
### 4.7 Integration with Marketplace (M06 extension)
|
| 158 |
+
|
| 159 |
+
`market.search` follows the same pattern when the query is natural-language. For tag-based queries it skips reranking.
|
| 160 |
+
|
| 161 |
+
---
|
| 162 |
+
|
| 163 |
+
## 5. Errors
|
| 164 |
+
|
| 165 |
+
| Code | Cause |
|
| 166 |
+
|------|-------|
|
| 167 |
+
| `bad_request` | `len(documents) > RERANK_MAX_DOCS`, empty query, malformed payload |
|
| 168 |
+
| `unavailable` | Backend loading or hardware unavailable |
|
| 169 |
+
| `model_not_found` | Requested `params.model` is not installed |
|
| 170 |
+
|
| 171 |
+
`unavailable` is retryable; the other two are not.
|
| 172 |
+
|
| 173 |
+
---
|
| 174 |
+
|
| 175 |
+
## 6. Configuration
|
| 176 |
+
|
| 177 |
+
```toml
|
| 178 |
+
[services.rerank]
|
| 179 |
+
enabled = true
|
| 180 |
+
default_model = "BAAI/bge-reranker-v2-m3"
|
| 181 |
+
device = "auto" # "auto" | "cuda" | "cpu"
|
| 182 |
+
max_batch = 32
|
| 183 |
+
max_doc_chars = 2048
|
| 184 |
+
load_timeout_seconds = 60
|
| 185 |
+
trust_required = "member"
|
| 186 |
+
```
|
| 187 |
+
|
| 188 |
+
Behind a feature flag: when `enabled=false`, the capability simply does not register and RAG falls back to dense-only.
|
| 189 |
+
|
| 190 |
+
---
|
| 191 |
+
|
| 192 |
+
## 7. Tests
|
| 193 |
+
|
| 194 |
+
### 7.1 Unit
|
| 195 |
+
- Sorting: scores `[0.1, 0.9, 0.5]` produce ranked order `[1, 2, 0]`
|
| 196 |
+
- Truncation: 4000-char doc gets truncated to 2048 before scoring
|
| 197 |
+
- `top_k` honoured; returns at most `top_k` results
|
| 198 |
+
- Bad request when `documents=[]` or `len > RERANK_MAX_DOCS`
|
| 199 |
+
|
| 200 |
+
### 7.2 Integration
|
| 201 |
+
- End-to-end: `rag.query` with reranking vs without, on the niederrhein-emergency corpus, asserts at least one expected document moves into top 3 with rerank that wasn't there without
|
| 202 |
+
- Cross-language: German query, mixed German/English candidates, BGE reranker should put the German candidate first when relevance is equal
|
| 203 |
+
|
| 204 |
+
### 7.3 Performance
|
| 205 |
+
- 100 docs @ 1024 chars: p50 β€ 300ms on RTX 5090; p95 β€ 600ms
|
| 206 |
+
- CPU fallback (no GPU): p50 β€ 4s for 50 docs (acceptable; degraded)
|
| 207 |
+
|
| 208 |
+
### 7.4 Failure-mode
|
| 209 |
+
- Backend crash mid-call: caller receives `unavailable`; service self-heals on next call
|
| 210 |
+
- Concurrent calls: 20 parallel reranks should not deadlock; backend serialises behind a single semaphore
|
| 211 |
+
|
| 212 |
+
---
|
| 213 |
+
|
| 214 |
+
## 8. Cross-references
|
| 215 |
+
|
| 216 |
+
- Capability spec: [CAPABILITY_CONTRACT_v2 Β§4.15](../CAPABILITY_CONTRACT_v2.md#415-reranktext10)
|
| 217 |
+
- Used by: M05 RAG extension, M06 Marketplace extension
|
| 218 |
+
- Observability: emits `rerank.calls_total`, `rerank.latency_ms`, `rerank.docs_truncated`, `rerank.errors_total{code}`
|
| 219 |
+
|
| 220 |
+
---
|
| 221 |
+
|
| 222 |
+
## 9. Open questions
|
| 223 |
+
|
| 224 |
+
1. **Reciprocal rank fusion** with dense scores as the alternative when rerank is unavailable β worth implementing in M05 as the fallback path?
|
| 225 |
+
2. **ColBERT-style late interaction** β heavier model, higher quality. Worth a second backend, or wait for Phase 3 to evaluate?
|
| 226 |
+
3. **Reranker for code/diff content** β different model family (e.g. `BAAI/bge-code-reranker`). Should `params.model` selection be auto-inferred from query/doc content?
|
| 227 |
+
4. **Caching** β query+doc-pair hash β score, evict LRU. Worth it for repeated queries in chat-driven RAG sessions, or premature optimisation?
|
docs/p2_p3/M25-group-chat.md
ADDED
|
@@ -0,0 +1,293 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# M25 β Group Chat
|
| 2 |
+
|
| 3 |
+
**Spec version:** v2.0
|
| 4 |
+
**Depends on:** [M10 Chat 1:1](../../modules/M10-chat.md), [M23 E2E Encryption](M23-e2e-encryption.md), [M16 Capability Tokens](M16-tokens.md), [M03 Capability Bus](../../modules/M03-capability-bus.md), [X02 Event Log](../../cross-cutting/X02-events.md)
|
| 5 |
+
**Depended on by:** UI (web + M22 mobile), [M14 Federation](M14-federation.md) (cross-community threads)
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## 1. Responsibility
|
| 10 |
+
|
| 11 |
+
Multi-party threaded conversations with the same guarantees as 1:1 chat: end-to-end encryption (optional but default on), event-log-anchored history, no central server required, members can come and go.
|
| 12 |
+
|
| 13 |
+
A thread is a long-lived object identified by a ULID. It has an authoritative member list maintained in the event log, an encryption "group session" (M23 sender keys), and message history. Threads do not currently support reactions, replies, threading-within-thread, or rich content β those are explicit non-goals for Phase 2 and may arrive in Phase 3 once usage informs design.
|
| 14 |
+
|
| 15 |
+
Group threads are the substrate the **Nachbarschaftshilfe** use case wants: a Sankt-Martins-ComitΓ© planning thread, a "Wer hat Werkzeug?" workshop thread, a household coordination thread between Christof, Jana, and grandparents.
|
| 16 |
+
|
| 17 |
+
---
|
| 18 |
+
|
| 19 |
+
## 2. File layout
|
| 20 |
+
|
| 21 |
+
```
|
| 22 |
+
hearthnet/services/chat/
|
| 23 |
+
βββ thread_service.py # ThreadService β capability registration & dispatch
|
| 24 |
+
βββ thread_views.py # Materialised views: thread list, member list, history
|
| 25 |
+
βββ thread_store.py # Read-only projections; not the source of truth
|
| 26 |
+
βββ group_session.py # Wraps M23 sender keys for a thread
|
| 27 |
+
βββ moderation.py # Phase-2: remove-member, archive β minimal
|
| 28 |
+
```
|
| 29 |
+
|
| 30 |
+
---
|
| 31 |
+
|
| 32 |
+
## 3. Public API
|
| 33 |
+
|
| 34 |
+
### 3.1 Dataclasses
|
| 35 |
+
|
| 36 |
+
```python
|
| 37 |
+
@dataclass(frozen=True)
|
| 38 |
+
class Thread:
|
| 39 |
+
thread_id: ThreadID
|
| 40 |
+
name: str
|
| 41 |
+
created_at: datetime
|
| 42 |
+
created_by: NodeID
|
| 43 |
+
members: frozenset[NodeID]
|
| 44 |
+
e2e_enabled: bool
|
| 45 |
+
ratchet_root: str | None # x25519 pubkey of group session root, None if cleartext
|
| 46 |
+
archived: bool
|
| 47 |
+
|
| 48 |
+
@dataclass(frozen=True)
|
| 49 |
+
class ThreadMessage:
|
| 50 |
+
event_id: EventID
|
| 51 |
+
thread_id: ThreadID
|
| 52 |
+
client_id: ClientID
|
| 53 |
+
sender: NodeID
|
| 54 |
+
sent_at: datetime
|
| 55 |
+
body: str | None # cleartext if e2e_enabled=False
|
| 56 |
+
encrypted: EncryptedPayload | None
|
| 57 |
+
attachments: list[Attachment]
|
| 58 |
+
delivered_to: frozenset[NodeID] # tracked via chat.thread.message.delivered events
|
| 59 |
+
```
|
| 60 |
+
|
| 61 |
+
### 3.2 `ThreadService`
|
| 62 |
+
|
| 63 |
+
```python
|
| 64 |
+
class ThreadService:
|
| 65 |
+
"""Capability handlers for chat.thread.*"""
|
| 66 |
+
|
| 67 |
+
def __init__(
|
| 68 |
+
self,
|
| 69 |
+
bus: CapabilityBus,
|
| 70 |
+
event_log: EventLog,
|
| 71 |
+
identity: Identity,
|
| 72 |
+
encryption: EncryptionService, # M23
|
| 73 |
+
view_store: ThreadViewStore,
|
| 74 |
+
observability: Observability,
|
| 75 |
+
): ...
|
| 76 |
+
|
| 77 |
+
async def start(self) -> None:
|
| 78 |
+
# Registers: chat.thread.create, .send, .history, .leave, .add_member, .archive
|
| 79 |
+
...
|
| 80 |
+
|
| 81 |
+
# --- handlers (selected) ---
|
| 82 |
+
async def create(self, body: CreateThreadBody) -> CreateThreadResult: ...
|
| 83 |
+
async def send(self, body: SendThreadBody) -> SendThreadResult: ...
|
| 84 |
+
async def history(self, body: HistoryBody) -> HistoryResult: ...
|
| 85 |
+
async def leave(self, body: LeaveBody) -> LeaveResult: ...
|
| 86 |
+
async def add_member(self, body: AddMemberBody) -> AddMemberResult: ...
|
| 87 |
+
async def archive(self, body: ArchiveBody) -> ArchiveResult: ...
|
| 88 |
+
```
|
| 89 |
+
|
| 90 |
+
### 3.3 `ThreadViewStore`
|
| 91 |
+
|
| 92 |
+
```python
|
| 93 |
+
class ThreadViewStore:
|
| 94 |
+
"""Read model. Backed by SQLite; rebuilt from the event log on cold start."""
|
| 95 |
+
|
| 96 |
+
def list_for_member(self, node_id: NodeID) -> list[Thread]: ...
|
| 97 |
+
def get_thread(self, thread_id: ThreadID) -> Thread | None: ...
|
| 98 |
+
def get_messages(self, thread_id: ThreadID, since_lamport: int = 0, limit: int = 200) -> list[ThreadMessage]: ...
|
| 99 |
+
def members_of(self, thread_id: ThreadID) -> frozenset[NodeID]: ...
|
| 100 |
+
|
| 101 |
+
# Internal β subscribed to the event log:
|
| 102 |
+
async def apply(self, event: Event) -> None: ...
|
| 103 |
+
```
|
| 104 |
+
|
| 105 |
+
### 3.4 `GroupSession`
|
| 106 |
+
|
| 107 |
+
Thin wrapper around M23 sender keys; one per thread.
|
| 108 |
+
|
| 109 |
+
```python
|
| 110 |
+
class GroupSession:
|
| 111 |
+
def __init__(self, thread_id: ThreadID, ratchet: SenderKeyRatchet): ...
|
| 112 |
+
|
| 113 |
+
def encrypt(self, plaintext: bytes) -> EncryptedPayload: ...
|
| 114 |
+
def decrypt(self, sender: NodeID, payload: EncryptedPayload) -> bytes: ...
|
| 115 |
+
def rekey(self) -> None: ...
|
| 116 |
+
def add_member(self, new_member: NodeID, their_identity_pubkey: bytes) -> None: ...
|
| 117 |
+
def remove_member(self, leaving_member: NodeID) -> None: ...
|
| 118 |
+
```
|
| 119 |
+
|
| 120 |
+
---
|
| 121 |
+
|
| 122 |
+
## 4. Behaviour
|
| 123 |
+
|
| 124 |
+
### 4.1 Thread creation
|
| 125 |
+
|
| 126 |
+
`chat.thread.create@1.0` flow:
|
| 127 |
+
|
| 128 |
+
1. Caller emits `chat.thread.created` event into the event log with:
|
| 129 |
+
- `thread_id` (newly minted ULID)
|
| 130 |
+
- initial member list
|
| 131 |
+
- `e2e_enabled` flag
|
| 132 |
+
- if e2e_enabled: a freshly generated `ratchet_root_pubkey` and a per-member encrypted **sender key** payload (see [M23 Β§6.3](M23-e2e-encryption.md))
|
| 133 |
+
2. Each member's node sees the event arrive, decrypts the sender key payload addressed to itself, and constructs the GroupSession.
|
| 134 |
+
3. The view store materialises a new Thread row.
|
| 135 |
+
|
| 136 |
+
If any member is offline at creation, they will receive the event when they next sync. Their GroupSession constructs lazily on first decrypt.
|
| 137 |
+
|
| 138 |
+
### 4.2 Sending
|
| 139 |
+
|
| 140 |
+
`chat.thread.send@1.0` flow:
|
| 141 |
+
|
| 142 |
+
1. Verify caller is in `Thread.members`.
|
| 143 |
+
2. If `e2e_enabled`, encrypt body with the GroupSession's current sender key. The ciphertext is opaque to the event log β even other community members who are not in the thread cannot read it.
|
| 144 |
+
3. Emit `chat.thread.message.sent` event.
|
| 145 |
+
4. The event reaches all members (regular event-log propagation, no thread-specific transport).
|
| 146 |
+
5. Each member's GroupSession decrypts; the message appears in their UI.
|
| 147 |
+
|
| 148 |
+
### 4.3 Membership changes
|
| 149 |
+
|
| 150 |
+
#### Adding a member
|
| 151 |
+
|
| 152 |
+
1. Any existing member can issue `chat.thread.add_member` (Phase 2; later phases may add policies like "only admin can add").
|
| 153 |
+
2. The caller's GroupSession is **rekeyed**: a new sender key is generated, encrypted under each existing member's pubkey and the new member's pubkey, and emitted in the `chat.thread.member.added` event.
|
| 154 |
+
3. The new member cannot read **prior** messages β they joined at the new epoch. (This is by design and standard for sender-key group encryption: forward-secrecy is preserved.) Old messages remain encrypted with the old sender key, which the new member never sees.
|
| 155 |
+
|
| 156 |
+
#### Removing a member
|
| 157 |
+
|
| 158 |
+
`chat.thread.remove_member` (or self-leave via `chat.thread.leave`):
|
| 159 |
+
|
| 160 |
+
1. Emit `chat.thread.member.removed`.
|
| 161 |
+
2. The remaining members rekey the GroupSession (similar to add but excluding the removed member). New messages are not readable by the removed member.
|
| 162 |
+
3. The removed member's UI marks the thread as "you left" and stops decrypting incoming messages. Their event log still contains old messages they can still read; they just can't read new ones.
|
| 163 |
+
|
| 164 |
+
### 4.4 History
|
| 165 |
+
|
| 166 |
+
`chat.thread.history@1.0`:
|
| 167 |
+
|
| 168 |
+
- **Self-only** capability (you can only ask for history of threads you're a member of).
|
| 169 |
+
- Returns from local view store. No cross-node query needed β every member already has the events.
|
| 170 |
+
- Pagination by `since_lamport` + `limit`. Messages return in **logical (Lamport) order**, not wall-clock order, to match what other members will see.
|
| 171 |
+
|
| 172 |
+
### 4.5 Read-receipts / delivery tracking
|
| 173 |
+
|
| 174 |
+
Each member's node emits `chat.thread.message.delivered` (lightweight, no payload beyond `event_id` reference) when they materialise a message. UI shows "delivered to 4/5" by counting these events. Optional β `policy.chat.delivery_receipts_enabled` (default true) controls whether they're emitted.
|
| 175 |
+
|
| 176 |
+
### 4.6 Archiving
|
| 177 |
+
|
| 178 |
+
`chat.thread.archived` is a soft state. Archived threads are hidden from the default thread list, no longer rekey on membership change, and no longer accept sends. Members can still read history. An archived thread can be unarchived by any member.
|
| 179 |
+
|
| 180 |
+
There is no "delete thread". Events are immutable. A thread that is archived and whose messages are all expired (via X02 retention policies) becomes effectively gone.
|
| 181 |
+
|
| 182 |
+
### 4.7 Attachments
|
| 183 |
+
|
| 184 |
+
`attachments` carry `cid` (blob CID) and `name`. The blob itself is uploaded via `file.put` separately. Members of the thread are by definition authorised to fetch the blob β the bus enforces this via a capability token issued automatically when sending an attachment in an E2E thread:
|
| 185 |
+
|
| 186 |
+
```
|
| 187 |
+
On send-with-attachment:
|
| 188 |
+
1. Service issues a short-lived (24h) token via M16 with:
|
| 189 |
+
scope.capabilities = ["file.fetch@1.0"]
|
| 190 |
+
scope.params_constraints.cid = [attachment.cid]
|
| 191 |
+
audience = thread.members (excluding self)
|
| 192 |
+
2. Token is included in the encrypted message body.
|
| 193 |
+
3. Recipients use the token when fetching the blob from whichever node holds it.
|
| 194 |
+
```
|
| 195 |
+
|
| 196 |
+
This avoids the "file is restricted but everyone in the thread should access it" coordination problem.
|
| 197 |
+
|
| 198 |
+
### 4.8 Federation of threads
|
| 199 |
+
|
| 200 |
+
A thread MAY include members from federated communities. Mechanics:
|
| 201 |
+
|
| 202 |
+
- The thread's `community_id` (the one in event headers) is the *creator's* community.
|
| 203 |
+
- Members from federated communities subscribe to the thread's events via the standard federation event-bridge (see [M14 Β§6](M14-federation.md)).
|
| 204 |
+
- Federated members are full participants β they can send, leave, be removed β provided the federation manifest grants `chat.thread.send@1.0`.
|
| 205 |
+
- The view store on a federated member's node carries the foreign-community thread alongside their local-community threads, distinguished by `Thread.community_id` field for UI purposes.
|
| 206 |
+
|
| 207 |
+
If federation is revoked, foreign members are silently removed from the thread on the next rekey.
|
| 208 |
+
|
| 209 |
+
### 4.9 Throughput and limits
|
| 210 |
+
|
| 211 |
+
- `THREAD_MAX_MEMBERS = 200` (Phase-2 conservative; larger groups should be a different module).
|
| 212 |
+
- `THREAD_MAX_MESSAGE_BYTES = 64 * 1024` for the cleartext body.
|
| 213 |
+
- `THREAD_RATE_LIMIT_PER_SENDER_PER_MINUTE = 60` (anti-spam, enforced by ThreadService).
|
| 214 |
+
- Beyond these β `bad_request` or `too_many_requests`.
|
| 215 |
+
|
| 216 |
+
---
|
| 217 |
+
|
| 218 |
+
## 5. Errors
|
| 219 |
+
|
| 220 |
+
| Code | Cause |
|
| 221 |
+
|------|-------|
|
| 222 |
+
| `bad_request` | Empty member list, malformed body, member list contains caller twice |
|
| 223 |
+
| `unauthorized` | Caller not a member of the thread (for send/history/leave/add) |
|
| 224 |
+
| `not_found` | `thread_id` unknown |
|
| 225 |
+
| `e2e_session_missing` | Caller has no GroupSession yet (sender keys not received) |
|
| 226 |
+
| `e2e_decrypt_failed` | Local key state corrupt; UI should prompt for a manual rekey |
|
| 227 |
+
| `too_many_requests` | Rate limit exceeded |
|
| 228 |
+
| `policy_violation` | E.g. trying to add member outside of federation scope |
|
| 229 |
+
|
| 230 |
+
---
|
| 231 |
+
|
| 232 |
+
## 6. Configuration
|
| 233 |
+
|
| 234 |
+
```toml
|
| 235 |
+
[services.chat.thread]
|
| 236 |
+
enabled = true
|
| 237 |
+
max_members = 200
|
| 238 |
+
max_message_bytes = 65536
|
| 239 |
+
rate_limit_per_sender_per_minute = 60
|
| 240 |
+
delivery_receipts_enabled = true
|
| 241 |
+
allow_federated_members = true
|
| 242 |
+
|
| 243 |
+
[services.chat.thread.archival]
|
| 244 |
+
auto_archive_after_days_idle = 0 # 0 = never auto-archive
|
| 245 |
+
```
|
| 246 |
+
|
| 247 |
+
---
|
| 248 |
+
|
| 249 |
+
## 7. Tests
|
| 250 |
+
|
| 251 |
+
### 7.1 Unit
|
| 252 |
+
- Create thread with 3 members; verify GroupSession is constructable by each member from the `chat.thread.created` payload
|
| 253 |
+
- Send + decrypt round-trip
|
| 254 |
+
- Add member; old messages remain undecryptable for them, new ones work
|
| 255 |
+
- Remove member; their session can't decrypt new messages
|
| 256 |
+
- Self-leave; cleanup is graceful (no orphan state)
|
| 257 |
+
- History pagination: 1000 messages, fetch 200 + 200 + 200... covers all
|
| 258 |
+
|
| 259 |
+
### 7.2 Integration
|
| 260 |
+
- Three nodes on one LAN form a thread; messages propagate via gossip
|
| 261 |
+
- Same with one member partitioned; their replay on reconnect works
|
| 262 |
+
- E2E on/off threads coexist; switching one to the other is not supported (must create a new thread)
|
| 263 |
+
- Federation: a federated peer's node receives the `chat.thread.created` event via the bridge and constructs a working GroupSession
|
| 264 |
+
|
| 265 |
+
### 7.3 Adversarial
|
| 266 |
+
- A non-member tries to call `chat.thread.send` β `unauthorized`
|
| 267 |
+
- A non-member subscribes to `chat.thread.message.<id>` pubsub: receives encrypted blobs they can't decrypt (no information leak beyond traffic patterns and member list)
|
| 268 |
+
- Replay: replaying an old `chat.thread.message.sent` event by IP-level adversary is rejected by per-message nonce in the E2E header
|
| 269 |
+
- Rekey storm: 100 sequential add/remove operations finish within 30s on the dev rig; no deadlock
|
| 270 |
+
|
| 271 |
+
### 7.4 Performance
|
| 272 |
+
- 50-member thread, 1 msg/s: p95 deliver-to-decrypt latency < 500ms on LAN
|
| 273 |
+
- History fetch of 10,000 messages: < 2s on SSD
|
| 274 |
+
|
| 275 |
+
---
|
| 276 |
+
|
| 277 |
+
## 8. Cross-references
|
| 278 |
+
|
| 279 |
+
- Capability spec: [CAPABILITY_CONTRACT_v2 Β§4.16β4.19](../CAPABILITY_CONTRACT_v2.md)
|
| 280 |
+
- Encryption primitives: [M23 Β§6 sender keys](M23-e2e-encryption.md)
|
| 281 |
+
- Event types: [CAPABILITY_CONTRACT_v2 Β§7.1](../CAPABILITY_CONTRACT_v2.md#71-new-event-types)
|
| 282 |
+
- Federation: [M14 Β§6](M14-federation.md)
|
| 283 |
+
|
| 284 |
+
---
|
| 285 |
+
|
| 286 |
+
## 9. Open questions
|
| 287 |
+
|
| 288 |
+
1. **Reactions / replies / rich content** β explicitly out of Phase 2. Worth a survey of community use before designing. (Likely Phase 3 add-on, gated on "are people actually asking for it?")
|
| 289 |
+
2. **Per-thread retention policy** β currently inherits the community-wide retention. Different threads might want different policies (planning thread = 30 days, household chat = forever).
|
| 290 |
+
3. **Read-only threads** (announcements) β pseudo-thread where only one member can send. Worth a flag or worth a dedicated capability?
|
| 291 |
+
4. **Thread search** β could plug into `rag.*`. Indexing of decrypted message text would be opt-in per thread; raises privacy concerns.
|
| 292 |
+
5. **Cross-thread mentions / linking** β e.g. "see thread X for context". Probably as a UI affordance (markdown link), not a protocol feature.
|
| 293 |
+
6. **Disappearing messages** β Signal-style auto-expiry per-thread. Useful for sensitive coordination; adds complexity. Phase 3 candidate.
|
docs/p2_p3/M26-distributed-inference.md
ADDED
|
@@ -0,0 +1,325 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# M26 β Distributed Inference
|
| 2 |
+
|
| 3 |
+
**Spec version:** v3.0 β *experimental*
|
| 4 |
+
**Depends on:** [X08 Tensor Transport](../cross-cutting/X08-tensor-transport.md), [X06 WebSocket](../../phase-2/cross-cutting/X06-websocket.md), [M04 LLM](../../modules/M04-llm.md), [M16 Tokens](../../phase-2/modules/M16-tokens.md), [X03 Observability](../../cross-cutting/X03-observability.md)
|
| 5 |
+
**Depended on by:** Optional `experimental.distributed_llm.chat` backend in M04
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## 1. Responsibility
|
| 10 |
+
|
| 11 |
+
Run a single LLM forward pass across multiple machines on the same LAN (or, in extremely careful setups, a single federation). Take a 7B model that doesn't fit on any one anchor's GPU and split it: anchor A holds layers 0β7, anchor B holds 8β15, anchor C holds 16β23, etc. A request orchestrator chains them and streams tokens back to the user.
|
| 12 |
+
|
| 13 |
+
This is a **research module**. It exists for two reasons:
|
| 14 |
+
|
| 15 |
+
1. **Resilience.** When a community's biggest GPU breaks, the next-biggest fleet of GPUs can still serve mid-sized models cooperatively.
|
| 16 |
+
2. **Reach.** A community of three households, each with consumer hardware, can collectively run a model none of them could run alone.
|
| 17 |
+
|
| 18 |
+
It is explicitly **not** for serving production user-facing LLM traffic at scale. The latency is worse than local inference (typically 2β4Γ per token), the orchestration is fragile (one shard offline = retry the whole pipeline), and the GPU memory savings come at significant complexity cost. Communities should default to local inference; this module exists for the cases where local isn't enough.
|
| 19 |
+
|
| 20 |
+
---
|
| 21 |
+
|
| 22 |
+
## 2. Non-goals (loud and clear)
|
| 23 |
+
|
| 24 |
+
- **Large models.** 70B-class models are out of scope. The math says you'd need ten 24 GB GPUs to host one, which is the wrong problem for a neighbourhood mesh to solve.
|
| 25 |
+
- **Cross-WAN sharding.** Inference across the public internet is uneconomical (latency, bandwidth). Limit to same LAN or same-VPN federation.
|
| 26 |
+
- **Heterogeneous shards across model versions.** All shards in a pipeline must serve the **exact same model and weights checksum**. No partial-model recovery.
|
| 27 |
+
- **Replacing local inference.** When `policy.research.enable = false`, this module is inert.
|
| 28 |
+
|
| 29 |
+
---
|
| 30 |
+
|
| 31 |
+
## 3. File layout
|
| 32 |
+
|
| 33 |
+
```
|
| 34 |
+
hearthnet/distributed_inference/
|
| 35 |
+
βββ __init__.py
|
| 36 |
+
βββ shard.py # Shard, ShardDescriptor, ShardServer
|
| 37 |
+
βββ pipeline.py # Pipeline, PipelineOrchestrator
|
| 38 |
+
βββ routing.py # Picks a set of shards that cover [0..N] layers
|
| 39 |
+
βββ health.py # Heartbeats, failover detection
|
| 40 |
+
βββ backends/
|
| 41 |
+
βββ base.py
|
| 42 |
+
βββ petals_like.py # uses bigscience/petals client/server primitives
|
| 43 |
+
βββ small_model_layered.py # custom impl for small models (β€ 3B)
|
| 44 |
+
```
|
| 45 |
+
|
| 46 |
+
---
|
| 47 |
+
|
| 48 |
+
## 4. Public API
|
| 49 |
+
|
| 50 |
+
### 4.1 Dataclasses
|
| 51 |
+
|
| 52 |
+
```python
|
| 53 |
+
@dataclass(frozen=True)
|
| 54 |
+
class ShardDescriptor:
|
| 55 |
+
shard_id: ShardID # "<model_id>:<lo>-<hi>"
|
| 56 |
+
model_id: str # HF model id
|
| 57 |
+
weights_sha256: str # full model weights hash; shards must match
|
| 58 |
+
layer_range: tuple[int,int] # inclusive
|
| 59 |
+
vram_required_mb: int
|
| 60 |
+
max_concurrent_streams: int
|
| 61 |
+
host: NodeID
|
| 62 |
+
endpoint: Endpoint # ws://...
|
| 63 |
+
advertised_at: datetime
|
| 64 |
+
|
| 65 |
+
@dataclass
|
| 66 |
+
class Pipeline:
|
| 67 |
+
pipeline_id: str
|
| 68 |
+
model_id: str
|
| 69 |
+
weights_sha256: str
|
| 70 |
+
total_layers: int
|
| 71 |
+
ordered_shards: list[ShardDescriptor]
|
| 72 |
+
established_at: datetime
|
| 73 |
+
|
| 74 |
+
@dataclass
|
| 75 |
+
class ShardHealth:
|
| 76 |
+
shard_id: ShardID
|
| 77 |
+
online: bool
|
| 78 |
+
last_seen: datetime
|
| 79 |
+
p95_latency_ms: float
|
| 80 |
+
queue_depth: int
|
| 81 |
+
```
|
| 82 |
+
|
| 83 |
+
### 4.2 `ShardServer`
|
| 84 |
+
|
| 85 |
+
```python
|
| 86 |
+
class ShardServer:
|
| 87 |
+
"""Hosts one contiguous shard. Loaded on demand; lazy-evictable under memory pressure."""
|
| 88 |
+
|
| 89 |
+
def __init__(self, descriptor: ShardDescriptor, model_loader: ModelLoader, settings: ShardSettings): ...
|
| 90 |
+
|
| 91 |
+
async def start(self) -> None:
|
| 92 |
+
# Load weights for the layer range; register `experimental.distributed_llm.shard.serve` on the bus
|
| 93 |
+
...
|
| 94 |
+
|
| 95 |
+
async def forward(self, activations_in: TensorChunkStream) -> TensorChunkStream:
|
| 96 |
+
"""The hot path. Receives activations, runs layers, emits activations."""
|
| 97 |
+
...
|
| 98 |
+
|
| 99 |
+
async def health(self) -> ShardHealth: ...
|
| 100 |
+
|
| 101 |
+
async def evict(self) -> None:
|
| 102 |
+
"""Free VRAM; triggered by host memory manager."""
|
| 103 |
+
...
|
| 104 |
+
```
|
| 105 |
+
|
| 106 |
+
### 4.3 `PipelineOrchestrator`
|
| 107 |
+
|
| 108 |
+
```python
|
| 109 |
+
class PipelineOrchestrator:
|
| 110 |
+
"""
|
| 111 |
+
Chooses shards to cover the model's layers, opens streams to each, and
|
| 112 |
+
pumps activations through them in order. Handles failover.
|
| 113 |
+
"""
|
| 114 |
+
|
| 115 |
+
def __init__(
|
| 116 |
+
self,
|
| 117 |
+
bus: CapabilityBus,
|
| 118 |
+
router: ShardRouter,
|
| 119 |
+
health: ShardHealthTracker,
|
| 120 |
+
observability: Observability,
|
| 121 |
+
): ...
|
| 122 |
+
|
| 123 |
+
async def chat(
|
| 124 |
+
self,
|
| 125 |
+
request: LlmChatRequest,
|
| 126 |
+
params: DistributedChatParams,
|
| 127 |
+
) -> AsyncIterator[StreamFrame]:
|
| 128 |
+
# 1. Resolve a Pipeline covering all layers of the target model
|
| 129 |
+
# 2. Open WS streams to each shard via X08 tensor transport
|
| 130 |
+
# 3. For each token step:
|
| 131 |
+
# embedding β shard 0 β shard 1 β ... β shard N β token sample
|
| 132 |
+
# 4. Yield `token_delta` frames; emit `shard_status` and `shard_failover` diagnostics
|
| 133 |
+
# 5. On any shard failure, attempt re-routing once; if that fails and
|
| 134 |
+
# `params.fallback_to_local`, fall back to local inference and emit a
|
| 135 |
+
# `pipeline_aborted` frame
|
| 136 |
+
...
|
| 137 |
+
```
|
| 138 |
+
|
| 139 |
+
### 4.4 `ShardRouter`
|
| 140 |
+
|
| 141 |
+
```python
|
| 142 |
+
class ShardRouter:
|
| 143 |
+
"""
|
| 144 |
+
Given a model_id and an `experimental.shard.advertised` event stream,
|
| 145 |
+
pick a covering set of shards minimising:
|
| 146 |
+
- total network hops
|
| 147 |
+
- max per-shard queue depth
|
| 148 |
+
- chance of overlap with the caller's own GPU (avoid self-as-shard)
|
| 149 |
+
"""
|
| 150 |
+
|
| 151 |
+
def __init__(self, store: ShardStore, settings: RoutingSettings): ...
|
| 152 |
+
|
| 153 |
+
async def pick(self, model_id: str, weights_sha256: str) -> Pipeline: ...
|
| 154 |
+
async def repick(self, pipeline: Pipeline, exclude: set[ShardID]) -> Pipeline: ...
|
| 155 |
+
```
|
| 156 |
+
|
| 157 |
+
---
|
| 158 |
+
|
| 159 |
+
## 5. Behaviour
|
| 160 |
+
|
| 161 |
+
### 5.1 Shard advertisement and discovery
|
| 162 |
+
|
| 163 |
+
A node hosting a shard emits `experimental.shard.advertised` events into the community event log. The event carries `ShardDescriptor` fields plus a timestamp. Advertisements expire after `DISTRIBUTED_SHARD_HEALTH_TIMEOUT_S * 4` (default 120s); shard hosts must re-advertise via heartbeat.
|
| 164 |
+
|
| 165 |
+
When a node opts out (`policy.research.enable=false`), it does not emit advertisements. Existing advertisements expire normally.
|
| 166 |
+
|
| 167 |
+
The shard store is a local read model built from these events, indexed by `(model_id, weights_sha256, layer_range)`.
|
| 168 |
+
|
| 169 |
+
### 5.2 Pipeline construction
|
| 170 |
+
|
| 171 |
+
`ShardRouter.pick`:
|
| 172 |
+
|
| 173 |
+
1. Filter advertisements to those matching `model_id` and `weights_sha256`.
|
| 174 |
+
2. Greedy cover: starting from layer 0, pick the shard with the lowest queue depth that includes the next uncovered layer; advance the cursor; repeat. Returns failure if any layer is uncoverable.
|
| 175 |
+
3. Prefer shards on the same LAN if possible (LAN advertisements have a lower "hop weight" metric attached by Discovery).
|
| 176 |
+
4. Avoid sharding to **self** as the first shard β embedding + sampling should stay on the orchestrator.
|
| 177 |
+
|
| 178 |
+
Constructed pipelines are not persisted; they're per-call.
|
| 179 |
+
|
| 180 |
+
### 5.3 Forward pass
|
| 181 |
+
|
| 182 |
+
Per token:
|
| 183 |
+
|
| 184 |
+
```
|
| 185 |
+
[orchestrator] embedding β [shard 0] layers 0..7 β [shard 1] layers 8..15 β ... β [orchestrator] sample
|
| 186 |
+
```
|
| 187 |
+
|
| 188 |
+
Activations flow as **fp16 tensors** by default (configurable to fp32 for debugging). Each hop is a WebSocket binary frame stream (see [X08](../cross-cutting/X08-tensor-transport.md)). The orchestrator interleaves token-N and token-N+1: as soon as shard 0 finishes token N, the orchestrator pushes token N+1's embedding into shard 0 while shard 1 is still processing N. This pipeline parallelism approaches the latency of the longest-latency shard at steady state.
|
| 189 |
+
|
| 190 |
+
### 5.4 Failure handling
|
| 191 |
+
|
| 192 |
+
If a shard's stream errors or stalls past `DISTRIBUTED_SHARD_HEALTH_TIMEOUT_S`:
|
| 193 |
+
|
| 194 |
+
1. The orchestrator emits a `shard_status` frame with `status:"degraded"`.
|
| 195 |
+
2. Calls `router.repick(pipeline, exclude={failed_shard_id})`.
|
| 196 |
+
3. If repick succeeds, opens a fresh stream to the replacement and emits `shard_failover` frame. **In-flight tokens are restarted** (no mid-token recovery).
|
| 197 |
+
4. If repick fails and `params.fallback_to_local`, the orchestrator silently restarts the call as a local-only `llm.chat@2.0` against any local model that matches.
|
| 198 |
+
5. Else: emit `pipeline_aborted` frame and return `shard_unavailable`.
|
| 199 |
+
|
| 200 |
+
`DISTRIBUTED_FALLBACK_TO_LOCAL_AFTER_FAILURES` (default 2): if failover happens that many times in one call, give up and fall back to local.
|
| 201 |
+
|
| 202 |
+
### 5.5 Streaming and backpressure
|
| 203 |
+
|
| 204 |
+
Tensor-chunk streams use a window of `TENSOR_FLOW_CONTROL_WINDOW` chunks (default 16). Each chunk is at most `TENSOR_CHUNK_BYTES` (1 MB). If the downstream shard's send queue fills, the orchestrator pauses upstream until ACKs drain. See [X08 Β§4](../cross-cutting/X08-tensor-transport.md).
|
| 205 |
+
|
| 206 |
+
### 5.6 Concurrency
|
| 207 |
+
|
| 208 |
+
A shard's `max_concurrent_streams` is honoured strictly. If the orchestrator's call would exceed it, the orchestrator picks a different shard (via `router.repick`) rather than queuing.
|
| 209 |
+
|
| 210 |
+
A shard's GPU memory budget is enforced by the shard host's own resource manager; a shard exceeding its budget gets evicted and re-advertises with `vram_required_mb` updated next time it loads.
|
| 211 |
+
|
| 212 |
+
### 5.7 Models supported
|
| 213 |
+
|
| 214 |
+
Phase 3 launches with two backend choices:
|
| 215 |
+
|
| 216 |
+
| Backend | Models | Notes |
|
| 217 |
+
|---------|--------|-------|
|
| 218 |
+
| `small_model_layered` | Qwen2.5-{1.5B,3B,7B}, Llama-3.2-{1B,3B}, MiniCPM-3 | Custom HearthNet impl; PyTorch model surgery to expose per-layer forward |
|
| 219 |
+
| `petals_like` | (vendored from BigScience Petals) | Optional; only if user installs `hearthnet[petals]` extra |
|
| 220 |
+
|
| 221 |
+
The `small_model_layered` backend handles models up to roughly 7B parameters cleanly; beyond that the activation transport becomes the bottleneck.
|
| 222 |
+
|
| 223 |
+
### 5.8 Security boundary
|
| 224 |
+
|
| 225 |
+
A shard host receives activation tensors which **leak training data residue**. Treat activations as sensitive: do not log them, do not persist, do not retain past forward pass. Per-call signed authentication; the caller's identity is recorded in metrics but not in logs of tensor contents.
|
| 226 |
+
|
| 227 |
+
A malicious shard could degrade outputs subtly. Detection is hard in general; the orchestrator does **basic sanity checks** (norm bounds, NaN/Inf detection) but cannot detect adversarial corruption. Communities should only enable distributed inference among members they trust as much as they trust the LLM service operator.
|
| 228 |
+
|
| 229 |
+
### 5.9 Privacy threat surface
|
| 230 |
+
|
| 231 |
+
A shard sees the activations of every request routed through it. With effort, a shard host can reconstruct approximate input text (especially the prompt) from activations of intermediate layers. This is **a real concern, not a theoretical one**.
|
| 232 |
+
|
| 233 |
+
Mitigations (none perfect):
|
| 234 |
+
- Restrict participation to members at trust level `trusted` or higher.
|
| 235 |
+
- Mix activations with a small amount of noise at the orchestrator (research; not yet implemented).
|
| 236 |
+
- Use this module only for queries the requester would already trust the community with.
|
| 237 |
+
|
| 238 |
+
### 5.10 Observability
|
| 239 |
+
|
| 240 |
+
Per call, emit:
|
| 241 |
+
- `distributed_inference.pipeline_construct_ms`
|
| 242 |
+
- `distributed_inference.first_token_ms`
|
| 243 |
+
- `distributed_inference.tokens_per_second`
|
| 244 |
+
- `distributed_inference.shard_latency_ms{shard_id}` histograms
|
| 245 |
+
- `distributed_inference.failovers_total`
|
| 246 |
+
- `distributed_inference.fallback_to_local_total`
|
| 247 |
+
|
| 248 |
+
---
|
| 249 |
+
|
| 250 |
+
## 6. Errors
|
| 251 |
+
|
| 252 |
+
| Code | Cause |
|
| 253 |
+
|------|-------|
|
| 254 |
+
| `experimental_disabled` | `policy.research.enable=false` |
|
| 255 |
+
| `shard_unavailable` | No shard covers a required layer range, or all candidates are at max concurrency |
|
| 256 |
+
| `pipeline_stalled` | No progress within timeout |
|
| 257 |
+
| `weights_mismatch` | A shard's advertised `weights_sha256` differs from requested |
|
| 258 |
+
| `bad_request` | Unknown model, malformed pipeline params |
|
| 259 |
+
|
| 260 |
+
---
|
| 261 |
+
|
| 262 |
+
## 7. Configuration
|
| 263 |
+
|
| 264 |
+
```toml
|
| 265 |
+
[research.distributed_inference]
|
| 266 |
+
enabled = false
|
| 267 |
+
backend = "small_model_layered"
|
| 268 |
+
max_shards_per_request = 16
|
| 269 |
+
shard_health_timeout_seconds = 30
|
| 270 |
+
fallback_to_local = true
|
| 271 |
+
activation_dtype = "fp16" # "fp16" | "fp32"
|
| 272 |
+
allow_self_as_shard = false
|
| 273 |
+
max_concurrent_pipelines = 4
|
| 274 |
+
|
| 275 |
+
[research.distributed_inference.host]
|
| 276 |
+
serve_shards = false
|
| 277 |
+
shard_eviction_idle_seconds = 600
|
| 278 |
+
shard_max_vram_mb = 20000
|
| 279 |
+
```
|
| 280 |
+
|
| 281 |
+
---
|
| 282 |
+
|
| 283 |
+
## 8. Tests
|
| 284 |
+
|
| 285 |
+
### 8.1 Unit
|
| 286 |
+
- ShardRouter cover algorithm: 16-layer model + 3 advertised shards (0-7, 4-11, 8-15) β picks {0-7, 8-15}, ignores overlap shard
|
| 287 |
+
- Sanity bounds on activations: NaN injection triggers `pipeline_stalled` (via failed health check on subsequent chunk)
|
| 288 |
+
- Pipeline construction with weights mismatch β `weights_mismatch`
|
| 289 |
+
|
| 290 |
+
### 8.2 Integration (LAN)
|
| 291 |
+
- Two-node setup, 1.5B model split as 0-7 / 8-15; happy-path tokens/sec measured; baseline single-machine inference also measured; ratio reported (expect 0.4β0.6Γ local)
|
| 292 |
+
- Shard host kill mid-stream; failover to a third node; total call still succeeds; latency penalty bounded
|
| 293 |
+
- Concurrent two-pipeline test on three nodes; no deadlock; per-call latency degrades < 2Γ
|
| 294 |
+
|
| 295 |
+
### 8.3 Adversarial
|
| 296 |
+
- Malicious shard returns garbage activations: orchestrator's NaN/Inf detector catches the call; metric `distributed_inference.shard_corruption_detected_total` increments; pipeline aborts
|
| 297 |
+
- Slowloris shard (returns one chunk per second): `pipeline_stalled` after timeout; failover succeeds
|
| 298 |
+
|
| 299 |
+
### 8.4 Performance budget
|
| 300 |
+
- 3B model, 2-shard pipeline, RTX 5090 + RTX 4090: β₯ 8 tokens/sec sustained
|
| 301 |
+
- First-token latency β€ 800ms
|
| 302 |
+
- Construction-to-first-byte β€ 500ms
|
| 303 |
+
- Tensor-chunk overhead per hop β€ 25ms p95
|
| 304 |
+
|
| 305 |
+
---
|
| 306 |
+
|
| 307 |
+
## 9. Cross-references
|
| 308 |
+
|
| 309 |
+
- Capability spec: [CAPABILITY_CONTRACT_v3 Β§4.1β4.3](../CAPABILITY_CONTRACT_v3.md)
|
| 310 |
+
- Tensor transport: [X08](../cross-cutting/X08-tensor-transport.md)
|
| 311 |
+
- Base LLM service: [M04](../../modules/M04-llm.md)
|
| 312 |
+
- Trust levels: [M01](../../modules/M01-identity.md)
|
| 313 |
+
|
| 314 |
+
---
|
| 315 |
+
|
| 316 |
+
## 10. Open research questions
|
| 317 |
+
|
| 318 |
+
1. **Activation privacy.** Can we add fast-to-compute noise that preserves inference accuracy but defeats activation-inversion attacks? Cite the Geiping et al. inversion paper as the threat baseline.
|
| 319 |
+
2. **Mid-token recovery.** Currently a shard failure restarts the in-flight token. Could we use micro-checkpointing (every K tokens) to recover without a restart? Latency cost?
|
| 320 |
+
3. **Heterogeneous shards.** Could a 4090 host the early layers (heavier compute per layer) and a 3060 the later ones, while remaining balanced? Probably yes β automated load assignment is the research question.
|
| 321 |
+
4. **Async pipeline.** Currently the orchestrator interleaves at the token level. Could it interleave at the layer level (one shard processes token N+2 while another processes N+1) for higher throughput? In theory yes; coordination protocol unclear.
|
| 322 |
+
5. **Mixed local + distributed.** When the orchestrator could host some layers itself (it has a GPU), should it? When? Currently `allow_self_as_shard=false`. A heuristic that considers compute headroom would be richer.
|
| 323 |
+
6. **Adversarial detection.** Beyond NaN/Inf, can we cheaply detect activation tampering by comparing to a small "shadow inference" on a tiny model? Cost vs. benefit unclear.
|
| 324 |
+
7. **Pricing / incentive.** A shard host pays in GPU time. A community-internal token-economy is explicitly out of scope (00-OVERVIEW Β§8). But a *reputational* signal β "this anchor served 4000 shard-tokens this week" β could be helpful. Should it be a metric?
|
| 325 |
+
8. **Backend strategy.** `petals_like` vs `small_model_layered`: which delivers better quality / latency / robustness for our target models? An honest A/B is the answer.
|
docs/p2_p3/M27-moe-routing.md
ADDED
|
@@ -0,0 +1,378 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# M27 β MoE Expert Routing
|
| 2 |
+
|
| 3 |
+
**Spec version:** v3.0 β *experimental*
|
| 4 |
+
**Depends on:** [M03 Capability Bus](../../modules/M03-capability-bus.md), [M04 LLM](../../modules/M04-llm.md), [M10 Chat](../../modules/M10-chat.md), [M25 Group Chat](../../phase-2/modules/M25-group-chat.md), [X02 Event Log](../../cross-cutting/X02-events.md), [M16 Tokens](../../phase-2/modules/M16-tokens.md)
|
| 5 |
+
**Depended on by:** Optional routing layer in M03 Bus dispatcher
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## 1. Responsibility
|
| 10 |
+
|
| 11 |
+
"MoE" here means **Mixture of Experts** in a generalised sense: given a question, route to the best expert. The experts can be:
|
| 12 |
+
|
| 13 |
+
- a **model** running locally on some node ("Llama-3.2-3B is good at code"),
|
| 14 |
+
- a **service** capability ("the niederrhein-emergency RAG corpus knows this"),
|
| 15 |
+
- a **human** with declared expertise ("Maria in Issum has organised Sankt Martins for 20 years"),
|
| 16 |
+
- an **external** API or another community via federation.
|
| 17 |
+
|
| 18 |
+
The router takes a query summary + tags, asks "which expert would do this best", returns top-K candidates with scores. The caller chooses one (or the system chooses automatically with a confidence threshold) and the chosen expert serves the request.
|
| 19 |
+
|
| 20 |
+
This module is a research bet that **knowing-who-to-ask is more valuable than scaling-the-model**. A 3B model that knows when to defer to a neighbour with first-hand knowledge will outperform a 70B model that has to confabulate. Whether that bet pays off is exactly what the research is for.
|
| 21 |
+
|
| 22 |
+
---
|
| 23 |
+
|
| 24 |
+
## 2. Non-goals
|
| 25 |
+
|
| 26 |
+
- **Auction-style routing.** Experts do not bid; there is no money flowing. Routing is by score, not price.
|
| 27 |
+
- **Mandatory routing.** The router is a *recommender*. The caller can always choose to run a query against a default LLM. MoE routing is opt-in per capability or per call.
|
| 28 |
+
- **Replacing RAG.** RAG still does retrieval inside a corpus. The router decides *which corpus* (or which non-RAG expert) β different layer.
|
| 29 |
+
- **Routing without consent.** A human expert never gets pinged unless they have explicitly registered availability for the topic.
|
| 30 |
+
|
| 31 |
+
---
|
| 32 |
+
|
| 33 |
+
## 3. File layout
|
| 34 |
+
|
| 35 |
+
```
|
| 36 |
+
hearthnet/moe/
|
| 37 |
+
βββ __init__.py
|
| 38 |
+
βββ router.py # MoeRouter β the capability handler
|
| 39 |
+
βββ scorer.py # Learned scoring model; rule-based fallback
|
| 40 |
+
βββ expert_registry.py # Tracks registered experts and their declared topics
|
| 41 |
+
βββ human_in_the_loop.py # Coordinates handoff to a human; manages timeouts
|
| 42 |
+
βββ feedback.py # Records routing outcomes to train the scorer
|
| 43 |
+
```
|
| 44 |
+
|
| 45 |
+
---
|
| 46 |
+
|
| 47 |
+
## 4. Public API
|
| 48 |
+
|
| 49 |
+
### 4.1 Dataclasses
|
| 50 |
+
|
| 51 |
+
```python
|
| 52 |
+
@dataclass(frozen=True)
|
| 53 |
+
class ExpertDescriptor:
|
| 54 |
+
expert_id: ExpertID # "human:<NodeID>" | "model:<id>" | "service:<cap_name>" | "external:<url>"
|
| 55 |
+
kind: ExpertKind
|
| 56 |
+
topics: frozenset[str]
|
| 57 |
+
capabilities: frozenset[str] # for kind=service: which bus capability to invoke
|
| 58 |
+
availability: AvailabilityWindow
|
| 59 |
+
consent_to_route: bool
|
| 60 |
+
score_bias: float = 0.0 # operator nudge; positive favours, negative dispels
|
| 61 |
+
registered_at: datetime
|
| 62 |
+
expires_at: datetime | None
|
| 63 |
+
|
| 64 |
+
@dataclass
|
| 65 |
+
class RouteCandidate:
|
| 66 |
+
expert_id: ExpertID
|
| 67 |
+
kind: ExpertKind
|
| 68 |
+
score: float
|
| 69 |
+
expected_latency_minutes: int # for humans, hours; for models, seconds
|
| 70 |
+
rationale: str
|
| 71 |
+
name: str | None # for display
|
| 72 |
+
|
| 73 |
+
@dataclass
|
| 74 |
+
class RouteResult:
|
| 75 |
+
candidates: list[RouteCandidate]
|
| 76 |
+
rationale: str
|
| 77 |
+
routed_at: datetime
|
| 78 |
+
|
| 79 |
+
@dataclass
|
| 80 |
+
class Handoff:
|
| 81 |
+
handoff_id: str
|
| 82 |
+
expert_id: ExpertID
|
| 83 |
+
context_summary: str
|
| 84 |
+
initiated_at: datetime
|
| 85 |
+
deadline_at: datetime
|
| 86 |
+
state: Literal["pending","accepted","declined","completed","timed_out"]
|
| 87 |
+
thread_id: ThreadID | None
|
| 88 |
+
```
|
| 89 |
+
|
| 90 |
+
### 4.2 `MoeRouter`
|
| 91 |
+
|
| 92 |
+
```python
|
| 93 |
+
class MoeRouter:
|
| 94 |
+
"""Capability handler for experimental.moe.route@1.0"""
|
| 95 |
+
|
| 96 |
+
def __init__(
|
| 97 |
+
self,
|
| 98 |
+
bus: CapabilityBus,
|
| 99 |
+
registry: ExpertRegistry,
|
| 100 |
+
scorer: Scorer,
|
| 101 |
+
feedback: FeedbackStore,
|
| 102 |
+
settings: MoeSettings,
|
| 103 |
+
observability: Observability,
|
| 104 |
+
): ...
|
| 105 |
+
|
| 106 |
+
async def start(self) -> None: ...
|
| 107 |
+
|
| 108 |
+
async def route(self, body: RouteBody) -> RouteResult: ...
|
| 109 |
+
async def handoff(self, body: HandoffBody) -> Handoff: ...
|
| 110 |
+
async def feedback_outcome(self, handoff_id: str, outcome: Outcome) -> None: ...
|
| 111 |
+
```
|
| 112 |
+
|
| 113 |
+
### 4.3 `Scorer`
|
| 114 |
+
|
| 115 |
+
```python
|
| 116 |
+
class Scorer(Protocol):
|
| 117 |
+
name: str
|
| 118 |
+
|
| 119 |
+
def score(
|
| 120 |
+
self,
|
| 121 |
+
request_summary: str,
|
| 122 |
+
tags: list[str],
|
| 123 |
+
candidates: list[ExpertDescriptor],
|
| 124 |
+
context: ScoringContext,
|
| 125 |
+
) -> list[float]:
|
| 126 |
+
"""Return one score per candidate, same order."""
|
| 127 |
+
...
|
| 128 |
+
|
| 129 |
+
class RuleBasedScorer:
|
| 130 |
+
"""The default scorer. Pure rules: tag overlap, recency of expert activity, availability, bias. No ML."""
|
| 131 |
+
...
|
| 132 |
+
|
| 133 |
+
class LearnedScorer:
|
| 134 |
+
"""A small classifier trained on feedback outcomes. Off by default; activated once
|
| 135 |
+
MOE_ROUTER_TRAIN_MIN_EXAMPLES historical handoffs with outcomes are recorded."""
|
| 136 |
+
...
|
| 137 |
+
```
|
| 138 |
+
|
| 139 |
+
### 4.4 `ExpertRegistry`
|
| 140 |
+
|
| 141 |
+
```python
|
| 142 |
+
class ExpertRegistry:
|
| 143 |
+
"""
|
| 144 |
+
Materialised view of `experimental.moe.expert.registered` events.
|
| 145 |
+
Indexed by topic for fast routing.
|
| 146 |
+
"""
|
| 147 |
+
|
| 148 |
+
def __init__(self, event_log: EventLog): ...
|
| 149 |
+
|
| 150 |
+
def by_topic(self, tag: str) -> list[ExpertDescriptor]: ...
|
| 151 |
+
def by_id(self, expert_id: ExpertID) -> ExpertDescriptor | None: ...
|
| 152 |
+
def available_now(self) -> list[ExpertDescriptor]: ...
|
| 153 |
+
```
|
| 154 |
+
|
| 155 |
+
### 4.5 `HumanInTheLoopCoordinator`
|
| 156 |
+
|
| 157 |
+
```python
|
| 158 |
+
class HumanInTheLoopCoordinator:
|
| 159 |
+
"""
|
| 160 |
+
Manages handoff-to-human flows.
|
| 161 |
+
- Sends a chat invitation (M25 thread) to the chosen human.
|
| 162 |
+
- Awaits acceptance within HANDOFF_RESPONSE_DEADLINE_MINUTES.
|
| 163 |
+
- Falls back to next-best candidate if declined or timed out.
|
| 164 |
+
- Stores audit trail (who routed where, why, outcome).
|
| 165 |
+
"""
|
| 166 |
+
|
| 167 |
+
def __init__(self, bus: CapabilityBus, thread_service: ThreadService, registry: ExpertRegistry, settings: HitlSettings): ...
|
| 168 |
+
|
| 169 |
+
async def initiate(self, handoff: Handoff) -> None: ...
|
| 170 |
+
async def on_response(self, handoff_id: str, accepted: bool) -> None: ...
|
| 171 |
+
async def on_completion(self, handoff_id: str, outcome: Outcome) -> None: ...
|
| 172 |
+
```
|
| 173 |
+
|
| 174 |
+
---
|
| 175 |
+
|
| 176 |
+
## 5. Behaviour
|
| 177 |
+
|
| 178 |
+
### 5.1 Expert registration
|
| 179 |
+
|
| 180 |
+
A node calls `experimental.moe.expert.register@1.0` to register an expert. For `kind="human"`:
|
| 181 |
+
|
| 182 |
+
- The caller must be the human in question (self-registration) OR be an anchor registering on behalf of a member with explicit consent token.
|
| 183 |
+
- `consent_to_route=true` is mandatory; humans without it are silently excluded from routing.
|
| 184 |
+
- Topics are free-form strings, lowercased, kebab-case. The registry will compute embeddings of topics so that `"sankt_martins"` and `"sankt martins"` match.
|
| 185 |
+
|
| 186 |
+
For `kind="model"` or `kind="service"`:
|
| 187 |
+
|
| 188 |
+
- The node hosting the model/service self-registers.
|
| 189 |
+
- Topics describe what the model is good at, e.g. `["code","python"]` or `["niederrhein-history","local-genealogy"]`.
|
| 190 |
+
|
| 191 |
+
For `kind="external"`:
|
| 192 |
+
|
| 193 |
+
- An anchor registers a third-party endpoint (HF Inference, OpenAI, Anthropic, another HearthNet community via federation).
|
| 194 |
+
- External experts are not consulted unless `policy.research.moe_allow_external=true`.
|
| 195 |
+
|
| 196 |
+
### 5.2 Routing flow
|
| 197 |
+
|
| 198 |
+
`experimental.moe.route@1.0`:
|
| 199 |
+
|
| 200 |
+
1. Caller submits `{request_summary, tags, top_k}`.
|
| 201 |
+
2. The registry returns all currently-available experts whose topics overlap `tags` or whose semantic similarity to `request_summary` is above `MOE_TOPIC_SIMILARITY_THRESHOLD` (default 0.55).
|
| 202 |
+
3. Scorer scores each candidate.
|
| 203 |
+
4. Apply score biases (per-expert operator nudges, per-community policy).
|
| 204 |
+
5. Sort descending; return top-K with rationales.
|
| 205 |
+
6. The caller decides whether to:
|
| 206 |
+
- Route automatically (if top score β₯ `MOE_AUTO_ROUTE_THRESHOLD`, default 0.85),
|
| 207 |
+
- Present the user with the candidate list to choose,
|
| 208 |
+
- Or fall back to default LLM.
|
| 209 |
+
|
| 210 |
+
### 5.3 Handoff to a human expert
|
| 211 |
+
|
| 212 |
+
`experimental.moe.expert.handoff@1.0`:
|
| 213 |
+
|
| 214 |
+
1. The coordinator creates a new E2E group thread (M25) with the requester + chosen human.
|
| 215 |
+
2. The requester's question (or its summary) is posted to the thread.
|
| 216 |
+
3. The chosen human receives a notification.
|
| 217 |
+
4. If the human accepts within `HANDOFF_RESPONSE_DEADLINE_MINUTES` (default 60, configurable), the thread proceeds normally.
|
| 218 |
+
5. If declined or timed out: the coordinator silently picks the next candidate (or falls back to a model). The requester is informed once total wait exceeds `HANDOFF_WAIT_BUDGET_MINUTES` (default 30).
|
| 219 |
+
|
| 220 |
+
The human's UI shows handoffs as a low-priority "questions for you" inbox; they are not interruptive notifications by default. Policy can flip this for community types where instant response matters (e.g. civdef pilot, M31).
|
| 221 |
+
|
| 222 |
+
### 5.4 Routing model + RAG hybrid
|
| 223 |
+
|
| 224 |
+
A common pattern: a chat session begins, the user asks a question. The orchestrator:
|
| 225 |
+
|
| 226 |
+
1. Sends `request_summary` (the question + recent thread context) to the router.
|
| 227 |
+
2. If the top expert is a `service` expert pointing at `rag.query@1.0` against a specific corpus β run RAG against that corpus.
|
| 228 |
+
3. If the top is a `model` expert β run `llm.chat` against that model.
|
| 229 |
+
4. If the top is a `human` expert and the user explicitly opted in to human handoff β initiate handoff.
|
| 230 |
+
5. Else default model.
|
| 231 |
+
|
| 232 |
+
This is opt-in; the chat UI surfaces a "ask a neighbour?" affordance when human handoff is a credible option.
|
| 233 |
+
|
| 234 |
+
### 5.5 Feedback loop
|
| 235 |
+
|
| 236 |
+
After every routed call, the caller (or a UI signal: "did this answer your question?") records an `Outcome`:
|
| 237 |
+
|
| 238 |
+
```python
|
| 239 |
+
@dataclass
|
| 240 |
+
class Outcome:
|
| 241 |
+
handoff_id: str
|
| 242 |
+
helpful: bool | None # None if no signal
|
| 243 |
+
user_rating: int | None # 1β5
|
| 244 |
+
completion_time_seconds: float | None
|
| 245 |
+
handed_off_again: bool # the user asked the question elsewhere
|
| 246 |
+
```
|
| 247 |
+
|
| 248 |
+
Feedback is stored in `FeedbackStore` (SQLite). Once `MOE_ROUTER_TRAIN_MIN_EXAMPLES` (default 200) outcomes exist, the `LearnedScorer` becomes available and is retrained every `MOE_ROUTER_RETRAIN_EVERY_HOURS` (default 24). Communities can flip back to `RuleBasedScorer` at any time.
|
| 249 |
+
|
| 250 |
+
### 5.6 Privacy in routing
|
| 251 |
+
|
| 252 |
+
Routing is **observable** by definition β the request summary is sent to the router, which inspects it to pick an expert. Implications:
|
| 253 |
+
|
| 254 |
+
- The router lives on the **caller's own node**; the request summary is not transmitted off-node for routing.
|
| 255 |
+
- When the chosen expert is on a different node, the request body is sent over the bus as usual (signed, optionally E2E if it's a chat thread).
|
| 256 |
+
- The router does not log full request summaries. It logs `tags`, the candidate list, and the chosen expert. The summary is held in memory for the duration of the call.
|
| 257 |
+
- For handoff to humans, the human sees the actual question β they need to. The handoff event in the audit trail records "request handed off to X", not the question's content.
|
| 258 |
+
|
| 259 |
+
### 5.7 Cross-community routing (federation)
|
| 260 |
+
|
| 261 |
+
A federation manifest can include the scope `moe.route@1.0`. In that case the router can include experts from federated communities in its candidate list. Cross-community handoff:
|
| 262 |
+
|
| 263 |
+
- Initiates a federated thread (M25 + M14): the thread's events are bridged to the federated community.
|
| 264 |
+
- The expert's identity (e.g. "Lukas from Geldern") is visible to the requester.
|
| 265 |
+
- Federation scope must include `chat.thread.send@1.0` and `chat.thread.history@1.0` for the thread to function.
|
| 266 |
+
|
| 267 |
+
### 5.8 Operational policy
|
| 268 |
+
|
| 269 |
+
Communities can configure:
|
| 270 |
+
|
| 271 |
+
```yaml
|
| 272 |
+
moe:
|
| 273 |
+
enabled: true
|
| 274 |
+
auto_route_threshold: 0.85
|
| 275 |
+
topic_similarity_threshold: 0.55
|
| 276 |
+
human_handoff:
|
| 277 |
+
enabled: true
|
| 278 |
+
response_deadline_minutes: 60
|
| 279 |
+
wait_budget_minutes: 30
|
| 280 |
+
allowed_during_quiet_hours: false # no human pings 22:00β06:00 local
|
| 281 |
+
external_experts: false
|
| 282 |
+
cross_community: false
|
| 283 |
+
```
|
| 284 |
+
|
| 285 |
+
### 5.9 Anti-abuse
|
| 286 |
+
|
| 287 |
+
- **Rate limit per requester:** `MOE_REQUESTS_PER_USER_PER_HOUR` (default 60). Prevents one user from spamming the human-expert pool.
|
| 288 |
+
- **Per-expert cooldown:** an expert is not offered the same user's request within `MOE_PER_EXPERT_COOLDOWN_MINUTES` (default 30).
|
| 289 |
+
- **Decline penalty:** an expert who declines 3 handoffs in a row gets temporarily marked `availability=false` until they update their registration.
|
| 290 |
+
|
| 291 |
+
---
|
| 292 |
+
|
| 293 |
+
## 6. Errors
|
| 294 |
+
|
| 295 |
+
| Code | Cause |
|
| 296 |
+
|------|-------|
|
| 297 |
+
| `experimental_disabled` | Research not enabled |
|
| 298 |
+
| `bad_request` | Empty `request_summary`, malformed tags |
|
| 299 |
+
| `not_found` | `expert_id` does not exist (in `handoff`) |
|
| 300 |
+
| `handoff_declined` | Chosen expert declined and no fallback was permitted |
|
| 301 |
+
| `handoff_timed_out` | No response within deadline |
|
| 302 |
+
| `policy_violation` | Cross-community handoff but federation does not allow |
|
| 303 |
+
|
| 304 |
+
---
|
| 305 |
+
|
| 306 |
+
## 7. Configuration
|
| 307 |
+
|
| 308 |
+
```toml
|
| 309 |
+
[research.moe]
|
| 310 |
+
enabled = false
|
| 311 |
+
scorer = "rule_based" # "rule_based" | "learned"
|
| 312 |
+
auto_route_threshold = 0.85
|
| 313 |
+
topic_similarity_threshold = 0.55
|
| 314 |
+
top_k_default = 3
|
| 315 |
+
requests_per_user_per_hour = 60
|
| 316 |
+
per_expert_cooldown_minutes = 30
|
| 317 |
+
allow_external = false
|
| 318 |
+
allow_cross_community = false
|
| 319 |
+
|
| 320 |
+
[research.moe.human_handoff]
|
| 321 |
+
enabled = true
|
| 322 |
+
response_deadline_minutes = 60
|
| 323 |
+
wait_budget_minutes = 30
|
| 324 |
+
allowed_during_quiet_hours = false
|
| 325 |
+
quiet_hours_start = "22:00"
|
| 326 |
+
quiet_hours_end = "06:00"
|
| 327 |
+
|
| 328 |
+
[research.moe.learned_scorer]
|
| 329 |
+
train_min_examples = 200
|
| 330 |
+
retrain_every_hours = 24
|
| 331 |
+
model_kind = "logistic_regression" # small, interpretable
|
| 332 |
+
```
|
| 333 |
+
|
| 334 |
+
---
|
| 335 |
+
|
| 336 |
+
## 8. Tests
|
| 337 |
+
|
| 338 |
+
### 8.1 Unit
|
| 339 |
+
- RuleBasedScorer: tag-overlap dominance test (4 candidates, exact tag match scores highest)
|
| 340 |
+
- Availability filter: expert with `availability` window not covering "now" is excluded
|
| 341 |
+
- Cooldown: same user calls twice within `per_expert_cooldown_minutes` β second call excludes that expert
|
| 342 |
+
|
| 343 |
+
### 8.2 Integration
|
| 344 |
+
- Three-node community: two humans + one model registered as experts for `{cooking, niederrhein-history}`. Query about Sankt-Martins-Lieder β human expert chosen; handoff flow completes.
|
| 345 |
+
- Handoff decline: chosen expert declines, fallback picks next candidate; user sees a single thread experience without knowing about the decline.
|
| 346 |
+
- Cross-community: federation manifest grants `moe.route`; query routed to an expert in the federated community; thread bridged correctly.
|
| 347 |
+
|
| 348 |
+
### 8.3 Adversarial
|
| 349 |
+
- Spam: one user submits 100 routes in 10 minutes β rate-limit blocks at #60, returns `too_many_requests`.
|
| 350 |
+
- Decline-storm: an expert declines 10 in a row β after the third, that expert is auto-unavailable; not offered as candidate until they re-register.
|
| 351 |
+
- Score injection: a community member tries to set `score_bias=999` on their own expert record β registration rejects (caller must be anchor for `score_bias` outside `[-1, 1]`).
|
| 352 |
+
|
| 353 |
+
### 8.4 UX
|
| 354 |
+
- Top-K presentation in chat UI: candidates show as a 3-button affordance under the user's question; user picks one; thread morphs accordingly.
|
| 355 |
+
- Outcome capture: thumbs-up/down on the answer records an `Outcome`; visible in router metrics dashboard.
|
| 356 |
+
|
| 357 |
+
---
|
| 358 |
+
|
| 359 |
+
## 9. Cross-references
|
| 360 |
+
|
| 361 |
+
- Capability spec: [CAPABILITY_CONTRACT_v3 Β§4.4β4.6](../CAPABILITY_CONTRACT_v3.md)
|
| 362 |
+
- Group chat (handoff substrate): [M25](../../phase-2/modules/M25-group-chat.md)
|
| 363 |
+
- Federation: [M14](../../phase-2/modules/M14-federation.md)
|
| 364 |
+
|
| 365 |
+
---
|
| 366 |
+
|
| 367 |
+
## 10. Open research questions
|
| 368 |
+
|
| 369 |
+
1. **What signal predicts a good route?** Tag overlap is shallow. Embeddings of past handoffs vs current request might do better. The `LearnedScorer` is the placeholder; the actual feature engineering is unsolved.
|
| 370 |
+
2. **Calibration.** Is a score of 0.85 actually 85% likely to be a good route? Reliability diagrams from feedback data needed.
|
| 371 |
+
3. **Negative experts.** Should the router learn that "Llama-3.2 is *bad* at Niederrhein-Plattdeutsch" and avoid it? Currently only positive scores.
|
| 372 |
+
4. **Cold-start.** A new community has no feedback data and no `LearnedScorer`. Bootstrapping by federation (borrowing experts from a more-mature peer)?
|
| 373 |
+
5. **Human consent UX.** What is the right number of handoffs per week before it becomes a burden? Per-person ceiling, per-community ceiling, dynamic?
|
| 374 |
+
6. **Privacy of the rationale.** Should the rationale ("Maria worked on Sankt Martins for 20 years") be visible to the requester? It can reveal information about Maria. Default: rationale is shown to requester only when the expert opts in to that.
|
| 375 |
+
7. **Refusal protocol.** When a model expert "refuses" (e.g. "I cannot answer this"), should the router re-route, or trust the refusal? Mistaken refusals are a known LLM failure mode.
|
| 376 |
+
8. **Expert overlap.** Two experts both register for `{sankt_martins}`. Both equally good. What's the tiebreaker that doesn't always favour the same person? Round-robin? Random? Both score 0.85 β caller chooses?
|
| 377 |
+
9. **Network effects.** As more people register as experts, does the score signal get diluted? Empirical question.
|
| 378 |
+
10. **Audit and review.** A community might want a quarterly "who was routed to most, on what topics" view β for fairness, for spotting overworked experts. UX for surfacing this respectfully.
|
docs/p2_p3/M28-fedlearn.md
ADDED
|
@@ -0,0 +1,348 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# M28 β Federated Learning (LoRA Aggregation)
|
| 2 |
+
|
| 3 |
+
**Spec version:** v3.0 β *experimental*
|
| 4 |
+
**Depends on:** [M03 Capability Bus](../../modules/M03-capability-bus.md), [M04 LLM](../../modules/M04-llm.md), [M14 Federation](../../phase-2/modules/M14-federation.md), [X02 Event Log](../../cross-cutting/X02-events.md), [M16 Tokens](../../phase-2/modules/M16-tokens.md), [X06 WebSocket](../../phase-2/cross-cutting/X06-websocket.md)
|
| 5 |
+
**Depended on by:** nothing in MVP β opt-in research feature
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## 1. Responsibility
|
| 10 |
+
|
| 11 |
+
Federated learning of small **LoRA adapters** on top of a shared base model. Each node trains locally on its own data, sends only the **adapter weight deltas** (not raw data, not full weights) to an aggregator, and receives back an averaged adapter that subsequent nodes can use or further refine.
|
| 12 |
+
|
| 13 |
+
The bet: a 3B-parameter base model with a *community-tuned* LoRA adapter ("how people in our village actually phrase things, what jargon our Feuerwehr uses, what the local agricultural calendar looks like") is more useful for the community than a generic 3B model, and we can do this without any node ever shipping its private data off-box.
|
| 14 |
+
|
| 15 |
+
This module deliberately stays at LoRA scope only. Full fine-tunes, distillation, and continual pre-training are explicitly out β both because they are bandwidth-hostile and because the privacy story for full-weight federation is significantly harder.
|
| 16 |
+
|
| 17 |
+
---
|
| 18 |
+
|
| 19 |
+
## 2. Non-goals
|
| 20 |
+
|
| 21 |
+
- **Federating raw data.** Never. Training data stays on the node that owns it.
|
| 22 |
+
- **Full fine-tunes.** LoRA only. If a use case truly needs more, that's a different research project.
|
| 23 |
+
- **Cross-base-model aggregation.** All participants in a round must run the same base model at the same quantisation. Heterogeneous aggregation is open research.
|
| 24 |
+
- **Mandatory participation.** Every node decides per-round whether to join. There is no "you must contribute back" rule.
|
| 25 |
+
- **Aggregator centralisation.** Any node can host an aggregator. There is no privileged aggregator role.
|
| 26 |
+
- **Hiding participation.** Whether you joined a round is visible to other participants in that round; only your data and your gradients are private.
|
| 27 |
+
|
| 28 |
+
---
|
| 29 |
+
|
| 30 |
+
## 3. File layout
|
| 31 |
+
|
| 32 |
+
```
|
| 33 |
+
hearthnet/fedlearn/
|
| 34 |
+
βββ __init__.py
|
| 35 |
+
βββ coordinator.py # Orchestrates a round: announce, gather, aggregate, distribute
|
| 36 |
+
βββ participant.py # Local-side: respond to round announcements, train, submit
|
| 37 |
+
βββ trainer.py # Wraps M04 LLM in a LoRA training loop (peft + bitsandbytes)
|
| 38 |
+
βββ aggregator.py # FedAvg with optional secure aggregation
|
| 39 |
+
βββ delta.py # Serialise/deserialise LoRA deltas (state-dict subset)
|
| 40 |
+
βββ privacy.py # Optional DP-noise injection and gradient clipping
|
| 41 |
+
βββ manifest.py # Round manifest: base model id, hyperparams, signature
|
| 42 |
+
```
|
| 43 |
+
|
| 44 |
+
---
|
| 45 |
+
|
| 46 |
+
## 4. Public API
|
| 47 |
+
|
| 48 |
+
### 4.1 Dataclasses
|
| 49 |
+
|
| 50 |
+
```python
|
| 51 |
+
RoundID = NewType("RoundID", str) # ULID
|
| 52 |
+
|
| 53 |
+
@dataclass(frozen=True)
|
| 54 |
+
class RoundManifest:
|
| 55 |
+
round_id: RoundID
|
| 56 |
+
coordinator: NodeID
|
| 57 |
+
base_model_id: str # exact model id from M04 ("qwen2.5:3b-instruct-q4_K_M")
|
| 58 |
+
base_model_sha: str # SHA-256 of base weights; mismatch = exclusion
|
| 59 |
+
lora_target_modules: tuple[str, ...] # which linear layers carry LoRA (e.g. "q_proj","v_proj")
|
| 60 |
+
lora_rank: int # 4 β€ r β€ FEDLEARN_MAX_LORA_RANK
|
| 61 |
+
lora_alpha: int
|
| 62 |
+
lora_dropout: float
|
| 63 |
+
train_steps: int # max local SGD steps per participant
|
| 64 |
+
learning_rate: float
|
| 65 |
+
batch_size: int
|
| 66 |
+
seed: int # for deterministic init of LoRA matrices
|
| 67 |
+
dp_noise_scale: float # 0.0 = off
|
| 68 |
+
clip_norm: float # gradient clip; must be > 0 if DP on
|
| 69 |
+
min_participants: int # round aborts if fewer participants submit
|
| 70 |
+
max_participants: int
|
| 71 |
+
deadline: datetime # UTC; submissions after this dropped
|
| 72 |
+
topic: str # free-form: "niederrhein-emergency", "village-chat"
|
| 73 |
+
consent_text: str # human-readable; participant must accept
|
| 74 |
+
coordinator_sig: bytes # detached Ed25519 over the manifest
|
| 75 |
+
|
| 76 |
+
@dataclass
|
| 77 |
+
class ParticipantSubmission:
|
| 78 |
+
round_id: RoundID
|
| 79 |
+
participant: NodeID
|
| 80 |
+
delta_bytes: bytes # serialised LoRA state-dict
|
| 81 |
+
delta_sha: str
|
| 82 |
+
num_samples: int # for weighted FedAvg
|
| 83 |
+
train_loss: float # for telemetry only
|
| 84 |
+
submitted_at: datetime
|
| 85 |
+
signature: bytes # Ed25519 over (round_id, participant, delta_sha, num_samples)
|
| 86 |
+
|
| 87 |
+
@dataclass
|
| 88 |
+
class RoundResult:
|
| 89 |
+
round_id: RoundID
|
| 90 |
+
aggregated_delta_sha: str
|
| 91 |
+
n_participants: int
|
| 92 |
+
total_samples: int
|
| 93 |
+
aggregator: NodeID
|
| 94 |
+
completed_at: datetime
|
| 95 |
+
manifest_sha: str
|
| 96 |
+
download_url: str # capability bus uri for fetching the aggregated delta
|
| 97 |
+
```
|
| 98 |
+
|
| 99 |
+
### 4.2 Capabilities
|
| 100 |
+
|
| 101 |
+
```python
|
| 102 |
+
async def fedlearn_round_announce(manifest: RoundManifest) -> RoundID
|
| 103 |
+
async def fedlearn_round_list(topic: str | None = None) -> list[RoundManifest]
|
| 104 |
+
async def fedlearn_round_join(round_id: RoundID, consent: bool) -> JoinReceipt
|
| 105 |
+
async def fedlearn_round_submit(submission: ParticipantSubmission) -> SubmitReceipt
|
| 106 |
+
async def fedlearn_round_status(round_id: RoundID) -> RoundStatus
|
| 107 |
+
async def fedlearn_round_finalize(round_id: RoundID) -> RoundResult # coordinator-only
|
| 108 |
+
async def fedlearn_adapter_fetch(sha: str) -> bytes
|
| 109 |
+
async def fedlearn_adapter_apply(sha: str, scope: Literal["session","node"]) -> ApplyReceipt
|
| 110 |
+
```
|
| 111 |
+
|
| 112 |
+
All capabilities are in the `experimental.fedlearn.*` namespace and only registered on the bus when `experimental.fedlearn = true` in the node config.
|
| 113 |
+
|
| 114 |
+
### 4.3 Coordinator class
|
| 115 |
+
|
| 116 |
+
```python
|
| 117 |
+
class RoundCoordinator:
|
| 118 |
+
def __init__(self,
|
| 119 |
+
bus: CapabilityBus,
|
| 120 |
+
event_log: EventLog,
|
| 121 |
+
llm: LLMService,
|
| 122 |
+
fedlearn_config: FedLearnConfig): ...
|
| 123 |
+
|
| 124 |
+
async def announce_round(self, draft: RoundManifestDraft) -> RoundID: ...
|
| 125 |
+
async def collect_submissions(self, round_id: RoundID) -> list[ParticipantSubmission]: ...
|
| 126 |
+
async def aggregate(self, round_id: RoundID) -> bytes: ...
|
| 127 |
+
async def finalize_and_publish(self, round_id: RoundID) -> RoundResult: ...
|
| 128 |
+
|
| 129 |
+
# internal
|
| 130 |
+
async def _validate_submission(self, sub: ParticipantSubmission, manifest: RoundManifest) -> None: ...
|
| 131 |
+
async def _emit(self, evt: Event) -> None: ...
|
| 132 |
+
```
|
| 133 |
+
|
| 134 |
+
### 4.4 Participant class
|
| 135 |
+
|
| 136 |
+
```python
|
| 137 |
+
class RoundParticipant:
|
| 138 |
+
def __init__(self,
|
| 139 |
+
bus: CapabilityBus,
|
| 140 |
+
event_log: EventLog,
|
| 141 |
+
llm: LLMService,
|
| 142 |
+
data_provider: TrainingDataProvider,
|
| 143 |
+
fedlearn_config: FedLearnConfig): ...
|
| 144 |
+
|
| 145 |
+
async def consider_round(self, manifest: RoundManifest) -> Decision: ...
|
| 146 |
+
async def train(self, manifest: RoundManifest) -> ParticipantSubmission: ...
|
| 147 |
+
async def submit(self, submission: ParticipantSubmission) -> SubmitReceipt: ...
|
| 148 |
+
async def apply_aggregated(self, result: RoundResult, scope: Literal["session","node"]) -> ApplyReceipt: ...
|
| 149 |
+
```
|
| 150 |
+
|
| 151 |
+
### 4.5 Aggregator
|
| 152 |
+
|
| 153 |
+
```python
|
| 154 |
+
class FedAvgAggregator:
|
| 155 |
+
def __init__(self, manifest: RoundManifest): ...
|
| 156 |
+
|
| 157 |
+
def add(self, submission: ParticipantSubmission, delta: dict[str, Tensor]) -> None: ...
|
| 158 |
+
def aggregate(self) -> dict[str, Tensor]: ... # weighted by num_samples
|
| 159 |
+
|
| 160 |
+
class SecureFedAvgAggregator(FedAvgAggregator):
|
| 161 |
+
"""Optional: pairwise masking so the aggregator sees only the sum, never individual deltas."""
|
| 162 |
+
def __init__(self, manifest: RoundManifest, mask_scheme: Literal["additive_pairwise"] = "additive_pairwise"): ...
|
| 163 |
+
```
|
| 164 |
+
|
| 165 |
+
### 4.6 Privacy helpers
|
| 166 |
+
|
| 167 |
+
```python
|
| 168 |
+
def clip_gradient(state_dict: dict[str, Tensor], max_norm: float) -> dict[str, Tensor]
|
| 169 |
+
def add_dp_noise(state_dict: dict[str, Tensor], scale: float, rng: Generator) -> dict[str, Tensor]
|
| 170 |
+
def epsilon_estimate(scale: float, clip: float, n_steps: int, batch: int, dataset_size: int) -> float
|
| 171 |
+
```
|
| 172 |
+
|
| 173 |
+
---
|
| 174 |
+
|
| 175 |
+
## 5. Behaviour
|
| 176 |
+
|
| 177 |
+
### 5.1 Round lifecycle
|
| 178 |
+
|
| 179 |
+
```
|
| 180 |
+
ANNOUNCED ββjoinβββΆ JOINED ββtrainβββΆ TRAINED ββsubmitβββΆ SUBMITTED βββ
|
| 181 |
+
β β
|
| 182 |
+
β βββββββββββββ aggregate βββββββββββββ
|
| 183 |
+
β βΌ
|
| 184 |
+
βββββdeadline reachedβββββΆ AGGREGATING ββfinalizeβββΆ COMPLETED
|
| 185 |
+
β
|
| 186 |
+
βββmin_participants not metβββΆ ABORTED
|
| 187 |
+
```
|
| 188 |
+
|
| 189 |
+
State transitions are recorded as events (`fedlearn.round.*`) on the coordinator's event log. Participants see their own state mirrored via subscription.
|
| 190 |
+
|
| 191 |
+
### 5.2 Manifest signing
|
| 192 |
+
|
| 193 |
+
Manifest is canonicalised (JCS, like federation manifests in M14 Β§5.2), then signed Ed25519 by the coordinator's node key. Participants must verify the signature before training. A manifest with an invalid signature is dropped silently and logged as a security event (`security.signature.invalid`).
|
| 194 |
+
|
| 195 |
+
### 5.3 Consent flow
|
| 196 |
+
|
| 197 |
+
When `fedlearn.round.join` is called, the participant module must:
|
| 198 |
+
|
| 199 |
+
1. Check `experimental.fedlearn` is enabled in node config. If not β `experimental_disabled`.
|
| 200 |
+
2. Display `manifest.consent_text` to the operator via the M11 Notifications path. The operator must explicitly accept. The acceptance is stored as a signed `fedlearn.consent.granted` event.
|
| 201 |
+
3. Verify coordinator signature. If invalid β `signature_invalid` (we deliberately don't say *whose* signature; bystanders learn nothing useful).
|
| 202 |
+
4. Check `base_model_sha` against the locally-installed base model. If mismatch β `base_model_mismatch`. Do not download a different base on demand; this is a hard error.
|
| 203 |
+
5. Check resource budget: estimate VRAM and disk for the training run from `lora_rank * len(target_modules) * hidden_size`. If insufficient β `insufficient_resources`.
|
| 204 |
+
6. If all checks pass β emit `fedlearn.round.joined`, return `JoinReceipt`.
|
| 205 |
+
|
| 206 |
+
### 5.4 Local training
|
| 207 |
+
|
| 208 |
+
The trainer wraps M04's LLM handle in a HuggingFace `peft.LoraConfig` and uses `bitsandbytes` 4-bit base + fp16 LoRA matrices. Training data is provided by an injected `TrainingDataProvider` β the module never reaches into other modules' storage. Typical providers:
|
| 209 |
+
|
| 210 |
+
- `ChatHistoryProvider` (asks M10 for redacted, consented chat turns),
|
| 211 |
+
- `KBProvider` (asks M07 for documents tagged for training),
|
| 212 |
+
- `CustomFileProvider` (operator-curated training set).
|
| 213 |
+
|
| 214 |
+
After `train_steps` steps or convergence (loss plateau over a window), the trainer extracts the LoRA state-dict, applies optional gradient clipping and DP noise (if `manifest.dp_noise_scale > 0`), serialises, signs, and returns a `ParticipantSubmission`.
|
| 215 |
+
|
| 216 |
+
### 5.5 Aggregation
|
| 217 |
+
|
| 218 |
+
The default aggregator is weighted **FedAvg**: each adapter weight is weighted by `num_samples` and averaged across submissions. After aggregation, the coordinator emits `fedlearn.round.aggregated` and stores the aggregated delta via the capability bus (using the same content-addressed file path that M06 Files uses).
|
| 219 |
+
|
| 220 |
+
If the round was declared with `secure=true` in the draft, `SecureFedAvgAggregator` is used: each participant pair establishes an additive mask, masks cancel in the sum, and the aggregator never sees individual deltas. This costs an extra round-trip between participants before submission (the *mask exchange phase*) and requires `min_participants β₯ 3`.
|
| 221 |
+
|
| 222 |
+
### 5.6 Distribution
|
| 223 |
+
|
| 224 |
+
The aggregated adapter is published as a content-addressed file. Participants who joined the round get a `fedlearn.round.completed` event with the SHA. They can choose to:
|
| 225 |
+
|
| 226 |
+
- **Session apply** β load into a single LLM session via M04 (`llm.session.apply_adapter`),
|
| 227 |
+
- **Node apply** β install as the default adapter for the node (requires explicit operator action),
|
| 228 |
+
- **Discard** β do nothing.
|
| 229 |
+
|
| 230 |
+
Non-participants can also fetch and apply adapters they trust. There is no DRM and no whitelist: the aggregated delta is just a file with a SHA.
|
| 231 |
+
|
| 232 |
+
### 5.7 Failure modes
|
| 233 |
+
|
| 234 |
+
- **Coordinator vanishes mid-round:** participants wait until `deadline`, then any participant can call `fedlearn.round.finalize_takeover(round_id)` which constructs the aggregated delta from received submissions and re-publishes. The takeover is signed by the takeover-node and is visible as such.
|
| 235 |
+
- **A participant submits garbage:** validation in `_validate_submission` checks tensor shapes, dtypes, finite-ness (no NaN/Inf), and that the delta is structurally a valid LoRA state-dict for the manifest's `lora_target_modules`. Garbage submissions are dropped and logged.
|
| 236 |
+
- **Sybil flooding:** all participants must be authenticated with M01 identity and the manifest can require a minimum reputation/trust score (this is open research β for v3.0 the field exists in the manifest but is not yet enforced).
|
| 237 |
+
- **Adversarial gradient (poisoning):** out of scope for v3.0; documented in Open Research Questions Β§10.
|
| 238 |
+
|
| 239 |
+
---
|
| 240 |
+
|
| 241 |
+
## 6. Errors
|
| 242 |
+
|
| 243 |
+
| Code | When |
|
| 244 |
+
|---------------------------------|---------------------------------------------------------------------|
|
| 245 |
+
| `experimental_disabled` | Caller invokes a fedlearn capability with the flag off |
|
| 246 |
+
| `signature_invalid` | Manifest or submission signature does not verify |
|
| 247 |
+
| `base_model_mismatch` | Local base model SHA differs from manifest |
|
| 248 |
+
| `insufficient_resources` | Estimated VRAM/disk exceeds budget |
|
| 249 |
+
| `consent_required` | join() called without an explicit consent record |
|
| 250 |
+
| `round_full` | `max_participants` reached |
|
| 251 |
+
| `round_closed` | Submission after deadline |
|
| 252 |
+
| `delta_invalid` | Submitted state-dict fails structural validation |
|
| 253 |
+
| `fedlearn_aggregation_failed` | Aggregation produced NaN/Inf or insufficient submissions |
|
| 254 |
+
| `fedlearn_min_participants_unmet` | Round closes with fewer than `min_participants` valid submissions |
|
| 255 |
+
| `fedlearn_aggregator_unreachable` | finalize() called while coordinator is offline and takeover not triggered |
|
| 256 |
+
| `adapter_not_found` | `fedlearn.adapter.fetch` for an unknown SHA |
|
| 257 |
+
|
| 258 |
+
---
|
| 259 |
+
|
| 260 |
+
## 7. Configuration
|
| 261 |
+
|
| 262 |
+
```python
|
| 263 |
+
@dataclass(frozen=True)
|
| 264 |
+
class FedLearnConfig:
|
| 265 |
+
enabled: bool = False # master switch; default off
|
| 266 |
+
max_lora_rank: int = FEDLEARN_MAX_LORA_RANK # 64
|
| 267 |
+
max_lora_target_modules: int = FEDLEARN_MAX_LORA_TARGET_MODULES # 8
|
| 268 |
+
max_train_steps: int = FEDLEARN_MAX_TRAIN_STEPS # 1000
|
| 269 |
+
max_round_participants: int = FEDLEARN_MAX_PARTICIPANTS # 32
|
| 270 |
+
min_round_participants: int = FEDLEARN_MIN_PARTICIPANTS # 3
|
| 271 |
+
dp_noise_scale_default: float = FEDLEARN_DP_NOISE_SCALE_DEFAULT # 0.0 (off)
|
| 272 |
+
clip_norm_default: float = FEDLEARN_CLIP_NORM_DEFAULT # 1.0
|
| 273 |
+
submission_max_bytes: int = FEDLEARN_SUBMISSION_MAX_BYTES # 64 MiB
|
| 274 |
+
require_secure_aggregation: bool = False
|
| 275 |
+
auto_apply_aggregated: bool = False # never auto-apply by default
|
| 276 |
+
training_vram_budget_mb: int = 8192
|
| 277 |
+
training_disk_budget_mb: int = 4096
|
| 278 |
+
```
|
| 279 |
+
|
| 280 |
+
All `FEDLEARN_*` constants live in `hearthnet/constants.py` so a single source of truth governs both validation and documentation generation.
|
| 281 |
+
|
| 282 |
+
---
|
| 283 |
+
|
| 284 |
+
## 8. Tests
|
| 285 |
+
|
| 286 |
+
### 8.1 Unit
|
| 287 |
+
|
| 288 |
+
- `test_manifest_canonicalisation_stable` β re-encoding does not change SHA.
|
| 289 |
+
- `test_manifest_signature_roundtrip`.
|
| 290 |
+
- `test_delta_serialisation_roundtrip` β tensors preserve dtype and shape.
|
| 291 |
+
- `test_fedavg_weighted_arithmetic` β manually averaged deltas match aggregator output to within fp16 noise.
|
| 292 |
+
- `test_dp_noise_zero_is_identity` β `add_dp_noise(d, scale=0.0)` is a no-op.
|
| 293 |
+
- `test_clip_gradient_norm` β post-clip norm β€ `max_norm`.
|
| 294 |
+
- `test_secure_aggregation_masks_cancel` β sum of masks across all pairs is zero.
|
| 295 |
+
|
| 296 |
+
### 8.2 Property
|
| 297 |
+
|
| 298 |
+
- Across random shapes, `fedavg([d, d, d]) == d`.
|
| 299 |
+
- Across random submissions, `fedavg(submissions)` is finite when all inputs are finite.
|
| 300 |
+
|
| 301 |
+
### 8.3 Integration
|
| 302 |
+
|
| 303 |
+
- Two-node loopback round on a 0.5B base model: announce β join β train (synthetic data, 10 steps) β submit β aggregate β apply. Aggregated adapter must be loadable and must not blow up perplexity by more than 2x on a held-out set (sanity, not quality).
|
| 304 |
+
- Coordinator-failure round: simulate coordinator going offline after submissions received; takeover by another participant produces an aggregated delta with the same SHA.
|
| 305 |
+
- Sybil-defence stub: round with `min_participants=3` and only 2 valid submissions aborts with `fedlearn_min_participants_unmet`.
|
| 306 |
+
|
| 307 |
+
### 8.4 Negative
|
| 308 |
+
|
| 309 |
+
- Wrong base SHA β `base_model_mismatch`.
|
| 310 |
+
- Submission with NaN in one tensor β `delta_invalid`.
|
| 311 |
+
- Submission missing one of the target modules β `delta_invalid`.
|
| 312 |
+
- Manifest signed by an untrusted identity β `signature_invalid`.
|
| 313 |
+
- Disabled flag β `experimental_disabled` even for read-only queries.
|
| 314 |
+
|
| 315 |
+
---
|
| 316 |
+
|
| 317 |
+
## 9. Cross-references
|
| 318 |
+
|
| 319 |
+
- **Phase 1 M04 LLM** β provides the local model handle, exposes `llm.session.apply_adapter` and `llm.adapter.list`.
|
| 320 |
+
- **Phase 1 M07 Knowledge Base** β `KBProvider` reads tagged documents for training.
|
| 321 |
+
- **Phase 2 M14 Federation** β federated rounds across communities use the federation transport for manifest distribution and submission. Cross-community rounds require both communities' DPOs to sign the round consent.
|
| 322 |
+
- **Phase 2 M16 Tokens** β round participation tokens (`fedlearn-participant` scope) are issued by the coordinator and bound to a single round.
|
| 323 |
+
- **Phase 2 M25 Group Chat** β `village-chat` rounds typically draw training data from group chat history (consented turns only).
|
| 324 |
+
- **Phase 3 M30 Evidence/EBKH** β aggregated adapters can be tracked as claims in the evidence graph; "adapter X improved perplexity on held-out set Y" is a `claim.assert`.
|
| 325 |
+
|
| 326 |
+
---
|
| 327 |
+
|
| 328 |
+
## 10. Open research questions
|
| 329 |
+
|
| 330 |
+
1. **Gradient poisoning defence.** Coordinated malicious participants can submit deltas that, when aggregated, degrade or backdoor the adapter. Median-based aggregation (Krum, trimmed mean) is a partial defence; an authenticated-data attestation (per-submission proof that gradients were computed on real, non-cherry-picked data) is the harder question. v3.0 ships FedAvg only; v3.1 may add Krum behind a flag.
|
| 331 |
+
|
| 332 |
+
2. **Heterogeneous base models.** Today, every participant in a round must run the same base model at the same quantisation. Cross-base aggregation (e.g., projecting LoRA from Qwen-3B-Q4 to Qwen-3B-Q5 or even Qwen-3B β Qwen-7B) is open. The naive approach (re-projecting via a translation matrix learnt from a calibration set) loses accuracy quickly.
|
| 333 |
+
|
| 334 |
+
3. **Adaptive DP-noise.** Fixed `dp_noise_scale` is crude. Per-round noise calibration as a function of `min_participants` and `lora_rank` would tighten the privacy/utility tradeoff. Out of scope for v3.0.
|
| 335 |
+
|
| 336 |
+
4. **Reputation-weighted FedAvg.** Weighting submissions by `num_samples * trust_score` instead of `num_samples` alone. Requires a credible trust signal, which the broader HearthNet design has not yet committed to.
|
| 337 |
+
|
| 338 |
+
5. **Continual rounds.** Today each round produces a stand-alone adapter. Stacking rounds (round N tunes on top of round N-1's aggregate) raises questions about drift, fairness, and rollback. Probably belongs in a future M28b.
|
| 339 |
+
|
| 340 |
+
6. **Cross-task adapters.** A `niederrhein-emergency` adapter and a `village-chat` adapter are trained separately. Whether they can be cleanly combined at inference time (LoRA composition) is a known-hard problem and explicitly not promised here.
|
| 341 |
+
|
| 342 |
+
7. **Hardware-class fairness.** A round held by a participant with an RTX 5090 might exclude phone-class participants by setting `train_steps` too high. A "ranked tier" with separate aggregations per tier is one possibility. Currently the manifest is a single-tier flat artefact.
|
| 343 |
+
|
| 344 |
+
8. **Audit of training data.** Even though raw data never leaves the node, the *fact that training happened on consented data* is currently un-auditable from the outside. A future zero-knowledge attestation of "this delta was computed on N samples each tagged training=true" would be useful. Out of scope.
|
| 345 |
+
|
| 346 |
+
---
|
| 347 |
+
|
| 348 |
+
*Last updated: spec v3.0.*
|
docs/p2_p3/M29-lora-beacons.md
ADDED
|
@@ -0,0 +1,322 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# M29 β LoRa Hardware Beacons
|
| 2 |
+
|
| 3 |
+
**Spec version:** v3.0 β *experimental*
|
| 4 |
+
**Depends on:** [M03 Capability Bus](../../modules/M03-capability-bus.md), [M02 Transport](../../modules/M02-transport.md), [X02 Event Log](../../cross-cutting/X02-events.md), [M11 Notifications](../../modules/M11-notifications.md), [M01 Identity](../../modules/M01-identity.md)
|
| 5 |
+
**Depended on by:** Civil Defense (M31) optionally consumes beacon-presence signals
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## 1. Responsibility
|
| 10 |
+
|
| 11 |
+
Optional **out-of-band presence and panic-button channel** over 868 MHz LoRa hardware. When the internet is out, when the cellular network is down, when the power grid is wobbly β the LoRa stack still carries a 32-byte "I exist" ping or a tiny panic message between neighbours up to a few kilometres apart.
|
| 12 |
+
|
| 13 |
+
This module is explicitly **not a data channel**. No AI traffic, no chat content, no file transfer. The bandwidth is laughably small (sub-100 bytes per minute per node in a normal duty-cycle regime), the latency is awful, and the airwave is shared. What LoRa is good at is "I'm still here, and the gateway in the next village is still reachable."
|
| 14 |
+
|
| 15 |
+
The module exposes a small set of capabilities for sending and receiving beacons, mapping LoRa device IDs to HearthNet identities, and surfacing the resulting connectivity graph to the rest of the stack as a fallback signal. Hardware is a USB-attached LoRa stick (RFM95W, sx1276, sx1262 chipsets) bridged via serial.
|
| 16 |
+
|
| 17 |
+
---
|
| 18 |
+
|
| 19 |
+
## 2. Non-goals
|
| 20 |
+
|
| 21 |
+
- **General-purpose meshing of HearthNet over LoRa.** Bandwidth and duty cycle make this impossible at any useful scale.
|
| 22 |
+
- **Encryption of beacon contents.** Beacons carry identity hash + sequence + minimal flags only. Anything sensitive belongs on a different channel.
|
| 23 |
+
- **Replacing TETRA/BOS.** Emergency services have their own radio. This is a *neighbour-to-neighbour* fallback, not a replacement for professional emergency comms.
|
| 24 |
+
- **Hardware abstraction layer for every LoRa chipset.** v3.0 supports a small whitelist (RFM95W, sx1276, sx1262 via Meshtastic-firmware sticks). Others are open contributions.
|
| 25 |
+
- **Long-distance routing.** No multi-hop store-and-forward in v3.0. A beacon goes one hop or it doesn't go.
|
| 26 |
+
- **Legal interpretation of national radio regulations.** Each operator is responsible for complying with their local rules (BNetzA in Germany, ETSI in EU, FCC in US). The module enforces *configured* duty-cycle limits but cannot enforce the law on the operator.
|
| 27 |
+
|
| 28 |
+
---
|
| 29 |
+
|
| 30 |
+
## 3. File layout
|
| 31 |
+
|
| 32 |
+
```
|
| 33 |
+
hearthnet/lora/
|
| 34 |
+
βββ __init__.py
|
| 35 |
+
βββ service.py # LoraBeaconService β the capability handler
|
| 36 |
+
βββ serial_bridge.py # USB-serial framing to the LoRa stick
|
| 37 |
+
βββ frame.py # Encode/decode the 32-byte beacon frame
|
| 38 |
+
βββ duty_cycle.py # Track airtime, enforce duty-cycle limits
|
| 39 |
+
βββ peer_map.py # LoRa device ID β NodeID mapping (with TOFU verification)
|
| 40 |
+
βββ adapters/
|
| 41 |
+
βββ __init__.py
|
| 42 |
+
βββ meshtastic.py # Meshtastic firmware stick
|
| 43 |
+
βββ rfm95w.py # Bare RFM95W via serial-port gateway firmware
|
| 44 |
+
βββ sx126x.py # sx1262 module
|
| 45 |
+
```
|
| 46 |
+
|
| 47 |
+
---
|
| 48 |
+
|
| 49 |
+
## 4. Public API
|
| 50 |
+
|
| 51 |
+
### 4.1 Dataclasses
|
| 52 |
+
|
| 53 |
+
```python
|
| 54 |
+
LoraBeaconID = NewType("LoraBeaconID", str) # device-local sequence + prefix
|
| 55 |
+
LoraDeviceID = NewType("LoraDeviceID", str) # hardware ID from the stick
|
| 56 |
+
|
| 57 |
+
@dataclass(frozen=True)
|
| 58 |
+
class LoraBeacon:
|
| 59 |
+
beacon_id: LoraBeaconID
|
| 60 |
+
sender_hash: bytes # 4-byte truncated SHA-256 of sender NodeID
|
| 61 |
+
sequence: int # u16, wraps
|
| 62 |
+
flags: int # u8 bit-field; see Β§5.2
|
| 63 |
+
rssi: int | None # dBm, on receive only
|
| 64 |
+
snr: float | None # dB, on receive only
|
| 65 |
+
timestamp: datetime # local clock at decode
|
| 66 |
+
|
| 67 |
+
@dataclass(frozen=True)
|
| 68 |
+
class LoraPeer:
|
| 69 |
+
node_id: NodeID
|
| 70 |
+
device_id: LoraDeviceID
|
| 71 |
+
sender_hash: bytes
|
| 72 |
+
first_seen: datetime
|
| 73 |
+
last_seen: datetime
|
| 74 |
+
rssi_recent: int | None
|
| 75 |
+
verified_tofu: bool # True after operator confirmation
|
| 76 |
+
|
| 77 |
+
@dataclass(frozen=True)
|
| 78 |
+
class DutyCycleStatus:
|
| 79 |
+
region: Literal["EU868","US915","AS923"]
|
| 80 |
+
window_seconds: int
|
| 81 |
+
airtime_used_ms: int
|
| 82 |
+
airtime_budget_ms: int # e.g. 36000 ms in EU868 1% window
|
| 83 |
+
next_tx_allowed_at: datetime
|
| 84 |
+
```
|
| 85 |
+
|
| 86 |
+
### 4.2 Capabilities
|
| 87 |
+
|
| 88 |
+
All under `experimental.lora.*`:
|
| 89 |
+
|
| 90 |
+
```python
|
| 91 |
+
async def lora_status() -> LoraStatus
|
| 92 |
+
async def lora_beacon_send(flags: int = 0) -> LoraBeaconID
|
| 93 |
+
async def lora_panic_send() -> LoraBeaconID # sets FLAG_PANIC; bypasses normal pacing
|
| 94 |
+
async def lora_peer_list() -> list[LoraPeer]
|
| 95 |
+
async def lora_peer_verify(device_id: LoraDeviceID, node_id: NodeID) -> VerifyReceipt
|
| 96 |
+
async def lora_recent_beacons(since: datetime | None = None) -> list[LoraBeacon]
|
| 97 |
+
async def lora_duty_cycle() -> DutyCycleStatus
|
| 98 |
+
async def lora_subscribe_beacons() -> AsyncIterator[LoraBeacon]
|
| 99 |
+
```
|
| 100 |
+
|
| 101 |
+
### 4.3 Service class
|
| 102 |
+
|
| 103 |
+
```python
|
| 104 |
+
class LoraBeaconService:
|
| 105 |
+
def __init__(self,
|
| 106 |
+
bus: CapabilityBus,
|
| 107 |
+
event_log: EventLog,
|
| 108 |
+
notifications: NotificationService,
|
| 109 |
+
identity: IdentityService,
|
| 110 |
+
config: LoraConfig): ...
|
| 111 |
+
|
| 112 |
+
async def start(self) -> None: ... # opens serial, begins RX loop
|
| 113 |
+
async def stop(self) -> None: ...
|
| 114 |
+
async def send_beacon(self, flags: int = 0) -> LoraBeaconID: ...
|
| 115 |
+
async def send_panic(self) -> LoraBeaconID: ...
|
| 116 |
+
async def on_frame_received(self, raw: bytes, rssi: int, snr: float) -> None: ...
|
| 117 |
+
async def _drain_rx(self) -> None: ...
|
| 118 |
+
def duty_cycle_status(self) -> DutyCycleStatus: ...
|
| 119 |
+
```
|
| 120 |
+
|
| 121 |
+
### 4.4 Serial bridge
|
| 122 |
+
|
| 123 |
+
```python
|
| 124 |
+
class SerialBridge:
|
| 125 |
+
def __init__(self, port: str, baud: int = 115200, adapter: LoraAdapter = ...): ...
|
| 126 |
+
async def open(self) -> None: ...
|
| 127 |
+
async def close(self) -> None: ...
|
| 128 |
+
async def write(self, frame: bytes) -> None: ...
|
| 129 |
+
async def read(self) -> AsyncIterator[bytes]: ...
|
| 130 |
+
|
| 131 |
+
class LoraAdapter(Protocol):
|
| 132 |
+
"""Per-chipset/firmware framing rules."""
|
| 133 |
+
name: str
|
| 134 |
+
def encode_tx(self, payload: bytes) -> bytes: ...
|
| 135 |
+
def decode_rx(self, raw: bytes) -> tuple[bytes, int, float]: ... # payload, rssi, snr
|
| 136 |
+
def at_init_commands(self) -> list[bytes]: ...
|
| 137 |
+
```
|
| 138 |
+
|
| 139 |
+
---
|
| 140 |
+
|
| 141 |
+
## 5. Behaviour
|
| 142 |
+
|
| 143 |
+
### 5.1 Beacon frame
|
| 144 |
+
|
| 145 |
+
Strictly 32 bytes, big-endian:
|
| 146 |
+
|
| 147 |
+
```
|
| 148 |
+
offset size field
|
| 149 |
+
0 1 version (currently 0x01)
|
| 150 |
+
1 4 sender_hash (SHA-256(NodeID)[:4])
|
| 151 |
+
5 2 sequence (u16, wraps)
|
| 152 |
+
7 1 flags
|
| 153 |
+
8 1 reserved (0x00)
|
| 154 |
+
9 4 unix_seconds (sender's clock, u32; informational only)
|
| 155 |
+
13 19 payload (currently zero-padded; reserved for future use)
|
| 156 |
+
```
|
| 157 |
+
|
| 158 |
+
No payload content is carried beyond identity-hash + flags + clock. The flags field carries:
|
| 159 |
+
|
| 160 |
+
```python
|
| 161 |
+
FLAG_PANIC = 0x01 # urgent attention requested
|
| 162 |
+
FLAG_OK = 0x02 # explicit "I'm fine" (operator pressed an OK button)
|
| 163 |
+
FLAG_GATEWAY = 0x04 # this node has an alternate transport currently up
|
| 164 |
+
FLAG_LOW_BATTERY = 0x08 # device-level low-battery indicator
|
| 165 |
+
FLAG_RESERVED_* = 0x10..0x80
|
| 166 |
+
```
|
| 167 |
+
|
| 168 |
+
Frames are not encrypted. Frames *are* not anonymous either β the sender hash is small enough to collide (4 bytes), but stable enough that a passive observer can correlate beacons from the same sender over time. This is documented and acceptable for the threat model: LoRa airwaves are observable by construction.
|
| 169 |
+
|
| 170 |
+
### 5.2 RX path
|
| 171 |
+
|
| 172 |
+
1. `SerialBridge` yields raw frames as they arrive.
|
| 173 |
+
2. `LoraAdapter.decode_rx` peels off the chipset framing and returns the 32-byte payload + RSSI + SNR.
|
| 174 |
+
3. `service.on_frame_received` validates: length == 32, version == 0x01, sender_hash plausibly maps to a known or unknown peer.
|
| 175 |
+
4. If `sender_hash` matches a verified peer in `peer_map`, the beacon is recorded against that peer.
|
| 176 |
+
5. If `sender_hash` is unknown, a `lora.peer.unknown` event is emitted with a TOFU verification prompt for the operator.
|
| 177 |
+
6. If `FLAG_PANIC` is set, a high-priority notification is raised via M11 regardless of peer-verification status.
|
| 178 |
+
7. The beacon is published on the bus subscription `experimental.lora.beacon.received`.
|
| 179 |
+
|
| 180 |
+
### 5.3 TX path and duty cycle
|
| 181 |
+
|
| 182 |
+
Beaconing follows a fixed cadence `LORA_BEACON_PERIOD_SECONDS` (default 600 = 10 minutes). Each transmission's airtime is computed from spreading factor and bandwidth (typical: SF9, BW125 β ~165 ms per 32-byte frame) and added to the duty-cycle window.
|
| 183 |
+
|
| 184 |
+
The duty-cycle window enforces the region's regulation:
|
| 185 |
+
|
| 186 |
+
| Region | Window | Budget |
|
| 187 |
+
|--------|--------|--------|
|
| 188 |
+
| EU868 | 3600 s | 36 s (1%) |
|
| 189 |
+
| US915 | 3600 s | unlimited (FHSS) but config still applies |
|
| 190 |
+
| AS923 | 3600 s | 36 s (1%) |
|
| 191 |
+
|
| 192 |
+
If a normal `send_beacon` call would exceed the budget, it is **deferred** until the budget allows. `send_panic` ignores the duty-cycle limit (regulations universally permit emergency transmissions). The operator is told via notification that the duty-cycle override was used and the event log records `lora.duty_cycle.overridden`.
|
| 193 |
+
|
| 194 |
+
### 5.4 Peer mapping (TOFU)
|
| 195 |
+
|
| 196 |
+
The first time a `sender_hash` is received, the module emits a notification: *"A new LoRa peer with hash 0xABCD1234 was heard. Do you recognise this device?"* The operator can:
|
| 197 |
+
|
| 198 |
+
- **Verify by NodeID** β provide a HearthNet NodeID; the module checks that `SHA-256(NodeID)[:4] == sender_hash` and stores the verified mapping.
|
| 199 |
+
- **Mark as unknown** β store the hash with no NodeID; future beacons from this hash will still be tracked but flagged unknown.
|
| 200 |
+
- **Block** β drop all beacons from this hash; never prompt again.
|
| 201 |
+
|
| 202 |
+
Hash collisions (two different NodeIDs producing the same 4-byte hash) are possible but unlikely. When two operators independently verify the same hash to different NodeIDs, the conflict is surfaced as a `lora.peer.conflict` event for manual resolution.
|
| 203 |
+
|
| 204 |
+
### 5.5 Beacon-presence signal
|
| 205 |
+
|
| 206 |
+
Other modules can subscribe to `experimental.lora.beacon.received` to incorporate "this peer is alive on LoRa even though the internet says they're offline" into their own logic. M31 Civil Defense in particular uses this to corroborate that a target node is alive during an outage incident.
|
| 207 |
+
|
| 208 |
+
The presence signal is *advisory*: a node that beacons on LoRa is alive in the radio sense, but that says nothing about whether the operator is responsive or whether higher-layer services are available there.
|
| 209 |
+
|
| 210 |
+
### 5.6 Failure modes
|
| 211 |
+
|
| 212 |
+
- **No stick attached or USB error:** `lora.status()` reports `unavailable`. The module starts in a disabled state; no errors are raised on startup, only logged.
|
| 213 |
+
- **Stick attached but firmware mismatch:** `at_init_commands` fail; the adapter raises `lora_hardware_unsupported` and the service stays disabled.
|
| 214 |
+
- **Receive flood:** the RX queue is bounded (`LORA_RX_QUEUE_MAX` default 256). Overflow drops oldest entries and emits a `lora.rx.dropped` event.
|
| 215 |
+
- **Clock skew:** beacons carry the sender's clock, but the receiver never trusts it for ordering β local arrival timestamp is authoritative.
|
| 216 |
+
- **Adversarial flooding:** an attacker on 868 MHz can spam frames; the duty-cycle limits *us* but not *them*. The service rate-limits beacons per `sender_hash` at the RX side (`LORA_PEER_RX_MAX_PER_MINUTE`, default 20) to avoid filling notifications. Excess beacons from one hash are dropped silently after the rate limit; this is a known DoS vector and documented in Β§10.
|
| 217 |
+
|
| 218 |
+
---
|
| 219 |
+
|
| 220 |
+
## 6. Errors
|
| 221 |
+
|
| 222 |
+
| Code | When |
|
| 223 |
+
|-------------------------------|------------------------------------------------------------------|
|
| 224 |
+
| `experimental_disabled` | Capability called with the flag off |
|
| 225 |
+
| `lora_hardware_unavailable` | No stick present or serial port not opened |
|
| 226 |
+
| `lora_hardware_unsupported` | Adapter init failed; firmware not whitelisted |
|
| 227 |
+
| `lora_duty_cycle_exhausted` | Non-panic send requested with budget at zero and override off |
|
| 228 |
+
| `lora_peer_unknown` | `lora.peer.verify` for a sender_hash we've never seen |
|
| 229 |
+
| `lora_peer_conflict` | verify() would create a (hash β two distinct NodeIDs) mapping |
|
| 230 |
+
| `lora_frame_malformed` | RX frame fails structural validation |
|
| 231 |
+
|
| 232 |
+
---
|
| 233 |
+
|
| 234 |
+
## 7. Configuration
|
| 235 |
+
|
| 236 |
+
```python
|
| 237 |
+
@dataclass(frozen=True)
|
| 238 |
+
class LoraConfig:
|
| 239 |
+
enabled: bool = False
|
| 240 |
+
serial_port: str = "/dev/ttyUSB0" # also Windows COM4, etc.
|
| 241 |
+
serial_baud: int = 115200
|
| 242 |
+
adapter: Literal["meshtastic","rfm95w","sx126x"] = "meshtastic"
|
| 243 |
+
region: Literal["EU868","US915","AS923"] = "EU868"
|
| 244 |
+
spreading_factor: int = 9 # 7..12; higher = more range, less rate
|
| 245 |
+
bandwidth_khz: int = 125
|
| 246 |
+
coding_rate_denom: int = 5 # 4/5
|
| 247 |
+
tx_power_dbm: int = 14 # legal max for EU868
|
| 248 |
+
beacon_period_seconds: int = LORA_BEACON_PERIOD_SECONDS_DEFAULT # 600
|
| 249 |
+
panic_burst_count: int = 3 # PANIC sends this many frames rapid-fire
|
| 250 |
+
panic_burst_gap_ms: int = 800
|
| 251 |
+
rx_queue_max: int = LORA_RX_QUEUE_MAX # 256
|
| 252 |
+
peer_rx_max_per_minute: int = LORA_PEER_RX_MAX_PER_MINUTE # 20
|
| 253 |
+
tofu_auto_accept: bool = False # never auto-trust new hashes by default
|
| 254 |
+
duty_cycle_override_for_panic: bool = True
|
| 255 |
+
```
|
| 256 |
+
|
| 257 |
+
Constants live in `hearthnet/constants.py`.
|
| 258 |
+
|
| 259 |
+
---
|
| 260 |
+
|
| 261 |
+
## 8. Tests
|
| 262 |
+
|
| 263 |
+
### 8.1 Unit
|
| 264 |
+
|
| 265 |
+
- `test_frame_encode_decode_roundtrip` β random payloads encode to exactly 32 bytes and round-trip.
|
| 266 |
+
- `test_sender_hash_matches_nodeid` β `SHA-256(NodeID)[:4]` matches the field in the encoded frame.
|
| 267 |
+
- `test_duty_cycle_tracks_airtime` β synthetic transmissions accumulate; budget drains; recovers over time.
|
| 268 |
+
- `test_panic_overrides_duty_cycle` β `send_panic` succeeds at zero budget when override is enabled.
|
| 269 |
+
- `test_panic_blocked_when_override_disabled` β `send_panic` returns `lora_duty_cycle_exhausted` when override is off.
|
| 270 |
+
- `test_peer_rx_rate_limit` β 30 frames from one hash within a minute β only 20 surface.
|
| 271 |
+
|
| 272 |
+
### 8.2 Integration (loopback)
|
| 273 |
+
|
| 274 |
+
- Mock `SerialBridge` echoes TX as RX after a configurable delay. Verify a sent beacon shows up in `recent_beacons` and on the subscription.
|
| 275 |
+
- Two simulated nodes (separate SerialBridges connected via an in-memory channel) β A sends, B receives, B's peer_map contains A after TOFU verification, RSSI/SNR are populated.
|
| 276 |
+
|
| 277 |
+
### 8.3 Hardware-in-the-loop (optional)
|
| 278 |
+
|
| 279 |
+
- With a real LoRa stick, send N beacons and verify duty-cycle accounting matches what the firmware reports.
|
| 280 |
+
- Range test: two sticks at increasing distance; record packet-loss vs distance.
|
| 281 |
+
|
| 282 |
+
### 8.4 Negative
|
| 283 |
+
|
| 284 |
+
- Disabled flag β all capabilities return `experimental_disabled`.
|
| 285 |
+
- No serial port β `lora_hardware_unavailable` on status.
|
| 286 |
+
- Truncated frame β `lora_frame_malformed`, dropped.
|
| 287 |
+
- Conflicting verify β `lora_peer_conflict`.
|
| 288 |
+
|
| 289 |
+
---
|
| 290 |
+
|
| 291 |
+
## 9. Cross-references
|
| 292 |
+
|
| 293 |
+
- **Phase 1 M01 Identity** β `SHA-256(NodeID)[:4]` is the sender hash; verification uses M01's NodeID type.
|
| 294 |
+
- **Phase 1 M02 Transport** β LoRa is *not* a Transport in the M02 sense. It does not carry capability-bus traffic; it lives parallel to M02 as an alternative signalling channel. The two share no code.
|
| 295 |
+
- **Phase 1 M11 Notifications** β high-priority panic-beacon notifications and TOFU prompts route through M11.
|
| 296 |
+
- **Phase 1 X02 Event Log** β `lora.*` events.
|
| 297 |
+
- **Phase 3 M31 Civil Defense** β beacon-presence is one corroborating signal for "is the target node alive" during an incident.
|
| 298 |
+
- **Phase 3 X09 Conformance Suite** β LoRa is an optional capability; conformance tests use a mock serial bridge.
|
| 299 |
+
|
| 300 |
+
---
|
| 301 |
+
|
| 302 |
+
## 10. Open research questions
|
| 303 |
+
|
| 304 |
+
1. **Mesh routing.** Multi-hop store-and-forward over LoRa is well-explored in the Meshtastic project. Whether HearthNet should adopt it (and inherit the bandwidth tradeoffs) or keep one-hop simplicity is unsettled. Probably belongs in M29b.
|
| 305 |
+
|
| 306 |
+
2. **Authenticated beacons.** Adding even a 4-byte MAC would let receivers reject forged sender-hashes. This costs payload space we don't have today. A 64-byte frame variant (`version 0x02`) with HMAC-truncated-to-8-bytes is the obvious extension.
|
| 307 |
+
|
| 308 |
+
3. **DoS robustness.** Per-hash rate limiting is naive; an attacker just rotates hashes. The defence on 868 MHz is mostly the regulatory duty-cycle and physical proximity, neither of which we control in software. Documented as a known limitation.
|
| 309 |
+
|
| 310 |
+
4. **Sleep-and-wake duty cycles.** Battery-powered nodes (a panic button by the bedside) want to sleep most of the time and wake on demand. Class-A/B/C LoRaWAN-style scheduling is the standard answer. Out of scope for v3.0.
|
| 311 |
+
|
| 312 |
+
5. **Chipset coverage.** v3.0 supports a small whitelist. Each new chipset is an adapter shaped exactly like the existing ones; contributors are encouraged.
|
| 313 |
+
|
| 314 |
+
6. **GPS integration.** Many LoRa sticks ship with a GPS module. We deliberately did not surface location data in v3.0 β location is privacy-sensitive and the use case is unclear. A future `FLAG_HAS_GPS` + paired side-channel might make sense for civil-defence scenarios.
|
| 315 |
+
|
| 316 |
+
7. **Integration with civil-defence radio.** TETRA-BOS and BOS-Digitalfunk are professional networks we have no business interoperating with. But a *unidirectional* "did the BOS station broadcast a known alert" listener might be useful. Legally complex.
|
| 317 |
+
|
| 318 |
+
8. **Network coding.** When multiple nearby nodes beacon, the airwave fills. Cooperative beacon scheduling (so neighbours don't transmit on top of each other) is a fun problem. Currently each node beacons independently and collisions are accepted.
|
| 319 |
+
|
| 320 |
+
---
|
| 321 |
+
|
| 322 |
+
*Last updated: spec v3.0.*
|
docs/p2_p3/M30-evidence-ebkh.md
ADDED
|
@@ -0,0 +1,384 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# M30 β Evidence Graph & EBKH Integration
|
| 2 |
+
|
| 3 |
+
**Spec version:** v3.0 β *experimental*
|
| 4 |
+
**Depends on:** [M03 Capability Bus](../../modules/M03-capability-bus.md), [X02 Event Log](../../cross-cutting/X02-events.md), [M01 Identity](../../modules/M01-identity.md), [M06 Files](../../modules/M06-files.md), [M07 Knowledge Base](../../modules/M07-knowledge-base.md), [M16 Tokens](../../phase-2/modules/M16-tokens.md), [M21 Tool Calls](../../phase-2/modules/M21-tool-calls.md)
|
| 5 |
+
**Depended on by:** M31 Civil Defense (alerts may carry an evidence chain); RAG quality reports
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## 1. Responsibility
|
| 10 |
+
|
| 11 |
+
A **content-addressed claim graph** layered alongside the append-only event log, plus an integration adapter for Christof's existing **EBKH v3+** (event-sourced knowledge hub with OSINT capabilities, PostGIS, and the class-reference graph already described in his upstream work).
|
| 12 |
+
|
| 13 |
+
The event log answers "what happened, in what order". The evidence graph answers a different question: "what is asserted, by whom, on what basis, and what counterclaims exist?" This is necessary because in a community AI mesh, *information has provenance* β a claim about when the Sankt Martins parade starts, where the local emergency assembly point is, or what the Volksbank's IBAN is β must be traceable to a source, must be cross-checkable, and must support dispute. The event log doesn't do that; it just records that someone said something.
|
| 14 |
+
|
| 15 |
+
The module is small in line count but conceptually load-bearing: it defines what counts as evidence, what a claim is, how claims compose, and how the rest of the stack queries provenance. The EBKH adapter wires this to Christof's already-built PostGIS + OSINT infrastructure so the two systems converge rather than duplicate.
|
| 16 |
+
|
| 17 |
+
---
|
| 18 |
+
|
| 19 |
+
## 2. Non-goals
|
| 20 |
+
|
| 21 |
+
- **Replacing the event log.** The event log is still the source of truth for *what happened*. The evidence graph is a *derived* view over claims extracted from events plus claims asserted directly.
|
| 22 |
+
- **Adjudicating truth.** The module records claims, disputes, and attestations. It does not decide who is right.
|
| 23 |
+
- **General-purpose knowledge graph.** This is not a Wikidata clone. The schema is deliberately narrow: claim, source, attestation, dispute, derivation.
|
| 24 |
+
- **Replacing RAG.** RAG retrieves passages. The evidence graph annotates those passages with provenance and lets a downstream summariser say "according to source X (last verified Y)". Different layer.
|
| 25 |
+
- **OSINT collection.** EBKH already collects from external sources; this module *integrates* with that collector, it does not duplicate it.
|
| 26 |
+
- **Mandatory adoption.** RAG and chat can still operate without consulting the evidence graph. The graph is queried when callers explicitly want provenance.
|
| 27 |
+
|
| 28 |
+
---
|
| 29 |
+
|
| 30 |
+
## 3. File layout
|
| 31 |
+
|
| 32 |
+
```
|
| 33 |
+
hearthnet/evidence/
|
| 34 |
+
βββ __init__.py
|
| 35 |
+
βββ service.py # EvidenceService β capability handler
|
| 36 |
+
βββ claim.py # Claim, ClaimSource, Attestation, Dispute, Derivation dataclasses
|
| 37 |
+
βββ store.py # ClaimStore β append-only with content-addressed ClaimIDs
|
| 38 |
+
βββ query.py # Provenance traversal: trace(), neighbours(), conflicts()
|
| 39 |
+
βββ extractor.py # Pulls claims out of events (chat, KB ingest, federation)
|
| 40 |
+
βββ ebkh_adapter.py # Bridge to Christof's EBKH v3+ via JSON-RPC
|
| 41 |
+
βββ trust.py # Per-source trust scoring (advisory, no enforcement)
|
| 42 |
+
```
|
| 43 |
+
|
| 44 |
+
---
|
| 45 |
+
|
| 46 |
+
## 4. Public API
|
| 47 |
+
|
| 48 |
+
### 4.1 Dataclasses
|
| 49 |
+
|
| 50 |
+
```python
|
| 51 |
+
ClaimID = NewType("ClaimID", str) # SHA-256 over canonical claim record
|
| 52 |
+
SourceID = NewType("SourceID", str) # "url:https://...", "node:<NodeID>", "doc:<FileID>", "ebkh:<entity_uri>"
|
| 53 |
+
|
| 54 |
+
EvidenceLevel = Literal[
|
| 55 |
+
"unverified", # claim made, no corroboration
|
| 56 |
+
"cited", # claim has at least one source
|
| 57 |
+
"cross_referenced", # β₯2 independent sources agree
|
| 58 |
+
"attested", # an identified party with skin in the game has signed
|
| 59 |
+
"disputed", # at least one counterclaim with sources
|
| 60 |
+
]
|
| 61 |
+
|
| 62 |
+
@dataclass(frozen=True)
|
| 63 |
+
class ClaimSource:
|
| 64 |
+
source_id: SourceID
|
| 65 |
+
accessed_at: datetime
|
| 66 |
+
content_sha: str | None # SHA of the cited content if available
|
| 67 |
+
excerpt: str | None # short verbatim excerpt; max 280 chars
|
| 68 |
+
|
| 69 |
+
@dataclass(frozen=True)
|
| 70 |
+
class Claim:
|
| 71 |
+
claim_id: ClaimID
|
| 72 |
+
subject: str # canonical subject string
|
| 73 |
+
predicate: str # e.g. "starts_at", "located_at", "operated_by"
|
| 74 |
+
object: str # canonical object string
|
| 75 |
+
asserted_by: NodeID
|
| 76 |
+
asserted_at: datetime
|
| 77 |
+
sources: tuple[ClaimSource, ...]
|
| 78 |
+
confidence: float # asserter's self-rated confidence [0,1]
|
| 79 |
+
derived_from: tuple[ClaimID, ...] # parent claims if this is a derivation
|
| 80 |
+
signature: bytes # asserter's Ed25519 signature
|
| 81 |
+
|
| 82 |
+
@dataclass(frozen=True)
|
| 83 |
+
class Attestation:
|
| 84 |
+
claim_id: ClaimID
|
| 85 |
+
attester: NodeID
|
| 86 |
+
attested_at: datetime
|
| 87 |
+
rationale: str # human-readable "I know this because..."
|
| 88 |
+
role: str # "first-hand witness", "official record holder", "expert"
|
| 89 |
+
signature: bytes
|
| 90 |
+
|
| 91 |
+
@dataclass(frozen=True)
|
| 92 |
+
class Dispute:
|
| 93 |
+
claim_id: ClaimID # the claim being disputed
|
| 94 |
+
counterclaim_id: ClaimID # the claim made in response
|
| 95 |
+
disputer: NodeID
|
| 96 |
+
disputed_at: datetime
|
| 97 |
+
rationale: str
|
| 98 |
+
signature: bytes
|
| 99 |
+
|
| 100 |
+
@dataclass(frozen=True)
|
| 101 |
+
class ProvenanceTrace:
|
| 102 |
+
claim: Claim
|
| 103 |
+
sources: tuple[ClaimSource, ...]
|
| 104 |
+
attestations: tuple[Attestation, ...]
|
| 105 |
+
disputes: tuple[Dispute, ...]
|
| 106 |
+
derivation_tree: tuple[Claim, ...] # walked depth-first, deduplicated
|
| 107 |
+
evidence_level: EvidenceLevel
|
| 108 |
+
trust_score: float # advisory; from trust.py
|
| 109 |
+
```
|
| 110 |
+
|
| 111 |
+
### 4.2 Capabilities
|
| 112 |
+
|
| 113 |
+
All under `experimental.evidence.*`:
|
| 114 |
+
|
| 115 |
+
```python
|
| 116 |
+
async def evidence_claim_assert(draft: ClaimDraft) -> ClaimID
|
| 117 |
+
async def evidence_claim_dispute(claim_id: ClaimID, counterclaim: ClaimDraft, rationale: str) -> ClaimID
|
| 118 |
+
async def evidence_claim_attest(claim_id: ClaimID, role: str, rationale: str) -> AttestationReceipt
|
| 119 |
+
async def evidence_claim_get(claim_id: ClaimID) -> Claim
|
| 120 |
+
async def evidence_claim_query(subject: str | None = None,
|
| 121 |
+
predicate: str | None = None,
|
| 122 |
+
object: str | None = None,
|
| 123 |
+
min_evidence: EvidenceLevel = "unverified") -> list[Claim]
|
| 124 |
+
async def evidence_provenance_trace(claim_id: ClaimID, max_depth: int = 5) -> ProvenanceTrace
|
| 125 |
+
async def evidence_subject_summary(subject: str) -> SubjectSummary
|
| 126 |
+
async def evidence_ebkh_sync(direction: Literal["pull","push","bidi"] = "bidi") -> SyncReport
|
| 127 |
+
```
|
| 128 |
+
|
| 129 |
+
### 4.3 Service class
|
| 130 |
+
|
| 131 |
+
```python
|
| 132 |
+
class EvidenceService:
|
| 133 |
+
def __init__(self,
|
| 134 |
+
bus: CapabilityBus,
|
| 135 |
+
event_log: EventLog,
|
| 136 |
+
identity: IdentityService,
|
| 137 |
+
store: ClaimStore,
|
| 138 |
+
extractor: ClaimExtractor,
|
| 139 |
+
ebkh: EbkhAdapter | None,
|
| 140 |
+
trust: TrustScorer,
|
| 141 |
+
config: EvidenceConfig): ...
|
| 142 |
+
|
| 143 |
+
async def assert_claim(self, draft: ClaimDraft) -> ClaimID: ...
|
| 144 |
+
async def dispute(self, claim_id: ClaimID, counterclaim: ClaimDraft, rationale: str) -> ClaimID: ...
|
| 145 |
+
async def attest(self, claim_id: ClaimID, role: str, rationale: str) -> AttestationReceipt: ...
|
| 146 |
+
async def trace(self, claim_id: ClaimID, max_depth: int) -> ProvenanceTrace: ...
|
| 147 |
+
async def summarise_subject(self, subject: str) -> SubjectSummary: ...
|
| 148 |
+
async def evidence_level(self, claim_id: ClaimID) -> EvidenceLevel: ...
|
| 149 |
+
```
|
| 150 |
+
|
| 151 |
+
### 4.4 Claim store
|
| 152 |
+
|
| 153 |
+
```python
|
| 154 |
+
class ClaimStore:
|
| 155 |
+
async def put(self, claim: Claim) -> ClaimID: ... # idempotent on ClaimID
|
| 156 |
+
async def get(self, claim_id: ClaimID) -> Claim | None: ...
|
| 157 |
+
async def by_subject(self, subject: str) -> list[Claim]: ...
|
| 158 |
+
async def by_triple(self, subject: str, predicate: str, object: str | None) -> list[Claim]: ...
|
| 159 |
+
async def disputes_of(self, claim_id: ClaimID) -> list[Dispute]: ...
|
| 160 |
+
async def attestations_of(self, claim_id: ClaimID) -> list[Attestation]: ...
|
| 161 |
+
async def derivatives_of(self, claim_id: ClaimID) -> list[Claim]: ...
|
| 162 |
+
```
|
| 163 |
+
|
| 164 |
+
The store is append-only. A "retraction" is itself a claim (`predicate="retracted"`, `object="<claim_id>"`) and is treated as a special kind of dispute by the trace algorithm.
|
| 165 |
+
|
| 166 |
+
### 4.5 Extractor
|
| 167 |
+
|
| 168 |
+
```python
|
| 169 |
+
class ClaimExtractor:
|
| 170 |
+
"""Watches the event log and proposes claims from candidate events.
|
| 171 |
+
|
| 172 |
+
Proposals are *suggestions*, not auto-asserted. A claim only enters
|
| 173 |
+
the store when an identified asserter signs it.
|
| 174 |
+
"""
|
| 175 |
+
async def consume(self, evt: Event) -> list[ClaimDraft]: ...
|
| 176 |
+
def register_pattern(self, predicate: str, matcher: Callable[[Event], ClaimDraft | None]) -> None: ...
|
| 177 |
+
```
|
| 178 |
+
|
| 179 |
+
Patterns shipped in v3.0:
|
| 180 |
+
|
| 181 |
+
- KB ingest event β `claim(doc_sha, "contains_text", text_hash)` with the doc as source.
|
| 182 |
+
- Tool-call event (M21) with HTTP fetch β `claim(url, "served", content_sha)` with the URL as source.
|
| 183 |
+
- Federation manifest event β `claim(remote_node, "advertises_capability", cap_name)` with the manifest as source.
|
| 184 |
+
- LoRa beacon (M29) reception β *not* auto-extracted; presence is logged but not claimed.
|
| 185 |
+
|
| 186 |
+
### 4.6 EBKH adapter
|
| 187 |
+
|
| 188 |
+
```python
|
| 189 |
+
class EbkhAdapter:
|
| 190 |
+
def __init__(self, endpoint: str, token: str, postgis_dsn: str | None = None): ...
|
| 191 |
+
|
| 192 |
+
async def push_claim(self, claim: Claim) -> EbkhRef: ...
|
| 193 |
+
async def pull_entity(self, entity_uri: str) -> list[Claim]: ...
|
| 194 |
+
async def query_spatial(self, bbox: Bbox, predicate: str | None = None) -> list[Claim]: ...
|
| 195 |
+
async def sync(self, direction: Literal["pull","push","bidi"]) -> SyncReport: ...
|
| 196 |
+
```
|
| 197 |
+
|
| 198 |
+
The adapter speaks EBKH's JSON-RPC over HTTPS with an Ed25519-bound bearer token (issued via M16). Spatial queries piggy-back on EBKH's PostGIS layer β useful for civil-defence claims like "Sammelplatz is at geom(...)". For nodes without EBKH installed, the adapter is `None` and capabilities still function on the local claim store only.
|
| 199 |
+
|
| 200 |
+
### 4.7 Trust scoring
|
| 201 |
+
|
| 202 |
+
`TrustScorer` produces an advisory `[0,1]` score for a source. The function is intentionally simple and visible:
|
| 203 |
+
|
| 204 |
+
```python
|
| 205 |
+
class TrustScorer:
|
| 206 |
+
def score_source(self, source: ClaimSource, context: TrustContext) -> float: ...
|
| 207 |
+
def score_asserter(self, node_id: NodeID, context: TrustContext) -> float: ...
|
| 208 |
+
```
|
| 209 |
+
|
| 210 |
+
Inputs include: how long the source has been known, how many of its prior claims were not disputed, whether it's signed by a verified identity, whether it's in an operator-curated allowlist or blocklist. The score is **always shown alongside the claim, never hidden**, and never causes a claim to be omitted from query results β only re-ranked. Operators can override individual scores.
|
| 211 |
+
|
| 212 |
+
---
|
| 213 |
+
|
| 214 |
+
## 5. Behaviour
|
| 215 |
+
|
| 216 |
+
### 5.1 Canonicalisation and ClaimID
|
| 217 |
+
|
| 218 |
+
A claim's identity is its `ClaimID`, defined as:
|
| 219 |
+
|
| 220 |
+
```
|
| 221 |
+
ClaimID = base32-no-pad( SHA-256( JCS({
|
| 222 |
+
subject, predicate, object,
|
| 223 |
+
asserted_by, asserted_at_iso8601,
|
| 224 |
+
sources: [{source_id, accessed_at_iso, content_sha} ...],
|
| 225 |
+
confidence_5dp,
|
| 226 |
+
derived_from: [...sorted ClaimIDs...],
|
| 227 |
+
}) ) )
|
| 228 |
+
```
|
| 229 |
+
|
| 230 |
+
The signature is *not* part of the ClaimID β a different asserter making the identical claim would produce a different signature but the same record. To distinguish, we use `Claim.asserted_by` and the signature ensures non-repudiation. A claim asserted twice by the same node at the same instant with the same sources is genuinely the same claim and the store deduplicates.
|
| 231 |
+
|
| 232 |
+
### 5.2 Evidence level computation
|
| 233 |
+
|
| 234 |
+
```
|
| 235 |
+
unverified = no sources
|
| 236 |
+
cited = β₯1 source
|
| 237 |
+
cross_referenced = β₯2 sources, distinct source_ids, β₯1 not from asserter's own node
|
| 238 |
+
attested = β₯1 attestation with role in {"first-hand witness","official record holder"}
|
| 239 |
+
disputed = β₯1 unretracted dispute by a node with trust_score β₯ EVIDENCE_DISPUTE_MIN_TRUST
|
| 240 |
+
```
|
| 241 |
+
|
| 242 |
+
Levels are not mutually exclusive in nature, but the API returns the *strongest applicable level* with `disputed` taking precedence over everything else if present. This way callers default to "show that there's a dispute" rather than burying it under a stronger-sounding label.
|
| 243 |
+
|
| 244 |
+
### 5.3 Provenance trace algorithm
|
| 245 |
+
|
| 246 |
+
`trace(claim_id, max_depth)` does a depth-first walk over `derived_from` edges, deduplicates by ClaimID, and collects every source, attestation, and dispute encountered. The walk stops at `max_depth` (default 5) or at a cycle (cycles shouldn't exist by construction, but we guard anyway).
|
| 247 |
+
|
| 248 |
+
The result is a flat tuple in topological order from root claim outward. UI is expected to render this as a tree or a list, with disputes inlined wherever they occur.
|
| 249 |
+
|
| 250 |
+
### 5.4 Subject summary
|
| 251 |
+
|
| 252 |
+
`summarise_subject(subject)` is the workhorse for the rest of the stack. It returns:
|
| 253 |
+
|
| 254 |
+
- All claims with this subject, grouped by predicate.
|
| 255 |
+
- For each predicate, the strongest claim by evidence level and trust score.
|
| 256 |
+
- All disputes affecting this subject.
|
| 257 |
+
- A flat list of distinct sources contributing.
|
| 258 |
+
|
| 259 |
+
This is what a RAG pipeline calls to add provenance to its retrieved passages, and what civil-defence (M31) calls to verify a target before publishing an alert.
|
| 260 |
+
|
| 261 |
+
### 5.5 EBKH sync
|
| 262 |
+
|
| 263 |
+
`evidence.ebkh.sync` runs in three modes:
|
| 264 |
+
|
| 265 |
+
- **pull** β fetch claims for subjects in our local store from EBKH, add as new claims (asserted by the EBKH node identity).
|
| 266 |
+
- **push** β send our locally-asserted claims to EBKH; EBKH stores them tagged with our node ID.
|
| 267 |
+
- **bidi** β both, in that order.
|
| 268 |
+
|
| 269 |
+
Sync is idempotent. Each side stores the other-side's claim records; nothing is overwritten. Conflicts (same triple, different sources) become co-existing claims, and disputes can be raised normally.
|
| 270 |
+
|
| 271 |
+
EBKH's existing PostGIS schema is reused for spatial predicates. The adapter does *not* try to model the full EBKH schema in our claim graph; it surfaces what is asked for and lets EBKH remain the authoritative store for OSINT-collected material.
|
| 272 |
+
|
| 273 |
+
### 5.6 Failure modes
|
| 274 |
+
|
| 275 |
+
- **Claim signature invalid** on receipt from another node β reject; emit `security.signature.invalid`.
|
| 276 |
+
- **Dispute on a non-existent claim** β `claim_not_found`.
|
| 277 |
+
- **Cyclic derivation** β reject the new claim; `evidence_cycle_detected`. (This can only happen via malicious crafting; honest derivation cannot cycle.)
|
| 278 |
+
- **EBKH unreachable** during sync β return a `SyncReport` with `partial=true` and the unreachable error; do not fail the calling operation.
|
| 279 |
+
|
| 280 |
+
---
|
| 281 |
+
|
| 282 |
+
## 6. Errors
|
| 283 |
+
|
| 284 |
+
| Code | When |
|
| 285 |
+
|-----------------------------------|-------------------------------------------------------------------|
|
| 286 |
+
| `experimental_disabled` | Capability called with the flag off |
|
| 287 |
+
| `claim_not_found` | Operation references a ClaimID we don't have |
|
| 288 |
+
| `claim_signature_invalid` | Signature doesn't verify against asserter's identity |
|
| 289 |
+
| `evidence_cycle_detected` | Proposed claim's derivation chain forms a cycle |
|
| 290 |
+
| `evidence_contradiction` | (advisory) two claims with the same triple but opposite objects |
|
| 291 |
+
| `ebkh_unavailable` | EBKH endpoint not configured or unreachable |
|
| 292 |
+
| `trust_below_threshold` | (advisory) attached to results; not an error condition by itself |
|
| 293 |
+
|
| 294 |
+
`evidence_contradiction` is *advisory* β returned in query results as a flag, not raised as an exception. The system never silently picks a winner.
|
| 295 |
+
|
| 296 |
+
---
|
| 297 |
+
|
| 298 |
+
## 7. Configuration
|
| 299 |
+
|
| 300 |
+
```python
|
| 301 |
+
@dataclass(frozen=True)
|
| 302 |
+
class EvidenceConfig:
|
| 303 |
+
enabled: bool = False
|
| 304 |
+
auto_extract: bool = True # let extractor propose drafts
|
| 305 |
+
extract_patterns: tuple[str, ...] = ("kb_ingest","tool_fetch","federation_manifest")
|
| 306 |
+
claim_ttl_days: int = EVIDENCE_CLAIM_TTL_DAYS_DEFAULT # 365
|
| 307 |
+
trust_default: float = 0.5
|
| 308 |
+
dispute_min_trust: float = EVIDENCE_DISPUTE_MIN_TRUST # 0.3
|
| 309 |
+
ebkh_endpoint: str | None = None
|
| 310 |
+
ebkh_token_scope: str = "evidence-sync"
|
| 311 |
+
ebkh_sync_interval_minutes: int = 60
|
| 312 |
+
max_provenance_depth: int = EVIDENCE_MAX_PROVENANCE_DEPTH # 8
|
| 313 |
+
summary_max_predicates: int = 32
|
| 314 |
+
```
|
| 315 |
+
|
| 316 |
+
Constants live in `hearthnet/constants.py`. `claim_ttl_days` does not delete claims β it marks them as stale for query purposes; the actual record is permanent.
|
| 317 |
+
|
| 318 |
+
---
|
| 319 |
+
|
| 320 |
+
## 8. Tests
|
| 321 |
+
|
| 322 |
+
### 8.1 Unit
|
| 323 |
+
|
| 324 |
+
- `test_claim_id_canonicalisation` β re-ordering source list or whitespace changes do not affect ClaimID.
|
| 325 |
+
- `test_claim_signature_roundtrip`.
|
| 326 |
+
- `test_evidence_level_disputed_wins` β a claim with two sources *and* a dispute returns `disputed`.
|
| 327 |
+
- `test_provenance_trace_dedup` β diamond derivation graph yields each ancestor once.
|
| 328 |
+
- `test_extractor_kb_ingest_pattern` β KB ingest event produces a draft with the right predicate.
|
| 329 |
+
- `test_retraction_is_dispute` β a retraction shows up in `disputes_of`.
|
| 330 |
+
|
| 331 |
+
### 8.2 Property
|
| 332 |
+
|
| 333 |
+
- For random claims, `evidence_level(c)` is monotonic when adding sources/attestations and falls to `disputed` on adding a dispute.
|
| 334 |
+
- For random derivation DAGs, `trace(root)` yields exactly the reachable set.
|
| 335 |
+
|
| 336 |
+
### 8.3 Integration
|
| 337 |
+
|
| 338 |
+
- KB ingest β claim drafted β operator asserts β query returns it.
|
| 339 |
+
- Dispute lifecycle: assert claim, attest it, dispute it, see `evidence_level=disputed`, retract the dispute (as a dispute-of-the-dispute), verify level returns to `attested`.
|
| 340 |
+
- EBKH adapter against a mock JSON-RPC endpoint: round-trip a spatial claim, verify the bbox query returns it.
|
| 341 |
+
- Federated extraction: a federation manifest event from M14 produces a claim about advertised capabilities, which is then visible to subject-query `summarise_subject(<remote_node_id>)`.
|
| 342 |
+
|
| 343 |
+
### 8.4 Negative
|
| 344 |
+
|
| 345 |
+
- Cyclic derived_from input β `evidence_cycle_detected`.
|
| 346 |
+
- Claim signed by an unknown identity β `claim_signature_invalid`.
|
| 347 |
+
- EBKH endpoint configured but unreachable β `sync` returns partial; capability does not raise.
|
| 348 |
+
|
| 349 |
+
---
|
| 350 |
+
|
| 351 |
+
## 9. Cross-references
|
| 352 |
+
|
| 353 |
+
- **Phase 1 M01 Identity** β every claim is signed; signatures verified against M01.
|
| 354 |
+
- **Phase 1 M06 Files** β `doc:<FileID>` sources resolve to content-addressed files.
|
| 355 |
+
- **Phase 1 M07 Knowledge Base** β KB ingest events feed the extractor.
|
| 356 |
+
- **Phase 1 X02 Event Log** β `evidence.claim.*`, `evidence.dispute.*`, `evidence.attestation.*` events.
|
| 357 |
+
- **Phase 2 M21 Tool Calls** β fetched URLs are extracted as claims for downstream provenance.
|
| 358 |
+
- **Phase 3 M31 Civil Defense** β alerts carry a top-level claim ID, allowing recipients to trace why an alert was issued.
|
| 359 |
+
- **Phase 3 X09 Conformance Suite** β provenance-trace correctness is part of the experimental suite.
|
| 360 |
+
- **External: EBKH v3+** β Christof's existing event-sourced knowledge hub; PostGIS-backed; this module is the integration point.
|
| 361 |
+
|
| 362 |
+
---
|
| 363 |
+
|
| 364 |
+
## 10. Open research questions
|
| 365 |
+
|
| 366 |
+
1. **Claim semantics.** "subject/predicate/object" is intentionally loose. Whether to adopt RDF, JSON-LD, or a custom ontology with a controlled vocabulary is unsettled. v3.0 accepts free strings and ships a small recommended vocabulary in the docs.
|
| 367 |
+
|
| 368 |
+
2. **Trust composition.** When a derived claim depends on three parent claims of varying trust, what's the derived trust? Min, product, weighted-average? Currently the trust scorer ignores derivation. A future version may compose explicitly.
|
| 369 |
+
|
| 370 |
+
3. **Dispute escalation.** Today a dispute is a single counter-claim. In practice, communities will want threaded discussion attached to disputes. Whether this belongs here or in M10/M25 chat is a design call.
|
| 371 |
+
|
| 372 |
+
4. **Time-bound claims.** "The Volksbank opens at 9:00" is true on weekdays but not Sundays. The schema has no first-class temporal modality. A pragmatic workaround is encoding temporal qualifiers into the predicate ("opens_at_weekday"), but a proper temporal logic layer would be cleaner. Out of scope.
|
| 373 |
+
|
| 374 |
+
5. **Confidentiality.** Some claims are sensitive (a neighbour's medical condition, a Feuerwehr member's home address). The current model has no claim-level access control. The capability-bus tokens (M16) can scope access to the evidence service entirely, but not at a per-claim granularity. Open.
|
| 375 |
+
|
| 376 |
+
6. **OSINT integration boundaries.** EBKH ingests from external feeds. When does an external feed become a sufficiently authoritative source to upgrade evidence level? The pragmatic stance in v3.0 is "operator decides via the trust allowlist". A future version may automate this.
|
| 377 |
+
|
| 378 |
+
7. **Visualisation.** Provenance trees get big fast. A graph visualisation widget (probably d3 in plain HTML) would help operators. Specced but unbuilt.
|
| 379 |
+
|
| 380 |
+
8. **Federated claim propagation.** Two communities federate, and one asserts a claim relevant to the other. Should it auto-mirror? Today, no β claims propagate only when explicitly queried via `ebkh_sync` or fetched on demand. A push model would be possible but worsens consent.
|
| 381 |
+
|
| 382 |
+
---
|
| 383 |
+
|
| 384 |
+
*Last updated: spec v3.0.*
|
docs/p2_p3/M31-civil-defense.md
ADDED
|
@@ -0,0 +1,410 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# M31 β Civil Defense (NRW BevΓΆlkerungsschutz Pilot)
|
| 2 |
+
|
| 3 |
+
**Spec version:** v3.0 β *experimental*
|
| 4 |
+
**Depends on:** [M03 Capability Bus](../../modules/M03-capability-bus.md), [X02 Event Log](../../cross-cutting/X02-events.md), [M01 Identity](../../modules/M01-identity.md), [M11 Notifications](../../modules/M11-notifications.md), [M14 Federation](../../phase-2/modules/M14-federation.md), [M16 Tokens](../../phase-2/modules/M16-tokens.md), [M30 Evidence](./M30-evidence-ebkh.md), [M29 LoRa Beacons](./M29-lora-beacons.md), [M22 Mobile Native](../../phase-2/modules/M22-mobile-native.md)
|
| 5 |
+
**Depended on by:** nothing β terminal module; civil defense is a downstream consumer of everything else
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## 1. Responsibility
|
| 10 |
+
|
| 11 |
+
A scoped pilot for **NRW BevΓΆlkerungsschutz**: integrate HearthNet with the role structures that Germany's civil-defence ecosystem actually uses (THW, DRK, Feuerwehr, Katastrophenschutz) so that during an incident, role-certified members can publish authenticated alerts, coordinate locally, and produce a tamper-evident audit trail that survives legal review.
|
| 12 |
+
|
| 13 |
+
This module is deliberately **regional and regulated**. It does *not* try to be a global civil-defence platform. It encodes the role taxonomy, certificate semantics, and audit-retention rules that apply in Nordrhein-Westfalen, with hooks for other German LΓ€nder and EU regions to plug in later. The pilot lives in Issum and the Niederrhein because that's where Christof can actually walk into a Feuerwehrhaus and get this tested with humans who will use it under stress.
|
| 14 |
+
|
| 15 |
+
Where the rest of HearthNet aims for "soft consensus across neighbours", this module aims for "hard provenance, signed by an authority, retained per legal mandate". Different ergonomics. Different threat model.
|
| 16 |
+
|
| 17 |
+
---
|
| 18 |
+
|
| 19 |
+
## 2. Non-goals
|
| 20 |
+
|
| 21 |
+
- **Replacing official alert systems.** NINA, KATWARN, Cell-Broadcast, and BOS radio remain the authoritative channels. M31 is *complementary* β it works when official channels are degraded, congested, or geographically miss the affected area, and it carries the *local context* that mass-broadcast systems can't.
|
| 22 |
+
- **Issuing legally binding evacuation orders.** Those come from the Krisenstab and are out of any AI-mediated system's authority.
|
| 23 |
+
- **Modelling every German Land.** v3.0 targets NRW; M31 has a region adapter so others can be added, but the module ships with NRW only.
|
| 24 |
+
- **Replacing TETRA-BOS.** Professional emergency-services radio is its own thing. We coexist; we don't interop.
|
| 25 |
+
- **Automatic identity verification of certificate holders.** A role certificate carries who issued it and who it was issued to. *Verifying* that a holder is who they claim is the issuer's responsibility, not ours. We check the signature chain; we don't re-do the background check.
|
| 26 |
+
- **Persistent geolocation of helpers.** We record where alerts target and where reported incidents are. We do not continuously track helpers' phones.
|
| 27 |
+
|
| 28 |
+
---
|
| 29 |
+
|
| 30 |
+
## 3. File layout
|
| 31 |
+
|
| 32 |
+
```
|
| 33 |
+
hearthnet/civdef/
|
| 34 |
+
βββ __init__.py
|
| 35 |
+
βββ service.py # CivilDefenseService β capability handler
|
| 36 |
+
βββ alert.py # Alert, AlertEnvelope, AlertSeverity dataclasses
|
| 37 |
+
βββ role.py # RoleCertificate, role schemas per region
|
| 38 |
+
βββ audit.py # Tamper-evident audit chain + export
|
| 39 |
+
βββ regions/
|
| 40 |
+
β βββ __init__.py
|
| 41 |
+
β βββ nrw.py # NRW role taxonomy & issuer trust roots
|
| 42 |
+
β βββ _stubs.py # other LΓ€nder placeholders
|
| 43 |
+
βββ target.py # Geographic / role / channel targeting
|
| 44 |
+
βββ ack.py # Acknowledgement collection
|
| 45 |
+
```
|
| 46 |
+
|
| 47 |
+
---
|
| 48 |
+
|
| 49 |
+
## 4. Public API
|
| 50 |
+
|
| 51 |
+
### 4.1 Dataclasses
|
| 52 |
+
|
| 53 |
+
```python
|
| 54 |
+
AlertID = NewType("AlertID", str) # ULID
|
| 55 |
+
AlertSeverity = Literal["info","advisory","warning","emergency","extreme"]
|
| 56 |
+
|
| 57 |
+
@dataclass(frozen=True)
|
| 58 |
+
class RoleCertificate:
|
| 59 |
+
cert_id: str
|
| 60 |
+
holder: NodeID
|
| 61 |
+
role: str # canonical role, e.g. "DE.NRW.THW.OV.Leiter"
|
| 62 |
+
region: str # "DE.NRW.KreisKleve"
|
| 63 |
+
issuer: NodeID # issuing authority's HearthNet identity
|
| 64 |
+
issuer_chain: tuple[NodeID, ...] # chain back to a trust root
|
| 65 |
+
issued_at: datetime
|
| 66 |
+
expires_at: datetime
|
| 67 |
+
scopes: frozenset[str] # what this cert is allowed to do
|
| 68 |
+
signature: bytes
|
| 69 |
+
revocation_url: str | None
|
| 70 |
+
|
| 71 |
+
@dataclass(frozen=True)
|
| 72 |
+
class AlertTarget:
|
| 73 |
+
region: str # "DE.NRW.KreisKleve.Issum"
|
| 74 |
+
bbox: Bbox | None # optional precise geo target
|
| 75 |
+
roles: tuple[str, ...] # which roles should see this; empty = public
|
| 76 |
+
channels: tuple[Literal["push","lora","federation","local"], ...]
|
| 77 |
+
|
| 78 |
+
@dataclass(frozen=True)
|
| 79 |
+
class Alert:
|
| 80 |
+
alert_id: AlertID
|
| 81 |
+
severity: AlertSeverity
|
| 82 |
+
title: str # β€ 80 chars
|
| 83 |
+
body: str # β€ 1000 chars
|
| 84 |
+
target: AlertTarget
|
| 85 |
+
instructions: tuple[str, ...] # short imperative lines
|
| 86 |
+
published_at: datetime
|
| 87 |
+
expires_at: datetime
|
| 88 |
+
publisher: NodeID
|
| 89 |
+
publisher_role: str
|
| 90 |
+
publisher_cert: str # cert_id
|
| 91 |
+
evidence_claim: ClaimID | None # link to M30 claim chain if relevant
|
| 92 |
+
correlation_id: str | None # links to NINA/KATWARN ID if mirrored
|
| 93 |
+
signature: bytes # publisher signs the alert
|
| 94 |
+
issuer_attestation: bytes | None # optional co-sign by a higher-tier issuer
|
| 95 |
+
|
| 96 |
+
@dataclass(frozen=True)
|
| 97 |
+
class AlertEnvelope:
|
| 98 |
+
alert: Alert
|
| 99 |
+
federation_hops: tuple[NodeID, ...] # forward path for audit
|
| 100 |
+
received_at: datetime
|
| 101 |
+
received_via: Literal["bus","federation","lora_signal","manual"]
|
| 102 |
+
|
| 103 |
+
@dataclass(frozen=True)
|
| 104 |
+
class Ack:
|
| 105 |
+
alert_id: AlertID
|
| 106 |
+
acker: NodeID
|
| 107 |
+
acked_at: datetime
|
| 108 |
+
status: Literal["received","acting","need_help","standing_down","mistaken"]
|
| 109 |
+
note: str # β€ 280 chars
|
| 110 |
+
signature: bytes
|
| 111 |
+
|
| 112 |
+
@dataclass(frozen=True)
|
| 113 |
+
class AuditEntry:
|
| 114 |
+
seq: int # monotonic per audit chain
|
| 115 |
+
alert_id: AlertID
|
| 116 |
+
event: str # "published","forwarded","acked","mirrored","cancelled"
|
| 117 |
+
actor: NodeID
|
| 118 |
+
at: datetime
|
| 119 |
+
payload_sha: str
|
| 120 |
+
prev_sha: str # chain-link to previous audit entry
|
| 121 |
+
signature: bytes
|
| 122 |
+
```
|
| 123 |
+
|
| 124 |
+
### 4.2 Capabilities
|
| 125 |
+
|
| 126 |
+
All under `experimental.civdef.*`:
|
| 127 |
+
|
| 128 |
+
```python
|
| 129 |
+
async def civdef_alert_publish(draft: AlertDraft) -> AlertID
|
| 130 |
+
async def civdef_alert_cancel(alert_id: AlertID, reason: str) -> CancelReceipt
|
| 131 |
+
async def civdef_alert_list(active_only: bool = True,
|
| 132 |
+
severity_min: AlertSeverity = "info") -> list[Alert]
|
| 133 |
+
async def civdef_alert_get(alert_id: AlertID) -> AlertEnvelope
|
| 134 |
+
async def civdef_alert_subscribe(target_filter: AlertTarget | None = None) -> AsyncIterator[AlertEnvelope]
|
| 135 |
+
async def civdef_alert_ack(alert_id: AlertID, status: AckStatus, note: str = "") -> AckReceipt
|
| 136 |
+
async def civdef_alert_acks(alert_id: AlertID) -> list[Ack]
|
| 137 |
+
async def civdef_role_register(cert: RoleCertificate) -> RegisterReceipt
|
| 138 |
+
async def civdef_role_list() -> list[RoleCertificate]
|
| 139 |
+
async def civdef_role_revoke(cert_id: str, reason: str) -> RevokeReceipt # issuer-only
|
| 140 |
+
async def civdef_audit_export(alert_id: AlertID | None = None,
|
| 141 |
+
since: datetime | None = None,
|
| 142 |
+
format: Literal["jsonl","pdf"] = "jsonl") -> bytes
|
| 143 |
+
```
|
| 144 |
+
|
| 145 |
+
### 4.3 Service class
|
| 146 |
+
|
| 147 |
+
```python
|
| 148 |
+
class CivilDefenseService:
|
| 149 |
+
def __init__(self,
|
| 150 |
+
bus: CapabilityBus,
|
| 151 |
+
event_log: EventLog,
|
| 152 |
+
identity: IdentityService,
|
| 153 |
+
notifications: NotificationService,
|
| 154 |
+
federation: FederationService,
|
| 155 |
+
evidence: EvidenceService | None,
|
| 156 |
+
region: RegionAdapter,
|
| 157 |
+
audit_store: AuditChainStore,
|
| 158 |
+
config: CivDefConfig): ...
|
| 159 |
+
|
| 160 |
+
async def publish_alert(self, draft: AlertDraft, publisher_cert: RoleCertificate) -> AlertID: ...
|
| 161 |
+
async def cancel_alert(self, alert_id: AlertID, reason: str, by_cert: RoleCertificate) -> None: ...
|
| 162 |
+
async def receive_alert(self, envelope: AlertEnvelope) -> None: ...
|
| 163 |
+
async def register_role(self, cert: RoleCertificate) -> None: ...
|
| 164 |
+
async def revoke_role(self, cert_id: str, by_cert: RoleCertificate, reason: str) -> None: ...
|
| 165 |
+
async def ack(self, alert_id: AlertID, status: AckStatus, note: str) -> AckReceipt: ...
|
| 166 |
+
async def export_audit(self, ...) -> bytes: ...
|
| 167 |
+
```
|
| 168 |
+
|
| 169 |
+
### 4.4 Region adapter
|
| 170 |
+
|
| 171 |
+
```python
|
| 172 |
+
class RegionAdapter(Protocol):
|
| 173 |
+
region_code: str
|
| 174 |
+
trust_roots: tuple[NodeID, ...] # public keys of recognised issuers
|
| 175 |
+
role_schema: dict[str, RoleSpec] # role name β spec
|
| 176 |
+
audit_retention_years: int
|
| 177 |
+
mandatory_severity_minimums: dict[str, AlertSeverity] # role β max severity it can publish
|
| 178 |
+
|
| 179 |
+
def validate_role(self, cert: RoleCertificate) -> None: ...
|
| 180 |
+
def validate_alert(self, draft: AlertDraft, publisher_cert: RoleCertificate) -> None: ...
|
| 181 |
+
```
|
| 182 |
+
|
| 183 |
+
`regions/nrw.py` ships the NRW taxonomy with roles drawn from real-world structure: `DE.NRW.<Kreis>.<Gemeinde>.<Org>.<Role>`, e.g. `DE.NRW.Kleve.Issum.Feuerwehr.Wehrleiter`, `DE.NRW.Kleve.THW.OV.Leiter`, `DE.NRW.Kleve.DRK.Ortsverein.Bereitschaftsleiter`, `DE.NRW.Kleve.KatS.Stabsleiter`. Each role declares maximum severity it may publish, geographic scope it may target, and whether it may co-sign cross-org alerts.
|
| 184 |
+
|
| 185 |
+
### 4.5 Audit chain store
|
| 186 |
+
|
| 187 |
+
```python
|
| 188 |
+
class AuditChainStore:
|
| 189 |
+
"""Append-only, signed, hash-chained audit log.
|
| 190 |
+
|
| 191 |
+
Retention is governed by config.audit_retention_years; default is 10 (NRW pragmatic baseline,
|
| 192 |
+
operator must confirm against current Landesarchivgesetz at deployment time).
|
| 193 |
+
"""
|
| 194 |
+
async def append(self, entry: AuditEntry) -> None: ...
|
| 195 |
+
async def latest(self) -> AuditEntry | None: ...
|
| 196 |
+
async def get_range(self, start_seq: int, end_seq: int) -> list[AuditEntry]: ...
|
| 197 |
+
async def verify_chain(self, start: int = 0, end: int | None = None) -> VerifyReport: ...
|
| 198 |
+
async def export(self, ...) -> bytes: ...
|
| 199 |
+
```
|
| 200 |
+
|
| 201 |
+
---
|
| 202 |
+
|
| 203 |
+
## 5. Behaviour
|
| 204 |
+
|
| 205 |
+
### 5.1 Role certification
|
| 206 |
+
|
| 207 |
+
Role certificates form a chain to a regional trust root. NRW's trust roots are configured at deployment time and should match published issuer keys (Innenministerium NRW, the Kreis Kleve administration, etc. β note that as of v3.0 these *do not* publish HearthNet-compatible keys; the pilot uses a substitute issuance ceremony where the local Wehrleiter signs certificates after manual identity verification, and a clear migration path to real institutional keys is documented).
|
| 208 |
+
|
| 209 |
+
A certificate may be:
|
| 210 |
+
|
| 211 |
+
- **Issued** β signed by an authority that itself chains to a trust root.
|
| 212 |
+
- **Active** β within validity window and not revoked.
|
| 213 |
+
- **Revoked** β explicitly revoked by issuer; revocation is itself signed and appended to the audit chain.
|
| 214 |
+
- **Expired** β past `expires_at`.
|
| 215 |
+
|
| 216 |
+
Service operations that require a role check the certificate at every invocation. Revocations propagate via federation; a node receiving a revocation must, on next receipt of an alert signed by the revoked cert, refuse delivery and emit `civdef.alert.dropped.revoked`.
|
| 217 |
+
|
| 218 |
+
### 5.2 Alert publication
|
| 219 |
+
|
| 220 |
+
```
|
| 221 |
+
publish_alert(draft, cert):
|
| 222 |
+
1. cert.holder must equal self.identity β else civdef_cert_not_owned
|
| 223 |
+
2. cert active, not revoked, not expired β else civdef_cert_invalid
|
| 224 |
+
3. region.validate_role(cert) β else civdef_cert_unrecognised
|
| 225 |
+
4. region.validate_alert(draft, cert) (severity / scope match) β else civdef_cert_out_of_scope
|
| 226 |
+
5. Construct Alert with publisher_role from cert.role
|
| 227 |
+
6. Sign Alert with self.identity
|
| 228 |
+
7. (optional) collect issuer_attestation if config requires co-sign
|
| 229 |
+
8. Append to audit chain: event="published"
|
| 230 |
+
9. Emit civdef.alert.published event
|
| 231 |
+
10. Distribute:
|
| 232 |
+
- "local" β notifications via M11 to local subscribers
|
| 233 |
+
- "push" β mobile-native delivery via M22
|
| 234 |
+
- "federation" β M14 forwarding to federated nodes matching target.region
|
| 235 |
+
- "lora" β if M29 enabled, set FLAG_PANIC on the next beacon as a presence-of-alert signal
|
| 236 |
+
11. Optionally mirror to evidence graph (M30) as a claim record
|
| 237 |
+
12. Return AlertID
|
| 238 |
+
```
|
| 239 |
+
|
| 240 |
+
If the publisher loses connectivity mid-publish, the audit-chain `published` entry has already been appended locally, so the alert is recoverable on reconnect and re-distributes from there. Idempotent on AlertID.
|
| 241 |
+
|
| 242 |
+
### 5.3 Targeting
|
| 243 |
+
|
| 244 |
+
`AlertTarget` is a set of orthogonal filters:
|
| 245 |
+
|
| 246 |
+
- **region** β hierarchical region code; matches by prefix (`DE.NRW.Kleve` matches `DE.NRW.Kleve.Issum`).
|
| 247 |
+
- **bbox** β optional geographic bounding box (overrides region for the precise area).
|
| 248 |
+
- **roles** β empty means public; non-empty restricts visibility to certificate holders of those roles.
|
| 249 |
+
- **channels** β which delivery mechanisms to use.
|
| 250 |
+
|
| 251 |
+
A receiving node filters on its own identity's location, registered roles, and active subscriptions. The filter is enforced **client-side at delivery** as well as **publisher-side at distribution**, so a node that mis-claims a role doesn't expose role-only content (the federation forwarder uses publisher-side filtering when forwarding `roles`-restricted alerts).
|
| 252 |
+
|
| 253 |
+
### 5.4 Acknowledgements
|
| 254 |
+
|
| 255 |
+
When a role-targeted alert arrives, the recipient may ack with a status:
|
| 256 |
+
|
| 257 |
+
- `received` β read confirmation.
|
| 258 |
+
- `acting` β operationally taking action (e.g., Feuerwehr en route).
|
| 259 |
+
- `need_help` β recipient cannot act; help requested.
|
| 260 |
+
- `standing_down` β alert handled, recipient disengages.
|
| 261 |
+
- `mistaken` β the recipient believes this alert is in error; an attached `note` should explain.
|
| 262 |
+
|
| 263 |
+
Acks are signed, appended to the audit chain, and visible to the publisher via `civdef.alert.acks(alert_id)`. Public alerts (no `roles` filter) suppress acks unless `config.allow_public_ack=true` β to prevent ack floods on widely-distributed alerts.
|
| 264 |
+
|
| 265 |
+
### 5.5 Cancellation
|
| 266 |
+
|
| 267 |
+
Cancellation requires a certificate with cancel scope (typically the original publisher or a same-or-higher role in the same region). A cancellation:
|
| 268 |
+
|
| 269 |
+
1. Records the cancellation in the audit chain.
|
| 270 |
+
2. Emits `civdef.alert.cancelled` to all original delivery channels.
|
| 271 |
+
3. Marks the alert inactive in `civdef_alert_list` queries (`active_only=true`).
|
| 272 |
+
|
| 273 |
+
The original alert is not deleted. Audit retention applies to the cancellation as well.
|
| 274 |
+
|
| 275 |
+
### 5.6 Audit chain
|
| 276 |
+
|
| 277 |
+
The audit chain is an append-only, hash-chained, signed log specific to this module. Each entry's `prev_sha` is the SHA-256 of the previous entry's canonicalised body, creating a tamper-evident chain. `verify_chain` walks from genesis (or a checkpoint) verifying signatures and hashes; failure raises `civdef_audit_chain_broken` and is surfaced as a high-priority operator notification.
|
| 278 |
+
|
| 279 |
+
Audit entries cover: alert published, alert forwarded (with federation hop), alert acked, alert cancelled, role certificate registered, role certificate revoked, audit chain checkpointed. Export produces `jsonl` (machine-readable, default) or `pdf` (operator-readable for legal review, generated via the public `pdf` skill).
|
| 280 |
+
|
| 281 |
+
Retention is governed by `CIVDEF_AUDIT_RETENTION_YEARS` (default 10 β operator must validate against current NRW Landesarchivgesetz at deployment; the constant is the recommendation, not the law).
|
| 282 |
+
|
| 283 |
+
### 5.7 Federation interaction
|
| 284 |
+
|
| 285 |
+
Alerts cross federation boundaries via M14. The federation manifest must declare `civdef` as an advertised capability; otherwise the alert is not forwarded into the neighbouring community. Forwarding nodes append themselves to `AlertEnvelope.federation_hops` for audit, but do not re-sign the alert (the publisher's signature is the source of truth). The receiving community independently audits the alert against its own role schemas; if the publisher's role is not recognised, the alert is delivered with a `civdef.alert.foreign_role` flag and is *not* surfaced as a high-severity push.
|
| 286 |
+
|
| 287 |
+
### 5.8 LoRa interaction
|
| 288 |
+
|
| 289 |
+
LoRa beacons (M29) carry no alert content; they carry only presence. When the local node receives a `severity β {emergency, extreme}` alert and LoRa is enabled, the node sets `FLAG_PANIC` on its next beacon and increases beacon cadence to the panic-burst configured in M29. This is a *signal* that something is happening, not a *content* channel. Receivers must consult bus or notifications for the actual alert content.
|
| 290 |
+
|
| 291 |
+
### 5.9 Failure modes
|
| 292 |
+
|
| 293 |
+
- **Publisher's cert revoked after publish, before propagation completes**: federation forwarders that have received the revocation drop the in-flight alert; nodes that have not yet seen the revocation propagate normally. Eventually consistent; documented limitation.
|
| 294 |
+
- **Audit chain corruption** (disk failure, manual tampering): `verify_chain` detects; the module enters degraded mode where new publishes are blocked until an operator acknowledges and re-checkpoints. Reads continue.
|
| 295 |
+
- **Trust root key compromise**: out of scope for v3.0 to *recover* automatically; documented incident response: revoke all certs chaining to the compromised root, rotate root, reissue.
|
| 296 |
+
- **Mass-ack flood**: `allow_public_ack=false` default; per-alert ack rate-limit `CIVDEF_ACK_MAX_PER_MINUTE_PER_NODE`.
|
| 297 |
+
|
| 298 |
+
---
|
| 299 |
+
|
| 300 |
+
## 6. Errors
|
| 301 |
+
|
| 302 |
+
| Code | When |
|
| 303 |
+
|-----------------------------------|-------------------------------------------------------------------|
|
| 304 |
+
| `experimental_disabled` | Capability called with the flag off |
|
| 305 |
+
| `civdef_cert_not_owned` | Publish/ack with a cert whose holder β caller's identity |
|
| 306 |
+
| `civdef_cert_invalid` | Certificate expired, revoked, or signature broken |
|
| 307 |
+
| `civdef_cert_unrecognised` | Issuer chain doesn't terminate at a configured trust root |
|
| 308 |
+
| `civdef_cert_out_of_scope` | Cert's role/region doesn't authorise the requested action |
|
| 309 |
+
| `civdef_alert_not_found` | Operation references an unknown AlertID |
|
| 310 |
+
| `civdef_alert_target_invalid` | Target region/bbox malformed or outside the issuer's scope |
|
| 311 |
+
| `civdef_audit_chain_broken` | Hash or signature mismatch in the audit chain |
|
| 312 |
+
| `civdef_role_revoked` | Operation attempted with a revoked certificate |
|
| 313 |
+
| `civdef_region_unsupported` | No region adapter loaded for the requested region |
|
| 314 |
+
| `civdef_ack_rate_limited` | Ack rate exceeded for this alert from this node |
|
| 315 |
+
|
| 316 |
+
---
|
| 317 |
+
|
| 318 |
+
## 7. Configuration
|
| 319 |
+
|
| 320 |
+
```python
|
| 321 |
+
@dataclass(frozen=True)
|
| 322 |
+
class CivDefConfig:
|
| 323 |
+
enabled: bool = False
|
| 324 |
+
region: str = "DE.NRW"
|
| 325 |
+
audit_retention_years: int = CIVDEF_AUDIT_RETENTION_YEARS # 10
|
| 326 |
+
require_issuer_cosign: dict[AlertSeverity, bool] = field(default_factory=lambda: {
|
| 327 |
+
"info": False, "advisory": False, "warning": False,
|
| 328 |
+
"emergency": True, "extreme": True,
|
| 329 |
+
})
|
| 330 |
+
allow_public_ack: bool = False
|
| 331 |
+
ack_max_per_minute_per_node: int = CIVDEF_ACK_MAX_PER_MINUTE_PER_NODE # 5
|
| 332 |
+
federation_forward: bool = True
|
| 333 |
+
lora_panic_signal: bool = True
|
| 334 |
+
severity_push_threshold: AlertSeverity = "warning" # below this, no mobile push
|
| 335 |
+
trust_roots_extra: tuple[NodeID, ...] = () # operator-added roots
|
| 336 |
+
region_adapter_overrides: dict[str, str] = field(default_factory=dict)
|
| 337 |
+
```
|
| 338 |
+
|
| 339 |
+
Constants centralised in `hearthnet/constants.py`.
|
| 340 |
+
|
| 341 |
+
---
|
| 342 |
+
|
| 343 |
+
## 8. Tests
|
| 344 |
+
|
| 345 |
+
### 8.1 Unit
|
| 346 |
+
|
| 347 |
+
- `test_role_cert_chain_to_root` β cert with valid chain β accepted; broken chain β rejected.
|
| 348 |
+
- `test_role_cert_expired` β past `expires_at` β `civdef_cert_invalid`.
|
| 349 |
+
- `test_alert_signature_roundtrip`.
|
| 350 |
+
- `test_target_region_prefix_match` β `DE.NRW.Kleve` matches `DE.NRW.Kleve.Issum`, not `DE.NRW.Wesel`.
|
| 351 |
+
- `test_audit_chain_link` β appending entries chains correctly; `verify_chain` returns ok.
|
| 352 |
+
- `test_audit_chain_tamper_detected` β flip a byte in the middle; `verify_chain` reports the break.
|
| 353 |
+
- `test_severity_cap_per_role` β Wehrleiter publishing `extreme` β `civdef_cert_out_of_scope` if schema caps at `emergency`.
|
| 354 |
+
- `test_revocation_propagates` β revoke cert; subsequent alerts from that cert dropped.
|
| 355 |
+
|
| 356 |
+
### 8.2 Integration
|
| 357 |
+
|
| 358 |
+
- Two-node alert flow: node A (Wehrleiter cert) publishes `warning` alert targeting `DE.NRW.Kleve.Issum`; node B (resident in Issum, no cert) receives via M11 push.
|
| 359 |
+
- Role-targeted alert: A publishes alert with `roles=("DE.NRW.Kleve.THW.OV.Leiter",)`; B (without cert) does not receive; C (with cert) does.
|
| 360 |
+
- Federation: A publishes in community X; X federates to Y; Y's resident D receives with `federation_hops=[X]`.
|
| 361 |
+
- Cancellation: A cancels; B's alert list moves it to inactive.
|
| 362 |
+
- Audit export: publish, ack, cancel; export `jsonl`; round-trip parses and `verify_chain` passes.
|
| 363 |
+
|
| 364 |
+
### 8.3 Negative / adversarial
|
| 365 |
+
|
| 366 |
+
- Forged cert chain (random issuer key) β `civdef_cert_unrecognised`.
|
| 367 |
+
- Targeting `DE.BY` (outside NRW) from an NRW-only cert β `civdef_alert_target_invalid`.
|
| 368 |
+
- Ack flood beyond rate limit β `civdef_ack_rate_limited`.
|
| 369 |
+
- Tampered audit chain β publish blocked until operator re-checkpoint.
|
| 370 |
+
|
| 371 |
+
### 8.4 Tabletop
|
| 372 |
+
|
| 373 |
+
- Manual scenarios with Issum Feuerwehr volunteers: simulated Hochwasser event, simulated grid outage, simulated industrial incident on the A57. Goals: latency from alert publication to first ack, false-positive ack rate, operator-perceived clarity of UI under stress.
|
| 374 |
+
|
| 375 |
+
---
|
| 376 |
+
|
| 377 |
+
## 9. Cross-references
|
| 378 |
+
|
| 379 |
+
- **Phase 1 M01 Identity** β every cert, alert, ack, and audit entry is signed against M01 identities.
|
| 380 |
+
- **Phase 1 M11 Notifications** β alerts surface via notifications with priority mapped from `severity`.
|
| 381 |
+
- **Phase 2 M14 Federation** β alerts cross community boundaries via federation.
|
| 382 |
+
- **Phase 2 M16 Tokens** β cert validation reuses M16's signature primitives; alert distribution endpoints require `civdef-receive` scoped tokens.
|
| 383 |
+
- **Phase 2 M22 Mobile Native** β mobile push for `severity β₯ severity_push_threshold`.
|
| 384 |
+
- **Phase 3 M29 LoRa Beacons** β `FLAG_PANIC` corroboration during emergencies.
|
| 385 |
+
- **Phase 3 M30 Evidence** β alerts may carry an `evidence_claim` ClaimID; recipients can `evidence.provenance.trace` to see the reasoning chain.
|
| 386 |
+
- **Phase 3 X09 Conformance Suite** β civdef has a dedicated conformance section because of audit-chain integrity requirements.
|
| 387 |
+
|
| 388 |
+
---
|
| 389 |
+
|
| 390 |
+
## 10. Open research questions
|
| 391 |
+
|
| 392 |
+
1. **Real institutional keys.** v3.0 uses substitute issuance because NRW authorities do not (yet) publish HearthNet-compatible keys. The migration path β getting the Innenministerium or Kreis Kleve to publish keys and sign initial role certs β is a political process, not a technical one. Documented; out of code scope.
|
| 393 |
+
|
| 394 |
+
2. **NINA / KATWARN bridge.** A read-only mirror that pulls public NINA alerts and republishes them locally with a `correlation_id` is plausible and would be valuable. Whether it's M31's job or a separate bridge module is undecided.
|
| 395 |
+
|
| 396 |
+
3. **Multi-Land schema.** The NRW role taxonomy is concrete; Bayern, Niedersachsen, Hessen each have variations (especially around KatS structures). A community-contributed `regions/` directory is the plan; v3.0 ships only NRW.
|
| 397 |
+
|
| 398 |
+
4. **Co-signing UX.** When `require_issuer_cosign=true` for emergencies, the publisher must obtain a co-signature from a higher-tier issuer. Latency-sensitive. A pre-delegated "emergency co-sign authority" mechanism (similar to OCSP-stapling for certs) is the obvious extension. Not in v3.0.
|
| 399 |
+
|
| 400 |
+
5. **Public-ack ergonomics.** Public alerts with `allow_public_ack=true` would let citizens self-report ("I am safe", "I need help"), but the failure modes (ack flood, false reports) are severe enough that v3.0 defaults this off. A future tier with rate limits and ack-content moderation is plausible.
|
| 401 |
+
|
| 402 |
+
6. **Legal retention.** `CIVDEF_AUDIT_RETENTION_YEARS=10` is the operator-friendly default. Actual legal retention varies (NRW Landesarchivgesetz, federal data retention rules for civil-defence records, GDPR exceptions for vital interests). The deployment guide must explicitly walk operators through this; we cannot guess from code.
|
| 403 |
+
|
| 404 |
+
7. **Cross-border alerts.** Issum borders the Netherlands. An alert about a Dutch industrial incident might originate from a Dutch system. Cross-border interop is interesting and outside v3.0 scope. The `region` adapter pattern doesn't preclude it.
|
| 405 |
+
|
| 406 |
+
8. **Drills and false-alarm semantics.** A drill should look real enough to be useful and clearly different enough to not panic non-participants. A `drill=true` flag on Alert is the obvious addition; v3.0 omits it pending feedback from real drill rehearsals.
|
| 407 |
+
|
| 408 |
+
---
|
| 409 |
+
|
| 410 |
+
*Last updated: spec v3.0.*
|
docs/p2_p3/M32-protocol-standard.md
ADDED
|
@@ -0,0 +1,323 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# M32 β Protocol Standardisation & Conformance
|
| 2 |
+
|
| 3 |
+
**Spec version:** v3.0 β *experimental*
|
| 4 |
+
**Depends on:** [M03 Capability Bus](../../modules/M03-capability-bus.md), [X02 Event Log](../../cross-cutting/X02-events.md), [M14 Federation](../../phase-2/modules/M14-federation.md), [X09 Conformance Suite](../cross-cutting/X09-conformance-suite.md)
|
| 5 |
+
**Depended on by:** anyone building an alternate implementation of HearthNet
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## 1. Responsibility
|
| 10 |
+
|
| 11 |
+
Turn the HearthNet specs from "Christof's working code with documentation" into something a **second team** could implement compatibly. This module is the bookkeeping for that: a versioned protocol document set, a conformance reporting capability, governance for how the spec changes, and a registry of known implementations and their conformance levels.
|
| 12 |
+
|
| 13 |
+
The premise is that long-term, a single implementation is fragile. If HearthNet only works as long as one person maintains it, it doesn't survive. Standardising the protocol (the wire formats, the capability contracts, the federation semantics) so other implementations can interoperate is the path to durability β and it sets up the social and legal scaffolding (versioning policy, change process, conformance claims) that other projects will need to take HearthNet seriously.
|
| 14 |
+
|
| 15 |
+
This is not "rewrite HearthNet as an RFC". It's "package what already exists in a form that can be cited, versioned, and conformance-tested". The reference implementation is HearthNet itself; the protocol document is the contract; the conformance suite (X09) is the proof.
|
| 16 |
+
|
| 17 |
+
---
|
| 18 |
+
|
| 19 |
+
## 2. Non-goals
|
| 20 |
+
|
| 21 |
+
- **Submitting to IETF / W3C in v3.0.** That's a multi-year governance process and is out of scope. We make ourselves *ready* for it by structuring the documents and adopting RFC-style versioning conventions, but we don't file anything.
|
| 22 |
+
- **Patent licensing.** There are no patented techniques in HearthNet's core (capability bus, event log, federation, etc. β all build on well-known primitives). We document this assumption but do not run a patent review.
|
| 23 |
+
- **Trademark of "HearthNet".** Out of scope. If the protocol survives, someone (probably ki-fusion-labs.de) eventually claims the name; v3.0 doesn't deal with that.
|
| 24 |
+
- **Conformance certification with a fee structure.** Conformance reports are self-published and free. No "HearthNet Insideβ’ certification programme".
|
| 25 |
+
- **Backward compatibility forever.** Protocol versions can have breaking changes between major versions, with documented migration. Within a minor version, compatibility is required.
|
| 26 |
+
|
| 27 |
+
---
|
| 28 |
+
|
| 29 |
+
## 3. File layout
|
| 30 |
+
|
| 31 |
+
The module is unusual in that most of its content lives *outside* the `hearthnet/` Python package β in a top-level `protocol/` directory at the repository root.
|
| 32 |
+
|
| 33 |
+
```
|
| 34 |
+
protocol/ # repo-root sibling of hearthnet/
|
| 35 |
+
βββ README.md
|
| 36 |
+
βββ VERSION # current protocol version (e.g. 3.0.0)
|
| 37 |
+
βββ CHANGELOG.md
|
| 38 |
+
βββ governance.md # change process, decision rights
|
| 39 |
+
βββ versioning.md # semver-with-twists rules
|
| 40 |
+
βββ reference-implementations.md # registry
|
| 41 |
+
βββ core/
|
| 42 |
+
β βββ 01-identity-and-addressing.md
|
| 43 |
+
β βββ 02-transport.md
|
| 44 |
+
β βββ 03-capability-bus.md
|
| 45 |
+
β βββ 04-event-log.md
|
| 46 |
+
β βββ 05-tokens.md
|
| 47 |
+
β βββ 06-federation.md
|
| 48 |
+
β βββ ...
|
| 49 |
+
βββ experimental/
|
| 50 |
+
βββ 30-evidence.md
|
| 51 |
+
βββ 31-civil-defense.md
|
| 52 |
+
|
| 53 |
+
hearthnet/protocol/
|
| 54 |
+
βββ __init__.py
|
| 55 |
+
βββ service.py # ProtocolService β registry and report capability
|
| 56 |
+
βββ registry.py # In-memory + persisted registry of known impls
|
| 57 |
+
βββ report.py # Conformance report dataclass + serialiser
|
| 58 |
+
```
|
| 59 |
+
|
| 60 |
+
The `protocol/` directory is the **specification artefact** β versioned alongside code but conceptually independent. The Python `hearthnet/protocol/` module is the thin runtime surface that lets a HearthNet node expose its conformance information on the bus.
|
| 61 |
+
|
| 62 |
+
---
|
| 63 |
+
|
| 64 |
+
## 4. Public API
|
| 65 |
+
|
| 66 |
+
### 4.1 Dataclasses
|
| 67 |
+
|
| 68 |
+
```python
|
| 69 |
+
@dataclass(frozen=True)
|
| 70 |
+
class ProtocolVersion:
|
| 71 |
+
major: int
|
| 72 |
+
minor: int
|
| 73 |
+
patch: int
|
| 74 |
+
suffix: str = "" # "", "rc1", "experimental"
|
| 75 |
+
|
| 76 |
+
def __str__(self) -> str: ... # "3.0.0" or "3.0.0-rc1"
|
| 77 |
+
def is_compatible_with(self, other: ProtocolVersion) -> bool: ... # same major
|
| 78 |
+
|
| 79 |
+
@dataclass(frozen=True)
|
| 80 |
+
class ImplementationDescriptor:
|
| 81 |
+
name: str # e.g. "hearthnet-reference"
|
| 82 |
+
vendor: str # e.g. "ki-fusion-labs.de"
|
| 83 |
+
version: str # implementation version, e.g. "0.4.2"
|
| 84 |
+
protocol_versions: tuple[ProtocolVersion, ...] # which protocol versions supported
|
| 85 |
+
homepage_url: str | None
|
| 86 |
+
contact: str | None
|
| 87 |
+
|
| 88 |
+
@dataclass(frozen=True)
|
| 89 |
+
class ConformanceReport:
|
| 90 |
+
implementation: ImplementationDescriptor
|
| 91 |
+
protocol_version: ProtocolVersion
|
| 92 |
+
suite_version: str # X09 conformance suite version
|
| 93 |
+
ran_at: datetime
|
| 94 |
+
sections: dict[str, SectionResult] # section name β result
|
| 95 |
+
overall: Literal["pass","fail","partial","skipped"]
|
| 96 |
+
signature: bytes # implementation signs its own report
|
| 97 |
+
|
| 98 |
+
@dataclass(frozen=True)
|
| 99 |
+
class SectionResult:
|
| 100 |
+
name: str
|
| 101 |
+
total: int
|
| 102 |
+
passed: int
|
| 103 |
+
failed: int
|
| 104 |
+
skipped: int
|
| 105 |
+
failures: tuple[FailureDetail, ...]
|
| 106 |
+
```
|
| 107 |
+
|
| 108 |
+
### 4.2 Capabilities
|
| 109 |
+
|
| 110 |
+
```python
|
| 111 |
+
async def protocol_version_list() -> list[ProtocolVersion]
|
| 112 |
+
async def protocol_self_describe() -> ImplementationDescriptor
|
| 113 |
+
async def protocol_conformance_report(suite_version: str | None = None) -> ConformanceReport
|
| 114 |
+
async def protocol_registry_list() -> list[ImplementationDescriptor]
|
| 115 |
+
async def protocol_registry_announce(descriptor: ImplementationDescriptor) -> AnnounceReceipt
|
| 116 |
+
```
|
| 117 |
+
|
| 118 |
+
These are stable (non-experimental) capabilities β the protocol must include its own self-description and conformance-reporting capability, otherwise interop is impossible.
|
| 119 |
+
|
| 120 |
+
### 4.3 Service class
|
| 121 |
+
|
| 122 |
+
```python
|
| 123 |
+
class ProtocolService:
|
| 124 |
+
def __init__(self,
|
| 125 |
+
bus: CapabilityBus,
|
| 126 |
+
event_log: EventLog,
|
| 127 |
+
federation: FederationService,
|
| 128 |
+
registry: ImplementationRegistry,
|
| 129 |
+
conformance_runner: ConformanceRunner | None,
|
| 130 |
+
config: ProtocolConfig): ...
|
| 131 |
+
|
| 132 |
+
def supported_versions(self) -> list[ProtocolVersion]: ...
|
| 133 |
+
def self_descriptor(self) -> ImplementationDescriptor: ...
|
| 134 |
+
async def run_conformance(self, suite_version: str | None = None) -> ConformanceReport: ...
|
| 135 |
+
async def announce(self, descriptor: ImplementationDescriptor) -> None: ... # to local registry + federation
|
| 136 |
+
async def registry_list(self) -> list[ImplementationDescriptor]: ...
|
| 137 |
+
```
|
| 138 |
+
|
| 139 |
+
### 4.4 Implementation registry
|
| 140 |
+
|
| 141 |
+
```python
|
| 142 |
+
class ImplementationRegistry:
|
| 143 |
+
"""Local registry of known implementations.
|
| 144 |
+
|
| 145 |
+
Populated by:
|
| 146 |
+
- self (this node, on startup)
|
| 147 |
+
- federation peers' announcements
|
| 148 |
+
- operator-curated additions
|
| 149 |
+
"""
|
| 150 |
+
async def upsert(self, descriptor: ImplementationDescriptor, source: NodeID) -> None: ...
|
| 151 |
+
async def list(self) -> list[ImplementationDescriptor]: ...
|
| 152 |
+
async def known_by_name(self, name: str) -> list[ImplementationDescriptor]: ...
|
| 153 |
+
```
|
| 154 |
+
|
| 155 |
+
---
|
| 156 |
+
|
| 157 |
+
## 5. Behaviour
|
| 158 |
+
|
| 159 |
+
### 5.1 Versioning policy
|
| 160 |
+
|
| 161 |
+
Protocol versions follow **semver with explicit stability tiers**:
|
| 162 |
+
|
| 163 |
+
- **Major** (`X.0.0`): breaking changes to wire formats or capability contracts. New major version requires explicit migration documentation. Old majors remain readable for migration purposes for at least 2 years.
|
| 164 |
+
- **Minor** (`X.Y.0`): additive β new capabilities, new event types, new optional fields. Backward compatibility within the major is required: a `3.0` impl talking to a `3.2` impl must work, with the `3.0` side ignoring new fields/events.
|
| 165 |
+
- **Patch** (`X.Y.Z`): clarification, typo fixes, no functional change.
|
| 166 |
+
- **Suffix** `-experimental`: capabilities in the `experimental.*` namespace; can change without bumping major.
|
| 167 |
+
|
| 168 |
+
Each protocol document carries a frontmatter `protocol-version: 3.X.Y` that names the smallest version that contains it. The `protocol/VERSION` file is the current latest. The `CHANGELOG.md` lists every diff between versions.
|
| 169 |
+
|
| 170 |
+
### 5.2 Stability tiers
|
| 171 |
+
|
| 172 |
+
Capabilities are tagged with one of:
|
| 173 |
+
|
| 174 |
+
- `stable` β frozen at the major version; any change is a breaking change.
|
| 175 |
+
- `provisional` β expected to become stable; minor-version breaking changes allowed with deprecation period.
|
| 176 |
+
- `experimental` β `experimental.*` namespace; may change or vanish.
|
| 177 |
+
|
| 178 |
+
The `protocol_self_describe` capability reports which capabilities the implementation supports and at which tier. A capability marked `stable` in the protocol but implemented at `experimental` tier in the node is a configuration error and is logged at startup.
|
| 179 |
+
|
| 180 |
+
### 5.3 Conformance reporting
|
| 181 |
+
|
| 182 |
+
A `ConformanceReport` is the artefact produced by running the X09 conformance suite against a running node. The report is:
|
| 183 |
+
|
| 184 |
+
- **Self-signed** by the implementation β there is no central authority that "certifies" reports.
|
| 185 |
+
- **Reproducible** β the suite version is in the report; running the same suite against the same impl should produce equivalent results modulo timestamps.
|
| 186 |
+
- **Public** β implementations are encouraged to publish their reports openly (in their repo, on their website, federated via the registry).
|
| 187 |
+
- **Honest about partial conformance** β `partial` is a valid outcome and is more useful than a misleading `pass`.
|
| 188 |
+
|
| 189 |
+
Reports do not expire, but a report from suite version `1.0.0` is not equivalent to a report from `2.0.0`. The X09 suite versions independently of the protocol.
|
| 190 |
+
|
| 191 |
+
### 5.4 Implementation registry & federation
|
| 192 |
+
|
| 193 |
+
Each HearthNet node announces its `ImplementationDescriptor` to its federation peers on connect. Peers add it to their local registry. Operators can query their local registry for "who else is out there" via `protocol_registry_list`.
|
| 194 |
+
|
| 195 |
+
The registry is *advisory*. There is no trust beyond "this peer claimed this descriptor at this time". A peer claiming to be an implementation it isn't is a security incident, but the registry doesn't authenticate vendor names β only that the node signed its descriptor.
|
| 196 |
+
|
| 197 |
+
### 5.5 Governance (documented, not enforced)
|
| 198 |
+
|
| 199 |
+
The `protocol/governance.md` document describes how protocol changes happen:
|
| 200 |
+
|
| 201 |
+
1. **Proposal**: any contributor writes a "change note" as a PR to `protocol/`. Includes motivation, exact spec diff, migration story, and conformance impact.
|
| 202 |
+
2. **Discussion**: open period (default 4 weeks) for review.
|
| 203 |
+
3. **Decision**: maintainers (initially just Christof, ideally expanding to a small group) accept, reject, or request revision. Rejections are logged with rationale.
|
| 204 |
+
4. **Merge**: accepted change merges to `protocol/main` with version bump per Β§5.1 rules.
|
| 205 |
+
5. **Release**: tagged release of the protocol document set independent of code releases.
|
| 206 |
+
|
| 207 |
+
This is a process document; the module does not *enforce* governance technically. Enforcement is social.
|
| 208 |
+
|
| 209 |
+
### 5.6 Reference implementations registry
|
| 210 |
+
|
| 211 |
+
`protocol/reference-implementations.md` is a living document listing known implementations. v3.0 entries:
|
| 212 |
+
|
| 213 |
+
- **hearthnet-reference** (Python, this codebase). Status: complete, all stable + provisional + experimental capabilities.
|
| 214 |
+
- (placeholder for a second impl when one exists).
|
| 215 |
+
|
| 216 |
+
Adding an implementation to this document requires a PR demonstrating at minimum a passing conformance report for `core` sections of the X09 suite at the current major version.
|
| 217 |
+
|
| 218 |
+
### 5.7 Migration documentation
|
| 219 |
+
|
| 220 |
+
Each major version transition ships a `protocol/migration/X-to-Y.md` document. v3.0 includes:
|
| 221 |
+
|
| 222 |
+
- `protocol/migration/2-to-3.md` β placeholder for now; v3.0 introduces only experimental capabilities, so a strict v2βv3 migration is effectively a no-op for stable code paths.
|
| 223 |
+
|
| 224 |
+
### 5.8 Failure modes
|
| 225 |
+
|
| 226 |
+
- **Mismatched protocol versions in federation**: handled by M14 federation manifest version negotiation. M32 itself doesn't intervene at runtime; it just reports.
|
| 227 |
+
- **Conformance suite not present**: `protocol.conformance.report` returns `skipped` overall with reason `suite_not_installed`.
|
| 228 |
+
- **Conflicting registry entries** (same `name` claimed by two distinct vendors): both stored; the registry list returns both; operators decide.
|
| 229 |
+
|
| 230 |
+
---
|
| 231 |
+
|
| 232 |
+
## 6. Errors
|
| 233 |
+
|
| 234 |
+
| Code | When |
|
| 235 |
+
|-------------------------------|-------------------------------------------------------------------|
|
| 236 |
+
| `protocol_version_unknown` | Operation references a protocol version not in our table |
|
| 237 |
+
| `protocol_suite_not_installed`| Conformance report requested but X09 not available |
|
| 238 |
+
| `protocol_descriptor_invalid` | Announcement with malformed descriptor |
|
| 239 |
+
| `protocol_unsupported_capability` | Federation negotiation finds no compatible major version |
|
| 240 |
+
|
| 241 |
+
---
|
| 242 |
+
|
| 243 |
+
## 7. Configuration
|
| 244 |
+
|
| 245 |
+
```python
|
| 246 |
+
@dataclass(frozen=True)
|
| 247 |
+
class ProtocolConfig:
|
| 248 |
+
enabled: bool = True # this one is enabled by default
|
| 249 |
+
supported_versions: tuple[str, ...] = ("3.0.0",)
|
| 250 |
+
default_announce_version: str = "3.0.0"
|
| 251 |
+
descriptor: ImplementationDescriptor = field(default_factory=lambda: ImplementationDescriptor(
|
| 252 |
+
name="hearthnet-reference",
|
| 253 |
+
vendor="ki-fusion-labs.de",
|
| 254 |
+
version="0.4.2",
|
| 255 |
+
protocol_versions=(ProtocolVersion(3, 0, 0),),
|
| 256 |
+
homepage_url="https://ki-fusion-labs.de/hearthnet",
|
| 257 |
+
contact=None,
|
| 258 |
+
))
|
| 259 |
+
announce_to_federation: bool = True
|
| 260 |
+
conformance_auto_run_on_startup: bool = False
|
| 261 |
+
registry_max_entries: int = 4096
|
| 262 |
+
```
|
| 263 |
+
|
| 264 |
+
---
|
| 265 |
+
|
| 266 |
+
## 8. Tests
|
| 267 |
+
|
| 268 |
+
### 8.1 Unit
|
| 269 |
+
|
| 270 |
+
- `test_protocol_version_compat` β same major compatible; different major not.
|
| 271 |
+
- `test_descriptor_signature_roundtrip`.
|
| 272 |
+
- `test_registry_upsert_idempotent`.
|
| 273 |
+
- `test_conformance_report_signed` β self-signed report's signature verifies against the implementation's identity.
|
| 274 |
+
- `test_protocol_version_parse` β `"3.0.0"`, `"3.0.0-rc1"`, `"3.0.0-experimental"` parse correctly.
|
| 275 |
+
|
| 276 |
+
### 8.2 Integration
|
| 277 |
+
|
| 278 |
+
- Two nodes (same impl, same version): announce β registry shows both.
|
| 279 |
+
- Two nodes (same impl, different versions): registry shows both; federation negotiates highest compatible.
|
| 280 |
+
- Run conformance suite against the reference impl: must pass `core/*` sections by definition (the suite is built to match the spec).
|
| 281 |
+
|
| 282 |
+
### 8.3 Spec-document tests
|
| 283 |
+
|
| 284 |
+
- `test_protocol_documents_present` β every protocol document referenced in `protocol/README.md` exists.
|
| 285 |
+
- `test_protocol_version_consistent` β `protocol/VERSION` matches the `default_announce_version` in code.
|
| 286 |
+
- `test_changelog_format` β `protocol/CHANGELOG.md` parses as a sequence of versioned entries with semver-valid ordering.
|
| 287 |
+
|
| 288 |
+
### 8.4 Negative
|
| 289 |
+
|
| 290 |
+
- Malformed descriptor β `protocol_descriptor_invalid`.
|
| 291 |
+
- Federation peer announces protocol `99.0.0` β registry stores it, federation negotiation declines.
|
| 292 |
+
|
| 293 |
+
---
|
| 294 |
+
|
| 295 |
+
## 9. Cross-references
|
| 296 |
+
|
| 297 |
+
- **All Phase 1, 2, 3 modules** β they collectively *are* the protocol. M32's job is the meta-layer.
|
| 298 |
+
- **Phase 2 M14 Federation** β federation manifest carries protocol version; negotiation uses M32's compat rules.
|
| 299 |
+
- **Phase 3 X09 Conformance Suite** β produces the data that M32's report capability surfaces.
|
| 300 |
+
|
| 301 |
+
---
|
| 302 |
+
|
| 303 |
+
## 10. Open research questions
|
| 304 |
+
|
| 305 |
+
1. **Independent implementation.** The protocol is real only when a second implementation exists. v3.0 ships only the reference. A small "minimal HearthNet" written in Go or Rust as a contrast implementation would prove the spec is implementable from the documents alone. Concrete next step, but not in v3.0.
|
| 306 |
+
|
| 307 |
+
2. **Formal verification of wire formats.** TLS-style formal proofs of the federation handshake and capability-bus dispatch would be valuable. Out of v3.0 scope; documented as a research direction.
|
| 308 |
+
|
| 309 |
+
3. **Governance bootstrapping.** "Christof decides" is fine for now and honest about the project's state. Transitioning to a multi-maintainer model needs a path β a Tech Steering Committee, a foundation, or simply a documented succession plan. Currently undefined.
|
| 310 |
+
|
| 311 |
+
4. **Standards-body engagement.** If the protocol matures, IETF (for federation/transport) and W3C (for capability-bus semantics if they look RPC-like) are plausible homes. v3.0 deliberately avoids premature standards engagement; the bar is "second implementation exists and is interoperable".
|
| 312 |
+
|
| 313 |
+
5. **Legal entity.** ki-fusion-labs.de is currently the operating entity. Whether a separate legal entity (e.V., foundation) is needed for a multi-vendor protocol is a real question. Out of code scope.
|
| 314 |
+
|
| 315 |
+
6. **Trademark and naming.** "HearthNet" as a trademark is undefined; the protocol could be renamed to something more obviously generic at standardisation time. The reference impl can keep the name.
|
| 316 |
+
|
| 317 |
+
7. **Optionality flags vs separate profiles.** A node might support `core` only, or `core + federation`, or everything. Whether to model this as per-capability optionality (current approach) or named profiles (`HearthNet-Core`, `HearthNet-Federated`, `HearthNet-Civdef`) is a design question that needs feedback from a second impl team.
|
| 318 |
+
|
| 319 |
+
8. **Conformance suite drift.** The X09 suite is the source of truth for "what does conformance mean"; the protocol documents describe the *intent*. When the two disagree, currently the suite wins (because it's executable). This is pragmatic but not principled. A future version may flip this and use the suite to *test* the documents, with the documents as primary.
|
| 320 |
+
|
| 321 |
+
---
|
| 322 |
+
|
| 323 |
+
*Last updated: spec v3.0.*
|
docs/p2_p3/X05-dht.md
ADDED
|
@@ -0,0 +1,374 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# X05 β Distributed Hash Table (DHT)
|
| 2 |
+
|
| 3 |
+
**Spec version:** v1.0 (Phase 2)
|
| 4 |
+
**Depends on:** M01 (identity), X01 (transport), X04 (config), X03 (observability)
|
| 5 |
+
**Depended on by:** M14 (federation discovery), M07 ext (background blob replication via content routing), M02 ext (cross-LAN peer discovery), M15 (relay bootstrap)
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## 1. Responsibility
|
| 10 |
+
|
| 11 |
+
Provide a Kademlia-style DHT over the internet that lets:
|
| 12 |
+
|
| 13 |
+
- A node find peers of its own community across LANs (cross-LAN extension of M02)
|
| 14 |
+
- A node find sources of a specific CID across communities (extension of M07's local source index)
|
| 15 |
+
- A node bootstrap into a federation (find an anchor of community X without knowing its IP)
|
| 16 |
+
|
| 17 |
+
Out of scope:
|
| 18 |
+
- Permanent storage in the DHT β DHT entries are TTL'd advertisements only
|
| 19 |
+
- Anonymity (no onion routing β there's no anonymity goal)
|
| 20 |
+
- Sybil resistance β communities are the trust roots; DHT is an unreliable hint layer
|
| 21 |
+
|
| 22 |
+
---
|
| 23 |
+
|
| 24 |
+
## 2. File layout
|
| 25 |
+
|
| 26 |
+
```
|
| 27 |
+
hearthnet/dht/
|
| 28 |
+
βββ __init__.py
|
| 29 |
+
βββ kademlia.py # KademliaNode, routing table
|
| 30 |
+
βββ routing.py # FindNode, FindValue, Store, Ping RPCs
|
| 31 |
+
βββ storage.py # local DHT k/v store with TTL
|
| 32 |
+
βββ bootstrap.py # bootstrap peer list, NAT-aware reachability
|
| 33 |
+
```
|
| 34 |
+
|
| 35 |
+
---
|
| 36 |
+
|
| 37 |
+
## 3. Concepts
|
| 38 |
+
|
| 39 |
+
### 3.1 Key space
|
| 40 |
+
|
| 41 |
+
XOR-distance over a 256-bit key space. Keys are derived as:
|
| 42 |
+
|
| 43 |
+
- For peers: `key = blake3(node_id_full)[:32]`
|
| 44 |
+
- For CIDs: `key = blake3(cid_string)[:32]`
|
| 45 |
+
- For communities: `key = blake3(community_id_full)[:32]`
|
| 46 |
+
|
| 47 |
+
### 3.2 Bucket structure
|
| 48 |
+
|
| 49 |
+
Standard Kademlia: 256 buckets of size `DHT_REPLICATION_K = 8` (from Phase 2 constants). Concurrent lookups: `DHT_ALPHA = 3`.
|
| 50 |
+
|
| 51 |
+
### 3.3 Values stored
|
| 52 |
+
|
| 53 |
+
The DHT does **not** store community state. It stores small, signed advertisements:
|
| 54 |
+
|
| 55 |
+
| Value type | Key | Value (signed) | TTL |
|
| 56 |
+
|------------|-----|----------------|-----|
|
| 57 |
+
| Peer presence | `blake3(node_id)` | `{endpoints, community_id, expires_at}` | matches manifest TTL (30s) |
|
| 58 |
+
| CID source | `blake3(cid)` | `{node_id, last_seen}` | 1 hour |
|
| 59 |
+
| Community bootstrap | `blake3(community_id)` | `{anchor_node_ids, endpoints, manifest_url}` | 24 hours |
|
| 60 |
+
|
| 61 |
+
The DHT is a **hint cache**. Authoritative state lives in community event logs.
|
| 62 |
+
|
| 63 |
+
---
|
| 64 |
+
|
| 65 |
+
## 4. Public API
|
| 66 |
+
|
| 67 |
+
### 4.1 `kademlia.py`
|
| 68 |
+
|
| 69 |
+
```python
|
| 70 |
+
# hearthnet/dht/kademlia.py
|
| 71 |
+
from dataclasses import dataclass
|
| 72 |
+
|
| 73 |
+
@dataclass(frozen=True)
|
| 74 |
+
class DhtContact:
|
| 75 |
+
node_key: bytes # 32 bytes
|
| 76 |
+
node_id_full: str
|
| 77 |
+
endpoint: Endpoint
|
| 78 |
+
last_seen: float
|
| 79 |
+
|
| 80 |
+
@dataclass(frozen=True)
|
| 81 |
+
class DhtValue:
|
| 82 |
+
"""A stored advertisement. The payload is a signed dict."""
|
| 83 |
+
key: bytes
|
| 84 |
+
payload: dict # has 'signature' field
|
| 85 |
+
expires_at: int # unix seconds
|
| 86 |
+
|
| 87 |
+
class KademliaNode:
|
| 88 |
+
"""One node's view of the DHT.
|
| 89 |
+
Provides high-level find_node / find_value / store APIs."""
|
| 90 |
+
|
| 91 |
+
def __init__(
|
| 92 |
+
self,
|
| 93 |
+
kp: KeyPair,
|
| 94 |
+
endpoint: Endpoint,
|
| 95 |
+
transport_client: HttpClient,
|
| 96 |
+
bootstrap_endpoints: list[Endpoint],
|
| 97 |
+
):
|
| 98 |
+
...
|
| 99 |
+
|
| 100 |
+
async def start(self) -> None:
|
| 101 |
+
"""Bootstrap: ping bootstrap_endpoints, populate routing table."""
|
| 102 |
+
|
| 103 |
+
async def stop(self) -> None: ...
|
| 104 |
+
|
| 105 |
+
# --- public lookups ---
|
| 106 |
+
|
| 107 |
+
async def find_node(self, target_key: bytes) -> list[DhtContact]:
|
| 108 |
+
"""Return the k closest contacts to target_key."""
|
| 109 |
+
|
| 110 |
+
async def find_value(self, key: bytes) -> list[DhtValue]:
|
| 111 |
+
"""Return values stored at this key (or empty)."""
|
| 112 |
+
|
| 113 |
+
async def store(self, value: DhtValue) -> int:
|
| 114 |
+
"""Replicate to k closest nodes. Returns count of successful stores."""
|
| 115 |
+
|
| 116 |
+
# --- maintenance ---
|
| 117 |
+
|
| 118 |
+
async def refresh_buckets(self) -> None:
|
| 119 |
+
"""Per DHT_REFRESH_SECONDS: ping a random key in each bucket to liveness-check it."""
|
| 120 |
+
|
| 121 |
+
async def republish_values(self) -> None:
|
| 122 |
+
"""Per DHT_REPUBLISH_SECONDS: re-store our own advertisements so TTL doesn't expire them."""
|
| 123 |
+
|
| 124 |
+
# --- introspection ---
|
| 125 |
+
|
| 126 |
+
def routing_table_size(self) -> int: ...
|
| 127 |
+
def stored_values(self) -> int: ...
|
| 128 |
+
```
|
| 129 |
+
|
| 130 |
+
### 4.2 `routing.py`
|
| 131 |
+
|
| 132 |
+
Wire RPCs exposed by the bus transport (X01) as additional endpoints under `/dht/v1/`:
|
| 133 |
+
|
| 134 |
+
| Endpoint | Method | Purpose |
|
| 135 |
+
|----------|--------|---------|
|
| 136 |
+
| `/dht/v1/ping` | POST | Liveness, exchange contact info |
|
| 137 |
+
| `/dht/v1/find_node` | POST | Return k closest contacts to a key |
|
| 138 |
+
| `/dht/v1/find_value` | POST | Return values at a key OR closest contacts if absent |
|
| 139 |
+
| `/dht/v1/store` | POST | Accept a value into local storage (if we're among the k closest) |
|
| 140 |
+
|
| 141 |
+
```python
|
| 142 |
+
# hearthnet/dht/routing.py
|
| 143 |
+
async def serve_ping(req: dict) -> dict: ...
|
| 144 |
+
async def serve_find_node(req: dict, kademlia: KademliaNode) -> dict: ...
|
| 145 |
+
async def serve_find_value(req: dict, kademlia: KademliaNode) -> dict: ...
|
| 146 |
+
async def serve_store(req: dict, kademlia: KademliaNode) -> dict: ...
|
| 147 |
+
|
| 148 |
+
# Request / response shapes documented inline in each function.
|
| 149 |
+
```
|
| 150 |
+
|
| 151 |
+
### 4.3 `storage.py`
|
| 152 |
+
|
| 153 |
+
```python
|
| 154 |
+
# hearthnet/dht/storage.py
|
| 155 |
+
class DhtStore:
|
| 156 |
+
"""Local key-value store with TTL eviction.
|
| 157 |
+
Backing: SQLite in <DATA>/dht/store.sqlite."""
|
| 158 |
+
|
| 159 |
+
def __init__(self, db_path: Path):
|
| 160 |
+
...
|
| 161 |
+
|
| 162 |
+
def put(self, value: DhtValue) -> bool:
|
| 163 |
+
"""Idempotent. Returns True if stored (we're in k closest), False if rejected."""
|
| 164 |
+
|
| 165 |
+
def get(self, key: bytes) -> list[DhtValue]:
|
| 166 |
+
"""Return non-expired values for this key."""
|
| 167 |
+
|
| 168 |
+
def evict_expired(self) -> int: ...
|
| 169 |
+
|
| 170 |
+
def size(self) -> int: ...
|
| 171 |
+
```
|
| 172 |
+
|
| 173 |
+
### 4.4 `bootstrap.py`
|
| 174 |
+
|
| 175 |
+
```python
|
| 176 |
+
# hearthnet/dht/bootstrap.py
|
| 177 |
+
|
| 178 |
+
DEFAULT_BOOTSTRAP_NODES: list[Endpoint] = [
|
| 179 |
+
# Filled at packaging time with community-run bootstrap endpoints.
|
| 180 |
+
# Christof's relay.hearthnet.de will be a default.
|
| 181 |
+
]
|
| 182 |
+
|
| 183 |
+
async def is_reachable(endpoint: Endpoint, timeout_seconds: float = 5) -> bool:
|
| 184 |
+
"""Send a ping; return True if responded."""
|
| 185 |
+
|
| 186 |
+
async def discover_external_ip() -> str | None:
|
| 187 |
+
"""Use STUN against a public STUN server to learn our external IP.
|
| 188 |
+
Used by relay-assisted bootstrap to advertise reachable endpoints."""
|
| 189 |
+
```
|
| 190 |
+
|
| 191 |
+
---
|
| 192 |
+
|
| 193 |
+
## 5. Behaviour
|
| 194 |
+
|
| 195 |
+
### 5.1 Advertisement lifecycle
|
| 196 |
+
|
| 197 |
+
```
|
| 198 |
+
Node starts β KademliaNode.start()
|
| 199 |
+
β ping bootstrap_endpoints; build initial routing table
|
| 200 |
+
β store our peer presence: store(DhtValue(blake3(node_id), {...}, ttl=30s))
|
| 201 |
+
β store community bootstrap: store(DhtValue(blake3(community_id), {anchors, ...}, ttl=24h))
|
| 202 |
+
β for each pinned CID: store(DhtValue(blake3(cid), {node_id, ...}, ttl=1h))
|
| 203 |
+
β
|
| 204 |
+
Every MANIFEST_REPUBLISH_INTERVAL_SECONDS: re-store peer presence
|
| 205 |
+
Every DHT_REPUBLISH_SECONDS: re-store all our advertisements
|
| 206 |
+
Every DHT_REFRESH_SECONDS: refresh routing table buckets
|
| 207 |
+
```
|
| 208 |
+
|
| 209 |
+
### 5.2 Lookup integration with M02
|
| 210 |
+
|
| 211 |
+
When [M02 PeerRegistry](../../modules/M02-discovery.md) doesn't find a peer for a known community on the LAN:
|
| 212 |
+
|
| 213 |
+
```python
|
| 214 |
+
# M02 extension (Phase 2)
|
| 215 |
+
async def find_remote_peers(community_id: str) -> list[PeerRecord]:
|
| 216 |
+
if dht is None:
|
| 217 |
+
return []
|
| 218 |
+
contacts = await dht.find_value(blake3(community_id))
|
| 219 |
+
candidates = [parse_community_bootstrap(v.payload) for v in contacts]
|
| 220 |
+
return await fetch_manifests_and_filter(candidates, community_id)
|
| 221 |
+
```
|
| 222 |
+
|
| 223 |
+
### 5.3 Lookup integration with M07
|
| 224 |
+
|
| 225 |
+
When [M07 TransferManager](../../modules/M07-file-blobs.md) needs sources for a CID and the local `file.cid.advertised` index is empty:
|
| 226 |
+
|
| 227 |
+
```python
|
| 228 |
+
# M07 extension (Phase 2)
|
| 229 |
+
async def find_remote_sources(cid: str) -> list[str]: # NodeIDs
|
| 230 |
+
contacts = await dht.find_value(blake3(cid))
|
| 231 |
+
return [parse_source_advert(v.payload).node_id for v in contacts]
|
| 232 |
+
```
|
| 233 |
+
|
| 234 |
+
### 5.4 Signature requirement on stored values
|
| 235 |
+
|
| 236 |
+
Every DHT value's `payload` must contain a `signature` field signed by the advertiser. Receivers reject values whose signature does not validate against the advertiser's claimed NodeID. Cost is small; protection is essential β without it, anyone can poison the DHT.
|
| 237 |
+
|
| 238 |
+
### 5.5 NAT traversal hooks
|
| 239 |
+
|
| 240 |
+
The DHT itself does not do hole-punching. It cooperates with [M15 Relay Tier](../M15-relay-tier.md):
|
| 241 |
+
|
| 242 |
+
- If our advertised endpoint is unreachable (NAT'd), we additionally advertise `via_relay: "<relay_url>"` in the value payload
|
| 243 |
+
- Peers wanting to reach us see the relay hint and route through it
|
| 244 |
+
- Direct peer-to-peer over NAT (STUN/TURN) is Phase 3
|
| 245 |
+
|
| 246 |
+
### 5.6 Privacy of the DHT
|
| 247 |
+
|
| 248 |
+
The DHT is a public-internet-facing component (by definition). It leaks:
|
| 249 |
+
- Which NodeIDs exist
|
| 250 |
+
- Which communities exist
|
| 251 |
+
- Which CIDs are popular
|
| 252 |
+
|
| 253 |
+
It does **not** leak:
|
| 254 |
+
- The contents of any blob
|
| 255 |
+
- The contents of community event logs
|
| 256 |
+
- Who's actually a member of a community (membership is in the signed manifest, fetched out of band)
|
| 257 |
+
|
| 258 |
+
This is acceptable for a system whose goal is community resilience, not anonymity.
|
| 259 |
+
|
| 260 |
+
### 5.7 Anti-spam
|
| 261 |
+
|
| 262 |
+
- Per-source rate limit on `store` calls: max 100 per minute per node
|
| 263 |
+
- Stored value size cap: 4 KB
|
| 264 |
+
- Per-bucket eviction prefers values with higher signature reputation (Phase 3)
|
| 265 |
+
|
| 266 |
+
### 5.8 Bootstrap reachability
|
| 267 |
+
|
| 268 |
+
`bootstrap_endpoints` (from config) are tried in order. If all fail, the node logs a warning and continues with mDNS+UDP only. The DHT is best-effort.
|
| 269 |
+
|
| 270 |
+
---
|
| 271 |
+
|
| 272 |
+
## 6. Wire format (request/response examples)
|
| 273 |
+
|
| 274 |
+
### 6.1 `POST /dht/v1/find_value`
|
| 275 |
+
|
| 276 |
+
Request:
|
| 277 |
+
```json
|
| 278 |
+
{
|
| 279 |
+
"key": "blake3:<hex of 32 bytes>",
|
| 280 |
+
"from": "ed25519:<our NodeID>",
|
| 281 |
+
"trace_id": "01HXR...",
|
| 282 |
+
"signature": "ed25519:<over the above three fields canonicalised>"
|
| 283 |
+
}
|
| 284 |
+
```
|
| 285 |
+
|
| 286 |
+
Response (value found):
|
| 287 |
+
```json
|
| 288 |
+
{
|
| 289 |
+
"values": [
|
| 290 |
+
{
|
| 291 |
+
"key": "blake3:...",
|
| 292 |
+
"payload": {"node_id":"...","endpoints":[...],"signature":"ed25519:..."},
|
| 293 |
+
"expires_at": 1717942800
|
| 294 |
+
}
|
| 295 |
+
]
|
| 296 |
+
}
|
| 297 |
+
```
|
| 298 |
+
|
| 299 |
+
Response (not found, get closer contacts):
|
| 300 |
+
```json
|
| 301 |
+
{
|
| 302 |
+
"values": [],
|
| 303 |
+
"closer": [{"node_id_full":"ed25519:...","endpoint":{"host":"...","port":7080}}, "..."]
|
| 304 |
+
}
|
| 305 |
+
```
|
| 306 |
+
|
| 307 |
+
---
|
| 308 |
+
|
| 309 |
+
## 7. Errors
|
| 310 |
+
|
| 311 |
+
`DhtError`:
|
| 312 |
+
|
| 313 |
+
- `bootstrap_failed` β no bootstrap endpoint reachable
|
| 314 |
+
- `lookup_timeout` β couldn't find value or contacts within DHT_LOOKUP_TIMEOUT
|
| 315 |
+
- `store_unauthorized` β payload signature invalid
|
| 316 |
+
- `value_too_large` β > 4 KB
|
| 317 |
+
- `rate_limited` β per-source store rate exceeded
|
| 318 |
+
|
| 319 |
+
These don't always map to wire codes β most DHT activity is internal to the node. When they bubble up to a caller, `dht_lookup_failed` is the wire code.
|
| 320 |
+
|
| 321 |
+
---
|
| 322 |
+
|
| 323 |
+
## 8. Configuration
|
| 324 |
+
|
| 325 |
+
```python
|
| 326 |
+
config.dht.enabled = False # opt-in; phase 1 default off
|
| 327 |
+
config.dht.bootstrap_endpoints = [...]
|
| 328 |
+
config.dht.public_endpoint_override = None # for nodes behind NAT, manual override
|
| 329 |
+
config.dht.advertise_cids = True # also advertise pinned CIDs
|
| 330 |
+
config.dht.advertise_community = True
|
| 331 |
+
```
|
| 332 |
+
|
| 333 |
+
Constants used: `DHT_REPLICATION_K=8`, `DHT_ALPHA=3`, `DHT_REFRESH_SECONDS=3600`, `DHT_REPUBLISH_SECONDS=86400`.
|
| 334 |
+
|
| 335 |
+
---
|
| 336 |
+
|
| 337 |
+
## 9. Tests
|
| 338 |
+
|
| 339 |
+
### Unit
|
| 340 |
+
- `test_xor_distance_metric`
|
| 341 |
+
- `test_routing_table_insert_eviction`
|
| 342 |
+
- `test_signed_value_verification`
|
| 343 |
+
- `test_unsigned_value_rejected`
|
| 344 |
+
- `test_ttl_eviction`
|
| 345 |
+
|
| 346 |
+
### Integration
|
| 347 |
+
- `test_three_node_dht_find_value` β three KademliaNodes in process, store, find
|
| 348 |
+
- `test_bootstrap_picks_up_existing_dht`
|
| 349 |
+
- `test_partition_then_reconnect_converges`
|
| 350 |
+
- `test_value_republish_keeps_alive`
|
| 351 |
+
|
| 352 |
+
### Property-based
|
| 353 |
+
- `test_kademlia_eventual_consistency_under_churn` (Hypothesis-driven)
|
| 354 |
+
|
| 355 |
+
---
|
| 356 |
+
|
| 357 |
+
## 10. Cross-references
|
| 358 |
+
|
| 359 |
+
| What | Where |
|
| 360 |
+
|------|-------|
|
| 361 |
+
| Used by federation bootstrap | [M14 Β§4.3](../modules/M14-federation.md) |
|
| 362 |
+
| Used by background blob replication | M07 ext (see [00-OVERVIEW Β§1](../00-OVERVIEW.md)) |
|
| 363 |
+
| Wire error code | `dht_lookup_failed` in [CAP2 Β§9](../CAPABILITY_CONTRACT_v2.md) |
|
| 364 |
+
| Phase 1 alternative (mDNS/UDP) | [M02](../../modules/M02-discovery.md) |
|
| 365 |
+
| Phase 3 sybil resistance | TBD |
|
| 366 |
+
|
| 367 |
+
---
|
| 368 |
+
|
| 369 |
+
## 11. Open questions
|
| 370 |
+
|
| 371 |
+
1. **libp2p reuse vs custom Python.** libp2p has a Python port but it's heavyweight. A focused 1000-LOC Kademlia matches our needs and stays auditable. Decision: custom for now; can swap.
|
| 372 |
+
2. **NAT hole punching.** Currently relay-only. STUN/TURN integration is Phase 3.
|
| 373 |
+
3. **Public DHT vs federated DHTs.** Should the DHT itself be federated (per-community DHT joined via cross-sig)? Maybe. Defer.
|
| 374 |
+
4. **Onion routing.** Out of scope. HearthNet has no anonymity goal.
|
docs/p2_p3/X06-websocket.md
ADDED
|
@@ -0,0 +1,316 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# X06 β WebSocket Upgrade
|
| 2 |
+
|
| 3 |
+
**Spec version:** v1.0 (Phase 2)
|
| 4 |
+
**Depends on:** X01 (transport), X03 (observability), `websockets` Python library
|
| 5 |
+
**Depended on by:** X01 transport server (in-place extension), M21 (tool-call loops), M25 (group chat live), M22 (mobile push delivery)
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## 1. Responsibility
|
| 10 |
+
|
| 11 |
+
Add bidirectional WebSocket transport alongside the existing HTTP/1.1 + SSE in [X01](../../cross-cutting/X01-transport.md). Use cases:
|
| 12 |
+
|
| 13 |
+
- Tool-call loops in `llm.chat` where the server needs to ask the client to execute a tool mid-stream
|
| 14 |
+
- Live pubsub topics that fan out many messages per second (group chat, federation heartbeats)
|
| 15 |
+
- Mobile clients on flaky cellular where reconnect is expensive
|
| 16 |
+
|
| 17 |
+
WebSockets do **not** replace the request/response model. They are an *upgrade* available on specific endpoints when both ends support v2 contract.
|
| 18 |
+
|
| 19 |
+
---
|
| 20 |
+
|
| 21 |
+
## 2. File layout
|
| 22 |
+
|
| 23 |
+
```
|
| 24 |
+
hearthnet/transport/
|
| 25 |
+
βββ websocket.py # WebSocket server-side handler + client-side wrapper
|
| 26 |
+
```
|
| 27 |
+
|
| 28 |
+
Single file; the protocol is small.
|
| 29 |
+
|
| 30 |
+
---
|
| 31 |
+
|
| 32 |
+
## 3. Endpoints supporting upgrade
|
| 33 |
+
|
| 34 |
+
| Endpoint | Behaviour |
|
| 35 |
+
|----------|-----------|
|
| 36 |
+
| `/bus/v1/call` | When `Upgrade: websocket` present and capability descriptor supports streaming, upgrade and use frame protocol on the WS instead of SSE |
|
| 37 |
+
| `/pubsub/v1/subscribe` | When upgraded, server pushes messages on topic without long-polling |
|
| 38 |
+
| `/sync/v1/events` | NOT upgraded β sync is bursty and short-lived; HTTP fits |
|
| 39 |
+
|
| 40 |
+
---
|
| 41 |
+
|
| 42 |
+
## 4. WebSocket frame protocol
|
| 43 |
+
|
| 44 |
+
WebSocket frames carry the **same JSON event/data envelope** as SSE. This is deliberate β handlers can be written once and dispatched to either transport.
|
| 45 |
+
|
| 46 |
+
### 4.1 Outbound (server β client)
|
| 47 |
+
|
| 48 |
+
Each WebSocket message is one JSON object:
|
| 49 |
+
|
| 50 |
+
```json
|
| 51 |
+
{"event": "token", "data": {"text": "Hallo "}, "seq": 12}
|
| 52 |
+
```
|
| 53 |
+
|
| 54 |
+
`seq` is monotonic per-stream from the server. Used for backpressure ACKs.
|
| 55 |
+
|
| 56 |
+
### 4.2 Inbound (client β server)
|
| 57 |
+
|
| 58 |
+
Two kinds of messages:
|
| 59 |
+
|
| 60 |
+
#### Backpressure ACK
|
| 61 |
+
|
| 62 |
+
```json
|
| 63 |
+
{"type":"ack","upto":8}
|
| 64 |
+
```
|
| 65 |
+
|
| 66 |
+
#### Tool result (mid-stream)
|
| 67 |
+
|
| 68 |
+
```json
|
| 69 |
+
{"type":"tool_result","tool_call_id":"tc_01HXR...","body":{...}}
|
| 70 |
+
```
|
| 71 |
+
|
| 72 |
+
Used in tool-call loops (see [M21](../modules/M21-tool-calls.md)).
|
| 73 |
+
|
| 74 |
+
#### Cancel
|
| 75 |
+
|
| 76 |
+
```json
|
| 77 |
+
{"type":"cancel"}
|
| 78 |
+
```
|
| 79 |
+
|
| 80 |
+
Cleanly stops the current operation. Server must abort within 200 ms and emit a final `error` or `done` frame.
|
| 81 |
+
|
| 82 |
+
### 4.3 Control frames
|
| 83 |
+
|
| 84 |
+
Standard WebSocket pings/pongs. `WEBSOCKET_PING_SECONDS = 30` between pings.
|
| 85 |
+
|
| 86 |
+
---
|
| 87 |
+
|
| 88 |
+
## 5. Public API
|
| 89 |
+
|
| 90 |
+
### 5.1 Server side
|
| 91 |
+
|
| 92 |
+
```python
|
| 93 |
+
# hearthnet/transport/websocket.py
|
| 94 |
+
class WebSocketSession:
|
| 95 |
+
"""Wraps a WebSocket connection from the server's perspective."""
|
| 96 |
+
|
| 97 |
+
def __init__(self, ws: WebSocket, kp: KeyPair):
|
| 98 |
+
...
|
| 99 |
+
|
| 100 |
+
@property
|
| 101 |
+
def closed(self) -> bool: ...
|
| 102 |
+
@property
|
| 103 |
+
def remote_node_id(self) -> str: ...
|
| 104 |
+
|
| 105 |
+
async def emit(self, event: str, data: dict) -> None:
|
| 106 |
+
"""Send a frame; respect flow control."""
|
| 107 |
+
|
| 108 |
+
async def emit_token(self, token: dict) -> None: ...
|
| 109 |
+
async def emit_progress(self, current: int, total: int, stage: str) -> None: ...
|
| 110 |
+
async def emit_error(self, code: ErrorCode, **kwargs) -> None: ...
|
| 111 |
+
async def emit_done(self, **meta) -> None: ...
|
| 112 |
+
|
| 113 |
+
async def receive(self) -> WsClientFrame | None:
|
| 114 |
+
"""Block until a client frame arrives, or None on close."""
|
| 115 |
+
|
| 116 |
+
async def close(self, code: int = 1000) -> None: ...
|
| 117 |
+
|
| 118 |
+
@dataclass(frozen=True)
|
| 119 |
+
class WsClientFrame:
|
| 120 |
+
type: str # "ack" | "tool_result" | "cancel"
|
| 121 |
+
data: dict
|
| 122 |
+
```
|
| 123 |
+
|
| 124 |
+
### 5.2 Client side
|
| 125 |
+
|
| 126 |
+
```python
|
| 127 |
+
class WebSocketClient:
|
| 128 |
+
"""Used by HttpClient (X01) when stream() is called with `prefer_ws=True`."""
|
| 129 |
+
|
| 130 |
+
def __init__(
|
| 131 |
+
self,
|
| 132 |
+
url: str,
|
| 133 |
+
kp: KeyPair,
|
| 134 |
+
community_id: str,
|
| 135 |
+
pinned_certs: PinnedCerts,
|
| 136 |
+
):
|
| 137 |
+
...
|
| 138 |
+
|
| 139 |
+
async def open(self) -> None: ...
|
| 140 |
+
async def close(self) -> None: ...
|
| 141 |
+
|
| 142 |
+
async def send_call(
|
| 143 |
+
self,
|
| 144 |
+
capability: str,
|
| 145 |
+
version: str,
|
| 146 |
+
body: dict,
|
| 147 |
+
*,
|
| 148 |
+
trace_id: str,
|
| 149 |
+
) -> None:
|
| 150 |
+
"""Initial call frame. Authentication via X-HearthNet-* headers
|
| 151 |
+
and a signed call-envelope sent as the first WS message."""
|
| 152 |
+
|
| 153 |
+
async def __aiter__(self) -> AsyncIterator[Frame]:
|
| 154 |
+
"""Yields Frame objects (same shape as SSE Frame)."""
|
| 155 |
+
|
| 156 |
+
async def send_tool_result(self, tool_call_id: str, body: dict) -> None: ...
|
| 157 |
+
async def send_ack(self, upto: int) -> None: ...
|
| 158 |
+
async def cancel(self) -> None: ...
|
| 159 |
+
```
|
| 160 |
+
|
| 161 |
+
### 5.3 Upgrade negotiation on the server
|
| 162 |
+
|
| 163 |
+
X01's [HttpServer](../../cross-cutting/X01-transport.md) gets a small dispatch shim:
|
| 164 |
+
|
| 165 |
+
```python
|
| 166 |
+
# in hearthnet/transport/server.py (Phase 2 extension)
|
| 167 |
+
async def dispatch_call(request: Request):
|
| 168 |
+
if request.headers.get("upgrade") == "websocket" and capability_supports_stream(...):
|
| 169 |
+
return await dispatch_via_websocket(request)
|
| 170 |
+
else:
|
| 171 |
+
return await dispatch_via_sse_or_json(request)
|
| 172 |
+
```
|
| 173 |
+
|
| 174 |
+
`capability_supports_stream` checks the descriptor's `stream_schema` is not None.
|
| 175 |
+
|
| 176 |
+
---
|
| 177 |
+
|
| 178 |
+
## 6. Behaviour
|
| 179 |
+
|
| 180 |
+
### 6.1 Handshake
|
| 181 |
+
|
| 182 |
+
```
|
| 183 |
+
client β GET /bus/v1/call
|
| 184 |
+
Connection: Upgrade
|
| 185 |
+
Upgrade: websocket
|
| 186 |
+
Sec-WebSocket-Protocol: hearthnet-bus.v2
|
| 187 |
+
(other X-HearthNet-* headers)
|
| 188 |
+
β
|
| 189 |
+
server: validates capability + initial signature
|
| 190 |
+
responds 101 Switching Protocols if v2 capable
|
| 191 |
+
responds 426 Upgrade Required (with downgrade hint) if not v2
|
| 192 |
+
β
|
| 193 |
+
client sends first message: signed call envelope
|
| 194 |
+
{"type":"call","envelope":{...},"signature":"ed25519:..."}
|
| 195 |
+
β
|
| 196 |
+
server: validates signature, dispatches to bus
|
| 197 |
+
β
|
| 198 |
+
server streams response frames; client streams ACKs / tool_results / cancels
|
| 199 |
+
```
|
| 200 |
+
|
| 201 |
+
### 6.2 Flow control
|
| 202 |
+
|
| 203 |
+
Same window-based FC as SSE (`STREAM_WINDOW_FRAMES = 16`, ACK every 8). Server checks `flow_control.send()` before each emit; client sends `ack` messages every 8 received frames.
|
| 204 |
+
|
| 205 |
+
### 6.3 Idle handling
|
| 206 |
+
|
| 207 |
+
If no message in either direction for `WEBSOCKET_IDLE_CLOSE_SECONDS` (120s), server closes with code 1000. Client may reopen.
|
| 208 |
+
|
| 209 |
+
### 6.4 Failure modes
|
| 210 |
+
|
| 211 |
+
| Symptom | Behaviour |
|
| 212 |
+
|---------|-----------|
|
| 213 |
+
| Client disconnect mid-stream | Server's task receives `CancelledError`, aborts the underlying capability within 200ms |
|
| 214 |
+
| Network drop | Either side's WS library raises; current stream is `error`-terminated locally |
|
| 215 |
+
| Server overload | Server may decline upgrade with 503 + retry hint; client falls back to SSE |
|
| 216 |
+
| Protocol version mismatch | Server replies 426 with `Sec-WebSocket-Protocol` listing supported versions |
|
| 217 |
+
|
| 218 |
+
### 6.5 Pubsub via WS
|
| 219 |
+
|
| 220 |
+
Subscribing to a topic via WS:
|
| 221 |
+
|
| 222 |
+
```
|
| 223 |
+
client GET /pubsub/v1/subscribe?topic=marketplace.post.created
|
| 224 |
+
Upgrade: websocket
|
| 225 |
+
β
|
| 226 |
+
server upgrades; sends backlog (if `since_seq` provided) then live messages
|
| 227 |
+
β
|
| 228 |
+
each message: {"event":"published","data":{...},"seq":N}
|
| 229 |
+
β
|
| 230 |
+
client sends ACKs to allow server to advance flow control
|
| 231 |
+
```
|
| 232 |
+
|
| 233 |
+
This replaces the long-polling pattern from Phase 1 Β§8 for clients that hold the connection. The long-poll endpoint remains for non-WS clients.
|
| 234 |
+
|
| 235 |
+
### 6.6 Tool-call loop (used by [M21](../modules/M21-tool-calls.md))
|
| 236 |
+
|
| 237 |
+
```
|
| 238 |
+
server emits:
|
| 239 |
+
{"event":"token","data":{"text":"..."}}
|
| 240 |
+
{"event":"tool_call_delta","data":{"id":"tc_1","name":"rag.query","arguments_delta":"..."}}
|
| 241 |
+
...
|
| 242 |
+
{"event":"tool_call","data":{"id":"tc_1","arguments":{"query":"...","corpus":"..."}}}
|
| 243 |
+
client must respond:
|
| 244 |
+
{"type":"tool_result","tool_call_id":"tc_1","body":{...result of bus.call("rag.query",...)...}}
|
| 245 |
+
server continues:
|
| 246 |
+
{"event":"token","data":{"text":"Based on the documents..."}}
|
| 247 |
+
...
|
| 248 |
+
{"event":"done","data":{...}}
|
| 249 |
+
```
|
| 250 |
+
|
| 251 |
+
Without WebSocket, the SSE-only fallback is for the *caller* (UI) to execute the tool and re-call `llm.chat` with the tool result added to messages. Both paths work; WS is more efficient.
|
| 252 |
+
|
| 253 |
+
---
|
| 254 |
+
|
| 255 |
+
## 7. Errors
|
| 256 |
+
|
| 257 |
+
`WebSocketError` codes (local domain):
|
| 258 |
+
|
| 259 |
+
- `upgrade_refused` β server returned 426 or 503
|
| 260 |
+
- `version_unsupported` β protocol mismatch
|
| 261 |
+
- `idle_timeout`
|
| 262 |
+
- `bad_frame` β malformed JSON or invalid `type`
|
| 263 |
+
|
| 264 |
+
On the wire, errors carried inside the WS as `event: error` frames map to the standard wire codes in [CAP Β§9](../../CAPABILITY_CONTRACT.md).
|
| 265 |
+
|
| 266 |
+
---
|
| 267 |
+
|
| 268 |
+
## 8. Configuration
|
| 269 |
+
|
| 270 |
+
```python
|
| 271 |
+
config.transport.websocket_enabled = True
|
| 272 |
+
config.transport.websocket_idle_close_seconds = WEBSOCKET_IDLE_CLOSE_SECONDS
|
| 273 |
+
config.transport.websocket_ping_seconds = WEBSOCKET_PING_SECONDS
|
| 274 |
+
```
|
| 275 |
+
|
| 276 |
+
---
|
| 277 |
+
|
| 278 |
+
## 9. Tests
|
| 279 |
+
|
| 280 |
+
### Unit
|
| 281 |
+
- `test_ws_frame_shape_matches_sse`
|
| 282 |
+
- `test_signed_call_envelope_first_message`
|
| 283 |
+
- `test_invalid_signature_closes_connection`
|
| 284 |
+
- `test_idle_close_after_timeout`
|
| 285 |
+
|
| 286 |
+
### Integration
|
| 287 |
+
- `test_two_node_ws_call_round_trip`
|
| 288 |
+
- `test_ws_stream_tokens_then_done`
|
| 289 |
+
- `test_ws_tool_result_inline`
|
| 290 |
+
- `test_ws_cancel_within_200ms`
|
| 291 |
+
- `test_ws_fallback_to_sse_when_426`
|
| 292 |
+
- `test_pubsub_via_ws_backlog_plus_live`
|
| 293 |
+
|
| 294 |
+
### Chaos
|
| 295 |
+
- `test_ws_dropped_packet_recovery` (using `tc`)
|
| 296 |
+
|
| 297 |
+
---
|
| 298 |
+
|
| 299 |
+
## 10. Cross-references
|
| 300 |
+
|
| 301 |
+
| What | Where |
|
| 302 |
+
|------|-------|
|
| 303 |
+
| Endpoint upgrade | [CAP2 Β§5.1](../CAPABILITY_CONTRACT_v2.md) |
|
| 304 |
+
| Frame protocol shared with SSE | [X01 Β§6](../../cross-cutting/X01-transport.md), [CAP Β§5.3](../../CAPABILITY_CONTRACT.md) |
|
| 305 |
+
| Tool-call loop | [M21](../modules/M21-tool-calls.md) |
|
| 306 |
+
| Mobile client benefits | [M22 Β§5](../modules/M22-mobile-native.md) |
|
| 307 |
+
| Phase 3 considerations (WebTransport / QUIC) | TBD |
|
| 308 |
+
|
| 309 |
+
---
|
| 310 |
+
|
| 311 |
+
## 11. Open questions
|
| 312 |
+
|
| 313 |
+
1. **HTTP/3 / WebTransport** β Phase 3 candidate; better on mobile, doesn't need TCP setup time on reconnect.
|
| 314 |
+
2. **Binary frames** β JSON works; binary CBOR could save bytes. Defer until profiling shows it matters.
|
| 315 |
+
3. **Multiplexing many capability calls on one WS** β currently one WS per call. Multiplex possible but adds complexity. Defer.
|
| 316 |
+
4. **WSS certificate handling** β same TLS pinning as HTTPS; works because WS goes over the same TLS connection.
|
docs/p2_p3/X07-federated-metrics.md
ADDED
|
@@ -0,0 +1,330 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# X07 β Federated Metrics
|
| 2 |
+
|
| 3 |
+
**Spec version:** v2.0
|
| 4 |
+
**Depends on:** [X03 Observability](../../cross-cutting/X03-observability.md), [M14 Federation](../modules/M14-federation.md), [M16 Tokens](../modules/M16-tokens.md), [X04 Config](../../cross-cutting/X04-config.md)
|
| 5 |
+
**Depended on by:** Operator dashboards (out of band), federation health UI
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## 1. Responsibility
|
| 10 |
+
|
| 11 |
+
Take the per-node Prometheus metrics produced by [X03](../../cross-cutting/X03-observability.md) and aggregate them, with consent, into a community-level view β and, where federation grants it, into a federation-level view.
|
| 12 |
+
|
| 13 |
+
X03 gives each node a private view of itself. X07 gives:
|
| 14 |
+
|
| 15 |
+
- **The community founder** a dashboard like "how healthy is the mesh today, where are the hot spots, what's the GPU saturation across all anchors?"
|
| 16 |
+
- **A federated peer** a much narrower view β opt-in, aggregated, no per-node identifiers β like "Geldern reports 18 active members and 4.2k events/day".
|
| 17 |
+
|
| 18 |
+
The design rule is: **less information at greater distance**. Per-node detail stays on the node. The community sees aggregates. Federated peers see anonymised aggregates. There is no global "every node and what it does" surface.
|
| 19 |
+
|
| 20 |
+
---
|
| 21 |
+
|
| 22 |
+
## 2. File layout
|
| 23 |
+
|
| 24 |
+
```
|
| 25 |
+
hearthnet/observability/
|
| 26 |
+
βββ federated.py # FederatedMetricsExporter & Aggregator
|
| 27 |
+
βββ otlp_export.py # Optional OpenTelemetry OTLP push
|
| 28 |
+
βββ aggregation_views.py # SQL-like views over time-series
|
| 29 |
+
βββ consent.py # Per-metric publish consent
|
| 30 |
+
```
|
| 31 |
+
|
| 32 |
+
---
|
| 33 |
+
|
| 34 |
+
## 3. Public API
|
| 35 |
+
|
| 36 |
+
### 3.1 `FederatedMetricsExporter`
|
| 37 |
+
|
| 38 |
+
```python
|
| 39 |
+
class FederatedMetricsExporter:
|
| 40 |
+
"""
|
| 41 |
+
Pulls metrics from the local Prometheus registry, applies consent rules,
|
| 42 |
+
and publishes aggregated subsets either:
|
| 43 |
+
- to the community's aggregator anchor (mesh-internal)
|
| 44 |
+
- to an external OTLP collector (optional, off by default)
|
| 45 |
+
- to federated peers via the bus
|
| 46 |
+
"""
|
| 47 |
+
|
| 48 |
+
def __init__(
|
| 49 |
+
self,
|
| 50 |
+
observability: Observability,
|
| 51 |
+
consent: ConsentPolicy,
|
| 52 |
+
bus: CapabilityBus,
|
| 53 |
+
settings: FederatedMetricsSettings,
|
| 54 |
+
): ...
|
| 55 |
+
|
| 56 |
+
async def start(self) -> None: ...
|
| 57 |
+
async def stop(self) -> None: ...
|
| 58 |
+
|
| 59 |
+
# Triggered by tick; publishes to internal bus topic
|
| 60 |
+
async def publish_community(self) -> None: ...
|
| 61 |
+
|
| 62 |
+
# Triggered when federated peer requests
|
| 63 |
+
async def publish_federated(self, peer_community_id: NodeID) -> AggregatedSnapshot: ...
|
| 64 |
+
|
| 65 |
+
# OTLP push, off by default
|
| 66 |
+
async def push_otlp(self, endpoint: str) -> None: ...
|
| 67 |
+
```
|
| 68 |
+
|
| 69 |
+
### 3.2 `MetricsAggregator`
|
| 70 |
+
|
| 71 |
+
Runs on the **aggregator anchor** (any anchor designated by community policy; default is the founder's node):
|
| 72 |
+
|
| 73 |
+
```python
|
| 74 |
+
class MetricsAggregator:
|
| 75 |
+
"""
|
| 76 |
+
Subscribes to `observability.metrics.tick.*` events from all members,
|
| 77 |
+
keeps a 7-day rolling window, exposes:
|
| 78 |
+
- GET /metrics/community (Prometheus format, community-wide)
|
| 79 |
+
- capability `observability.community_snapshot@1.0`
|
| 80 |
+
"""
|
| 81 |
+
|
| 82 |
+
def __init__(self, bus: CapabilityBus, event_log: EventLog, store: TimeSeriesStore): ...
|
| 83 |
+
|
| 84 |
+
async def start(self) -> None: ...
|
| 85 |
+
|
| 86 |
+
async def community_snapshot(self) -> CommunityMetrics: ...
|
| 87 |
+
async def federated_snapshot(self, peer_id: NodeID) -> AggregatedSnapshot: ...
|
| 88 |
+
```
|
| 89 |
+
|
| 90 |
+
### 3.3 Snapshot dataclasses
|
| 91 |
+
|
| 92 |
+
```python
|
| 93 |
+
@dataclass
|
| 94 |
+
class NodeMetricsTick:
|
| 95 |
+
"""What each node publishes every METRICS_TICK_SECONDS (default 60)."""
|
| 96 |
+
node_id: NodeID
|
| 97 |
+
timestamp: datetime
|
| 98 |
+
cpu_pct: float
|
| 99 |
+
mem_used_mb: int
|
| 100 |
+
mem_total_mb: int
|
| 101 |
+
gpu_pct: float | None
|
| 102 |
+
gpu_mem_used_mb: int | None
|
| 103 |
+
disk_used_gb: float
|
| 104 |
+
disk_total_gb: float
|
| 105 |
+
capability_calls_per_min: dict[str, int] # by capability
|
| 106 |
+
error_rate_per_min: dict[str, float]
|
| 107 |
+
p95_latency_ms_by_cap: dict[str, float]
|
| 108 |
+
online_seconds: int # since last restart
|
| 109 |
+
|
| 110 |
+
@dataclass
|
| 111 |
+
class CommunityMetrics:
|
| 112 |
+
"""Aggregated over the community. Has per-node detail (members see members)."""
|
| 113 |
+
timestamp: datetime
|
| 114 |
+
nodes_total: int
|
| 115 |
+
nodes_online: int
|
| 116 |
+
nodes: list[NodeMetricsTick]
|
| 117 |
+
capability_calls_per_min_total: dict[str, int]
|
| 118 |
+
events_per_min: int
|
| 119 |
+
storage_used_gb: float
|
| 120 |
+
federation_links: int
|
| 121 |
+
|
| 122 |
+
@dataclass
|
| 123 |
+
class AggregatedSnapshot:
|
| 124 |
+
"""For federated peers. No per-node detail, no identifiers, banded values."""
|
| 125 |
+
timestamp: datetime
|
| 126 |
+
community_id: NodeID
|
| 127 |
+
nodes_online_band: str # "10-20", "20-50", etc.
|
| 128 |
+
daily_active_members_band: str
|
| 129 |
+
capability_calls_per_day_top: list[tuple[str, str]] # [(cap, band)]
|
| 130 |
+
error_rate_band: str
|
| 131 |
+
federation_links_count: int
|
| 132 |
+
```
|
| 133 |
+
|
| 134 |
+
### 3.4 `ConsentPolicy`
|
| 135 |
+
|
| 136 |
+
```python
|
| 137 |
+
@dataclass
|
| 138 |
+
class ConsentPolicy:
|
| 139 |
+
"""
|
| 140 |
+
Loaded from policy.yaml. Controls what leaves the node.
|
| 141 |
+
"""
|
| 142 |
+
publish_to_community: set[str] # metric names included in NodeMetricsTick
|
| 143 |
+
publish_to_federated: set[str] # subset, applied to AggregatedSnapshot
|
| 144 |
+
publish_to_external: bool # OTLP push on/off
|
| 145 |
+
aggregation_min_nodes: int # don't expose a metric unless β₯ N nodes contribute
|
| 146 |
+
banding: dict[str, list[int]] # metric β bucket edges
|
| 147 |
+
```
|
| 148 |
+
|
| 149 |
+
---
|
| 150 |
+
|
| 151 |
+
## 4. Behaviour
|
| 152 |
+
|
| 153 |
+
### 4.1 Tick lifecycle
|
| 154 |
+
|
| 155 |
+
Every `METRICS_TICK_SECONDS` (default 60s) each node:
|
| 156 |
+
|
| 157 |
+
1. Snapshots its local Prometheus registry.
|
| 158 |
+
2. Filters per `ConsentPolicy.publish_to_community`.
|
| 159 |
+
3. Constructs a `NodeMetricsTick`.
|
| 160 |
+
4. Publishes to bus topic `observability.metrics.tick.<community_id>` over WebSocket pubsub (efficient: many small messages, low latency).
|
| 161 |
+
5. Also writes a local rolling-window copy for debug.
|
| 162 |
+
|
| 163 |
+
The aggregator anchor subscribes to the topic, ingests into its time-series store, and computes `CommunityMetrics` on demand.
|
| 164 |
+
|
| 165 |
+
### 4.2 Aggregator selection
|
| 166 |
+
|
| 167 |
+
The community policy contains:
|
| 168 |
+
|
| 169 |
+
```yaml
|
| 170 |
+
observability:
|
| 171 |
+
aggregator_anchor: ed25519:<NodeID> # optional; if absent, any anchor self-elects
|
| 172 |
+
aggregator_failover_seconds: 600
|
| 173 |
+
```
|
| 174 |
+
|
| 175 |
+
If the configured aggregator is offline for `aggregator_failover_seconds`, another anchor self-elects (lowest NodeID hash wins). A live community-wide view tolerates the aggregator going offline; nodes keep publishing ticks and a new aggregator picks up where the old one left off (with a brief gap).
|
| 176 |
+
|
| 177 |
+
### 4.3 What gets exposed to whom
|
| 178 |
+
|
| 179 |
+
| Metric category | Self | Other members | Aggregator anchor | Federated peers | External OTLP |
|
| 180 |
+
|------------------|------|----------------|-------------------|------------------|---------------|
|
| 181 |
+
| CPU / mem / GPU per-node | β
| per policy | β
| β | β |
|
| 182 |
+
| Per-capability call counts | β
| β
| β
| banded only | optional |
|
| 183 |
+
| Per-capability latencies | β
| aggregated | β
| β | β |
|
| 184 |
+
| Error rates | β
| aggregated | β
| banded only | optional |
|
| 185 |
+
| Federation link count | β
| β
| β
| exact count | β |
|
| 186 |
+
| File counts / sizes | β
| β | aggregated | banded | β |
|
| 187 |
+
| Identity of which node did what | β
| per policy | β (anonymised on ingest) | β | β |
|
| 188 |
+
|
| 189 |
+
The aggregator does **not** store per-node identity in its long-term time series. It computes per-node views on the fly for the founder UI but persists only anonymised aggregates after `MEMBER_DETAIL_RETENTION_HOURS` (default 24).
|
| 190 |
+
|
| 191 |
+
### 4.4 Banding
|
| 192 |
+
|
| 193 |
+
Federated snapshots use bands rather than exact numbers to prevent triangulation across multiple federations:
|
| 194 |
+
|
| 195 |
+
```yaml
|
| 196 |
+
banding:
|
| 197 |
+
nodes_online: [0, 5, 10, 20, 50, 100, 500]
|
| 198 |
+
daily_active_members: [0, 3, 10, 30, 100]
|
| 199 |
+
capability_calls_per_day: [0, 100, 1000, 10000, 100000]
|
| 200 |
+
error_rate: [0, 0.01, 0.05, 0.10]
|
| 201 |
+
```
|
| 202 |
+
|
| 203 |
+
Result: `"nodes_online_band": "10-20"` instead of `19`.
|
| 204 |
+
|
| 205 |
+
### 4.5 OTLP push (external)
|
| 206 |
+
|
| 207 |
+
Off by default. When `publish_to_external=true`:
|
| 208 |
+
|
| 209 |
+
- Pushes to a configured OTLP endpoint (could be Grafana Cloud, self-hosted Tempo/Mimir, or your own collector).
|
| 210 |
+
- Only metrics in `publish_to_external` set leave the node.
|
| 211 |
+
- The receiver gets aggregated, banded data β same restrictions as a federated peer.
|
| 212 |
+
- TLS required; OTLP headers carry an API token (set via env var, not in policy file).
|
| 213 |
+
|
| 214 |
+
This is the path for an operator (Christof) who wants a single Grafana dashboard across all his bofrost-managed communities β but the protections still apply: external collector cannot reconstruct who did what.
|
| 215 |
+
|
| 216 |
+
### 4.6 Trackio integration
|
| 217 |
+
|
| 218 |
+
Phase 1 already supports per-node Trackio logging. X07 adds: **the aggregator** can push a community-level summary to a Trackio space, useful for hackathon demos and HF leaderboard-style displays.
|
| 219 |
+
|
| 220 |
+
`policy.observability.trackio_community_space` (URL) is configurable. The aggregator anchor pushes `CommunityMetrics` rows hourly. Per-node detail is excluded from this path; only aggregates go.
|
| 221 |
+
|
| 222 |
+
### 4.7 Federated peer queries
|
| 223 |
+
|
| 224 |
+
A federated peer asks for our snapshot via:
|
| 225 |
+
|
| 226 |
+
```
|
| 227 |
+
POST /bus/v1/call
|
| 228 |
+
X-HearthNet-Community: <their community>
|
| 229 |
+
Capability: observability.federated_snapshot@1.0
|
| 230 |
+
Body: {"input": {"window_hours": 24}}
|
| 231 |
+
```
|
| 232 |
+
|
| 233 |
+
Bus checks federation scope, calls the aggregator, returns `AggregatedSnapshot`. The peer's UI may display "Geldern: 10-20 nodes online, light activity today" alongside their own community.
|
| 234 |
+
|
| 235 |
+
### 4.8 Cost & sizing
|
| 236 |
+
|
| 237 |
+
A `NodeMetricsTick` is roughly 500 bytes JSON. At 1 tick / 60s per node, a 50-node community publishes 50 Γ 500B / 60s β 420 B/s on the metrics topic. Negligible.
|
| 238 |
+
|
| 239 |
+
The aggregator's time-series store is **DuckDB** (Phase 2 choice; SQLite would also work). Retention: 7 days at full per-node resolution, then daily roll-ups for 90 days, then weekly forever.
|
| 240 |
+
|
| 241 |
+
---
|
| 242 |
+
|
| 243 |
+
## 5. Errors
|
| 244 |
+
|
| 245 |
+
| Code | Cause |
|
| 246 |
+
|------|-------|
|
| 247 |
+
| `unavailable` | Aggregator anchor offline |
|
| 248 |
+
| `aggregation_too_few_nodes` | < `aggregation_min_nodes` nodes contributed; refusing to disclose |
|
| 249 |
+
| `federation_forbidden` | Peer requested a metric category not in federation scope |
|
| 250 |
+
| `consent_denied` | Local policy forbids this metric from leaving the node |
|
| 251 |
+
|
| 252 |
+
---
|
| 253 |
+
|
| 254 |
+
## 6. Configuration
|
| 255 |
+
|
| 256 |
+
```toml
|
| 257 |
+
[observability.federated]
|
| 258 |
+
enabled = true
|
| 259 |
+
metrics_tick_seconds = 60
|
| 260 |
+
aggregator_failover_seconds = 600
|
| 261 |
+
member_detail_retention_hours = 24
|
| 262 |
+
aggregation_min_nodes = 3
|
| 263 |
+
publish_to_external = false
|
| 264 |
+
otlp_endpoint = ""
|
| 265 |
+
otlp_token_env = "OTLP_TOKEN"
|
| 266 |
+
trackio_community_space = ""
|
| 267 |
+
|
| 268 |
+
[observability.federated.consent.publish_to_community]
|
| 269 |
+
metrics = [
|
| 270 |
+
"node.cpu_pct", "node.mem_pct", "node.gpu_pct",
|
| 271 |
+
"node.online_seconds", "node.capability_calls_per_min",
|
| 272 |
+
"node.p95_latency_by_capability",
|
| 273 |
+
]
|
| 274 |
+
|
| 275 |
+
[observability.federated.consent.publish_to_federated]
|
| 276 |
+
metrics = [
|
| 277 |
+
"community.nodes_online", "community.daily_active_members",
|
| 278 |
+
"community.capability_calls_top", "community.federation_links",
|
| 279 |
+
]
|
| 280 |
+
|
| 281 |
+
[observability.federated.consent.banding]
|
| 282 |
+
"community.nodes_online" = [0, 5, 10, 20, 50, 100]
|
| 283 |
+
"community.daily_active_members" = [0, 3, 10, 30, 100]
|
| 284 |
+
"community.capability_calls_top" = [0, 100, 1000, 10000]
|
| 285 |
+
"community.error_rate" = [0, 0.01, 0.05, 0.10]
|
| 286 |
+
```
|
| 287 |
+
|
| 288 |
+
---
|
| 289 |
+
|
| 290 |
+
## 7. Tests
|
| 291 |
+
|
| 292 |
+
### 7.1 Unit
|
| 293 |
+
- Banding: value 17 with bands `[0,5,10,20,50]` returns `"10-20"`
|
| 294 |
+
- Aggregation refuses when contributors < min: `aggregation_too_few_nodes`
|
| 295 |
+
- Consent: a metric not in `publish_to_community` set is excluded from tick
|
| 296 |
+
- AggregatedSnapshot construction strips all NodeID fields
|
| 297 |
+
|
| 298 |
+
### 7.2 Integration
|
| 299 |
+
- 5 nodes publish ticks for 5 minutes; aggregator's snapshot reflects 5 contributors with correct totals
|
| 300 |
+
- Aggregator kill / failover: a second anchor takes over within 10 minutes, snapshot resumes
|
| 301 |
+
- Federated peer requests snapshot; receives banded version; cannot infer specific node counts
|
| 302 |
+
|
| 303 |
+
### 7.3 Adversarial
|
| 304 |
+
- Malicious node publishes inflated counters β outlier detection drops obvious outliers (>3Ο) from the aggregate
|
| 305 |
+
- Federated peer requests snapshot for window the aggregator hasn't filled β `aggregation_too_few_nodes`
|
| 306 |
+
- OTLP endpoint compromised: leaked data contains only banded aggregates; per-node attribution impossible
|
| 307 |
+
|
| 308 |
+
### 7.4 Privacy
|
| 309 |
+
- Asserts: no NodeID, IP, or device-identifying string is present in `AggregatedSnapshot`
|
| 310 |
+
- Asserts: after `MEMBER_DETAIL_RETENTION_HOURS` the aggregator's persisted store contains no per-node rows
|
| 311 |
+
|
| 312 |
+
---
|
| 313 |
+
|
| 314 |
+
## 8. Cross-references
|
| 315 |
+
|
| 316 |
+
- Capability: `observability.community_snapshot@1.0`, `observability.federated_snapshot@1.0` (introduced here, listed in [CAPABILITY_CONTRACT_v2 Β§3](../CAPABILITY_CONTRACT_v2.md#3-complete-new-capabilities-list))
|
| 317 |
+
- Bus topic: `observability.metrics.tick.<community_id>`
|
| 318 |
+
- Underlying primitives: [X03](../../cross-cutting/X03-observability.md)
|
| 319 |
+
- Federation scope: [M14 Β§5](../modules/M14-federation.md)
|
| 320 |
+
- Policy schema: [X04](../../cross-cutting/X04-config.md)
|
| 321 |
+
|
| 322 |
+
---
|
| 323 |
+
|
| 324 |
+
## 9. Open questions
|
| 325 |
+
|
| 326 |
+
1. **Differential privacy** β adding Laplacian noise to federated snapshots. Worth it for stronger guarantees, or does banding already suffice given small N?
|
| 327 |
+
2. **Federation gossip of snapshots** β should snapshots propagate transitively (AβBβC sees A's banded numbers), or strictly point-to-point? Phase-2 default: point-to-point.
|
| 328 |
+
3. **Per-capability cost accounting** β exposing GPU-seconds per capability call would help operators reason about cost / who's consuming what. Reveals usage patterns; needs consent design.
|
| 329 |
+
4. **Histogram vs banded scalars** β banded scalars are simple but lose distribution shape. Full Prometheus histograms with aggregated buckets might be a better federated unit. Trade-off: bytes on the wire vs richness.
|
| 330 |
+
5. **Aggregator beyond a single anchor** β at large scale (100+ nodes) a single aggregator becomes a bottleneck. Sharded aggregation (per-capability-prefix?) is a Phase-3 problem.
|
docs/p2_p3/X08-tensor-transport.md
ADDED
|
@@ -0,0 +1,302 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# X08 β Tensor Transport
|
| 2 |
+
|
| 3 |
+
**Spec version:** v3.0 β *experimental*
|
| 4 |
+
**Depends on:** [X06 WebSocket](../../phase-2/cross-cutting/X06-websocket.md), [M02 Transport](../../modules/M02-transport.md), [M01 Identity](../../modules/M01-identity.md)
|
| 5 |
+
**Depended on by:** [M26 Distributed Inference](../modules/M26-distributed-inference.md)
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## 1. Purpose
|
| 10 |
+
|
| 11 |
+
A binary, framed, flow-controlled transport for **tensor data** between HearthNet nodes β specifically the activations and gradients moved during M26 distributed inference. The text-oriented capability bus and JSON-shaped event envelopes are wrong for this traffic: tensors are large, dense, and benefit from binary representation, streaming, and explicit flow control.
|
| 12 |
+
|
| 13 |
+
X08 lives parallel to the bus, not on top of it. A tensor session is *negotiated* via the bus (M26 calls `pipeline.shard.connect` which returns an X08 endpoint URL and a session token), then the actual bytes move over a dedicated WebSocket binary channel.
|
| 14 |
+
|
| 15 |
+
Scope: bidirectional tensor streaming, fp16 by default, optional zstd compression above a threshold, 16-byte fixed-size headers, chunked payloads, ack-based flow control. Not a general-purpose RPC.
|
| 16 |
+
|
| 17 |
+
---
|
| 18 |
+
|
| 19 |
+
## 2. Non-goals
|
| 20 |
+
|
| 21 |
+
- **Replacing capability-bus traffic.** Control plane stays on the bus. X08 carries data only.
|
| 22 |
+
- **Persistent storage of tensors.** X08 is point-to-point, in-memory, ephemeral. Storage is the caller's job.
|
| 23 |
+
- **Cross-version negotiation of the frame format.** v3.0 ships one frame format. A future version bumps the major.
|
| 24 |
+
- **End-to-end encryption beyond TLS.** WebSocket runs over TLS via M02. Per-frame application-layer crypto is out of scope (the threat model doesn't require it because session establishment is authenticated, and the WSS hop is encrypted).
|
| 25 |
+
- **Reliable broadcast.** Sessions are 1:1. Multi-receiver fan-out is M26's problem if it needs it.
|
| 26 |
+
|
| 27 |
+
---
|
| 28 |
+
|
| 29 |
+
## 3. Wire format
|
| 30 |
+
|
| 31 |
+
### 3.1 Frame
|
| 32 |
+
|
| 33 |
+
Every frame is a single WebSocket binary message. Frame layout (big-endian):
|
| 34 |
+
|
| 35 |
+
```
|
| 36 |
+
offset size field
|
| 37 |
+
0 1 version (currently 0x01)
|
| 38 |
+
1 1 frame_type
|
| 39 |
+
2 2 reserved (must be 0x0000)
|
| 40 |
+
4 4 session_seq (u32, monotonic per session)
|
| 41 |
+
8 4 payload_length (u32, bytes of body)
|
| 42 |
+
12 4 flags
|
| 43 |
+
16 ... body (payload_length bytes)
|
| 44 |
+
```
|
| 45 |
+
|
| 46 |
+
The header is always 16 bytes. Body is opaque to the framing layer; its interpretation depends on `frame_type`.
|
| 47 |
+
|
| 48 |
+
### 3.2 Frame types
|
| 49 |
+
|
| 50 |
+
```
|
| 51 |
+
0x01 TENSOR_DATA body = tensor chunk (see Β§3.4)
|
| 52 |
+
0x02 TENSOR_END body = empty; marks last chunk of a tensor
|
| 53 |
+
0x03 ACK body = empty; acknowledges receipt up to session_seq
|
| 54 |
+
0x04 CONTROL_NACK body = utf-8 error reason
|
| 55 |
+
0x05 CONTROL_HELLO body = HelloMsg (json, utf-8)
|
| 56 |
+
0x06 CONTROL_BYE body = utf-8 reason, optional
|
| 57 |
+
0x07 CONTROL_FLOWCTL body = FlowCtlMsg (json, utf-8)
|
| 58 |
+
0x08 CONTROL_PING body = 8 bytes (echo nonce)
|
| 59 |
+
0x09 CONTROL_PONG body = 8 bytes (echoed nonce)
|
| 60 |
+
```
|
| 61 |
+
|
| 62 |
+
Frame types `0x10..0xFF` are reserved for future extensions and current implementations must close the session on unknown types.
|
| 63 |
+
|
| 64 |
+
### 3.3 Flags
|
| 65 |
+
|
| 66 |
+
```
|
| 67 |
+
0x00000001 COMPRESSED payload is zstd-compressed
|
| 68 |
+
0x00000002 FINAL last frame in this tensor (also implied by TENSOR_END)
|
| 69 |
+
0x00000004 GRAD payload is a gradient (informational; for telemetry)
|
| 70 |
+
0x00000008 ENCRYPTED reserved for future per-frame encryption
|
| 71 |
+
0xFFFFFFF0 reserved
|
| 72 |
+
```
|
| 73 |
+
|
| 74 |
+
### 3.4 Tensor chunk body
|
| 75 |
+
|
| 76 |
+
A `TENSOR_DATA` body is:
|
| 77 |
+
|
| 78 |
+
```
|
| 79 |
+
offset size field
|
| 80 |
+
0 2 tensor_id (u16, scoped to this session)
|
| 81 |
+
2 1 dtype (0x01=fp16, 0x02=fp32, 0x03=bf16, 0x04=int8)
|
| 82 |
+
3 1 n_dims (1..8)
|
| 83 |
+
4 n_dims*4 shape (u32 per dim, big-endian)
|
| 84 |
+
... data_bytes (compressed if COMPRESSED flag set)
|
| 85 |
+
```
|
| 86 |
+
|
| 87 |
+
`tensor_id` lets a session carry multiple concurrent tensors (e.g., parallel pipeline stages). A given `tensor_id` may be split across multiple `TENSOR_DATA` frames and is terminated by a `TENSOR_END` with the same `tensor_id`.
|
| 88 |
+
|
| 89 |
+
### 3.5 HelloMsg
|
| 90 |
+
|
| 91 |
+
```json
|
| 92 |
+
{
|
| 93 |
+
"session_id": "<ulid>",
|
| 94 |
+
"session_token": "<m16-token>",
|
| 95 |
+
"from": "<NodeID>",
|
| 96 |
+
"to": "<NodeID>",
|
| 97 |
+
"purpose": "pipeline.shard.forward",
|
| 98 |
+
"negotiation": {
|
| 99 |
+
"preferred_dtype": "fp16",
|
| 100 |
+
"compression": "zstd",
|
| 101 |
+
"max_chunk_bytes": 1048576,
|
| 102 |
+
"flow_window": 16
|
| 103 |
+
}
|
| 104 |
+
}
|
| 105 |
+
```
|
| 106 |
+
|
| 107 |
+
Both parties exchange `CONTROL_HELLO` on connect; mismatched purposes or invalid tokens terminate the session with `CONTROL_BYE`.
|
| 108 |
+
|
| 109 |
+
### 3.6 FlowCtlMsg
|
| 110 |
+
|
| 111 |
+
```json
|
| 112 |
+
{ "window": 16, "credits_added": 8 }
|
| 113 |
+
```
|
| 114 |
+
|
| 115 |
+
Receiver-initiated. Says "I can accept N more in-flight chunks beyond what I've already acked". See Β§4.3.
|
| 116 |
+
|
| 117 |
+
---
|
| 118 |
+
|
| 119 |
+
## 4. Behaviour
|
| 120 |
+
|
| 121 |
+
### 4.1 Session lifecycle
|
| 122 |
+
|
| 123 |
+
```
|
| 124 |
+
CONNECT ββhello exchangeβββΆ READY ββtensor dataβββΆ STREAMING ββend/byeβββΆ CLOSED
|
| 125 |
+
β
|
| 126 |
+
βββ auth fails βββΆ NACK βββΆ CLOSED
|
| 127 |
+
βββ timeout βββΆ CLOSED
|
| 128 |
+
```
|
| 129 |
+
|
| 130 |
+
A session is opened by the side that initiated the bus call (the M26 caller for forward passes; the shard server for activations sent back if reverse direction is needed). The HelloMsg `session_token` is an M16 token scoped to the bus capability that authorised this session (e.g., `pipeline-shard-forward`); the receiver validates it before accepting any `TENSOR_DATA`.
|
| 131 |
+
|
| 132 |
+
### 4.2 Sequencing
|
| 133 |
+
|
| 134 |
+
`session_seq` is a u32 starting at 1 and incrementing per outgoing frame from the sender. It wraps to 1 at 2^32-1 in the theoretical case but practically a single session is expected to be far below that. Wrap is supported by the protocol but is not exercised by tests.
|
| 135 |
+
|
| 136 |
+
The receiver tracks the highest `session_seq` it has processed and acknowledges via `ACK` frames whose `session_seq` echoes the highest contiguous received seq.
|
| 137 |
+
|
| 138 |
+
### 4.3 Flow control
|
| 139 |
+
|
| 140 |
+
The receiver advertises a *credit window* in `CONTROL_FLOWCTL`. The sender may have at most `window` un-acked frames in flight. Initial window is set in `HelloMsg.negotiation.flow_window` (default `TENSOR_FLOW_CONTROL_WINDOW=16`). The receiver replenishes credits by sending `FLOWCTL` with `credits_added > 0` as it processes frames.
|
| 141 |
+
|
| 142 |
+
If the sender's in-flight count reaches the window, it pauses until an `ACK` or `FLOWCTL` arrives. There is no timeout-based unblock; if the receiver disappears, the underlying WebSocket eventually closes and the session ends.
|
| 143 |
+
|
| 144 |
+
### 4.4 Compression
|
| 145 |
+
|
| 146 |
+
`COMPRESSED` flag is set per-frame, not per-session. The sender chooses; the receiver MUST support zstd (level 3 default). Compression is applied to the *body* (everything after the 16-byte header). The body's `payload_length` reflects the compressed size; the uncompressed shape is recovered from the tensor chunk header after decompression.
|
| 147 |
+
|
| 148 |
+
Compression is enabled when the raw body exceeds `TENSOR_COMPRESSION_THRESHOLD_BYTES` (default 64 KiB). Below this, the framing overhead dominates and compression is skipped.
|
| 149 |
+
|
| 150 |
+
### 4.5 Chunking
|
| 151 |
+
|
| 152 |
+
A tensor larger than `TENSOR_CHUNK_BYTES` (default 1 MiB) is split into multiple `TENSOR_DATA` frames sharing the same `tensor_id`. The split is on raw-byte boundaries (after compression if compressed); the receiver concatenates raw bytes per `tensor_id` and then, on `TENSOR_END`, decompresses (if needed) and reconstructs the tensor using the shape declared in the *first* chunk for that `tensor_id`. Subsequent chunks for the same `tensor_id` repeat the dtype/shape header β the receiver MUST verify consistency or close the session with a NACK.
|
| 153 |
+
|
| 154 |
+
### 4.6 Keepalive
|
| 155 |
+
|
| 156 |
+
Either side may send `CONTROL_PING` at any time; the peer must respond with `CONTROL_PONG` echoing the nonce. A session with no PING/PONG and no data for `TENSOR_KEEPALIVE_SECONDS` (default 30) sends a PING; failure to respond within 2Γ that closes the session.
|
| 157 |
+
|
| 158 |
+
### 4.7 Backpressure & cancellation
|
| 159 |
+
|
| 160 |
+
A caller cancelling a pipeline operation (M26) sends `CONTROL_BYE` with a reason. The receiver may discard in-flight tensors for the cancelled session. There is no "graceful drain" β cancellation is fast and lossy.
|
| 161 |
+
|
| 162 |
+
### 4.8 Failure modes
|
| 163 |
+
|
| 164 |
+
- **Decompression fails**: NACK + close. The caller in M26 retries with the failover shard.
|
| 165 |
+
- **Tensor shape inconsistency across chunks**: NACK + close.
|
| 166 |
+
- **Auth failure on HelloMsg**: NACK + close before any data flows.
|
| 167 |
+
- **Unknown frame type**: close with NACK reason `unknown_frame_type`.
|
| 168 |
+
- **Sequence gap**: NACK + close. There is no out-of-order recovery; WebSocket delivers in order, so a gap means corruption.
|
| 169 |
+
- **Window overrun by sender**: NACK + close β the sender violated flow control.
|
| 170 |
+
|
| 171 |
+
---
|
| 172 |
+
|
| 173 |
+
## 5. API
|
| 174 |
+
|
| 175 |
+
X08 is a library, not a capability surface. Public Python API:
|
| 176 |
+
|
| 177 |
+
```python
|
| 178 |
+
class TensorSession:
|
| 179 |
+
@classmethod
|
| 180 |
+
async def connect(cls,
|
| 181 |
+
url: str,
|
| 182 |
+
token: AuthToken,
|
| 183 |
+
*,
|
| 184 |
+
purpose: str,
|
| 185 |
+
remote: NodeID,
|
| 186 |
+
negotiation: SessionNegotiation | None = None) -> TensorSession: ...
|
| 187 |
+
@classmethod
|
| 188 |
+
async def accept(cls,
|
| 189 |
+
ws: WebSocket,
|
| 190 |
+
*,
|
| 191 |
+
expected_purpose: str,
|
| 192 |
+
validate_token: Callable[[AuthToken], None]) -> TensorSession: ...
|
| 193 |
+
|
| 194 |
+
async def send_tensor(self, tensor_id: int, t: Tensor, *, gradient: bool = False) -> None: ...
|
| 195 |
+
async def recv_tensor(self) -> RecvTensor: ...
|
| 196 |
+
async def close(self, reason: str = "") -> None: ...
|
| 197 |
+
|
| 198 |
+
@property
|
| 199 |
+
def session_id(self) -> str: ...
|
| 200 |
+
@property
|
| 201 |
+
def stats(self) -> SessionStats: ...
|
| 202 |
+
|
| 203 |
+
@dataclass(frozen=True)
|
| 204 |
+
class RecvTensor:
|
| 205 |
+
tensor_id: int
|
| 206 |
+
tensor: Tensor
|
| 207 |
+
is_grad: bool
|
| 208 |
+
|
| 209 |
+
@dataclass(frozen=True)
|
| 210 |
+
class SessionStats:
|
| 211 |
+
bytes_sent: int
|
| 212 |
+
bytes_received: int
|
| 213 |
+
bytes_compressed_out: int
|
| 214 |
+
bytes_uncompressed_out: int
|
| 215 |
+
frames_sent: int
|
| 216 |
+
frames_received: int
|
| 217 |
+
rtt_estimate_ms: float
|
| 218 |
+
```
|
| 219 |
+
|
| 220 |
+
Implementations: `hearthnet/transport/tensor/` houses `session.py`, `frame.py`, `flow.py`, `compress.py`.
|
| 221 |
+
|
| 222 |
+
---
|
| 223 |
+
|
| 224 |
+
## 6. Configuration
|
| 225 |
+
|
| 226 |
+
```python
|
| 227 |
+
@dataclass(frozen=True)
|
| 228 |
+
class TensorTransportConfig:
|
| 229 |
+
default_dtype: Literal["fp16","fp32","bf16","int8"] = "fp16"
|
| 230 |
+
chunk_bytes: int = TENSOR_CHUNK_BYTES # 1048576
|
| 231 |
+
flow_control_window: int = TENSOR_FLOW_CONTROL_WINDOW # 16
|
| 232 |
+
compression_threshold_bytes: int = TENSOR_COMPRESSION_THRESHOLD_BYTES # 65536
|
| 233 |
+
compression_level: int = 3 # zstd
|
| 234 |
+
keepalive_seconds: int = TENSOR_KEEPALIVE_SECONDS # 30
|
| 235 |
+
max_session_lifetime_seconds: int = 3600 # hard cap
|
| 236 |
+
max_concurrent_sessions: int = 64
|
| 237 |
+
rx_buffer_bytes_max: int = 64 * 1024 * 1024 # 64 MiB
|
| 238 |
+
```
|
| 239 |
+
|
| 240 |
+
Constants in `hearthnet/constants.py`.
|
| 241 |
+
|
| 242 |
+
---
|
| 243 |
+
|
| 244 |
+
## 7. Tests
|
| 245 |
+
|
| 246 |
+
### 7.1 Unit
|
| 247 |
+
|
| 248 |
+
- `test_frame_header_layout` β pack/unpack roundtrip for all frame types.
|
| 249 |
+
- `test_tensor_chunk_body` β pack/unpack roundtrip for all dtypes and ranks.
|
| 250 |
+
- `test_compression_roundtrip` β compressed body decompresses to identity.
|
| 251 |
+
- `test_chunking_reassembly` β 5 MiB tensor split into 5 chunks reassembles to identical bytes.
|
| 252 |
+
- `test_unknown_frame_type_closes` β receiver rejects 0xFF.
|
| 253 |
+
- `test_flow_control_blocks_at_window` β sender pauses at window edge, resumes on ACK.
|
| 254 |
+
- `test_seq_gap_closes` β injecting a missing seq forces NACK + close.
|
| 255 |
+
|
| 256 |
+
### 7.2 Property
|
| 257 |
+
|
| 258 |
+
- Random tensor shapes and dtypes: send β receive β equal modulo dtype precision.
|
| 259 |
+
- Random chunk sizes that always sum to the same total: reassembly identical.
|
| 260 |
+
|
| 261 |
+
### 7.3 Integration
|
| 262 |
+
|
| 263 |
+
- Loopback session over an in-memory WebSocket pair: send 10 tensors of varying size, verify all received, stats consistent.
|
| 264 |
+
- Two-process loopback: same as above but over a real localhost WSS.
|
| 265 |
+
- Cancellation mid-stream: sender sends half a tensor, receives BYE, no further frames sent.
|
| 266 |
+
- Auth failure: connect with bad token β NACK on hello.
|
| 267 |
+
|
| 268 |
+
### 7.4 Negative
|
| 269 |
+
|
| 270 |
+
- Send to a wrong purpose β hello mismatch β close.
|
| 271 |
+
- Send oversized tensor (exceeds rx_buffer_bytes_max) β receiver NACKs with `tensor_too_large`.
|
| 272 |
+
- Corrupt frame in the middle of a tensor: receiver detects via shape inconsistency or decompression failure β close.
|
| 273 |
+
|
| 274 |
+
---
|
| 275 |
+
|
| 276 |
+
## 8. Cross-references
|
| 277 |
+
|
| 278 |
+
- **Phase 1 M02 Transport** β provides the underlying WebSocket (WSS, TLS, certificate pinning).
|
| 279 |
+
- **Phase 2 X06 WebSocket** β defines the WebSocket framing and reconnection semantics that X08 layers on.
|
| 280 |
+
- **Phase 2 M16 Tokens** β session tokens authorise tensor transport sessions.
|
| 281 |
+
- **Phase 3 M26 Distributed Inference** β the primary consumer; defines purposes like `pipeline.shard.forward`, `pipeline.shard.backward`.
|
| 282 |
+
- **Phase 3 X09 Conformance Suite** β includes optional `tensor_transport` section, only run when M26 is enabled.
|
| 283 |
+
|
| 284 |
+
---
|
| 285 |
+
|
| 286 |
+
## 9. Open questions
|
| 287 |
+
|
| 288 |
+
1. **Per-frame encryption.** The `ENCRYPTED` flag is reserved. The use case is post-quantum hardening above TLS, or end-to-end above a federation-relay path that terminates TLS at the relay. Not in v3.0.
|
| 289 |
+
|
| 290 |
+
2. **Adaptive compression.** Fixed zstd level 3 is fine for typical activations. Per-session adaptive level (lower for hot, higher for warm tensors) is plausible. Out of scope.
|
| 291 |
+
|
| 292 |
+
3. **GPU-direct transport.** Activations sit in GPU memory and round-tripping through CPU memory for serialisation is wasteful. Direct GPU-to-network (NVLink/RDMA) is interesting but assumes a specific hardware topology that HearthNet doesn't have. Not in v3.0.
|
| 293 |
+
|
| 294 |
+
4. **Multipath.** Sending tensor chunks over multiple parallel WebSocket sessions to bond bandwidth is appealing but complicates ordering. v3.0 sticks to one session.
|
| 295 |
+
|
| 296 |
+
5. **Sequence wrap.** Practically irrelevant; correctness at wrap is asserted but not battle-tested.
|
| 297 |
+
|
| 298 |
+
6. **Flow control on the wire.** Currently we layer flow control on top of WebSocket, which already has some. The duplication is intentional (we want app-level explicit windowing for backpressure into the inference scheduler) but worth revisiting.
|
| 299 |
+
|
| 300 |
+
---
|
| 301 |
+
|
| 302 |
+
*Last updated: spec v3.0.*
|
docs/p2_p3/X09-conformance-suite.md
ADDED
|
@@ -0,0 +1,316 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# X09 β Conformance Suite
|
| 2 |
+
|
| 3 |
+
**Spec version:** v3.0 β *experimental*
|
| 4 |
+
**Depends on:** Every other module (the suite tests them); no runtime dependency in production
|
| 5 |
+
**Depended on by:** [M32 Protocol Standardisation](../modules/M32-protocol-standard.md)
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## 1. Purpose
|
| 10 |
+
|
| 11 |
+
A black-box, implementation-agnostic test suite that defines what "HearthNet-compliant" means in practice. The suite spins up an instance of an implementation, drives it through specified interactions, observes the wire format and the capability behaviour, and produces a `ConformanceReport` (defined in M32).
|
| 12 |
+
|
| 13 |
+
Where the spec documents say "the system MUST do X", the conformance suite contains a test that observes whether the system does X. If a behaviour is described in a spec but not tested by X09, the spec wins in principle but the suite wins in practice β so we treat closing that gap as a continuous effort.
|
| 14 |
+
|
| 15 |
+
The suite is designed so that an alternate implementation (a future Go or Rust HearthNet) can be tested by the same suite. This is the entire point: it makes "interoperable" a measurable property.
|
| 16 |
+
|
| 17 |
+
---
|
| 18 |
+
|
| 19 |
+
## 2. Non-goals
|
| 20 |
+
|
| 21 |
+
- **Replacing per-module unit tests.** Each module ships its own unit and property tests as described in its spec. X09 sits one level higher and treats the implementation as a black box.
|
| 22 |
+
- **Performance benchmarks.** Conformance is correctness, not speed. A future X10 may handle benchmarks.
|
| 23 |
+
- **Security audits.** Out of scope. The suite includes some negative-path tests but is not a pen-test.
|
| 24 |
+
- **Visual / UX testing.** The web UI is exercised only via its capability-bus and HTTP API surfaces.
|
| 25 |
+
- **Locking in implementation detail.** Tests assert on observable behaviour (wire formats, capability responses, event log entries), never on internal state.
|
| 26 |
+
|
| 27 |
+
---
|
| 28 |
+
|
| 29 |
+
## 3. File layout
|
| 30 |
+
|
| 31 |
+
The suite lives at repository root, sibling to `hearthnet/` and `protocol/`:
|
| 32 |
+
|
| 33 |
+
```
|
| 34 |
+
conformance/
|
| 35 |
+
βββ README.md
|
| 36 |
+
βββ VERSION # suite version, e.g. "1.0.0"
|
| 37 |
+
βββ pyproject.toml # standalone tool, runnable without hearthnet/
|
| 38 |
+
βββ runner.py # entry point: `python -m conformance.runner --target=...`
|
| 39 |
+
βββ report.py # builds ConformanceReport from results
|
| 40 |
+
βββ harness/
|
| 41 |
+
β βββ target.py # abstraction over a "system under test" (SUT)
|
| 42 |
+
β βββ docker_target.py # SUT in a docker container
|
| 43 |
+
β βββ local_target.py # SUT on the local network at a URL
|
| 44 |
+
β βββ fixtures.py # synthetic identities, tokens, files
|
| 45 |
+
β βββ wire_capture.py # records bus / WS traffic for diffing
|
| 46 |
+
βββ suites/
|
| 47 |
+
β βββ core/
|
| 48 |
+
β β βββ identity/
|
| 49 |
+
β β βββ transport/
|
| 50 |
+
β β βββ bus/
|
| 51 |
+
β β βββ events/
|
| 52 |
+
β β βββ tokens/
|
| 53 |
+
β β βββ files/
|
| 54 |
+
β β βββ kb/
|
| 55 |
+
β β βββ llm/
|
| 56 |
+
β βββ services/
|
| 57 |
+
β β βββ chat/
|
| 58 |
+
β β βββ group_chat/
|
| 59 |
+
β β βββ ocr/
|
| 60 |
+
β β βββ translation/
|
| 61 |
+
β β βββ stt_tts/
|
| 62 |
+
β βββ federation/
|
| 63 |
+
β βββ experimental/
|
| 64 |
+
β β βββ distributed_inference/
|
| 65 |
+
β β βββ moe/
|
| 66 |
+
β β βββ fedlearn/
|
| 67 |
+
β β βββ evidence/
|
| 68 |
+
β β βββ civdef/
|
| 69 |
+
β βββ operability/
|
| 70 |
+
β βββ shutdown_clean/
|
| 71 |
+
β βββ restart_persistence/
|
| 72 |
+
β βββ observability/
|
| 73 |
+
βββ vectors/ # test vectors: canonical inputs and expected outputs
|
| 74 |
+
βββ identity/
|
| 75 |
+
βββ tokens/
|
| 76 |
+
βββ federation/
|
| 77 |
+
βββ tensor_transport/
|
| 78 |
+
```
|
| 79 |
+
|
| 80 |
+
The whole `conformance/` directory is published as part of every protocol release, with the `VERSION` file aligning with the protocol's release cadence (but versioned independently β see Β§4.1).
|
| 81 |
+
|
| 82 |
+
---
|
| 83 |
+
|
| 84 |
+
## 4. Architecture
|
| 85 |
+
|
| 86 |
+
### 4.1 Suite versioning
|
| 87 |
+
|
| 88 |
+
Suite version follows semver. `suite_version` in `ConformanceReport` is the suite that produced the report. A protocol version is paired with a *minimum* suite version that is sufficient to test it. Newer suite versions test more thoroughly; older suite versions may not exercise newer protocol features.
|
| 89 |
+
|
| 90 |
+
### 4.2 Target abstraction
|
| 91 |
+
|
| 92 |
+
```python
|
| 93 |
+
class Target(Protocol):
|
| 94 |
+
"""A system under test (SUT). The suite never touches the SUT's internals."""
|
| 95 |
+
base_url: str
|
| 96 |
+
admin_token: AuthToken
|
| 97 |
+
|
| 98 |
+
async def start(self) -> None: ... # for managed targets like docker
|
| 99 |
+
async def stop(self) -> None: ...
|
| 100 |
+
async def reset(self) -> None: ... # blank slate (for tests that need it)
|
| 101 |
+
async def bus_call(self, capability: str, payload: dict) -> dict: ...
|
| 102 |
+
async def event_subscribe(self, types: list[str]) -> AsyncIterator[Event]: ...
|
| 103 |
+
async def http_get(self, path: str, headers: dict | None = None) -> Response: ...
|
| 104 |
+
async def http_post(self, path: str, body: bytes, headers: dict | None = None) -> Response: ...
|
| 105 |
+
async def ws_connect(self, path: str, subprotocol: str | None = None) -> WebSocket: ...
|
| 106 |
+
async def capture_wire(self) -> WireCapture: ... # for federation/tensor tests
|
| 107 |
+
```
|
| 108 |
+
|
| 109 |
+
Two concrete `Target` implementations ship:
|
| 110 |
+
|
| 111 |
+
- `LocalTarget`: SUT runs as a long-lived process accessible at a known URL. Simplest; used in CI against the reference implementation.
|
| 112 |
+
- `DockerTarget`: SUT runs in a docker container that the suite spawns. Useful for testing alternate implementations packaged as containers.
|
| 113 |
+
|
| 114 |
+
Authors of an alternate implementation supply their own `Target` subclass if needed.
|
| 115 |
+
|
| 116 |
+
### 4.3 Test format
|
| 117 |
+
|
| 118 |
+
Tests are plain `pytest` cases under `suites/`. They use the target as an injected fixture:
|
| 119 |
+
|
| 120 |
+
```python
|
| 121 |
+
# suites/core/identity/test_node_id_format.py
|
| 122 |
+
|
| 123 |
+
async def test_node_id_is_base32_no_pad(target: Target) -> None:
|
| 124 |
+
r = await target.bus_call("identity.self.describe", {})
|
| 125 |
+
node_id = r["node_id"]
|
| 126 |
+
assert re.fullmatch(r"[A-Z2-7]+", node_id), "NodeID must be base32 with no padding"
|
| 127 |
+
assert len(node_id) >= 52
|
| 128 |
+
```
|
| 129 |
+
|
| 130 |
+
Each test asserts at most one *spec requirement*. The test docstring names the spec section it covers; the runner uses this to produce traceability from `SectionResult.failures` back to the relevant module spec.
|
| 131 |
+
|
| 132 |
+
### 4.4 Wire vectors
|
| 133 |
+
|
| 134 |
+
For wire-format tests (federation manifest, tensor transport frame, token JWS envelope) the suite carries canonical byte vectors in `vectors/`. Tests assert that:
|
| 135 |
+
|
| 136 |
+
- The SUT, given a known input, produces a byte-equal output (after canonicalisation where applicable).
|
| 137 |
+
- The SUT, given a known byte vector, parses it without errors and produces the expected semantic content.
|
| 138 |
+
|
| 139 |
+
This catches subtle interop bugs β the kind of "we both speak JSON-with-tiny-differences" issue that has historically killed federated systems.
|
| 140 |
+
|
| 141 |
+
### 4.5 Report aggregation
|
| 142 |
+
|
| 143 |
+
After a run, `report.py`:
|
| 144 |
+
|
| 145 |
+
1. Collects per-test pass/fail/skip results.
|
| 146 |
+
2. Groups by suite path β SectionResult.
|
| 147 |
+
3. Computes `overall`:
|
| 148 |
+
- `pass` if all `core/*` and all `services/*` sections passed (experimental and operability may fail without affecting `pass`).
|
| 149 |
+
- `partial` if `core/*` passed but anything else failed.
|
| 150 |
+
- `fail` if any `core/*` test failed.
|
| 151 |
+
- `skipped` if no sections ran.
|
| 152 |
+
4. Signs the report with the SUT's identity (the SUT signs its own report β there is no external authority).
|
| 153 |
+
5. Emits `report.json` and a human-readable `report.html`.
|
| 154 |
+
|
| 155 |
+
### 4.6 Reproducibility
|
| 156 |
+
|
| 157 |
+
Every run produces a `run_manifest.json` containing:
|
| 158 |
+
|
| 159 |
+
- Suite version, suite git commit.
|
| 160 |
+
- Target type and configuration (without secrets).
|
| 161 |
+
- Random seed (suite seeds all RNGs deterministically for reproducibility).
|
| 162 |
+
- Test selection (which suites/tests were run vs skipped).
|
| 163 |
+
- Timestamps.
|
| 164 |
+
|
| 165 |
+
Replaying with the same manifest against the same SUT version must produce equivalent results modulo timestamps.
|
| 166 |
+
|
| 167 |
+
---
|
| 168 |
+
|
| 169 |
+
## 5. Required sections
|
| 170 |
+
|
| 171 |
+
A claim of "HearthNet-compliant at protocol version 3.0.0" requires passing **every test** under:
|
| 172 |
+
|
| 173 |
+
- `suites/core/identity/`
|
| 174 |
+
- `suites/core/transport/`
|
| 175 |
+
- `suites/core/bus/`
|
| 176 |
+
- `suites/core/events/`
|
| 177 |
+
- `suites/core/tokens/`
|
| 178 |
+
- `suites/core/files/`
|
| 179 |
+
- `suites/core/kb/` *(minimum: ingest, query)*
|
| 180 |
+
- `suites/core/llm/` *(minimum: chat capability, error handling)*
|
| 181 |
+
|
| 182 |
+
Plus passing the relevant *advertised-capability* sections under `suites/services/` for any service the implementation advertises. An implementation advertising `chat.thread.*` but not running `suites/services/chat/` is non-compliant by omission.
|
| 183 |
+
|
| 184 |
+
Federation is required for any implementation that advertises federation; otherwise it's optional. Experimental sections are *always* optional and `partial` is a valid honest outcome.
|
| 185 |
+
|
| 186 |
+
---
|
| 187 |
+
|
| 188 |
+
## 6. Behaviour
|
| 189 |
+
|
| 190 |
+
### 6.1 Pre-flight
|
| 191 |
+
|
| 192 |
+
Before running tests, the runner:
|
| 193 |
+
|
| 194 |
+
1. Confirms `target.start()` succeeded.
|
| 195 |
+
2. Calls `protocol.self_describe` and `protocol.version_list` to discover what to test.
|
| 196 |
+
3. Confirms `protocol_version` returned by the SUT is compatible with the suite's supported versions; if not, fails fast with `protocol_version_unsupported`.
|
| 197 |
+
4. Resets the SUT (`target.reset()`).
|
| 198 |
+
5. Loads vector files into memory.
|
| 199 |
+
|
| 200 |
+
### 6.2 Test isolation
|
| 201 |
+
|
| 202 |
+
Each test must be independent β order should not matter. Tests that need a clean slate request `target.reset()` in a fixture; tests that need shared state declare it via pytest fixtures with explicit scope.
|
| 203 |
+
|
| 204 |
+
Tests use synthetic identities and tokens generated per-test, never the real operator's keys.
|
| 205 |
+
|
| 206 |
+
### 6.3 Graceful skipping
|
| 207 |
+
|
| 208 |
+
A test that requires a capability not advertised by the SUT is *skipped*, not failed:
|
| 209 |
+
|
| 210 |
+
```python
|
| 211 |
+
@requires_capability("experimental.fedlearn.round.announce")
|
| 212 |
+
async def test_fedlearn_round_announce_signs_manifest(target: Target) -> None:
|
| 213 |
+
...
|
| 214 |
+
```
|
| 215 |
+
|
| 216 |
+
`requires_capability` queries `protocol.self_describe`. Skipped tests appear in the report as `skipped` with a reason. They never flip `overall` to `fail`.
|
| 217 |
+
|
| 218 |
+
### 6.4 Wire capture mode
|
| 219 |
+
|
| 220 |
+
For federation and tensor-transport sections, the runner may attach a `WireCapture` to record the raw bytes flowing between two SUT instances (or between an SUT and the suite's own simulator). The captured frames are checked against vectors and against the schema documented in the relevant cross-cutting spec.
|
| 221 |
+
|
| 222 |
+
Wire-capture mode requires the operator to have configured the SUT to log raw traffic to a known location (typically a Unix socket the suite reads). For SUTs that can't expose raw traffic, the suite falls back to behavioural assertions only and notes `partial` if wire vectors couldn't be verified.
|
| 223 |
+
|
| 224 |
+
### 6.5 Operability sections
|
| 225 |
+
|
| 226 |
+
`operability/` sections test resilience properties:
|
| 227 |
+
|
| 228 |
+
- `shutdown_clean`: send SIGTERM (or container stop); verify no events are lost, the audit chain (M31) verifies, and the SUT restarts cleanly.
|
| 229 |
+
- `restart_persistence`: data created before restart is queryable after restart.
|
| 230 |
+
- `observability`: standard event types fire as expected; X03 observability conformance.
|
| 231 |
+
|
| 232 |
+
### 6.6 Reporting failures
|
| 233 |
+
|
| 234 |
+
Each failure records:
|
| 235 |
+
|
| 236 |
+
- Spec section reference (e.g. `M14 Β§5.2 canonicalisation`).
|
| 237 |
+
- The actual observed value or behaviour.
|
| 238 |
+
- The expected value or behaviour.
|
| 239 |
+
- A reproduction recipe (capability call + payload, or wire vector identifier).
|
| 240 |
+
|
| 241 |
+
This is what makes a `partial` report useful: the failures are debuggable.
|
| 242 |
+
|
| 243 |
+
---
|
| 244 |
+
|
| 245 |
+
## 7. Configuration
|
| 246 |
+
|
| 247 |
+
```python
|
| 248 |
+
@dataclass(frozen=True)
|
| 249 |
+
class ConformanceConfig:
|
| 250 |
+
target_kind: Literal["local","docker","custom"] = "local"
|
| 251 |
+
target_url: str = "http://127.0.0.1:7900"
|
| 252 |
+
target_admin_token: str | None = None # acquired out-of-band
|
| 253 |
+
docker_image: str | None = None
|
| 254 |
+
suite_filter: tuple[str, ...] = () # glob patterns; empty = all required + advertised
|
| 255 |
+
skip_experimental: bool = False
|
| 256 |
+
skip_operability: bool = False
|
| 257 |
+
wire_capture: bool = False
|
| 258 |
+
output_dir: str = "./conformance-report"
|
| 259 |
+
parallel: int = 1 # 1 by default to avoid test-isolation surprises
|
| 260 |
+
seed: int = 0xC0FFEE
|
| 261 |
+
```
|
| 262 |
+
|
| 263 |
+
A typical CI invocation:
|
| 264 |
+
|
| 265 |
+
```
|
| 266 |
+
python -m conformance.runner \
|
| 267 |
+
--target=docker \
|
| 268 |
+
--docker-image=hearthnet:latest \
|
| 269 |
+
--output-dir=./report
|
| 270 |
+
```
|
| 271 |
+
|
| 272 |
+
---
|
| 273 |
+
|
| 274 |
+
## 8. Tests of the suite itself
|
| 275 |
+
|
| 276 |
+
The suite has its own tests, kept under `conformance/tests/`:
|
| 277 |
+
|
| 278 |
+
- `test_runner_smoke` β runs the suite against the reference impl, expects `overall=pass` for `core/*`.
|
| 279 |
+
- `test_skip_logic` β capabilities not advertised β tests skipped, not failed.
|
| 280 |
+
- `test_seed_deterministic` β given a seed, two consecutive runs produce identical reports modulo timestamps.
|
| 281 |
+
- `test_report_schema` β generated `ConformanceReport` validates against the schema in `protocol/`.
|
| 282 |
+
- `test_vector_integrity` β every file in `vectors/` parses with the canonical loader.
|
| 283 |
+
- `test_known_partial` β the reference impl with `experimental.*` disabled produces a `partial` report (because experimental tests skip, not fail) β verify that the `overall` calculation correctly produces `pass`, since experimental skips don't flip the bit.
|
| 284 |
+
|
| 285 |
+
---
|
| 286 |
+
|
| 287 |
+
## 9. Cross-references
|
| 288 |
+
|
| 289 |
+
- **M32 Protocol Standardisation** β consumes `ConformanceReport`; the suite is the source of "what conformance means".
|
| 290 |
+
- **Every module spec** β the suite's tests reference the spec section they verify.
|
| 291 |
+
- **X02 Event Log, X03 Observability** β operability tests assert on these.
|
| 292 |
+
- **X06 WebSocket, X08 Tensor Transport** β wire-capture vectors live in `vectors/`.
|
| 293 |
+
|
| 294 |
+
---
|
| 295 |
+
|
| 296 |
+
## 10. Open questions
|
| 297 |
+
|
| 298 |
+
1. **Adversarial tests.** v3.0 has minimal negative-path coverage in `core/`. A future suite version with a `security/` section that probes for known classes of mistakes (auth bypass, signature reuse, event-log forgery attempts) would be valuable. Out of scope for v3.0.
|
| 299 |
+
|
| 300 |
+
2. **Conformance for partial implementations.** The current model gates `pass` on all required `core/*` passing. A future tiered model (`HearthNet-Bronze` = identity + transport + bus only; `HearthNet-Silver` adds services; `HearthNet-Gold` adds federation) is appealing for low-resource implementations. Not in v3.0.
|
| 301 |
+
|
| 302 |
+
3. **Differential testing.** Once two implementations exist, running them side-by-side with the same input and asserting identical observable behaviour is the strongest interop test. The harness supports this in principle (two targets), but no tests in v3.0 actually use it because only one implementation exists.
|
| 303 |
+
|
| 304 |
+
4. **Vector generation.** Today vectors are hand-curated. Tooling to *regenerate* vectors from the reference implementation and detect drift would prevent test rot. Planned, not implemented.
|
| 305 |
+
|
| 306 |
+
5. **Reporting hub.** A public registry that collects published conformance reports from various implementations would help users assess interop status. Out of scope for the suite itself; M32's `protocol.registry.*` capabilities are the closest current analogue.
|
| 307 |
+
|
| 308 |
+
6. **Performance regression guardrails.** Not conformance, but obviously valuable. A separate X10 (TBD) may handle this.
|
| 309 |
+
|
| 310 |
+
7. **Long-haul tests.** Some bugs (memory leaks, slow drifts in audit chains) appear only after hours. The suite is built for short runs; a "soak mode" with `--duration=24h` would test these. Open.
|
| 311 |
+
|
| 312 |
+
8. **Federation interop with non-HearthNet systems.** Out of scope. The suite verifies HearthNet β HearthNet federation only.
|
| 313 |
+
|
| 314 |
+
---
|
| 315 |
+
|
| 316 |
+
*Last updated: spec v3.0.*
|