Chris4K commited on
Commit
46bc545
Β·
verified Β·
1 Parent(s): f03bdc5
docs/p2_p3/00-OVERVIEW.md ADDED
@@ -0,0 +1,324 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # HearthNet Phase 2 β€” Spec Set Overview
2
+
3
+ **Phase 2 scope:** post-hackathon, 1–3 months of work. Hardens the MVP into something other communities can adopt.
4
+
5
+ **Stance toward Phase 1:** strictly additive. Phase 1 specs are immutable from Phase 2's view. New modules plug into the bus; nothing in Phase 1 needs rewriting. Where a Phase 1 module is "extended", it is *extended* β€” same public API, new capabilities or backends added behind the existing facade.
6
+
7
+ ---
8
+
9
+ ## 0. What changes vs Phase 1
10
+
11
+ | Concern | Phase 1 (MVP) | Phase 2 |
12
+ |---------|---------------|---------|
13
+ | Discovery | mDNS + UDP (LAN only) | + DHT (cross-LAN), + relay-assisted NAT traversal |
14
+ | Transport | HTTP/1.1 + SSE long-poll pubsub | + WebSocket upgrade for bidirectional + pubsub |
15
+ | Trust | Per-request signing | + Capability tokens (delegation, federation) |
16
+ | Cross-community | Out of scope | Federation: signed peering, scoped capability access |
17
+ | Encryption | TLS-in-transit + signed-at-rest within community | + E2E (X25519 + ChaCha20-Poly1305) for chat & optionally files |
18
+ | Chat | 1:1 only | + Group chat (`chat.thread.*`), + store-and-forward via anchors |
19
+ | LLM | Text only | + Vision (`llm.chat` with image content), + Tool calls (`tool_call_delta`) |
20
+ | RAG | Digital PDFs | + OCR for scanned PDFs and images, + reranking |
21
+ | Services | LLM, embed, RAG, file, market, chat | + OCR, Translation, STT, TTS, Image generation, Rerank |
22
+ | Mobile | Web view served by anchor | + Native client (Flutter/RN) with push via relay tier |
23
+ | Files | Direct fetch on demand | + Resumable PUT, + Background replication, + At-rest encryption |
24
+ | Relay | None | + Hosted relay tier for NAT traversal, federation discovery, push |
25
+ | Observability | Local Prometheus + ring buffer + optional Trackio | + OTLP export, + Federated metrics aggregation |
26
+
27
+ ---
28
+
29
+ ## 1. Module map (Phase 2 additions)
30
+
31
+ ### New numbered modules
32
+
33
+ | ID | Module | Spec file | Concern |
34
+ |-----|------------------------------|----------------------------------------|-------------------------------------------------------|
35
+ | M14 | Federation | `modules/M14-federation.md` | Cross-community trust, federation manifests, scoped access |
36
+ | M15 | Relay Tier | `modules/M15-relay-tier.md` | Hosted HTTPS relay (NAT traversal, federation discovery, mobile push) |
37
+ | M16 | Capability Tokens | `modules/M16-tokens.md` | Short-lived delegation tokens (OAuth-flavoured) |
38
+ | M17 | OCR Service | `modules/M17-ocr.md` | `ocr.image`, `ocr.pdf` β€” Tesseract / TrOCR / multilingual |
39
+ | M18 | Translation Service | `modules/M18-translation.md` | `trans.text` β€” NLLB-backed, DE↔EN↔Plattdeutsch |
40
+ | M19 | Speech I/O | `modules/M19-stt-tts.md` | `stt.transcribe` (Whisper), `tts.synthesize` (XTTS/Edge) |
41
+ | M20 | Vision Services | `modules/M20-vision.md` | `img.describe`, `img.generate`, multimodal LLM input |
42
+ | M21 | Tool Calls | `modules/M21-tool-calls.md` | `tool_call_delta` frames, OpenAI/Anthropic-compatible |
43
+ | M22 | Mobile Native | `modules/M22-mobile-native.md` | Flutter/RN client with push |
44
+ | M23 | E2E Encryption | `modules/M23-e2e-encryption.md` | X25519 + ChaCha20-Poly1305 for chat (and optional files) |
45
+ | M24 | Reranking | `modules/M24-rerank.md` | `rerank.text` β€” BGE-reranker, used by RAG and search |
46
+ | M25 | Group Chat | `modules/M25-group-chat.md` | `chat.thread.*` β€” multi-party conversations |
47
+
48
+ ### New cross-cutting modules
49
+
50
+ | ID | Module | Spec file | Concern |
51
+ |-----|-----------------|----------------------------------------|---------------------------------------------------------|
52
+ | X05 | DHT | `cross-cutting/X05-dht.md` | Kademlia-style cross-LAN peer + content discovery |
53
+ | X06 | WebSocket | `cross-cutting/X06-websocket.md` | Bidirectional upgrade for `/bus/v1/call` and `/pubsub` |
54
+ | X07 | Federated Metrics | `cross-cutting/X07-federated-metrics.md` | Optional OTLP export + per-community aggregation |
55
+
56
+ ### Modifications to Phase 1 modules
57
+
58
+ These do not get new spec files β€” Phase 1 spec is *extended* in place at next major. The Phase 1 IMPLEMENTATION_REFERENCE will gain entries; flagged in [`IMPLEMENTATION_REFERENCE.md`](IMPLEMENTATION_REFERENCE.md) Β§0.
59
+
60
+ | Phase 1 module | Extension |
61
+ |----------------|-----------|
62
+ | M04 LLM | New backends gain multimodal + tools support; descriptors carry `modalities`, `tools_supported` flags |
63
+ | M05 RAG | Auto-reindex on embedding model change; hybrid (keyword+dense) search; `rerank.text` integration |
64
+ | M07 File/Blobs | Resumable PUT (server-side partial-transfer index); background replication; at-rest encryption envelope |
65
+ | M10 Chat | Calls into M23 for encryption; calls into M25 for group threads |
66
+ | M02 Discovery | Calls into X05 DHT when peers not found via mDNS/UDP |
67
+ | X01 Transport | Calls into X06 WebSocket on `Upgrade: websocket` header |
68
+ | M09 Emergency | Phase-2 captive-portal probe |
69
+
70
+ ---
71
+
72
+ ## 2. Dependency graph (Phase 2 additions on top of Phase 1)
73
+
74
+ ```
75
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
76
+ β”‚ Phase 1 (unchanged) β”‚
77
+ β”‚ X04 X03 X02 X01 M01..M13 β”‚
78
+ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
79
+ β”‚ β”‚ β”‚ β”‚
80
+ β–Ό β–Ό β–Ό β–Ό
81
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
82
+ β”‚ X05 β”‚ β”‚ X06 β”‚ β”‚ X07 β”‚ β”‚ M16 β”‚
83
+ β”‚ DHT β”‚ β”‚ WS β”‚ β”‚ Fed-M β”‚ β”‚ Tokens β”‚
84
+ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”¬β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
85
+ β”‚ β”‚ β”‚
86
+ β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”˜ β”‚
87
+ β–Ό β–Ό
88
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”
89
+ β”‚ M14 │◄──────────────── M15 β”‚
90
+ β”‚Federat.β”‚ β”‚ Relay β”‚
91
+ β””β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜
92
+ β”‚
93
+ β–Ό
94
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”
95
+ β”‚ M22 β”‚
96
+ β”‚ Mobile β”‚
97
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜
98
+
99
+ β”Œβ”€β”€β”€ Independent services (each plug into the bus) ───┐
100
+ β”‚ M17 OCR M18 Trans M19 STT/TTS M20 Vision β”‚
101
+ β”‚ M21 Tools (extends M04) M24 Rerank β”‚
102
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
103
+
104
+ β”Œβ”€β”€β”€ Chat extensions ───┐
105
+ β”‚ M23 E2E β”‚
106
+ β”‚ M25 Group β”‚
107
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
108
+ ```
109
+
110
+ Hard rules carried over from Phase 1:
111
+ - No service imports another service. Talk via the bus.
112
+ - No layer below the bus imports anything above it.
113
+
114
+ ---
115
+
116
+ ## 3. File tree additions
117
+
118
+ ```
119
+ hearthnet/
120
+ β”œβ”€β”€ federation/ # M14
121
+ β”‚ β”œβ”€β”€ __init__.py
122
+ β”‚ β”œβ”€β”€ manifest.py
123
+ β”‚ β”œβ”€β”€ peering.py
124
+ β”‚ └── relay_client.py
125
+ β”‚
126
+ β”œβ”€β”€ relay/ # M15 (separate deployable, lives in same repo)
127
+ β”‚ β”œβ”€β”€ __init__.py
128
+ β”‚ β”œβ”€β”€ server.py
129
+ β”‚ β”œβ”€β”€ nat_traversal.py
130
+ β”‚ β”œβ”€β”€ push.py
131
+ β”‚ └── tier.py
132
+ β”‚
133
+ β”œβ”€β”€ identity/
134
+ β”‚ └── tokens.py # M16 (now real, not stub)
135
+ β”‚
136
+ β”œβ”€β”€ dht/ # X05
137
+ β”‚ β”œβ”€β”€ __init__.py
138
+ β”‚ β”œβ”€β”€ kademlia.py
139
+ β”‚ β”œβ”€β”€ routing.py
140
+ β”‚ └── storage.py
141
+ β”‚
142
+ β”œβ”€β”€ transport/
143
+ β”‚ └── websocket.py # X06
144
+ β”‚
145
+ β”œβ”€β”€ observability/
146
+ β”‚ └── federated.py # X07
147
+ β”‚
148
+ β”œβ”€β”€ crypto/ # M23 (new top-level)
149
+ β”‚ β”œβ”€β”€ __init__.py
150
+ β”‚ β”œβ”€β”€ ratchet.py
151
+ β”‚ β”œβ”€β”€ kem.py
152
+ β”‚ └── envelope.py
153
+ β”‚
154
+ β”œβ”€β”€ services/
155
+ β”‚ β”œβ”€β”€ ocr/ # M17
156
+ β”‚ β”‚ β”œβ”€β”€ __init__.py
157
+ β”‚ β”‚ β”œβ”€β”€ service.py
158
+ β”‚ β”‚ └── backends/
159
+ β”‚ β”‚ β”œβ”€β”€ tesseract.py
160
+ β”‚ β”‚ β”œβ”€β”€ trocr.py
161
+ β”‚ β”‚ └── multilingual.py
162
+ β”‚ β”œβ”€β”€ translation/ # M18
163
+ β”‚ β”‚ β”œβ”€β”€ __init__.py
164
+ β”‚ β”‚ β”œβ”€β”€ service.py
165
+ β”‚ β”‚ └── backends/
166
+ β”‚ β”‚ β”œβ”€β”€ nllb.py
167
+ β”‚ β”‚ └── plattdeutsch.py
168
+ β”‚ β”œβ”€β”€ speech/ # M19
169
+ β”‚ β”‚ β”œβ”€β”€ __init__.py
170
+ β”‚ β”‚ β”œβ”€β”€ stt_service.py
171
+ β”‚ β”‚ β”œβ”€β”€ tts_service.py
172
+ β”‚ β”‚ └── backends/
173
+ β”‚ β”‚ β”œβ”€β”€ whisper.py
174
+ β”‚ β”‚ β”œβ”€β”€ xtts.py
175
+ β”‚ β”‚ └── edge_tts.py
176
+ β”‚ β”œβ”€β”€ image/ # M20
177
+ β”‚ β”‚ β”œβ”€β”€ __init__.py
178
+ β”‚ β”‚ β”œβ”€β”€ describe_service.py
179
+ β”‚ β”‚ β”œβ”€β”€ generate_service.py
180
+ β”‚ β”‚ └── backends/
181
+ β”‚ β”‚ β”œβ”€β”€ florence2.py
182
+ β”‚ β”‚ β”œβ”€β”€ minicpm_v.py
183
+ β”‚ β”‚ └── flux.py
184
+ β”‚ β”œβ”€β”€ rerank/ # M24
185
+ β”‚ β”‚ β”œβ”€β”€ __init__.py
186
+ β”‚ β”‚ β”œβ”€β”€ service.py
187
+ β”‚ β”‚ └── backends/
188
+ β”‚ β”‚ └── bge_reranker.py
189
+ β”‚ β”œβ”€β”€ llm/
190
+ β”‚ β”‚ └── tools.py # M21 (extends M04)
191
+ β”‚ β”œβ”€β”€ chat/
192
+ β”‚ β”‚ β”œβ”€β”€ encryption.py # M23 hook
193
+ β”‚ β”‚ β”œβ”€β”€ thread_service.py # M25
194
+ β”‚ β”‚ └── thread_views.py # M25
195
+ β”‚ └── file/
196
+ β”‚ β”œβ”€β”€ resume.py # P7 extension
197
+ β”‚ └── replication.py # P7 extension
198
+
199
+ mobile-native/ # M22 β€” separate codebase, Flutter project
200
+ └── (lives in /mobile-native, not in the Python package)
201
+ ```
202
+
203
+ ---
204
+
205
+ ## 4. Canonical conventions (delta from Phase 1)
206
+
207
+ ### 4.1 New type aliases
208
+
209
+ ```python
210
+ # additions to hearthnet/types.py
211
+
212
+ TokenID = str # ULID
213
+ ThreadID = str # ULID
214
+ FederationID = str # composite: "<community_a>:<community_b>"
215
+ TensorChunkID = str # blake3:<hex>, used in M25/X07 phase-3 only
216
+ PushDeviceID = str # opaque, assigned by relay tier
217
+ RatchetEpoch = int # per-thread monotonic
218
+ EncryptedPayload = bytes # ciphertext, base64 in JSON
219
+
220
+ # Extended Literal types:
221
+ TrustLevel = Literal["unknown","member","trusted","anchor","federated"]
222
+ Stability = Literal["experimental","beta","stable","deprecated"]
223
+ ```
224
+
225
+ ### 4.2 New constants
226
+
227
+ ```python
228
+ # additions to hearthnet/constants.py
229
+
230
+ TOKEN_DEFAULT_TTL_SECONDS = 3600
231
+ TOKEN_MAX_TTL_SECONDS = 86400
232
+ FEDERATION_MANIFEST_TTL_SECONDS = 86400
233
+ FEDERATION_HEARTBEAT_SECONDS = 300
234
+ DHT_REPLICATION_K = 8 # bucket size
235
+ DHT_ALPHA = 3 # concurrent lookups
236
+ DHT_REFRESH_SECONDS = 3600
237
+ DHT_REPUBLISH_SECONDS = 86400
238
+ WEBSOCKET_PING_SECONDS = 30
239
+ WEBSOCKET_IDLE_CLOSE_SECONDS = 120
240
+ RELAY_REGISTRATION_TTL_SECONDS = 7200
241
+ RELAY_PUSH_RETRY_MAX = 5
242
+ E2E_RATCHET_MAX_OUT_OF_ORDER = 32
243
+ E2E_RATCHET_REKEY_AFTER_MESSAGES = 100
244
+ E2E_PREKEY_BUNDLE_SIZE = 20
245
+ OCR_DEFAULT_DPI = 300
246
+ OCR_MAX_PAGES_PER_REQUEST = 50
247
+ TRANSLATION_MAX_CHARS = 4000
248
+ STT_MAX_AUDIO_SECONDS = 300
249
+ TTS_MAX_TEXT_CHARS = 5000
250
+ RERANK_MAX_DOCS = 100
251
+ FILE_REPLICATION_DESIRED_COPIES = 3
252
+ FILE_RESUME_PARTIAL_TTL_SECONDS = 3600
253
+ ```
254
+
255
+ ### 4.3 Capability namespace allocations (Phase 2 promotes from reserved)
256
+
257
+ | Prefix | Status in Phase 1 | Status in Phase 2 |
258
+ |--------|-------------------|---------------------|
259
+ | `federation.*` | beta (reserved) | stable |
260
+ | `ocr.*` | reserved | stable |
261
+ | `trans.*` | reserved | stable |
262
+ | `stt.*` `tts.*` | reserved | stable |
263
+ | `img.*` | reserved | stable |
264
+ | `rerank.*` | (new) | stable |
265
+ | `chat.thread.*` | reserved | stable |
266
+ | `chat.forward.*` | reserved | stable |
267
+ | `file.put.resume@1.0` | (new) | stable |
268
+
269
+ ---
270
+
271
+ ## 5. Build order (Phase 2)
272
+
273
+ | Step | Modules / extensions | What you can demo |
274
+ |------|----------------------------------|-------------------------------------------------|
275
+ | P2-1 | M16 Tokens | Delegate one capability call via a token |
276
+ | P2-2 | X05 DHT (basic) | Two LANs find each other through a public DHT |
277
+ | P2-3 | M14 Federation | Two communities cross-sign, query each other |
278
+ | P2-4 | M15 Relay Tier | NAT'd peers reach each other via your relay |
279
+ | P2-5 | X06 WebSocket | Lower-latency pubsub |
280
+ | P2-6 | M24 Rerank | RAG queries get better answers |
281
+ | P2-7 | M17 OCR + M05 RAG hook | Scanned PDFs become searchable |
282
+ | P2-8 | M18 Translation | DE β†’ EN of a marketplace post |
283
+ | P2-9 | M19 STT/TTS | "Sprich mit HearthNet" voice query |
284
+ | P2-10 | M21 Tool calls + M04 ext | LLM can call `rag.query` as a tool |
285
+ | P2-11 | M20 Vision | "Was siehst du auf diesem Bild?" |
286
+ | P2-12 | M23 E2E + M10 ext | Chat is now end-to-end encrypted |
287
+ | P2-13 | M25 Group chat | Three-way conversation |
288
+ | P2-14 | M07 ext (resume, replication, encrypt) | Bigger files, more resilient |
289
+ | P2-15 | M22 Mobile native | iOS / Android app on a real phone |
290
+ | P2-16 | X07 Federated metrics + observability polish | Real dashboards for operators |
291
+
292
+ Each step is independently demoable. Each gates on no Phase 1 changes β€” they all attach via the bus.
293
+
294
+ ---
295
+
296
+ ## 6. Spec versioning
297
+
298
+ - Capability Contract bumps to **v2.0** (additive within phase 2; major bump only on breaking changes).
299
+ - Contract version in node manifests becomes `"2.0"`; peers running Phase 1 see `contract_version=2.0` and reject the manifest with `schema_mismatch` unless they have a compatibility shim.
300
+ - **Compatibility shim:** a Phase 1 node may negotiate down by serving `/manifest?contract_version=1.0`. Optional. Phase 2 SHOULD include the shim for one minor release window.
301
+
302
+ ---
303
+
304
+ ## 7. What is intentionally NOT in Phase 2
305
+
306
+ Pushed to Phase 3 (see [`../phase-3/00-OVERVIEW.md`](../phase-3/00-OVERVIEW.md)):
307
+
308
+ - Distributed-tensor inference (Petals-style)
309
+ - MoE expert routing
310
+ - Federated learning on LoRA layers
311
+ - LoRA long-distance beacons
312
+ - EBKH evidence layer integration
313
+ - Civil-defence pilot
314
+ - Protocol-standardisation work
315
+ - Conformance test suite for multi-implementation interop
316
+
317
+ ---
318
+
319
+ ## 8. Out-of-band documents (Phase 2)
320
+
321
+ - **THREAT_MODEL_v2.md** β€” formal security write-up for federation + E2E + tokens
322
+ - **RELAY_OPERATIONS.md** β€” for whoever runs `relay.hearthnet.de` (likely Christof on Hetzner)
323
+ - **MOBILE_BUILD.md** β€” Flutter build, code-signing, store-submission notes
324
+ - **MIGRATION_v1_to_v2.md** β€” for existing Phase-1 communities upgrading
docs/p2_p3/00-OVERVIEW_p3.md ADDED
@@ -0,0 +1,288 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # HearthNet Phase 3 β€” Spec Set Overview
2
+
3
+ **Phase 3 scope:** research-shaped, 6–12 months. This is where HearthNet stops being a product and starts being a protocol. Each module here is an investment in a long-term capability where the engineering is the easy part β€” the hard part is establishing trust, governance, and standards.
4
+
5
+ **Stance:** Phase 3 specs are **roadmaps**, not contracts. Where a Phase-1/2 spec answers "what does this *do*?", a Phase-3 spec answers "what would we *build* if we were ready to commit?". Concrete enough to start, loose enough to be wrong about details without invalidating the direction.
6
+
7
+ ---
8
+
9
+ ## 0. Reading these specs
10
+
11
+ Phase 3 specs deviate from the Phase 1 / 2 template in three respects:
12
+
13
+ 1. **Stability tag is `experimental` for new capabilities** unless explicitly promoted later. Mesh nodes ignore experimental capabilities unless the operator opts in via `policy.research.enable = true`.
14
+ 2. **Each module carries an "Open research questions" section** that is longer than the spec itself, by design. Phase 3 modules answer *some* of their open questions before shipping; the rest stay open.
15
+ 3. **Acceptance criteria are described, not enumerated**. The point isn't to grade an implementation against a checklist; it's to say "we'll know this is working when…"
16
+
17
+ If you read a Phase 3 spec and feel uncertain about how something works, that uncertainty is faithful to the state of the work. The spec is doing its job by being honest about that.
18
+
19
+ ---
20
+
21
+ ## 1. Module map (Phase 3)
22
+
23
+ ### New numbered modules
24
+
25
+ | ID | Module | Spec file | Concern |
26
+ |-----|------------------------------|-------------------------------------------------|----------------------------------------------------------------------|
27
+ | M26 | Distributed Inference | `modules/M26-distributed-inference.md` | Layer-sharded LLMs across nodes (Petals-style), small models only |
28
+ | M27 | MoE Expert Routing | `modules/M27-moe-routing.md` | Route queries to the right expert (machine or human) via learned scorer |
29
+ | M28 | Federated Learning | `modules/M28-fedlearn.md` | FedAvg on LoRA layers; per-community fine-tuning without sharing data |
30
+ | M29 | LoRA Long-Distance Beacons | `modules/M29-lora-beacons.md` | 868MHz "community alive" beacons; no AI traffic; emergency-only |
31
+ | M30 | Evidence / EBKH | `modules/M30-evidence-ebkh.md` | Claim graph alongside the event log; provenance + verifiability |
32
+ | M31 | Civil Defence Pilot | `modules/M31-civil-defense.md` | THW / DRK / KatS bridge; compliance profile; audit trail |
33
+ | M32 | Protocol Standardisation | `modules/M32-protocol-standard.md` | Reference implementation, conformance suite, governance for the spec |
34
+
35
+ ### New cross-cutting modules
36
+
37
+ | ID | Module | Spec file | Concern |
38
+ |-----|-----------------------|---------------------------------------------------|------------------------------------------------------|
39
+ | X08 | Tensor Transport | `cross-cutting/X08-tensor-transport.md` | High-throughput chunked tensor passing for M26 |
40
+ | X09 | Conformance Suite | `cross-cutting/X09-conformance-suite.md` | Black-box tests defining what "HearthNet-compliant" means |
41
+
42
+ ### Modifications to earlier modules
43
+
44
+ | Phase 1/2 module | Phase 3 extension |
45
+ |------------------|-------------------|
46
+ | M03 Bus | Optional MoE routing layer between dispatcher and handler (M27) |
47
+ | M04 LLM | Optional `experimental.distributed_llm.chat@1.0` backend (M26) |
48
+ | X02 Event log | Optional `evidence.*` claim records side-by-side with events (M30) |
49
+ | M14 Federation | Federated learning rounds use federation as the trust substrate (M28) |
50
+ | X03 Observability | Per-call expert-routing trace; per-shard tensor-transport metrics (M27, X08) |
51
+
52
+ ---
53
+
54
+ ## 2. Dependency graph (Phase 3 additions on top of Phases 1–2)
55
+
56
+ ```
57
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
58
+ β”‚ Phase 1 + Phase 2 (unchanged) β”‚
59
+ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
60
+ β”‚ β”‚ β”‚
61
+ β–Ό β–Ό β–Ό
62
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
63
+ β”‚ X08 β”‚ β”‚ M27 β”‚ β”‚ M30 β”‚
64
+ β”‚ Tensor β”‚ β”‚ MoE β”‚ β”‚ EBKH β”‚
65
+ β”‚ Transp. β”‚ β”‚ Routing β”‚ β”‚ Evidenceβ”‚
66
+ β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜
67
+ β–Ό β”‚ β”‚
68
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚
69
+ β”‚ M26 β”‚ β”‚ β”‚
70
+ β”‚ Distrib.β”‚ β”‚ β”‚
71
+ β”‚ Infer. β”‚ β”‚ β”‚
72
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚
73
+ β–Ό β–Ό
74
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
75
+ β”‚ M28 β”‚ β”‚ M31 β”‚
76
+ β”‚ FedLearnβ”‚ β”‚ CivDef. β”‚
77
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
78
+
79
+ Standalone (no software deps, governance / hardware):
80
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
81
+ β”‚ M29 β”‚ (hardware)
82
+ β”‚ LoRa β”‚
83
+ β”‚ Beacons β”‚
84
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
85
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
86
+ β”‚ X09 β”‚ (process)
87
+ β”‚ Conform.β”‚
88
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
89
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
90
+ β”‚ M32 β”‚ (governance)
91
+ β”‚ Standardβ”‚
92
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
93
+ ```
94
+
95
+ Most Phase 3 modules are independent of each other. The exceptions:
96
+ - M26 depends on X08
97
+ - M27 informs M26 (MoE routing picks which expert/shard)
98
+ - M28 reuses M14 federation for cross-community rounds
99
+ - M31 reuses M30 for evidence-grade emergency claims
100
+
101
+ ---
102
+
103
+ ## 3. File tree additions
104
+
105
+ ```
106
+ hearthnet/
107
+ β”œβ”€β”€ distributed_inference/ # M26
108
+ β”‚ β”œβ”€β”€ __init__.py
109
+ β”‚ β”œβ”€β”€ shard.py
110
+ β”‚ β”œβ”€β”€ pipeline.py
111
+ β”‚ β”œβ”€β”€ routing.py
112
+ β”‚ └── backends/
113
+ β”‚ β”œβ”€β”€ petals_like.py
114
+ β”‚ └── small_model_layered.py
115
+ β”‚
116
+ β”œβ”€β”€ moe/ # M27
117
+ β”‚ β”œβ”€β”€ __init__.py
118
+ β”‚ β”œβ”€β”€ router.py
119
+ β”‚ β”œβ”€β”€ scorer.py
120
+ β”‚ └── human_in_the_loop.py
121
+ β”‚
122
+ β”œβ”€β”€ fedlearn/ # M28
123
+ β”‚ β”œβ”€β”€ __init__.py
124
+ β”‚ β”œβ”€β”€ coordinator.py
125
+ β”‚ β”œβ”€β”€ round.py
126
+ β”‚ β”œβ”€β”€ lora_diff.py
127
+ β”‚ └── aggregation.py
128
+ β”‚
129
+ β”œβ”€β”€ lora_beacons/ # M29 β€” hardware integration; tiny Python surface
130
+ β”‚ β”œβ”€β”€ __init__.py
131
+ β”‚ β”œβ”€β”€ beacon_bridge.py # serial protocol to a LoRa USB stick
132
+ β”‚ └── policy.py
133
+ β”‚
134
+ β”œβ”€β”€ evidence/ # M30
135
+ β”‚ β”œβ”€β”€ __init__.py
136
+ β”‚ β”œβ”€β”€ claim.py
137
+ β”‚ β”œβ”€β”€ claim_graph.py
138
+ β”‚ β”œβ”€β”€ provenance.py
139
+ β”‚ └── ebkh_bridge.py # bridge to Christof's EBKH v3+
140
+ β”‚
141
+ β”œβ”€β”€ civil_defense/ # M31
142
+ β”‚ β”œβ”€β”€ __init__.py
143
+ β”‚ β”œβ”€β”€ profile.py # THW / DRK / KatS member types
144
+ β”‚ β”œβ”€β”€ audit.py
145
+ β”‚ └── nrw_katastrophenschutz.py
146
+ β”‚
147
+ β”œβ”€β”€ transport/
148
+ β”‚ └── tensor.py # X08
149
+ β”‚
150
+ └── conformance/ # X09
151
+ β”œβ”€β”€ __init__.py
152
+ β”œβ”€β”€ runner.py
153
+ β”œβ”€β”€ suites/
154
+ β”‚ β”œβ”€β”€ identity.py
155
+ β”‚ β”œβ”€β”€ transport.py
156
+ β”‚ β”œβ”€β”€ bus.py
157
+ β”‚ β”œβ”€β”€ services.py
158
+ β”‚ └── federation.py
159
+ └── report.py
160
+
161
+ protocol/ # M32 β€” separate top-level dir at repo root
162
+ β”œβ”€β”€ README.md
163
+ β”œβ”€β”€ spec/ # the protocol spec, decoupled from the impl
164
+ β”‚ β”œβ”€β”€ 00-overview.md # mirror of CAPABILITY_CONTRACT but
165
+ β”‚ β”œβ”€β”€ 01-identity.md # implementation-agnostic
166
+ β”‚ └── ...
167
+ └── governance/
168
+ β”œβ”€β”€ CHANGELOG.md
169
+ β”œβ”€β”€ CONTRIBUTING.md
170
+ └── ROADMAP.md
171
+ ```
172
+
173
+ ---
174
+
175
+ ## 4. Conventions delta from Phase 2
176
+
177
+ ### 4.1 New `experimental` namespace
178
+
179
+ A Phase-3 capability MAY be advertised as `experimental.<name>@<ver>`. Mesh nodes default to **not registering** experimental capabilities; the operator must opt in via:
180
+
181
+ ```toml
182
+ [policy.research]
183
+ enable = true
184
+ enabled_capabilities = ["experimental.distributed_llm.chat@1.0", "experimental.fedlearn.round.*"]
185
+ ```
186
+
187
+ Once a capability is sufficiently proven, it is promoted out of the `experimental.` prefix in a contract bump.
188
+
189
+ ### 4.2 New type aliases
190
+
191
+ ```python
192
+ # additions to hearthnet/types.py
193
+
194
+ ShardID = str # "<model_id>:<layer_range>"
195
+ ExpertID = str # opaque, refers to a routable subsystem
196
+ ClaimID = str # ULID
197
+ RoundID = str # fedlearn round identifier (ULID)
198
+ LoraBeaconID = str # 8-byte hex, hardware-issued
199
+ EvidenceLevel = Literal["unverified","cited","cross_referenced","attested","disputed"]
200
+ ExpertKind = Literal["model","human","service","external"]
201
+ ```
202
+
203
+ ### 4.3 New constants
204
+
205
+ ```python
206
+ # additions to hearthnet/constants.py β€” Phase 3
207
+
208
+ # Distributed inference (M26)
209
+ DISTRIBUTED_MAX_SHARDS_PER_REQUEST = 16
210
+ DISTRIBUTED_SHARD_HEALTH_TIMEOUT_S = 30
211
+ DISTRIBUTED_FALLBACK_TO_LOCAL_AFTER_FAILURES = 2
212
+
213
+ # MoE routing (M27)
214
+ MOE_ROUTER_TOP_K = 3
215
+ MOE_ROUTER_TRAIN_MIN_EXAMPLES = 200
216
+ MOE_ROUTER_RETRAIN_EVERY_HOURS = 24
217
+
218
+ # Federated learning (M28)
219
+ FEDLEARN_MAX_ROUND_MINUTES = 120
220
+ FEDLEARN_MIN_PARTICIPANTS = 3
221
+ FEDLEARN_MAX_LORA_RANK = 64
222
+ FEDLEARN_GRAD_CLIP = 1.0
223
+ FEDLEARN_DP_NOISE_SCALE_DEFAULT = 0.0 # off by default; off-by-default differential privacy
224
+
225
+ # Evidence (M30)
226
+ EVIDENCE_CLAIM_TTL_DAYS_DEFAULT = 365
227
+ EVIDENCE_MAX_PROVENANCE_DEPTH = 16
228
+
229
+ # Civil defence (M31)
230
+ CIVDEF_AUDIT_RETENTION_YEARS = 10
231
+ CIVDEF_HEARTBEAT_SECONDS = 60
232
+
233
+ # Tensor transport (X08)
234
+ TENSOR_CHUNK_BYTES = 1_048_576 # 1 MB
235
+ TENSOR_FLOW_CONTROL_WINDOW = 16 # chunks
236
+ TENSOR_COMPRESSION_THRESHOLD_BYTES = 65_536
237
+
238
+ # LoRa beacons (M29)
239
+ LORA_BEACON_PERIOD_SECONDS_DEFAULT = 600 # 10 minutes
240
+ LORA_BEACON_MAX_PAYLOAD_BYTES = 32
241
+ ```
242
+
243
+ ---
244
+
245
+ ## 5. Build order (Phase 3)
246
+
247
+ Phase 3 is not a release; it is a set of long-running tracks. Suggested ordering by independence + value:
248
+
249
+ | Track | Modules | Outcome |
250
+ |-------|----------------------------------|-------------------------------------------------------------------------------|
251
+ | A | X09 Conformance + M32 Standard | Other people can build HearthNet-compliant nodes |
252
+ | B | M30 Evidence / EBKH | Marketplace claims and emergency posts carry provenance |
253
+ | C | M27 MoE Routing (machines only) | Better answers for free; routes RAG queries to best-suited backend |
254
+ | D | M27 + M28 (human routing) | Neighbour gets pinged when their expertise matches |
255
+ | E | M28 FedLearn | Communities co-train a small LoRA without sharing source data |
256
+ | F | X08 + M26 Distributed Inference | Two anchors jointly serve a 7B model; large models become feasible LAN-wide |
257
+ | G | M29 LoRa Beacons | Resilient "I am alive" pings during regional internet outages |
258
+ | H | M31 Civil Defence Pilot | A real Niederrhein THW Ortsverband uses HearthNet for an exercise |
259
+
260
+ Tracks can run in parallel. None of them block the existing Phase-2 system.
261
+
262
+ ---
263
+
264
+ ## 6. Spec versioning
265
+
266
+ - Capability Contract bumps to **v3.0** but the bump is *additive*. v2 nodes coexist with v3 nodes; experimental capabilities simply aren't seen by v2 nodes.
267
+ - The first concrete deliverable of Track A (M32) is to **decouple** the protocol spec from the implementation. After that, the contract has its own version track separate from the Python implementation's version.
268
+
269
+ ---
270
+
271
+ ## 7. Out-of-band documents (Phase 3)
272
+
273
+ - **RESEARCH_AGENDA.md** β€” the deeper "why" for each module; intended audience: PhD students and grant reviewers
274
+ - **GOVERNANCE.md** β€” how spec changes are proposed, reviewed, and accepted; ties into M32
275
+ - **ETHICS_REVIEW.md** β€” the framework for evaluating MoE-driven routing-to-humans (M27) and fedlearn-on-personal-data (M28)
276
+ - **CIVDEF_AGREEMENT_TEMPLATE.md** β€” the MoU template for a civil-defence pilot
277
+
278
+ ---
279
+
280
+ ## 8. What is NOT in Phase 3
281
+
282
+ Even with all of Phase 3 done, the following remain explicit non-goals:
283
+
284
+ - A central directory of communities. There is no "HearthNet.com" listing all communities. Discovery is via word of mouth + DHT + federation. Pushed indefinitely.
285
+ - An app store for capabilities. Capabilities are code in the source tree, reviewed by maintainers. Not pluggable at runtime by untrusted code.
286
+ - A consensus protocol (Paxos, Raft). Communities do not vote on shared state beyond event-log gossip. Federation does not imply consensus.
287
+ - A cryptocurrency / token economy. Not even for fedlearn incentives. Reputational signals only.
288
+ - AGI. Even the distributed inference module targets at-most-mid-sized models (7B-class). The thesis is "small models close to people are more useful than large models far away", and Phase 3 doesn't change that.
docs/p2_p3/CAPABILITY_CONTRACT_v2.md ADDED
@@ -0,0 +1,899 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # HearthNet Capability Contract β€” Phase 2 additions (v2.0)
2
+
3
+ **Spec version:** v2.0
4
+ **Last touched:** 2026-06-09
5
+ **Builds on:** [`../CAPABILITY_CONTRACT.md`](../CAPABILITY_CONTRACT.md) (v1.0)
6
+
7
+ This document is **additive** to v1.0. Everything in v1.0 still holds unless explicitly overridden here. Bumping a node's `contract_version` to `"2.0"` means: "I implement all of v1 plus the additions below."
8
+
9
+ ---
10
+
11
+ ## 1. Conventions delta
12
+
13
+ ### 1.1 New encoded forms
14
+
15
+ - **Token format:** `hntoken://v1/<base64-url-nopad of canonical-JSON of token body + signature>`. See [M16 Β§3](modules/M16-tokens.md).
16
+ - **Federation peering blob:** `hnfed://v1/<base64>` β€” analogous to invite blob, signed by both community roots (cross-sig).
17
+ - **Encrypted payload header:** when a chat body is E2E-encrypted, the event's `data.body` becomes a `{"e2e": true, "header": {...}, "ciphertext": "<base64>"}` object. See [M23 Β§4](modules/M23-e2e-encryption.md).
18
+
19
+ ### 1.2 Sign-over-method choice
20
+
21
+ Phase 2 capability tokens use the **JWS-flavoured** envelope, not the canonical-JSON envelope. Rationale: tokens are short-lived and frequently passed through HTTP intermediaries; JWS is the lingua franca.
22
+
23
+ ```
24
+ hntoken_envelope = base64url(header) + "." + base64url(payload) + "." + base64url(signature)
25
+ ```
26
+
27
+ Both forms continue to use Ed25519.
28
+
29
+ ### 1.3 New error codes (additive)
30
+
31
+ | Code | Meaning |
32
+ |------|---------|
33
+ | `federation_forbidden` | The caller's community is not federated with ours for this capability |
34
+ | `token_invalid` | Token signature failed |
35
+ | `token_expired` | Token past `exp` |
36
+ | `token_scope_insufficient` | Token does not grant this capability |
37
+ | `relay_unreachable` | Configured relay tier is down |
38
+ | `e2e_session_missing` | Caller did not establish an X3DH session before sending encrypted message |
39
+ | `e2e_decrypt_failed` | Ciphertext could not be decrypted (key mismatch, ratchet drift) |
40
+ | `dht_lookup_failed` | DHT lookup timed out before finding sources |
41
+ | `not_federated` | Federation manifest does not exist between these communities |
42
+
43
+ ---
44
+
45
+ ## 2. Capability namespace β€” Phase 2 stable set
46
+
47
+ Promoted from "reserved" in v1.0:
48
+
49
+ | Prefix | Now | Defined |
50
+ |--------|-----|---------|
51
+ | `federation.*` | stable | [M14](modules/M14-federation.md) |
52
+ | `ocr.*` | stable | [M17](modules/M17-ocr.md) |
53
+ | `trans.*` | stable | [M18](modules/M18-translation.md) |
54
+ | `stt.*` `tts.*` | stable | [M19](modules/M19-stt-tts.md) |
55
+ | `img.*` | stable | [M20](modules/M20-vision.md) |
56
+ | `rerank.*` | stable | [M24](modules/M24-rerank.md) |
57
+ | `chat.thread.*` | stable | [M25](modules/M25-group-chat.md) |
58
+ | `chat.forward.*` | stable | [M14](modules/M14-federation.md) (via relay) |
59
+ | `auth.*` | stable (new) | [M16](modules/M16-tokens.md) |
60
+
61
+ ---
62
+
63
+ ## 3. Complete new capabilities list
64
+
65
+ | Name | Stability | Stream? | Trust required | Section |
66
+ |------|-----------|---------|----------------|---------|
67
+ | `federation.peer.add@1.0` | stable | no | anchor (with co-sig) | Β§4.1 |
68
+ | `federation.peer.remove@1.0` | stable | no | anchor (with co-sig) | Β§4.2 |
69
+ | `federation.peer.list@1.0` | stable | no | member | Β§4.3 |
70
+ | `federation.proxy@1.0` | stable | yes | federated | Β§4.4 |
71
+ | `auth.token.issue@1.0` | stable | no | member | Β§4.5 |
72
+ | `auth.token.revoke@1.0` | stable | no | issuer or trusted | Β§4.6 |
73
+ | `auth.token.introspect@1.0` | stable | no | self | Β§4.7 |
74
+ | `ocr.image@1.0` | stable | no | member | Β§4.8 |
75
+ | `ocr.pdf@1.0` | stable | yes (progress) | trusted | Β§4.9 |
76
+ | `trans.text@1.0` | stable | no | member | Β§4.10 |
77
+ | `stt.transcribe@1.0` | stable | yes (segments) | member | Β§4.11 |
78
+ | `tts.synthesize@1.0` | stable | yes (audio chunks) | member | Β§4.12 |
79
+ | `img.describe@1.0` | stable | no | member | Β§4.13 |
80
+ | `img.generate@1.0` | stable | yes (progress) | trusted | Β§4.14 |
81
+ | `rerank.text@1.0` | stable | no | member | Β§4.15 |
82
+ | `chat.thread.create@1.0` | stable | no | member | Β§4.16 |
83
+ | `chat.thread.send@1.0` | stable | no | thread member | Β§4.17 |
84
+ | `chat.thread.history@1.0` | stable | no | thread member | Β§4.18 |
85
+ | `chat.thread.leave@1.0` | stable | no | thread member | Β§4.19 |
86
+ | `chat.forward.put@1.0` | stable | yes | anchor with forward | Β§4.20 |
87
+ | `chat.forward.fetch@1.0` | stable | yes | self | Β§4.21 |
88
+ | `file.put.resume@1.0` | stable | yes | trusted | Β§4.22 |
89
+ | `llm.chat@2.0` (UPDATE) | stable | yes | member | Β§4.23 |
90
+ | `llm.tools.call@1.0` (NEW, used by `llm.chat` tool flow) | stable | no | member | Β§4.24 |
91
+
92
+ ---
93
+
94
+ ## 4. Per-capability specifications
95
+
96
+ ### 4.1 `federation.peer.add@1.0`
97
+
98
+ Establish a federation link with another community.
99
+
100
+ **Request:**
101
+ ```json
102
+ {
103
+ "params": {},
104
+ "input": {
105
+ "client_id": "01HXR...",
106
+ "peer_community_id": "ed25519:<other community root pubkey>",
107
+ "peer_endpoints": [{"transport":"https","host":"...","port":7080}],
108
+ "co_signers": [{"node_id":"...","signature":"..."}, "...", "..."],
109
+ "scope": {
110
+ "capabilities": ["rag.query","market.list"],
111
+ "data_visibility":"public_corpora_only"
112
+ },
113
+ "expires_at": "2027-06-09T00:00:00Z"
114
+ }
115
+ }
116
+ ```
117
+
118
+ `co_signers` requires `policy.min_signatures_to_federate` (new policy field; default 3, see [M14 Β§5](modules/M14-federation.md)). The remote community must also have us in their federation manifest before federated calls work.
119
+
120
+ **Response:**
121
+ ```json
122
+ {"output": {"event_id": "01HXS...", "federation_id": "ed25519:A:ed25519:B"}, "meta": {"ms": 14}}
123
+ ```
124
+
125
+ Emits `federation.peer.added` event.
126
+
127
+ **Errors:** `unauthorized`, `bad_request`, `not_found` (peer endpoints unreachable).
128
+
129
+ ### 4.2 `federation.peer.remove@1.0`
130
+
131
+ Terminate a federation link.
132
+
133
+ **Request:**
134
+ ```json
135
+ {
136
+ "params": {},
137
+ "input": {
138
+ "client_id": "01HXR...",
139
+ "peer_community_id": "ed25519:...",
140
+ "reason": "policy_violation|unused|mutual",
141
+ "co_signers": [...]
142
+ }
143
+ }
144
+ ```
145
+
146
+ Emits `federation.peer.removed`.
147
+
148
+ ### 4.3 `federation.peer.list@1.0`
149
+
150
+ List active federations.
151
+
152
+ **Response:**
153
+ ```json
154
+ {
155
+ "output": {
156
+ "peers": [
157
+ {
158
+ "community_id": "ed25519:...",
159
+ "name": "Geldern Demo",
160
+ "scope": {"capabilities":["rag.query"]},
161
+ "established_at": "...",
162
+ "expires_at": "...",
163
+ "last_heartbeat": "..."
164
+ }
165
+ ]
166
+ },
167
+ "meta": {"ms": 2}
168
+ }
169
+ ```
170
+
171
+ ### 4.4 `federation.proxy@1.0`
172
+
173
+ A federated peer asks *our* community to forward a capability call to one of *our* members. This is how cross-community RAG query works: peer's anchor calls `federation.proxy` on our anchor, which then internally routes to `rag.query` on whichever local node has the corpus.
174
+
175
+ **Request:**
176
+ ```json
177
+ {
178
+ "params": {"target_capability": "rag.query@1.0"},
179
+ "input": {
180
+ "client_id": "01HXR...",
181
+ "token": "hntoken://v1/...",
182
+ "body": { /* the body of the underlying capability */ }
183
+ }
184
+ }
185
+ ```
186
+
187
+ **Response:** Whatever the target capability returns. Streams pass through transparently.
188
+
189
+ The proxy verifies the token's scope includes `target_capability`. Returns `federation_forbidden` otherwise.
190
+
191
+ ### 4.5 `auth.token.issue@1.0`
192
+
193
+ Issue a capability token.
194
+
195
+ **Request:**
196
+ ```json
197
+ {
198
+ "params": {},
199
+ "input": {
200
+ "client_id": "01HXR...",
201
+ "subject": "ed25519:<recipient NodeID>",
202
+ "scope": {
203
+ "capabilities": ["rag.query@1.0", "embed.text@1.0"],
204
+ "corpora": ["niederrhein-emergency"],
205
+ "rate_limit_per_minute": 60
206
+ },
207
+ "ttl_seconds": 3600,
208
+ "audience": "ed25519:<community_id where token is presented, optional>"
209
+ }
210
+ }
211
+ ```
212
+
213
+ **Response:**
214
+ ```json
215
+ {
216
+ "output": {"token": "hntoken://v1/eyJhbGc...", "token_id": "01HXS..."},
217
+ "meta": {"ms": 4}
218
+ }
219
+ ```
220
+
221
+ See [M16](modules/M16-tokens.md) for token body schema.
222
+
223
+ ### 4.6 `auth.token.revoke@1.0`
224
+
225
+ Revoke a previously-issued token.
226
+
227
+ **Request:**
228
+ ```json
229
+ {"params": {}, "input": {"client_id":"01HXR...","token_id":"01HXR..."}}
230
+ ```
231
+
232
+ Emits `auth.token.revoked` event.
233
+
234
+ ### 4.7 `auth.token.introspect@1.0`
235
+
236
+ Self-only: check whether a token is still valid.
237
+
238
+ **Request:** `{"params":{},"input":{"token":"hntoken://v1/..."}}`
239
+
240
+ **Response:** `{"output":{"active":bool,"scope":{...},"expires_at":"..."},"meta":{...}}`
241
+
242
+ ### 4.8 `ocr.image@1.0`
243
+
244
+ Extract text from a single image.
245
+
246
+ **Request:**
247
+ ```json
248
+ {
249
+ "params": {"backend": "tesseract", "languages": ["deu","eng"]},
250
+ "input": {
251
+ "image_cid": "blake3:...",
252
+ "preprocess": {"deskew": true, "denoise": false}
253
+ }
254
+ }
255
+ ```
256
+
257
+ **Response:**
258
+ ```json
259
+ {
260
+ "output": {
261
+ "text": "Trinkwasser ohne Strom ...",
262
+ "blocks": [
263
+ {"text":"Trinkwasser ohne Strom","bbox":[10,20,300,40],"confidence":0.94}
264
+ ],
265
+ "language": "de"
266
+ },
267
+ "meta": {"backend":"tesseract","ms":820}
268
+ }
269
+ ```
270
+
271
+ ### 4.9 `ocr.pdf@1.0`
272
+
273
+ Extract text from a (scanned) PDF. Streams per-page progress.
274
+
275
+ **Request:**
276
+ ```json
277
+ {
278
+ "params": {"backend":"multilingual","languages":["deu","lat"]},
279
+ "input": {
280
+ "doc_cid": "blake3:...",
281
+ "page_range": [1, 50],
282
+ "preprocess": {"deskew": true},
283
+ "store_text": true
284
+ }
285
+ }
286
+ ```
287
+
288
+ **Stream frames:**
289
+ ```
290
+ event: progress
291
+ data: {"current": 3, "total": 12, "stage": "OCRing page 3"}
292
+
293
+ event: page
294
+ data: {"page": 3, "text": "...", "confidence_mean": 0.91}
295
+
296
+ event: done
297
+ data: {"pages": 12, "stored_cid": "blake3:...", "ms": 18342}
298
+ ```
299
+
300
+ If `store_text:true`, the extracted text is stored as a new blob and its CID returned. Useful for piping into `rag.ingest`.
301
+
302
+ ### 4.10 `trans.text@1.0`
303
+
304
+ Translate between languages.
305
+
306
+ **Request:**
307
+ ```json
308
+ {
309
+ "params": {"backend":"nllb"},
310
+ "input": {
311
+ "text": "Brauche Wasserkanister",
312
+ "from": "de",
313
+ "to": "en",
314
+ "domain": "everyday"
315
+ }
316
+ }
317
+ ```
318
+
319
+ **Response:**
320
+ ```json
321
+ {
322
+ "output": {"text":"Need water canister", "confidence": 0.97},
323
+ "meta": {"backend":"nllb","model":"nllb-200-distilled-600M","ms":312}
324
+ }
325
+ ```
326
+
327
+ Plattdeutsch supported as `nds`. Marketplace UI offers one-click translate on a foreign-language post.
328
+
329
+ ### 4.11 `stt.transcribe@1.0`
330
+
331
+ Transcribe an audio blob.
332
+
333
+ **Request:**
334
+ ```json
335
+ {
336
+ "params": {"backend":"whisper","model":"large-v3"},
337
+ "input": {
338
+ "audio_cid": "blake3:...",
339
+ "language": "auto",
340
+ "diarize": false,
341
+ "translate_to_en": false
342
+ }
343
+ }
344
+ ```
345
+
346
+ **Stream frames:**
347
+ ```
348
+ event: segment
349
+ data: {"start": 0.0, "end": 4.2, "text": "Hallo, ich brauche...", "language":"de"}
350
+
351
+ event: segment
352
+ data: {"start": 4.2, "end": 8.1, "text": "Hilfe mit dem Generator."}
353
+
354
+ event: done
355
+ data: {"language":"de","ms":2100,"duration_seconds":18.4}
356
+ ```
357
+
358
+ ### 4.12 `tts.synthesize@1.0`
359
+
360
+ Synthesize speech from text.
361
+
362
+ **Request:**
363
+ ```json
364
+ {
365
+ "params": {"backend":"xtts","voice":"hannes_v1","language":"de"},
366
+ "input": {
367
+ "text": "Das Regenwasser muss zuerst gefiltert werden.",
368
+ "speed": 1.0,
369
+ "format": "ogg_vorbis"
370
+ }
371
+ }
372
+ ```
373
+
374
+ **Stream frames:**
375
+ ```
376
+ event: chunk
377
+ data: {"i":0,"size_bytes":16384,"data_b64":"..."}
378
+
379
+ event: done
380
+ data: {"total_bytes":91247,"duration_seconds":4.2,"format":"ogg_vorbis","ms":1832}
381
+ ```
382
+
383
+ ### 4.13 `img.describe@1.0`
384
+
385
+ Describe what's in an image.
386
+
387
+ **Request:**
388
+ ```json
389
+ {
390
+ "params": {"backend":"florence2"},
391
+ "input": {
392
+ "image_cid": "blake3:...",
393
+ "task": "detailed_caption",
394
+ "language": "de"
395
+ }
396
+ }
397
+ ```
398
+
399
+ `task` ∈ `{"caption","detailed_caption","ocr","objects","tags"}`.
400
+
401
+ **Response:**
402
+ ```json
403
+ {
404
+ "output": {
405
+ "caption": "Ein Schaltplan einer einfachen Wasserfilteranlage mit ...",
406
+ "tags": ["schaltplan","wasserfilter","skizze"],
407
+ "objects": [{"label":"pipe","bbox":[10,20,80,90]}]
408
+ },
409
+ "meta": {"backend":"florence2","ms":640}
410
+ }
411
+ ```
412
+
413
+ ### 4.14 `img.generate@1.0`
414
+
415
+ Generate an image from a text prompt.
416
+
417
+ **Request:**
418
+ ```json
419
+ {
420
+ "params": {"backend":"flux","model":"flux.1-dev","lora":"local-style-v1"},
421
+ "input": {
422
+ "prompt": "ein einfacher schaltplan einer wasserfilteranlage, schwarz auf weiss",
423
+ "negative_prompt": "color, photorealistic",
424
+ "width": 1024,
425
+ "height": 1024,
426
+ "steps": 20,
427
+ "seed": 12345
428
+ }
429
+ }
430
+ ```
431
+
432
+ **Stream frames:**
433
+ ```
434
+ event: progress
435
+ data: {"step":5,"total":20}
436
+
437
+ event: done
438
+ data: {"image_cid":"blake3:...","width":1024,"height":1024,"ms":12800}
439
+ ```
440
+
441
+ ### 4.15 `rerank.text@1.0`
442
+
443
+ Rerank a list of documents against a query.
444
+
445
+ **Request:**
446
+ ```json
447
+ {
448
+ "params": {"model":"BAAI/bge-reranker-v2-m3"},
449
+ "input": {
450
+ "query": "Wie reinige ich Regenwasser ohne Strom?",
451
+ "documents": [
452
+ {"id":"doc1","text":"..."},
453
+ {"id":"doc2","text":"..."}
454
+ ],
455
+ "top_k": 10
456
+ }
457
+ }
458
+ ```
459
+
460
+ **Response:**
461
+ ```json
462
+ {
463
+ "output": {
464
+ "ranked": [
465
+ {"id":"doc2","score":0.91},
466
+ {"id":"doc1","score":0.42}
467
+ ]
468
+ },
469
+ "meta": {"model":"BAAI/bge-reranker-v2-m3","ms":42}
470
+ }
471
+ ```
472
+
473
+ ### 4.16 `chat.thread.create@1.0`
474
+
475
+ Create a multi-party thread.
476
+
477
+ **Request:**
478
+ ```json
479
+ {
480
+ "params": {},
481
+ "input": {
482
+ "client_id": "01HXR...",
483
+ "name": "Nachbarschaftshilfe Mai",
484
+ "members": ["ed25519:...","ed25519:...","ed25519:..."],
485
+ "e2e_enabled": true
486
+ }
487
+ }
488
+ ```
489
+
490
+ **Response:** `{"output":{"thread_id":"01HXR...","event_id":"01HXR..."},"meta":{...}}`
491
+
492
+ ### 4.17 `chat.thread.send@1.0`
493
+
494
+ Send to a thread. Body is E2E-encrypted when `e2e_enabled`.
495
+
496
+ **Request:**
497
+ ```json
498
+ {
499
+ "params": {"thread_id":"01HXR..."},
500
+ "input": {
501
+ "client_id": "01HXR...",
502
+ "body": "...", // cleartext or {"e2e":true,...} envelope
503
+ "attachments": [{"cid":"blake3:...","name":"..."}]
504
+ }
505
+ }
506
+ ```
507
+
508
+ ### 4.18 `chat.thread.history@1.0`
509
+
510
+ Self-only history retrieval for a thread.
511
+
512
+ **Request:** `{"params":{"thread_id":"01HXR..."},"input":{"since_lamport":4000,"limit":200}}`
513
+
514
+ ### 4.19 `chat.thread.leave@1.0`
515
+
516
+ Leave a thread.
517
+
518
+ ### 4.20 `chat.forward.put@1.0`
519
+
520
+ Store-and-forward: leave a chat message with an anchor for later delivery.
521
+
522
+ **Stream initiator pattern** identical to `file.put`. Anchors that opt into the role register this capability.
523
+
524
+ ### 4.21 `chat.forward.fetch@1.0`
525
+
526
+ Self-only: collect queued messages from an anchor.
527
+
528
+ ### 4.22 `file.put.resume@1.0`
529
+
530
+ Resume a partial PUT.
531
+
532
+ **Request:**
533
+ ```json
534
+ {"params":{},"input":{"manifest_cid":"blake3:...","client_id":"01HXR..."}}
535
+ ```
536
+
537
+ **Response (server tells client which chunks are missing):**
538
+ ```
539
+ event: ready
540
+ data: {"missing":[3,4,5,8]}
541
+
542
+ (client sends only those chunks)
543
+
544
+ event: done
545
+ data: {"received":4}
546
+ ```
547
+
548
+ Server keeps partial transfer state for `FILE_RESUME_PARTIAL_TTL_SECONDS` (1 hour). After that, partial transfers are discarded and client must restart.
549
+
550
+ ### 4.23 `llm.chat@2.0` (update)
551
+
552
+ Backward-compatible **minor bump** (still `name="llm.chat"`, callers can still ask for `@>=1.0` and be matched). New optional fields:
553
+
554
+ ```json
555
+ {
556
+ "params": {"model":"...","modalities":["text","vision"]},
557
+ "input": {
558
+ "messages": [
559
+ {
560
+ "role": "user",
561
+ "content": [
562
+ {"type": "text", "text": "Was siehst du?"},
563
+ {"type": "image", "image_cid": "blake3:..."}
564
+ ]
565
+ }
566
+ ],
567
+ "tools": [
568
+ {
569
+ "name": "rag.query",
570
+ "description": "Search the niederrhein-emergency corpus",
571
+ "parameters_schema": { /* JSON Schema for tool args */ }
572
+ }
573
+ ],
574
+ "tool_choice": "auto"
575
+ }
576
+ }
577
+ ```
578
+
579
+ ### 4.24 `llm.tools.call@1.0`
580
+
581
+ When an LLM emits a `tool_call_delta` stream frame followed by `tool_call` end, the **caller** is responsible for executing the tool. To make this composable, the LLM service offers `llm.tools.call` as a convenience that wraps "execute one bus call, return its output as a tool message". Callers MAY use it; the more general flow is to have the orchestrator (UI / agent) handle it.
582
+
583
+ **Request:**
584
+ ```json
585
+ {
586
+ "params": {},
587
+ "input": {
588
+ "tool_call_id": "tc_01HXR...",
589
+ "target_capability":"rag.query@1.0",
590
+ "target_body": { /* the tool's args, validated against the tool's parameters_schema */ }
591
+ }
592
+ }
593
+ ```
594
+
595
+ **Response:** mirrors target capability's response.
596
+
597
+ ---
598
+
599
+ ## 5. Wire format changes
600
+
601
+ ### 5.1 WebSocket upgrade
602
+
603
+ For `/bus/v1/call`, clients MAY include:
604
+
605
+ ```
606
+ Connection: Upgrade
607
+ Upgrade: websocket
608
+ Sec-WebSocket-Protocol: hearthnet-bus.v2
609
+ ```
610
+
611
+ Server responds with a 101 if it supports WebSocket (Phase 2 nodes do). Once upgraded, the connection is bidirectional and persistent for the life of the request β€” useful for tool-call loops and streaming RAG.
612
+
613
+ Frames over WebSocket are the same JSON event-name + data envelope as SSE, just delivered as binary or text WebSocket frames instead of `data:` lines.
614
+
615
+ See [X06](cross-cutting/X06-websocket.md).
616
+
617
+ ### 5.2 Token-bearer requests
618
+
619
+ When a caller carries a capability token instead of (or in addition to) a per-request signature:
620
+
621
+ ```
622
+ X-HearthNet-Token: hntoken://v1/<base64>
623
+ ```
624
+
625
+ The server validates the token (signature, expiry, scope) and uses the token's `subject` as the effective caller for the trust check. The token's `issuer` must be a member of a federated community.
626
+
627
+ If both `X-HearthNet-Signature` and `X-HearthNet-Token` are present, signature is checked first; token is used to widen scope (e.g. "the caller is a federated peer, but for this single call they presented a token granting access").
628
+
629
+ ### 5.3 Federation routing
630
+
631
+ When a node receives a call where `X-HearthNet-Community` β‰  our community ID:
632
+
633
+ 1. Look up federation manifest for the calling community.
634
+ 2. If absent β†’ `not_federated` (404).
635
+ 3. If present but scope does not include the requested capability β†’ `federation_forbidden` (403).
636
+ 4. Else, dispatch normally; record federation usage in metrics.
637
+
638
+ ---
639
+
640
+ ## 6. Manifests
641
+
642
+ ### 6.1 Federation manifest (new)
643
+
644
+ ```json
645
+ {
646
+ "schema_version": 1,
647
+ "federation_id": "<community_a>:<community_b>",
648
+ "community_a": "ed25519:...",
649
+ "community_b": "ed25519:...",
650
+ "established_at": "2026-06-09T10:00:00Z",
651
+ "expires_at": "2027-06-09T10:00:00Z",
652
+ "scope": {
653
+ "a_grants_b": {"capabilities":["rag.query"], "corpora":["public-emergency"]},
654
+ "b_grants_a": {"capabilities":["rag.query"]}
655
+ },
656
+ "bootstrap_endpoints_a": [{"transport":"https","host":"...","port":7080}],
657
+ "bootstrap_endpoints_b": [{"transport":"https","host":"...","port":7080}],
658
+ "signatures": {
659
+ "a": {"signed_by":"ed25519:<anchor of A>","signature":"...","co_signers":[{...},{...}]},
660
+ "b": {"signed_by":"ed25519:<anchor of B>","signature":"...","co_signers":[{...},{...}]}
661
+ }
662
+ }
663
+ ```
664
+
665
+ Both sides must sign with their `min_signatures_to_federate` threshold. The federation manifest lives in **both** communities' event logs.
666
+
667
+ ### 6.2 Token body (new)
668
+
669
+ JWS-style. Header:
670
+
671
+ ```json
672
+ {"alg":"EdDSA","typ":"hntoken","v":1}
673
+ ```
674
+
675
+ Payload:
676
+
677
+ ```json
678
+ {
679
+ "iss": "ed25519:<issuer NodeID>",
680
+ "sub": "ed25519:<subject NodeID>",
681
+ "aud": "ed25519:<audience community, optional>",
682
+ "iat": 1717939200,
683
+ "exp": 1717942800,
684
+ "jti": "01HXR...",
685
+ "scope": {
686
+ "capabilities": ["rag.query@1.0"],
687
+ "params_constraints": {"corpus":["niederrhein-emergency"]},
688
+ "rate_limit_per_minute": 60
689
+ }
690
+ }
691
+ ```
692
+
693
+ Signature: Ed25519 over `base64url(header) + "." + base64url(payload)`.
694
+
695
+ ### 6.3 Node manifest delta
696
+
697
+ Phase 2 nodes set `contract_version: "2.0"`. Additional fields in `capabilities[].params`:
698
+
699
+ ```json
700
+ {
701
+ "name": "llm.chat",
702
+ "version": "2.0",
703
+ "params": {
704
+ "model": "...",
705
+ "modalities": ["text","vision"],
706
+ "tools_supported": true,
707
+ "max_tools_per_call": 16,
708
+ "requires_internet": false
709
+ }
710
+ }
711
+ ```
712
+
713
+ ---
714
+
715
+ ## 7. Events (additive to v1.0 Β§7.2)
716
+
717
+ ### 7.1 New event types
718
+
719
+ ```
720
+ federation.peer.added
721
+ federation.peer.removed
722
+ federation.heartbeat
723
+ auth.token.issued
724
+ auth.token.revoked
725
+ chat.thread.created
726
+ chat.thread.member.added
727
+ chat.thread.member.removed
728
+ chat.thread.message.sent
729
+ chat.thread.message.delivered
730
+ chat.thread.archived
731
+ e2e.prekeys.published
732
+ e2e.session.established
733
+ e2e.session.broken
734
+ file.replication.scheduled
735
+ file.replication.completed
736
+ ocr.document.indexed
737
+ ```
738
+
739
+ ### 7.2 Selected schemas
740
+
741
+ #### `federation.peer.added`
742
+
743
+ ```json
744
+ {
745
+ "peer_community_id": "ed25519:...",
746
+ "federation_id": "...",
747
+ "scope": {...},
748
+ "co_signers": [{...},{...},{...}]
749
+ }
750
+ ```
751
+
752
+ #### `auth.token.issued`
753
+
754
+ Stored without the signature payload (just metadata for audit):
755
+
756
+ ```json
757
+ {
758
+ "token_id": "01HXR...",
759
+ "subject": "ed25519:...",
760
+ "scope": {...},
761
+ "expires_at":"...",
762
+ "audience": "ed25519:..."
763
+ }
764
+ ```
765
+
766
+ #### `auth.token.revoked`
767
+
768
+ ```json
769
+ {"token_id":"01HXR...","reason":"manual|policy|compromise"}
770
+ ```
771
+
772
+ #### `chat.thread.created`
773
+
774
+ ```json
775
+ {
776
+ "thread_id": "01HXR...",
777
+ "client_id": "01HXR...",
778
+ "name": "Nachbarschaftshilfe Mai",
779
+ "members": ["ed25519:...","ed25519:..."],
780
+ "e2e_enabled": true,
781
+ "ratchet_root_pubkey": "x25519:..."
782
+ }
783
+ ```
784
+
785
+ #### `chat.thread.message.sent`
786
+
787
+ ```json
788
+ {
789
+ "thread_id": "01HXR...",
790
+ "client_id": "01HXR...",
791
+ "body": {"e2e":true,"header":{...},"ciphertext":"..."} | "<cleartext>",
792
+ "attachments": [...]
793
+ }
794
+ ```
795
+
796
+ #### `e2e.prekeys.published`
797
+
798
+ ```json
799
+ {
800
+ "node_id": "ed25519:...",
801
+ "identity_pubkey": "x25519:...",
802
+ "signed_prekey": {"pubkey":"x25519:...","signature":"ed25519:..."},
803
+ "one_time_prekeys": ["x25519:...","x25519:...","..."]
804
+ }
805
+ ```
806
+
807
+ #### `file.replication.scheduled`
808
+
809
+ ```json
810
+ {
811
+ "cid": "blake3:...",
812
+ "desired_copies": 3,
813
+ "current_copies": 1,
814
+ "candidate_holders": ["ed25519:...","ed25519:..."]
815
+ }
816
+ ```
817
+
818
+ #### `ocr.document.indexed`
819
+
820
+ ```json
821
+ {
822
+ "doc_cid": "blake3:...",
823
+ "text_cid": "blake3:...",
824
+ "pages": 12,
825
+ "languages": ["de","la"],
826
+ "ocr_backend": "multilingual"
827
+ }
828
+ ```
829
+
830
+ ### 7.3 Federation events propagate cross-community
831
+
832
+ Events with `event_type ∈ {federation.*, auth.token.issued, auth.token.revoked}` MAY be cross-published into a federated community's event log. The community receiving such an event records the originating community in `data._source_community`. This is the only case where an event's `community_id` does not equal the log it lives in.
833
+
834
+ ---
835
+
836
+ ## 8. Pub-sub topics (additive)
837
+
838
+ | Topic | Producer | Subscriber |
839
+ |-------|----------|------------|
840
+ | `federation.peer.added` | member adding | all members |
841
+ | `federation.peer.heartbeat.<peer_community>` | federation client loop | UI |
842
+ | `auth.token.issued` | issuer | issuer + subject |
843
+ | `chat.thread.message.<thread_id>` | sender | thread members |
844
+ | `e2e.prekey.request.<our_short_id>` | sender wanting session | recipient |
845
+ | `e2e.session.handshake.<our_short_id>` | initiator | responder |
846
+ | `file.replication.request.<cid_prefix>` | replication scheduler | all anchors |
847
+ | `mobile.push.<device_id>` | sender | push relay tier (M15) |
848
+
849
+ ---
850
+
851
+ ## 9. Errors β€” complete delta (additive to v1.0 Β§9)
852
+
853
+ | Code | When | Retry? |
854
+ |------|------|--------|
855
+ | `federation_forbidden` | Caller's community not federated for this capability | no |
856
+ | `not_federated` | No federation manifest with caller's community | no |
857
+ | `token_invalid` | Token signature bad | no |
858
+ | `token_expired` | Token past `exp` | no, request a new token |
859
+ | `token_scope_insufficient` | Token does not include this capability | no |
860
+ | `token_revoked` | Token id in revoked list | no |
861
+ | `relay_unreachable` | Configured relay tier down | yes, exp backoff |
862
+ | `e2e_session_missing` | No active X3DH session | yes, after key exchange |
863
+ | `e2e_decrypt_failed` | Ciphertext can't be decrypted | no, request rekey |
864
+ | `dht_lookup_failed` | DHT did not find sources in time | yes |
865
+ | `ratchet_out_of_order` | Message too far out of order; sender must rewind | maybe |
866
+
867
+ ---
868
+
869
+ ## 10. Versioning and migration
870
+
871
+ ### 10.1 Mixed-version mesh
872
+
873
+ A v1.0 node and a v2.0 node may coexist on the same LAN, but:
874
+
875
+ - A v2.0 node calling a v1.0 node for a Phase 2 capability gets `not_found` (v1 didn't register it).
876
+ - A v1.0 node calling a v2.0 node for a v1 capability works fine (additive contract).
877
+ - v2.0 routes around v1.0 nodes for any capability that requires v2 features.
878
+
879
+ ### 10.2 Migration of an existing community
880
+
881
+ When the founder upgrades to v2.0:
882
+
883
+ 1. New `policy.min_signatures_to_federate` field added with default 3
884
+ 2. New event types unlock; old log still replays cleanly
885
+ 3. Existing nodes prompted to upgrade via `community.policy.updated` event
886
+ 4. After 30 days, federation capabilities won't dispatch to non-upgraded nodes
887
+
888
+ See `MIGRATION_v1_to_v2.md` (out of band).
889
+
890
+ ---
891
+
892
+ ## 11. Out of scope still (deferred to Phase 3)
893
+
894
+ - Distributed-tensor inference capabilities (`experimental.distributed_llm.chat`)
895
+ - MoE-style expert routing (lives inside the bus as a learned scorer)
896
+ - Federated learning capabilities (`fedlearn.*`)
897
+ - LoRA long-distance beacons (no capability, hardware-only)
898
+ - Evidence-layer integration (`evidence.*` namespace reserved here, defined in Phase 3)
899
+ - Conformance test suite as a protocol surface
docs/p2_p3/CAPABILITY_CONTRACT_v3.md ADDED
@@ -0,0 +1,651 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # HearthNet Capability Contract β€” Phase 3 additions (v3.0)
2
+
3
+ **Spec version:** v3.0
4
+ **Last touched:** 2026-06-09
5
+ **Builds on:** [`../phase-2/CAPABILITY_CONTRACT_v2.md`](../phase-2/CAPABILITY_CONTRACT_v2.md) (v2.0)
6
+
7
+ This document is **additive** to v2.0. Phase 3 capabilities are mostly under the `experimental.` namespace; nodes default to ignoring them unless `policy.research.enable = true`.
8
+
9
+ ---
10
+
11
+ ## 1. Conventions delta
12
+
13
+ ### 1.1 The `experimental.` namespace
14
+
15
+ A capability of the form `experimental.<name>@<ver>` is treated specially:
16
+
17
+ - Nodes do not register experimental capabilities by default.
18
+ - The bus discovery layer (M02 / X05) excludes experimental capabilities from outbound advertisements unless the local policy opts in.
19
+ - A capability in `experimental.` MAY change its contract between minor versions without a contract bump. Callers MUST tolerate breakage.
20
+ - Once a capability is sufficiently proven and stable, it is **promoted** out of `experimental.` in a future minor contract bump. The prior `experimental.` form is then deprecated; both forms work for one minor-release window.
21
+
22
+ Promotion criteria (a soft checklist; the protocol working group, see M32, decides):
23
+ - β‰₯ 2 independent implementations.
24
+ - Used in production by β‰₯ 3 communities for β‰₯ 90 days.
25
+ - Open security review with no unresolved high-severity findings.
26
+ - Per-call cost (compute, latency, bytes) within a 2Γ— factor of the budget in the capability spec.
27
+
28
+ ### 1.2 Claim records (new top-level concept)
29
+
30
+ Phase 3 introduces a second persistent surface alongside the event log: the **claim graph**. Where the event log records "X did Y at T", the claim graph records "X asserts P, citing E".
31
+
32
+ ```json
33
+ {
34
+ "schema_version": 1,
35
+ "claim_id": "01HXR...",
36
+ "claim_type": "factual|preference|policy|sighting|...",
37
+ "predicate": {"subject":"...","verb":"...","object":"...","modifiers":{...}},
38
+ "evidence": [{"kind":"event_ref","value":"01HXR..."}, {"kind":"document_cid","value":"blake3:..."}],
39
+ "asserted_by": "ed25519:...",
40
+ "asserted_at": "...",
41
+ "evidence_level": "unverified|cited|cross_referenced|attested|disputed",
42
+ "supersedes": ["01HXR..."],
43
+ "signature": "..."
44
+ }
45
+ ```
46
+
47
+ Claims live in a separate Merkle-DAG store ([M30 Β§4](modules/M30-evidence-ebkh.md)). They are not events. Events describe what *happened*; claims describe what is *believed*.
48
+
49
+ ### 1.3 New error codes (additive)
50
+
51
+ | Code | Meaning |
52
+ |------|---------|
53
+ | `experimental_disabled` | Caller asked for an experimental capability that the node has not opted into |
54
+ | `shard_unavailable` | Distributed inference: one shard host failed mid-stream |
55
+ | `pipeline_stalled` | Distributed inference: no progress within timeout |
56
+ | `fedlearn_round_quorum` | Federated learning: too few participants for this round |
57
+ | `fedlearn_diff_invalid` | Submitted LoRA diff failed schema or bounds check |
58
+ | `evidence_contradiction` | A new claim directly contradicts a previously-attested claim |
59
+ | `civdef_audit_required` | Operation rejected because civil-defence audit policy is active and the call is unsigned by an authorised role |
60
+
61
+ ---
62
+
63
+ ## 2. Capability namespace allocations
64
+
65
+ Promoted from "reserved" in v2.0 or introduced new:
66
+
67
+ | Prefix | Status | Defined |
68
+ |--------|--------|---------|
69
+ | `experimental.distributed_llm.*` | experimental | [M26](modules/M26-distributed-inference.md) |
70
+ | `experimental.moe.*` | experimental | [M27](modules/M27-moe-routing.md) |
71
+ | `experimental.fedlearn.*` | experimental | [M28](modules/M28-fedlearn.md) |
72
+ | `evidence.*` | stable | [M30](modules/M30-evidence-ebkh.md) |
73
+ | `civdef.*` | stable (when civdef profile active) | [M31](modules/M31-civil-defense.md) |
74
+ | `protocol.*` | stable | [M32](modules/M32-protocol-standard.md) (conformance reporting) |
75
+
76
+ ---
77
+
78
+ ## 3. Phase 3 capabilities
79
+
80
+ | Name | Stability | Stream? | Trust required | Section |
81
+ |------|-----------|---------|----------------|---------|
82
+ | `experimental.distributed_llm.chat@1.0` | experimental | yes | member + research opt-in | Β§4.1 |
83
+ | `experimental.distributed_llm.shard.advertise@1.0` | experimental | no | trusted + research opt-in | Β§4.2 |
84
+ | `experimental.distributed_llm.shard.serve@1.0` | experimental | yes | trusted + research opt-in | Β§4.3 |
85
+ | `experimental.moe.route@1.0` | experimental | no | member + research opt-in | Β§4.4 |
86
+ | `experimental.moe.expert.register@1.0` | experimental | no | self + research opt-in | Β§4.5 |
87
+ | `experimental.moe.expert.handoff@1.0` | experimental | yes | as configured | Β§4.6 |
88
+ | `experimental.fedlearn.round.start@1.0` | experimental | no | anchor + research opt-in | Β§4.7 |
89
+ | `experimental.fedlearn.round.participate@1.0` | experimental | yes | member + research opt-in | Β§4.8 |
90
+ | `experimental.fedlearn.round.aggregate@1.0` | experimental | no | round coordinator | Β§4.9 |
91
+ | `experimental.fedlearn.lora.publish@1.0` | experimental | no | anchor | Β§4.10 |
92
+ | `evidence.claim.assert@1.0` | stable | no | member | Β§4.11 |
93
+ | `evidence.claim.dispute@1.0` | stable | no | member | Β§4.12 |
94
+ | `evidence.claim.attest@1.0` | stable | no | trusted | Β§4.13 |
95
+ | `evidence.claim.query@1.0` | stable | no | member | Β§4.14 |
96
+ | `evidence.provenance.trace@1.0` | stable | no | member | Β§4.15 |
97
+ | `civdef.alert.publish@1.0` | stable | no | authorised KatS role | Β§4.16 |
98
+ | `civdef.role.register@1.0` | stable | no | anchor (with role-cert) | Β§4.17 |
99
+ | `civdef.audit.export@1.0` | stable | yes | authorised auditor | Β§4.18 |
100
+ | `protocol.conformance.report@1.0` | stable | no | self | Β§4.19 |
101
+ | `protocol.version.list@1.0` | stable | no | unknown | Β§4.20 |
102
+
103
+ ---
104
+
105
+ ## 4. Per-capability specifications
106
+
107
+ ### 4.1 `experimental.distributed_llm.chat@1.0`
108
+
109
+ Like `llm.chat@2.0` but the inference is sharded across multiple shard-server nodes. The caller's node acts as the orchestrator and streams tokens back to the user.
110
+
111
+ **Request:**
112
+ ```json
113
+ {
114
+ "params": {
115
+ "model": "Qwen2.5-7B-Instruct",
116
+ "sharding": "auto",
117
+ "fallback_to_local": true
118
+ },
119
+ "input": {
120
+ "messages": [...]
121
+ }
122
+ }
123
+ ```
124
+
125
+ **Stream frames:** same as `llm.chat@2.0` (`token_delta`, `done`), plus diagnostic frames:
126
+
127
+ ```
128
+ event: shard_status
129
+ data: {"shards":[
130
+ {"shard_id":"Qwen2.5-7B:0-7","host":"ed25519:...","status":"online","latency_ms":4},
131
+ {"shard_id":"Qwen2.5-7B:8-15","host":"ed25519:...","status":"online","latency_ms":7}
132
+ ]}
133
+
134
+ event: shard_failover
135
+ data: {"failed_shard":"Qwen2.5-7B:8-15","replacement":"Qwen2.5-7B:8-15@other_host"}
136
+ ```
137
+
138
+ **Errors:** `shard_unavailable`, `pipeline_stalled`, `experimental_disabled`.
139
+
140
+ ### 4.2 `experimental.distributed_llm.shard.advertise@1.0`
141
+
142
+ A node informs the bus that it is willing to serve a specific shard range.
143
+
144
+ **Request:**
145
+ ```json
146
+ {
147
+ "params": {},
148
+ "input": {
149
+ "shard_id": "Qwen2.5-7B:0-7",
150
+ "model_id": "Qwen2.5-7B-Instruct",
151
+ "layer_range": [0, 7],
152
+ "max_concurrent_streams": 2,
153
+ "vram_required_mb": 6800
154
+ }
155
+ }
156
+ ```
157
+
158
+ Emits `experimental.shard.advertised` into the event log. Other nodes can then call `experimental.distributed_llm.shard.serve` on us to use the shard.
159
+
160
+ ### 4.3 `experimental.distributed_llm.shard.serve@1.0`
161
+
162
+ Tensor-passing inner call. Not normally invoked by user code β€” used by orchestrators only. See [X08](cross-cutting/X08-tensor-transport.md) for the wire format.
163
+
164
+ ### 4.4 `experimental.moe.route@1.0`
165
+
166
+ Decide which expert (model, human, service) to route a request to.
167
+
168
+ **Request:**
169
+ ```json
170
+ {
171
+ "params": {},
172
+ "input": {
173
+ "request_summary": "User asks about Sankt Martins parade route in Issum, 2026.",
174
+ "tags": ["local_knowledge","event_planning"],
175
+ "top_k": 3
176
+ }
177
+ }
178
+ ```
179
+
180
+ **Response:**
181
+ ```json
182
+ {
183
+ "output": {
184
+ "routes": [
185
+ {"expert_id":"human:ed25519:...", "kind":"human", "score":0.91, "name":"Maria K."},
186
+ {"expert_id":"corpus:niederrhein-events", "kind":"service", "score":0.74, "endpoint":"rag.query@1.0"},
187
+ {"expert_id":"model:llama3-70b-instruct", "kind":"model", "score":0.41}
188
+ ],
189
+ "rationale": "Sankt Martins is a local cultural event; humans with annotated knowledge of Issum specifically score highest."
190
+ },
191
+ "meta": {"ms": 28}
192
+ }
193
+ ```
194
+
195
+ ### 4.5 `experimental.moe.expert.register@1.0`
196
+
197
+ A node (or a human via their node) declares itself an expert on some topics.
198
+
199
+ ```json
200
+ {
201
+ "params": {},
202
+ "input": {
203
+ "expert_kind": "human",
204
+ "topics": ["sankt_martins","niederrhein_local_history"],
205
+ "availability": {"weekdays_19_21_local": true},
206
+ "consent_to_route": true
207
+ }
208
+ }
209
+ ```
210
+
211
+ ### 4.6 `experimental.moe.expert.handoff@1.0`
212
+
213
+ When a route to a human expert is chosen, this capability hands the conversation off to the expert's UI (typically: chat thread invite, optional E2E).
214
+
215
+ ```json
216
+ {
217
+ "params": {},
218
+ "input": {
219
+ "expert_id": "human:ed25519:...",
220
+ "context_summary": "...",
221
+ "permitted_replies": ["text","attachment"],
222
+ "deadline_minutes": 60
223
+ }
224
+ }
225
+ ```
226
+
227
+ ### 4.7 `experimental.fedlearn.round.start@1.0`
228
+
229
+ Anchor opens a federated learning round.
230
+
231
+ ```json
232
+ {
233
+ "params": {},
234
+ "input": {
235
+ "round_id": "01HXR...",
236
+ "base_model": "Qwen2.5-3B-Instruct",
237
+ "lora_config": {"r":16,"alpha":32,"target_modules":["q_proj","v_proj"]},
238
+ "training_corpus": "niederrhein-emergency",
239
+ "min_participants": 3,
240
+ "max_minutes": 120,
241
+ "objective": "next_token_loss",
242
+ "dp_noise_scale": 0.0
243
+ }
244
+ }
245
+ ```
246
+
247
+ ### 4.8 `experimental.fedlearn.round.participate@1.0`
248
+
249
+ A node opts into a round and streams its computed LoRA diff back.
250
+
251
+ **Stream frames:**
252
+ ```
253
+ event: phase
254
+ data: {"phase":"training","step":0,"total":200}
255
+
256
+ event: phase
257
+ data: {"phase":"training","step":200,"total":200}
258
+
259
+ event: diff
260
+ data: {"lora_diff_cid":"blake3:...","examples_seen":4321,"loss_end":0.84}
261
+ ```
262
+
263
+ ### 4.9 `experimental.fedlearn.round.aggregate@1.0`
264
+
265
+ Coordinator aggregates submitted diffs (FedAvg, weighted by `examples_seen`).
266
+
267
+ ```json
268
+ {
269
+ "params": {},
270
+ "input": {
271
+ "round_id": "01HXR...",
272
+ "diff_cids": ["blake3:...","blake3:...","blake3:..."]
273
+ }
274
+ }
275
+ ```
276
+
277
+ **Response:** `{"output":{"aggregated_lora_cid":"blake3:...","participants_used":3,"dropped":[]},"meta":{...}}`
278
+
279
+ ### 4.10 `experimental.fedlearn.lora.publish@1.0`
280
+
281
+ After aggregation, the anchor publishes the new LoRA to the community.
282
+
283
+ ```json
284
+ {
285
+ "params": {},
286
+ "input": {
287
+ "round_id": "01HXR...",
288
+ "aggregated_lora_cid": "blake3:...",
289
+ "base_model": "Qwen2.5-3B-Instruct",
290
+ "version": "niederrhein-emergency-v3"
291
+ }
292
+ }
293
+ ```
294
+
295
+ Emits `experimental.fedlearn.lora.published` event. Nodes that have opted into the corpus' LoRA can pull the new version.
296
+
297
+ ### 4.11 `evidence.claim.assert@1.0`
298
+
299
+ Assert a claim into the claim graph.
300
+
301
+ **Request:**
302
+ ```json
303
+ {
304
+ "params": {},
305
+ "input": {
306
+ "claim_type": "factual",
307
+ "predicate": {"subject":"<Brunnen 12 Issum>","verb":"yields","object":"<200L/h drinkable water>"},
308
+ "evidence": [
309
+ {"kind":"event_ref","value":"01HXR..."},
310
+ {"kind":"document_cid","value":"blake3:..."}
311
+ ],
312
+ "ttl_days": 365
313
+ }
314
+ }
315
+ ```
316
+
317
+ **Response:** `{"output":{"claim_id":"01HXR...","evidence_level":"cited"},"meta":{...}}`
318
+
319
+ ### 4.12 `evidence.claim.dispute@1.0`
320
+
321
+ ```json
322
+ {"params":{},"input":{"claim_id":"01HXR...","reason":"...","counter_evidence":[...]}}
323
+ ```
324
+
325
+ ### 4.13 `evidence.claim.attest@1.0`
326
+
327
+ Trusted member adds an attestation, raising the claim's evidence level.
328
+
329
+ ```json
330
+ {"params":{},"input":{"claim_id":"01HXR...","attestation":"I confirmed personally on 2026-06-08"}}
331
+ ```
332
+
333
+ A claim becomes `attested` after `policy.evidence.attestations_required_for_attested` distinct trusted attestations (default 3).
334
+
335
+ ### 4.14 `evidence.claim.query@1.0`
336
+
337
+ ```json
338
+ {
339
+ "params": {},
340
+ "input": {
341
+ "predicate_pattern": {"subject":"<Brunnen 12 Issum>","verb":"*"},
342
+ "min_evidence_level":"cited",
343
+ "limit": 20
344
+ }
345
+ }
346
+ ```
347
+
348
+ ### 4.15 `evidence.provenance.trace@1.0`
349
+
350
+ Walk the evidence chain backwards.
351
+
352
+ ```json
353
+ {
354
+ "params": {},
355
+ "input": {"claim_id":"01HXR...","max_depth":8}
356
+ }
357
+ ```
358
+
359
+ **Response:** a DAG of `{claim_id, predicate, evidence_summary, asserted_by, asserted_at, evidence_level}` nodes.
360
+
361
+ ### 4.16 `civdef.alert.publish@1.0`
362
+
363
+ Civil-defence-grade alert. Differs from `emergency.publish@1.0` (Phase 1) in that the caller must hold a `civdef.role` credential and the alert is signed for legal-evidence retention.
364
+
365
+ ```json
366
+ {
367
+ "params": {},
368
+ "input": {
369
+ "client_id": "01HXR...",
370
+ "severity": "warning|alert|emergency|extreme",
371
+ "category": "weather|fire|chemical|flood|infrastructure|other",
372
+ "title": "Stromausfall Issum Mitte",
373
+ "body": "...",
374
+ "areas": [{"polygon":"<geojson>"}],
375
+ "issued_by_role": "thw_ortsverband_geldern",
376
+ "audit_evidence": [{"kind":"role_certificate","value":"..."}]
377
+ }
378
+ }
379
+ ```
380
+
381
+ ### 4.17 `civdef.role.register@1.0`
382
+
383
+ Anchor registers a community member as holding an authorised KatS role.
384
+
385
+ ```json
386
+ {
387
+ "params": {},
388
+ "input": {
389
+ "subject": "ed25519:...",
390
+ "role": "thw_helfer|drk_sanitaeter|feuerwehr|katastrophenschutzbeauftragter",
391
+ "role_certificate_cid": "blake3:...",
392
+ "expires_at": "2027-01-01T00:00:00Z"
393
+ }
394
+ }
395
+ ```
396
+
397
+ ### 4.18 `civdef.audit.export@1.0`
398
+
399
+ Stream the audit trail for a time range β€” used by KatS auditors.
400
+
401
+ ```json
402
+ {
403
+ "params": {},
404
+ "input": {"from":"2026-04-01T00:00:00Z","to":"2026-06-01T00:00:00Z"}
405
+ }
406
+ ```
407
+
408
+ **Stream frames:** one frame per audit record; signed batches every 1000 records for tamper evidence.
409
+
410
+ ### 4.19 `protocol.conformance.report@1.0`
411
+
412
+ Generate a conformance report against the [X09](cross-cutting/X09-conformance-suite.md) suite.
413
+
414
+ ```json
415
+ {
416
+ "params": {},
417
+ "input": {"suite_version":"3.0"}
418
+ }
419
+ ```
420
+
421
+ **Response:** `{"output":{"report_cid":"blake3:...","passed":214,"failed":3,"skipped":17},"meta":{...}}`
422
+
423
+ ### 4.20 `protocol.version.list@1.0`
424
+
425
+ Returns the contract versions this node supports and the conformance suite versions it has passed.
426
+
427
+ ```json
428
+ {
429
+ "output": {
430
+ "contract_versions": ["1.0","2.0","3.0"],
431
+ "conformance_passed": [{"suite":"3.0","report_cid":"blake3:..."}],
432
+ "implementation": {"name":"hearthnet-py","version":"0.7.2","commit":"abc123"}
433
+ }
434
+ }
435
+ ```
436
+
437
+ ---
438
+
439
+ ## 5. Wire format additions
440
+
441
+ ### 5.1 Tensor transport (binary)
442
+
443
+ For `experimental.distributed_llm.shard.serve@1.0` only. WebSocket frames carry **binary payloads** with a 16-byte header:
444
+
445
+ ```
446
+ +---------------+---------------+---------------+
447
+ | 4B chunk_id | 4B chunk_seq | 4B total_seq |
448
+ | 2B flags | 2B reserved |
449
+ +---------------+---------------+---------------+
450
+ | tensor chunk (≀ 1 MB) |
451
+ +-----------------------------------------------+
452
+ ```
453
+
454
+ `flags`:
455
+ - `0x0001` LAST (this is the final chunk of the message)
456
+ - `0x0002` COMPRESSED (zstd-compressed; only when payload β‰₯ `TENSOR_COMPRESSION_THRESHOLD_BYTES`)
457
+ - `0x0004` FP16 (else FP32)
458
+
459
+ See [X08](cross-cutting/X08-tensor-transport.md) for the protocol.
460
+
461
+ ### 5.2 Claim records
462
+
463
+ Claims are stored in their own Merkle-DAG; they reference events but are not events. A claim record header:
464
+
465
+ ```
466
+ X-HearthNet-Claim: 01HXR...
467
+ X-HearthNet-Claim-Asserted-By: ed25519:...
468
+ ```
469
+
470
+ Claim records flow through the same transport but with `Content-Type: application/vnd.hearthnet.claim+json`.
471
+
472
+ ### 5.3 Civil-defence audit signatures
473
+
474
+ `civdef.*` capability calls always require both a per-call signature (per X01) **and** a `civdef.role` credential reference in headers:
475
+
476
+ ```
477
+ X-HearthNet-CivDef-Role: thw_helfer
478
+ X-HearthNet-CivDef-Role-Cert: blake3:...
479
+ ```
480
+
481
+ Without these, the call returns `civdef_audit_required`.
482
+
483
+ ---
484
+
485
+ ## 6. Manifests
486
+
487
+ ### 6.1 Node manifest delta
488
+
489
+ ```json
490
+ {
491
+ "contract_version": "3.0",
492
+ "experimental_capabilities_enabled": false,
493
+ "civdef_profile": {
494
+ "active": false,
495
+ "authority": "",
496
+ "audit_endpoint": ""
497
+ },
498
+ "research_opt_in": {
499
+ "fedlearn": false,
500
+ "distributed_inference": false,
501
+ "moe_human_routing": false
502
+ }
503
+ }
504
+ ```
505
+
506
+ A node with `experimental_capabilities_enabled=false` does not advertise any `experimental.*` capabilities.
507
+
508
+ ### 6.2 Community policy delta
509
+
510
+ ```yaml
511
+ research:
512
+ enable: false
513
+ enabled_capabilities: []
514
+
515
+ fedlearn:
516
+ participate: false
517
+ share_compute_with_federated: false
518
+ dp_noise_scale_min: 0.0
519
+
520
+ civdef:
521
+ active: false
522
+ authority: "" # e.g. "Kreis Kleve BevΓΆlkerungsschutz"
523
+ audit_export_to: ""
524
+
525
+ evidence:
526
+ attestations_required_for_attested: 3
527
+ default_claim_ttl_days: 365
528
+ retain_disputed_claims_days: 1825
529
+ ```
530
+
531
+ ---
532
+
533
+ ## 7. Events (additive to v2.0 Β§7.1)
534
+
535
+ ```
536
+ experimental.shard.advertised
537
+ experimental.shard.retired
538
+ experimental.fedlearn.round.opened
539
+ experimental.fedlearn.round.closed
540
+ experimental.fedlearn.lora.published
541
+ experimental.moe.expert.registered
542
+ experimental.moe.expert.unregistered
543
+ evidence.claim.asserted
544
+ evidence.claim.disputed
545
+ evidence.claim.attested
546
+ evidence.claim.superseded
547
+ civdef.alert.published
548
+ civdef.role.registered
549
+ civdef.role.revoked
550
+ civdef.audit.exported
551
+ protocol.conformance.reported
552
+ ```
553
+
554
+ ### Selected schemas
555
+
556
+ #### `evidence.claim.asserted`
557
+
558
+ ```json
559
+ {
560
+ "claim_id": "01HXR...",
561
+ "claim_type": "factual",
562
+ "predicate": {...},
563
+ "asserted_by": "ed25519:...",
564
+ "evidence_level": "cited",
565
+ "claim_payload_cid": "blake3:..."
566
+ }
567
+ ```
568
+
569
+ The event references the full claim record by CID; the claim itself lives in the claim store (M30 Β§4).
570
+
571
+ #### `civdef.alert.published`
572
+
573
+ ```json
574
+ {
575
+ "client_id": "01HXR...",
576
+ "severity": "alert",
577
+ "category": "infrastructure",
578
+ "title": "Stromausfall Issum Mitte",
579
+ "issued_by": "ed25519:...",
580
+ "issued_by_role":"thw_ortsverband_geldern",
581
+ "areas": [{...}]
582
+ }
583
+ ```
584
+
585
+ #### `experimental.fedlearn.lora.published`
586
+
587
+ ```json
588
+ {
589
+ "round_id": "01HXR...",
590
+ "base_model": "Qwen2.5-3B-Instruct",
591
+ "aggregated_lora_cid": "blake3:...",
592
+ "version": "niederrhein-emergency-v3",
593
+ "participants": 5,
594
+ "objective": "next_token_loss",
595
+ "dp_noise_scale": 0.0
596
+ }
597
+ ```
598
+
599
+ ---
600
+
601
+ ## 8. Pub-sub topics (additive)
602
+
603
+ | Topic | Producer | Subscriber |
604
+ |-------|----------|------------|
605
+ | `experimental.shard.advertised` | shard host | orchestrators |
606
+ | `experimental.fedlearn.round.opened.<round_id>` | coordinator | members |
607
+ | `experimental.moe.expert.registered` | expert | router |
608
+ | `evidence.claim.<claim_id>.changed` | asserter / disputer | watchers |
609
+ | `civdef.alert.<area_hash>` | civdef caller | members in area |
610
+
611
+ ---
612
+
613
+ ## 9. Errors β€” complete Phase 3 set
614
+
615
+ (additive to v2.0 Β§9)
616
+
617
+ | Code | When |
618
+ |------|------|
619
+ | `experimental_disabled` | Caller asked for an experimental capability the node has not opted into |
620
+ | `shard_unavailable` | A required shard host failed mid-pipeline |
621
+ | `pipeline_stalled` | No progress within `DISTRIBUTED_SHARD_HEALTH_TIMEOUT_S` |
622
+ | `fedlearn_round_quorum` | Round closed for lack of participants |
623
+ | `fedlearn_diff_invalid` | Submitted diff failed schema or norm bounds |
624
+ | `evidence_contradiction` | A new claim directly contradicts an attested claim; explicit override needed |
625
+ | `civdef_audit_required` | Operation rejected because civdef profile is active and call is not properly signed |
626
+ | `civdef_role_invalid` | Civdef role credential is missing, expired, or revoked |
627
+ | `conformance_failed` | A `protocol.*` operation depends on a passed conformance suite that this node has not passed |
628
+
629
+ ---
630
+
631
+ ## 10. Compatibility
632
+
633
+ - v3 contract nodes are backward-compatible with v2 nodes: v2 nodes simply do not see `experimental.*` capabilities and do not understand claim records (the relevant event types are skipped on replay).
634
+ - A v3 community must include at least one anchor that has passed the `protocol.conformance.report@1.0` suite at level 3.0, otherwise capabilities under `civdef.*` and the claim graph features remain inert (the spec calls this **degraded v3**; everything else still works).
635
+ - Promotion of an `experimental.X` to `X` is a normal v3 minor contract bump; the experimental form remains registered for one minor cycle.
636
+
637
+ ---
638
+
639
+ ## 11. Glossary additions
640
+
641
+ | Term | Meaning |
642
+ |------|---------|
643
+ | Shard | One contiguous range of transformer layers, served by one node |
644
+ | Pipeline | The chain of shards that produces a single LLM forward-pass |
645
+ | Expert | A routable subsystem (model, service, or human) that can answer a class of requests |
646
+ | Round | A federated-learning training session bounded in time |
647
+ | Diff | A LoRA delta tensor produced by one round participant |
648
+ | Claim | A signed assertion of a predicate, with evidence, in the claim graph |
649
+ | Attestation | A signed endorsement by a trusted member that a claim is correct |
650
+ | KatS | Katastrophenschutz β€” German civil protection |
651
+ | Conformance | The property of an implementation passing the X09 suite at a stated level |
docs/p2_p3/IMPLEMENTATION_REFERENCE.md ADDED
@@ -0,0 +1,288 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # HearthNet Phase 3 β€” Spec Set Overview
2
+
3
+ **Phase 3 scope:** research-shaped, 6–12 months. This is where HearthNet stops being a product and starts being a protocol. Each module here is an investment in a long-term capability where the engineering is the easy part β€” the hard part is establishing trust, governance, and standards.
4
+
5
+ **Stance:** Phase 3 specs are **roadmaps**, not contracts. Where a Phase-1/2 spec answers "what does this *do*?", a Phase-3 spec answers "what would we *build* if we were ready to commit?". Concrete enough to start, loose enough to be wrong about details without invalidating the direction.
6
+
7
+ ---
8
+
9
+ ## 0. Reading these specs
10
+
11
+ Phase 3 specs deviate from the Phase 1 / 2 template in three respects:
12
+
13
+ 1. **Stability tag is `experimental` for new capabilities** unless explicitly promoted later. Mesh nodes ignore experimental capabilities unless the operator opts in via `policy.research.enable = true`.
14
+ 2. **Each module carries an "Open research questions" section** that is longer than the spec itself, by design. Phase 3 modules answer *some* of their open questions before shipping; the rest stay open.
15
+ 3. **Acceptance criteria are described, not enumerated**. The point isn't to grade an implementation against a checklist; it's to say "we'll know this is working when…"
16
+
17
+ If you read a Phase 3 spec and feel uncertain about how something works, that uncertainty is faithful to the state of the work. The spec is doing its job by being honest about that.
18
+
19
+ ---
20
+
21
+ ## 1. Module map (Phase 3)
22
+
23
+ ### New numbered modules
24
+
25
+ | ID | Module | Spec file | Concern |
26
+ |-----|------------------------------|-------------------------------------------------|----------------------------------------------------------------------|
27
+ | M26 | Distributed Inference | `modules/M26-distributed-inference.md` | Layer-sharded LLMs across nodes (Petals-style), small models only |
28
+ | M27 | MoE Expert Routing | `modules/M27-moe-routing.md` | Route queries to the right expert (machine or human) via learned scorer |
29
+ | M28 | Federated Learning | `modules/M28-fedlearn.md` | FedAvg on LoRA layers; per-community fine-tuning without sharing data |
30
+ | M29 | LoRA Long-Distance Beacons | `modules/M29-lora-beacons.md` | 868MHz "community alive" beacons; no AI traffic; emergency-only |
31
+ | M30 | Evidence / EBKH | `modules/M30-evidence-ebkh.md` | Claim graph alongside the event log; provenance + verifiability |
32
+ | M31 | Civil Defence Pilot | `modules/M31-civil-defense.md` | THW / DRK / KatS bridge; compliance profile; audit trail |
33
+ | M32 | Protocol Standardisation | `modules/M32-protocol-standard.md` | Reference implementation, conformance suite, governance for the spec |
34
+
35
+ ### New cross-cutting modules
36
+
37
+ | ID | Module | Spec file | Concern |
38
+ |-----|-----------------------|---------------------------------------------------|------------------------------------------------------|
39
+ | X08 | Tensor Transport | `cross-cutting/X08-tensor-transport.md` | High-throughput chunked tensor passing for M26 |
40
+ | X09 | Conformance Suite | `cross-cutting/X09-conformance-suite.md` | Black-box tests defining what "HearthNet-compliant" means |
41
+
42
+ ### Modifications to earlier modules
43
+
44
+ | Phase 1/2 module | Phase 3 extension |
45
+ |------------------|-------------------|
46
+ | M03 Bus | Optional MoE routing layer between dispatcher and handler (M27) |
47
+ | M04 LLM | Optional `experimental.distributed_llm.chat@1.0` backend (M26) |
48
+ | X02 Event log | Optional `evidence.*` claim records side-by-side with events (M30) |
49
+ | M14 Federation | Federated learning rounds use federation as the trust substrate (M28) |
50
+ | X03 Observability | Per-call expert-routing trace; per-shard tensor-transport metrics (M27, X08) |
51
+
52
+ ---
53
+
54
+ ## 2. Dependency graph (Phase 3 additions on top of Phases 1–2)
55
+
56
+ ```
57
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
58
+ β”‚ Phase 1 + Phase 2 (unchanged) β”‚
59
+ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
60
+ β”‚ β”‚ β”‚
61
+ β–Ό β–Ό β–Ό
62
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
63
+ β”‚ X08 β”‚ β”‚ M27 β”‚ β”‚ M30 β”‚
64
+ β”‚ Tensor β”‚ β”‚ MoE β”‚ β”‚ EBKH β”‚
65
+ β”‚ Transp. β”‚ β”‚ Routing β”‚ β”‚ Evidenceβ”‚
66
+ β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜
67
+ β–Ό β”‚ β”‚
68
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚
69
+ β”‚ M26 β”‚ β”‚ β”‚
70
+ β”‚ Distrib.β”‚ β”‚ β”‚
71
+ β”‚ Infer. β”‚ β”‚ β”‚
72
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚
73
+ β–Ό β–Ό
74
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
75
+ β”‚ M28 β”‚ β”‚ M31 β”‚
76
+ β”‚ FedLearnβ”‚ β”‚ CivDef. β”‚
77
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
78
+
79
+ Standalone (no software deps, governance / hardware):
80
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
81
+ β”‚ M29 β”‚ (hardware)
82
+ β”‚ LoRa β”‚
83
+ β”‚ Beacons β”‚
84
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
85
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
86
+ β”‚ X09 β”‚ (process)
87
+ β”‚ Conform.β”‚
88
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
89
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
90
+ β”‚ M32 β”‚ (governance)
91
+ β”‚ Standardβ”‚
92
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
93
+ ```
94
+
95
+ Most Phase 3 modules are independent of each other. The exceptions:
96
+ - M26 depends on X08
97
+ - M27 informs M26 (MoE routing picks which expert/shard)
98
+ - M28 reuses M14 federation for cross-community rounds
99
+ - M31 reuses M30 for evidence-grade emergency claims
100
+
101
+ ---
102
+
103
+ ## 3. File tree additions
104
+
105
+ ```
106
+ hearthnet/
107
+ β”œβ”€β”€ distributed_inference/ # M26
108
+ β”‚ β”œβ”€β”€ __init__.py
109
+ β”‚ β”œβ”€β”€ shard.py
110
+ β”‚ β”œβ”€β”€ pipeline.py
111
+ β”‚ β”œβ”€β”€ routing.py
112
+ β”‚ └── backends/
113
+ β”‚ β”œβ”€β”€ petals_like.py
114
+ β”‚ └── small_model_layered.py
115
+ β”‚
116
+ β”œβ”€β”€ moe/ # M27
117
+ β”‚ β”œβ”€β”€ __init__.py
118
+ β”‚ β”œβ”€β”€ router.py
119
+ β”‚ β”œβ”€β”€ scorer.py
120
+ β”‚ └── human_in_the_loop.py
121
+ β”‚
122
+ β”œβ”€β”€ fedlearn/ # M28
123
+ β”‚ β”œβ”€β”€ __init__.py
124
+ β”‚ β”œβ”€β”€ coordinator.py
125
+ β”‚ β”œβ”€β”€ round.py
126
+ β”‚ β”œβ”€β”€ lora_diff.py
127
+ β”‚ └── aggregation.py
128
+ β”‚
129
+ β”œβ”€β”€ lora_beacons/ # M29 β€” hardware integration; tiny Python surface
130
+ β”‚ β”œβ”€β”€ __init__.py
131
+ β”‚ β”œβ”€β”€ beacon_bridge.py # serial protocol to a LoRa USB stick
132
+ β”‚ └── policy.py
133
+ β”‚
134
+ β”œβ”€β”€ evidence/ # M30
135
+ β”‚ β”œβ”€β”€ __init__.py
136
+ β”‚ β”œβ”€β”€ claim.py
137
+ β”‚ β”œβ”€β”€ claim_graph.py
138
+ β”‚ β”œβ”€β”€ provenance.py
139
+ β”‚ └── ebkh_bridge.py # bridge to Christof's EBKH v3+
140
+ β”‚
141
+ β”œβ”€β”€ civil_defense/ # M31
142
+ β”‚ β”œβ”€β”€ __init__.py
143
+ β”‚ β”œβ”€β”€ profile.py # THW / DRK / KatS member types
144
+ β”‚ β”œβ”€β”€ audit.py
145
+ β”‚ └── nrw_katastrophenschutz.py
146
+ β”‚
147
+ β”œβ”€β”€ transport/
148
+ β”‚ └── tensor.py # X08
149
+ β”‚
150
+ └── conformance/ # X09
151
+ β”œβ”€β”€ __init__.py
152
+ β”œβ”€β”€ runner.py
153
+ β”œβ”€β”€ suites/
154
+ β”‚ β”œβ”€β”€ identity.py
155
+ β”‚ β”œβ”€β”€ transport.py
156
+ β”‚ β”œβ”€β”€ bus.py
157
+ β”‚ β”œβ”€β”€ services.py
158
+ β”‚ └── federation.py
159
+ └── report.py
160
+
161
+ protocol/ # M32 β€” separate top-level dir at repo root
162
+ β”œβ”€β”€ README.md
163
+ β”œβ”€β”€ spec/ # the protocol spec, decoupled from the impl
164
+ β”‚ β”œβ”€β”€ 00-overview.md # mirror of CAPABILITY_CONTRACT but
165
+ β”‚ β”œβ”€β”€ 01-identity.md # implementation-agnostic
166
+ β”‚ └── ...
167
+ └── governance/
168
+ β”œβ”€β”€ CHANGELOG.md
169
+ β”œβ”€β”€ CONTRIBUTING.md
170
+ └── ROADMAP.md
171
+ ```
172
+
173
+ ---
174
+
175
+ ## 4. Conventions delta from Phase 2
176
+
177
+ ### 4.1 New `experimental` namespace
178
+
179
+ A Phase-3 capability MAY be advertised as `experimental.<name>@<ver>`. Mesh nodes default to **not registering** experimental capabilities; the operator must opt in via:
180
+
181
+ ```toml
182
+ [policy.research]
183
+ enable = true
184
+ enabled_capabilities = ["experimental.distributed_llm.chat@1.0", "experimental.fedlearn.round.*"]
185
+ ```
186
+
187
+ Once a capability is sufficiently proven, it is promoted out of the `experimental.` prefix in a contract bump.
188
+
189
+ ### 4.2 New type aliases
190
+
191
+ ```python
192
+ # additions to hearthnet/types.py
193
+
194
+ ShardID = str # "<model_id>:<layer_range>"
195
+ ExpertID = str # opaque, refers to a routable subsystem
196
+ ClaimID = str # ULID
197
+ RoundID = str # fedlearn round identifier (ULID)
198
+ LoraBeaconID = str # 8-byte hex, hardware-issued
199
+ EvidenceLevel = Literal["unverified","cited","cross_referenced","attested","disputed"]
200
+ ExpertKind = Literal["model","human","service","external"]
201
+ ```
202
+
203
+ ### 4.3 New constants
204
+
205
+ ```python
206
+ # additions to hearthnet/constants.py β€” Phase 3
207
+
208
+ # Distributed inference (M26)
209
+ DISTRIBUTED_MAX_SHARDS_PER_REQUEST = 16
210
+ DISTRIBUTED_SHARD_HEALTH_TIMEOUT_S = 30
211
+ DISTRIBUTED_FALLBACK_TO_LOCAL_AFTER_FAILURES = 2
212
+
213
+ # MoE routing (M27)
214
+ MOE_ROUTER_TOP_K = 3
215
+ MOE_ROUTER_TRAIN_MIN_EXAMPLES = 200
216
+ MOE_ROUTER_RETRAIN_EVERY_HOURS = 24
217
+
218
+ # Federated learning (M28)
219
+ FEDLEARN_MAX_ROUND_MINUTES = 120
220
+ FEDLEARN_MIN_PARTICIPANTS = 3
221
+ FEDLEARN_MAX_LORA_RANK = 64
222
+ FEDLEARN_GRAD_CLIP = 1.0
223
+ FEDLEARN_DP_NOISE_SCALE_DEFAULT = 0.0 # off by default; off-by-default differential privacy
224
+
225
+ # Evidence (M30)
226
+ EVIDENCE_CLAIM_TTL_DAYS_DEFAULT = 365
227
+ EVIDENCE_MAX_PROVENANCE_DEPTH = 16
228
+
229
+ # Civil defence (M31)
230
+ CIVDEF_AUDIT_RETENTION_YEARS = 10
231
+ CIVDEF_HEARTBEAT_SECONDS = 60
232
+
233
+ # Tensor transport (X08)
234
+ TENSOR_CHUNK_BYTES = 1_048_576 # 1 MB
235
+ TENSOR_FLOW_CONTROL_WINDOW = 16 # chunks
236
+ TENSOR_COMPRESSION_THRESHOLD_BYTES = 65_536
237
+
238
+ # LoRa beacons (M29)
239
+ LORA_BEACON_PERIOD_SECONDS_DEFAULT = 600 # 10 minutes
240
+ LORA_BEACON_MAX_PAYLOAD_BYTES = 32
241
+ ```
242
+
243
+ ---
244
+
245
+ ## 5. Build order (Phase 3)
246
+
247
+ Phase 3 is not a release; it is a set of long-running tracks. Suggested ordering by independence + value:
248
+
249
+ | Track | Modules | Outcome |
250
+ |-------|----------------------------------|-------------------------------------------------------------------------------|
251
+ | A | X09 Conformance + M32 Standard | Other people can build HearthNet-compliant nodes |
252
+ | B | M30 Evidence / EBKH | Marketplace claims and emergency posts carry provenance |
253
+ | C | M27 MoE Routing (machines only) | Better answers for free; routes RAG queries to best-suited backend |
254
+ | D | M27 + M28 (human routing) | Neighbour gets pinged when their expertise matches |
255
+ | E | M28 FedLearn | Communities co-train a small LoRA without sharing source data |
256
+ | F | X08 + M26 Distributed Inference | Two anchors jointly serve a 7B model; large models become feasible LAN-wide |
257
+ | G | M29 LoRa Beacons | Resilient "I am alive" pings during regional internet outages |
258
+ | H | M31 Civil Defence Pilot | A real Niederrhein THW Ortsverband uses HearthNet for an exercise |
259
+
260
+ Tracks can run in parallel. None of them block the existing Phase-2 system.
261
+
262
+ ---
263
+
264
+ ## 6. Spec versioning
265
+
266
+ - Capability Contract bumps to **v3.0** but the bump is *additive*. v2 nodes coexist with v3 nodes; experimental capabilities simply aren't seen by v2 nodes.
267
+ - The first concrete deliverable of Track A (M32) is to **decouple** the protocol spec from the implementation. After that, the contract has its own version track separate from the Python implementation's version.
268
+
269
+ ---
270
+
271
+ ## 7. Out-of-band documents (Phase 3)
272
+
273
+ - **RESEARCH_AGENDA.md** β€” the deeper "why" for each module; intended audience: PhD students and grant reviewers
274
+ - **GOVERNANCE.md** β€” how spec changes are proposed, reviewed, and accepted; ties into M32
275
+ - **ETHICS_REVIEW.md** β€” the framework for evaluating MoE-driven routing-to-humans (M27) and fedlearn-on-personal-data (M28)
276
+ - **CIVDEF_AGREEMENT_TEMPLATE.md** β€” the MoU template for a civil-defence pilot
277
+
278
+ ---
279
+
280
+ ## 8. What is NOT in Phase 3
281
+
282
+ Even with all of Phase 3 done, the following remain explicit non-goals:
283
+
284
+ - A central directory of communities. There is no "HearthNet.com" listing all communities. Discovery is via word of mouth + DHT + federation. Pushed indefinitely.
285
+ - An app store for capabilities. Capabilities are code in the source tree, reviewed by maintainers. Not pluggable at runtime by untrusted code.
286
+ - A consensus protocol (Paxos, Raft). Communities do not vote on shared state beyond event-log gossip. Federation does not imply consensus.
287
+ - A cryptocurrency / token economy. Not even for fedlearn incentives. Reputational signals only.
288
+ - AGI. Even the distributed inference module targets at-most-mid-sized models (7B-class). The thesis is "small models close to people are more useful than large models far away", and Phase 3 doesn't change that.
docs/p2_p3/IMPLEMENTATION_REFERENCE_p3.md ADDED
@@ -0,0 +1,493 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # HearthNet Phase 3 β€” Implementation Reference
2
+
3
+ **Spec set:** v3.0 β€” *experimental*
4
+ **Status:** Research-shaped. Names and contracts may shift before promotion to stable.
5
+
6
+ This document is the symbol index and quick-reference for everything introduced in Phase 3. It mirrors the structure of Phase 1's and Phase 2's `IMPLEMENTATION_REFERENCE.md`, so a maintainer can find the relevant spec section, file, class, capability, event, or constant by name.
7
+
8
+ Read this with the Phase 1 + Phase 2 references already in hand. Phase 3 is purely additive; it does not redefine anything from earlier phases.
9
+
10
+ ---
11
+
12
+ ## 1. Module index
13
+
14
+ | ID | Name | Status | Spec file |
15
+ |----|------|--------|-----------|
16
+ | M26 | Distributed Inference (layer sharding) | experimental | [phase-3/modules/M26-distributed-inference.md](modules/M26-distributed-inference.md) |
17
+ | M27 | MoE Expert Routing | experimental | [phase-3/modules/M27-moe-routing.md](modules/M27-moe-routing.md) |
18
+ | M28 | Federated Learning (LoRA aggregation) | experimental | [phase-3/modules/M28-fedlearn.md](modules/M28-fedlearn.md) |
19
+ | M29 | LoRa Hardware Beacons | experimental | [phase-3/modules/M29-lora-beacons.md](modules/M29-lora-beacons.md) |
20
+ | M30 | Evidence Graph & EBKH Integration | experimental | [phase-3/modules/M30-evidence-ebkh.md](modules/M30-evidence-ebkh.md) |
21
+ | M31 | Civil Defense (NRW BevΓΆlkerungsschutz) | experimental | [phase-3/modules/M31-civil-defense.md](modules/M31-civil-defense.md) |
22
+ | M32 | Protocol Standardisation & Conformance | provisional | [phase-3/modules/M32-protocol-standard.md](modules/M32-protocol-standard.md) |
23
+ | X08 | Tensor Transport | experimental | [phase-3/cross-cutting/X08-tensor-transport.md](cross-cutting/X08-tensor-transport.md) |
24
+ | X09 | Conformance Suite | provisional | [phase-3/cross-cutting/X09-conformance-suite.md](cross-cutting/X09-conformance-suite.md) |
25
+
26
+ All `experimental` modules are gated by per-module feature flags in `hearthnet/config.py` and default to `False`.
27
+
28
+ ---
29
+
30
+ ## 2. Capability index
31
+
32
+ All Phase 3 capabilities (except `protocol.*`) live in the `experimental.*` namespace. The bus refuses to register them unless the corresponding feature flag is on.
33
+
34
+ ### 2.1 Distributed Inference (M26)
35
+
36
+ | Capability | Module | Notes |
37
+ |------------|--------|-------|
38
+ | `experimental.distributed_llm.shard.list` | M26 | Local advertisement of shards we host |
39
+ | `experimental.distributed_llm.shard.connect` | M26 | Negotiate an X08 tensor session |
40
+ | `experimental.distributed_llm.shard.forward` | M26 | Run forward through a hosted shard |
41
+ | `experimental.distributed_llm.pipeline.plan` | M26 | Construct a layer pipeline for a model |
42
+ | `experimental.distributed_llm.pipeline.run` | M26 | Execute a planned pipeline |
43
+ | `experimental.distributed_llm.pipeline.status` | M26 | Pipeline state and stats |
44
+
45
+ ### 2.2 MoE Routing (M27)
46
+
47
+ | Capability | Module | Notes |
48
+ |------------|--------|-------|
49
+ | `experimental.moe.expert.register` | M27 | Register self as an expert for some topics |
50
+ | `experimental.moe.expert.list` | M27 | List registered experts |
51
+ | `experimental.moe.expert.unregister` | M27 | Withdraw expert registration |
52
+ | `experimental.moe.route.query` | M27 | Get top-K experts for a query |
53
+ | `experimental.moe.route.handoff` | M27 | Initiate human-in-the-loop handoff |
54
+ | `experimental.moe.feedback.record` | M27 | Record outcome for scorer training |
55
+
56
+ ### 2.3 Federated Learning (M28)
57
+
58
+ | Capability | Module | Notes |
59
+ |------------|--------|-------|
60
+ | `experimental.fedlearn.round.announce` | M28 | Coordinator announces a round |
61
+ | `experimental.fedlearn.round.list` | M28 | List open rounds |
62
+ | `experimental.fedlearn.round.join` | M28 | Participant joins a round |
63
+ | `experimental.fedlearn.round.submit` | M28 | Submit gradient delta |
64
+ | `experimental.fedlearn.round.status` | M28 | Get round state |
65
+ | `experimental.fedlearn.round.finalize` | M28 | Coordinator finalises and aggregates |
66
+ | `experimental.fedlearn.adapter.fetch` | M28 | Fetch an aggregated adapter by SHA |
67
+ | `experimental.fedlearn.adapter.apply` | M28 | Apply adapter to session or node |
68
+
69
+ ### 2.4 LoRa Beacons (M29)
70
+
71
+ | Capability | Module | Notes |
72
+ |------------|--------|-------|
73
+ | `experimental.lora.status` | M29 | Hardware and link status |
74
+ | `experimental.lora.beacon.send` | M29 | Send a normal beacon |
75
+ | `experimental.lora.panic.send` | M29 | Send a panic burst |
76
+ | `experimental.lora.peer.list` | M29 | Known LoRa peers |
77
+ | `experimental.lora.peer.verify` | M29 | TOFU-confirm a peer's NodeID binding |
78
+ | `experimental.lora.recent_beacons` | M29 | Recent RX'd beacons |
79
+ | `experimental.lora.duty_cycle` | M29 | Current duty-cycle budget status |
80
+
81
+ ### 2.5 Evidence Graph (M30)
82
+
83
+ | Capability | Module | Notes |
84
+ |------------|--------|-------|
85
+ | `experimental.evidence.claim.assert` | M30 | Assert a new claim |
86
+ | `experimental.evidence.claim.dispute` | M30 | Dispute an existing claim |
87
+ | `experimental.evidence.claim.attest` | M30 | Attest to an existing claim |
88
+ | `experimental.evidence.claim.get` | M30 | Fetch a single claim by ID |
89
+ | `experimental.evidence.claim.query` | M30 | Query claims by triple |
90
+ | `experimental.evidence.provenance.trace` | M30 | Walk the derivation graph |
91
+ | `experimental.evidence.subject.summary` | M30 | Multi-claim summary for a subject |
92
+ | `experimental.evidence.ebkh.sync` | M30 | Sync with external EBKH endpoint |
93
+
94
+ ### 2.6 Civil Defense (M31)
95
+
96
+ | Capability | Module | Notes |
97
+ |------------|--------|-------|
98
+ | `experimental.civdef.alert.publish` | M31 | Role-cert-gated alert publication |
99
+ | `experimental.civdef.alert.cancel` | M31 | Cancel an active alert |
100
+ | `experimental.civdef.alert.list` | M31 | List active alerts (filtered) |
101
+ | `experimental.civdef.alert.get` | M31 | Fetch an alert envelope |
102
+ | `experimental.civdef.alert.subscribe` | M31 | Subscribe to alerts matching a filter |
103
+ | `experimental.civdef.alert.ack` | M31 | Acknowledge an alert |
104
+ | `experimental.civdef.alert.acks` | M31 | List acks for an alert |
105
+ | `experimental.civdef.role.register` | M31 | Register a role certificate |
106
+ | `experimental.civdef.role.list` | M31 | List registered certificates |
107
+ | `experimental.civdef.role.revoke` | M31 | Revoke a certificate |
108
+ | `experimental.civdef.audit.export` | M31 | Export tamper-evident audit chain |
109
+
110
+ ### 2.7 Protocol (M32) β€” stable
111
+
112
+ | Capability | Module | Notes |
113
+ |------------|--------|-------|
114
+ | `protocol.version.list` | M32 | Versions supported |
115
+ | `protocol.self.describe` | M32 | Implementation descriptor |
116
+ | `protocol.conformance.report` | M32 | Run / fetch conformance report |
117
+ | `protocol.registry.list` | M32 | Known implementations |
118
+ | `protocol.registry.announce` | M32 | Announce self to registry |
119
+
120
+ ---
121
+
122
+ ## 3. Event types
123
+
124
+ All Phase 3 event types follow the convention `<area>.<entity>.<verb>` and are recorded in the X02 event log.
125
+
126
+ ### 3.1 Distributed inference
127
+
128
+ ```
129
+ distributed_llm.shard.advertised
130
+ distributed_llm.shard.withdrawn
131
+ distributed_llm.pipeline.planned
132
+ distributed_llm.pipeline.started
133
+ distributed_llm.pipeline.shard_failed
134
+ distributed_llm.pipeline.failover
135
+ distributed_llm.pipeline.completed
136
+ distributed_llm.pipeline.cancelled
137
+ ```
138
+
139
+ ### 3.2 MoE
140
+
141
+ ```
142
+ moe.expert.registered
143
+ moe.expert.unregistered
144
+ moe.route.computed
145
+ moe.handoff.initiated
146
+ moe.handoff.accepted
147
+ moe.handoff.declined
148
+ moe.handoff.timed_out
149
+ moe.feedback.recorded
150
+ ```
151
+
152
+ ### 3.3 Federated learning
153
+
154
+ ```
155
+ fedlearn.round.announced
156
+ fedlearn.round.joined
157
+ fedlearn.round.consent.granted
158
+ fedlearn.round.submitted
159
+ fedlearn.round.aggregated
160
+ fedlearn.round.completed
161
+ fedlearn.round.aborted
162
+ fedlearn.round.takeover
163
+ fedlearn.adapter.published
164
+ fedlearn.adapter.applied
165
+ ```
166
+
167
+ ### 3.4 LoRa
168
+
169
+ ```
170
+ lora.beacon.sent
171
+ lora.beacon.received
172
+ lora.panic.sent
173
+ lora.panic.received
174
+ lora.peer.unknown
175
+ lora.peer.verified
176
+ lora.peer.conflict
177
+ lora.duty_cycle.exhausted
178
+ lora.duty_cycle.overridden
179
+ lora.rx.dropped
180
+ ```
181
+
182
+ ### 3.5 Evidence
183
+
184
+ ```
185
+ evidence.claim.asserted
186
+ evidence.claim.attested
187
+ evidence.dispute.opened
188
+ evidence.dispute.retracted
189
+ evidence.ebkh.synced
190
+ evidence.ebkh.sync_partial
191
+ ```
192
+
193
+ ### 3.6 Civil defense
194
+
195
+ ```
196
+ civdef.alert.published
197
+ civdef.alert.forwarded
198
+ civdef.alert.acked
199
+ civdef.alert.cancelled
200
+ civdef.alert.dropped.revoked
201
+ civdef.alert.foreign_role
202
+ civdef.role.registered
203
+ civdef.role.revoked
204
+ civdef.audit.checkpointed
205
+ civdef.audit.broken
206
+ ```
207
+
208
+ ### 3.7 Protocol
209
+
210
+ ```
211
+ protocol.descriptor.announced
212
+ protocol.registry.updated
213
+ protocol.conformance.ran
214
+ ```
215
+
216
+ ---
217
+
218
+ ## 4. Type aliases (added to `hearthnet/types.py`)
219
+
220
+ ```python
221
+ ShardID = NewType("ShardID", str) # "model:layer_range[:tier]"
222
+ ExpertID = NewType("ExpertID", str) # "human:..." | "model:..." | "service:..." | "external:..."
223
+ ExpertKind = Literal["human","model","service","external"]
224
+ ClaimID = NewType("ClaimID", str) # base32 of SHA-256 canonical claim
225
+ SourceID = NewType("SourceID", str)
226
+ EvidenceLevel = Literal["unverified","cited","cross_referenced","attested","disputed"]
227
+ RoundID = NewType("RoundID", str) # ULID
228
+ LoraBeaconID = NewType("LoraBeaconID", str)
229
+ LoraDeviceID = NewType("LoraDeviceID", str)
230
+ AlertID = NewType("AlertID", str) # ULID
231
+ AlertSeverity = Literal["info","advisory","warning","emergency","extreme"]
232
+ AckStatus = Literal["received","acting","need_help","standing_down","mistaken"]
233
+
234
+ @dataclass(frozen=True)
235
+ class ProtocolVersion: major: int; minor: int; patch: int; suffix: str = ""
236
+
237
+ # Reused from earlier phases but referenced here for completeness:
238
+ NodeID # M01
239
+ EventID # X02
240
+ AuthToken # M16
241
+ Bbox # M07 spatial extensions
242
+ Tensor # local LLM tensor type, dtype-tagged
243
+ ```
244
+
245
+ ---
246
+
247
+ ## 5. Centralised constants (`hearthnet/constants.py`, Phase 3 additions)
248
+
249
+ ```python
250
+ # --- Distributed inference (M26) ---
251
+ DISTRIBUTED_LLM_MAX_SHARDS_PER_PIPELINE = 16
252
+ DISTRIBUTED_LLM_SHARD_HEARTBEAT_SECONDS = 5
253
+ DISTRIBUTED_LLM_FAILOVER_TIMEOUT_SECONDS = 10
254
+ DISTRIBUTED_LLM_MAX_PIPELINE_LATENCY_TOKENS_PER_S = 2.0 # advisory floor
255
+ DISTRIBUTED_LLM_DEFAULT_DTYPE = "fp16"
256
+
257
+ # --- MoE routing (M27) ---
258
+ MOE_TOP_K_DEFAULT = 3
259
+ MOE_LEARNED_SCORER_MIN_FEEDBACK_SAMPLES = 200
260
+ MOE_HUMAN_HANDOFF_DEFAULT_TIMEOUT_HOURS = 24
261
+ MOE_HUMAN_HANDOFF_COOLDOWN_HOURS = 2
262
+ MOE_HUMAN_RATE_LIMIT_PER_DAY = 5
263
+
264
+ # --- Federated learning (M28) ---
265
+ FEDLEARN_MAX_LORA_RANK = 64
266
+ FEDLEARN_MAX_LORA_TARGET_MODULES = 8
267
+ FEDLEARN_MAX_TRAIN_STEPS = 1000
268
+ FEDLEARN_MAX_PARTICIPANTS = 32
269
+ FEDLEARN_MIN_PARTICIPANTS = 3
270
+ FEDLEARN_DP_NOISE_SCALE_DEFAULT = 0.0 # off
271
+ FEDLEARN_CLIP_NORM_DEFAULT = 1.0
272
+ FEDLEARN_SUBMISSION_MAX_BYTES = 64 * 1024 * 1024
273
+
274
+ # --- LoRa beacons (M29) ---
275
+ LORA_BEACON_PERIOD_SECONDS_DEFAULT = 600 # 10 min
276
+ LORA_BEACON_MAX_PAYLOAD_BYTES = 32
277
+ LORA_RX_QUEUE_MAX = 256
278
+ LORA_PEER_RX_MAX_PER_MINUTE = 20
279
+ LORA_PANIC_BURST_COUNT = 3
280
+ LORA_PANIC_BURST_GAP_MS = 800
281
+
282
+ # --- Evidence (M30) ---
283
+ EVIDENCE_CLAIM_TTL_DAYS_DEFAULT = 365
284
+ EVIDENCE_DISPUTE_MIN_TRUST = 0.3
285
+ EVIDENCE_MAX_PROVENANCE_DEPTH = 8
286
+
287
+ # --- Civil defense (M31) ---
288
+ CIVDEF_AUDIT_RETENTION_YEARS = 10 # operator must validate against current law
289
+ CIVDEF_ACK_MAX_PER_MINUTE_PER_NODE = 5
290
+ CIVDEF_ALERT_TITLE_MAX_CHARS = 80
291
+ CIVDEF_ALERT_BODY_MAX_CHARS = 1000
292
+
293
+ # --- Tensor transport (X08) ---
294
+ TENSOR_CHUNK_BYTES = 1 * 1024 * 1024 # 1 MiB
295
+ TENSOR_FLOW_CONTROL_WINDOW = 16
296
+ TENSOR_COMPRESSION_THRESHOLD_BYTES = 64 * 1024
297
+ TENSOR_KEEPALIVE_SECONDS = 30
298
+ TENSOR_MAX_SESSION_LIFETIME_SECONDS = 3600
299
+
300
+ # --- Conformance suite (X09) ---
301
+ CONFORMANCE_DEFAULT_SEED = 0xC0FFEE
302
+ CONFORMANCE_DEFAULT_OUTPUT_DIR = "./conformance-report"
303
+ ```
304
+
305
+ ---
306
+
307
+ ## 6. Error codes (Phase 3 additions)
308
+
309
+ | Code | Module | When |
310
+ |------|--------|------|
311
+ | `experimental_disabled` | shared | Capability called with the feature flag off |
312
+ | `shard_unavailable` | M26 | No replica for the required layer range |
313
+ | `shard_unreachable` | M26 | All replicas connectivity-failed |
314
+ | `pipeline_failed` | M26 | Aggregate failure of an in-flight pipeline |
315
+ | `pipeline_cancelled` | M26 | Pipeline cancelled by caller |
316
+ | `tensor_too_large` | X08 | Tensor exceeds rx_buffer_bytes_max |
317
+ | `unknown_frame_type` | X08 | Frame type outside the defined set |
318
+ | `expert_unknown` | M27 | Referenced expert is not registered |
319
+ | `expert_unavailable` | M27 | Expert known but currently outside availability window |
320
+ | `human_handoff_declined` | M27 | Human expert explicitly declined |
321
+ | `human_handoff_timed_out` | M27 | Handoff exceeded timeout without ack |
322
+ | `consent_required` | M28 | join() without explicit operator consent |
323
+ | `base_model_mismatch` | M28 | Local base model SHA differs from manifest |
324
+ | `insufficient_resources` | M28 | Estimated VRAM/disk exceeds budget |
325
+ | `delta_invalid` | M28 | Submitted state-dict fails structural validation |
326
+ | `fedlearn_aggregation_failed` | M28 | Aggregation produced NaN/Inf |
327
+ | `fedlearn_min_participants_unmet` | M28 | Round closed below quorum |
328
+ | `fedlearn_aggregator_unreachable` | M28 | Finalize attempted while coordinator offline |
329
+ | `adapter_not_found` | M28 | Fetch for unknown adapter SHA |
330
+ | `lora_hardware_unavailable` | M29 | No stick present |
331
+ | `lora_hardware_unsupported` | M29 | Adapter init failed |
332
+ | `lora_duty_cycle_exhausted` | M29 | Non-panic send with empty budget |
333
+ | `lora_peer_unknown` | M29 | Verify against unseen sender_hash |
334
+ | `lora_peer_conflict` | M29 | Verify would create conflicting binding |
335
+ | `lora_frame_malformed` | M29 | RX frame structurally invalid |
336
+ | `claim_not_found` | M30 | Reference to unknown ClaimID |
337
+ | `claim_signature_invalid` | M30 | Signature doesn't verify |
338
+ | `evidence_cycle_detected` | M30 | Derivation chain forms a cycle |
339
+ | `evidence_contradiction` | M30 | (advisory) conflicting claims on same triple |
340
+ | `ebkh_unavailable` | M30 | EBKH endpoint unreachable |
341
+ | `civdef_cert_not_owned` | M31 | Cert holder β‰  caller identity |
342
+ | `civdef_cert_invalid` | M31 | Cert expired, revoked, or signature broken |
343
+ | `civdef_cert_unrecognised` | M31 | Issuer chain doesn't reach a trust root |
344
+ | `civdef_cert_out_of_scope` | M31 | Cert lacks the requested role/region |
345
+ | `civdef_alert_not_found` | M31 | Operation on unknown AlertID |
346
+ | `civdef_alert_target_invalid` | M31 | Malformed target or outside scope |
347
+ | `civdef_audit_chain_broken` | M31 | Audit chain hash/signature mismatch |
348
+ | `civdef_role_revoked` | M31 | Op with revoked cert |
349
+ | `civdef_region_unsupported` | M31 | No region adapter loaded |
350
+ | `civdef_ack_rate_limited` | M31 | Ack rate exceeded |
351
+ | `protocol_version_unknown` | M32 | Reference to unknown protocol version |
352
+ | `protocol_suite_not_installed` | M32 | Conformance report requested without X09 |
353
+ | `protocol_descriptor_invalid` | M32 | Malformed descriptor announcement |
354
+ | `protocol_unsupported_capability` | M32 | Federation negotiates no compatible major |
355
+
356
+ ---
357
+
358
+ ## 7. File map (top-level)
359
+
360
+ ```
361
+ hearthnet/
362
+ β”œβ”€β”€ distributed_inference/ # M26
363
+ β”‚ β”œβ”€β”€ shard.py
364
+ β”‚ β”œβ”€β”€ orchestrator.py
365
+ β”‚ β”œβ”€β”€ router.py
366
+ β”‚ β”œβ”€β”€ plan.py
367
+ β”‚ └── failover.py
368
+ β”œβ”€β”€ moe/ # M27
369
+ β”‚ β”œβ”€β”€ router.py
370
+ β”‚ β”œβ”€β”€ scorer.py
371
+ β”‚ β”œβ”€β”€ expert_registry.py
372
+ β”‚ β”œβ”€β”€ human_in_the_loop.py
373
+ β”‚ └── feedback.py
374
+ β”œβ”€β”€ fedlearn/ # M28
375
+ β”‚ β”œβ”€β”€ coordinator.py
376
+ β”‚ β”œβ”€β”€ participant.py
377
+ β”‚ β”œβ”€β”€ trainer.py
378
+ β”‚ β”œβ”€β”€ aggregator.py
379
+ β”‚ β”œβ”€β”€ delta.py
380
+ β”‚ β”œβ”€β”€ privacy.py
381
+ β”‚ └── manifest.py
382
+ β”œβ”€β”€ lora/ # M29
383
+ β”‚ β”œβ”€β”€ service.py
384
+ β”‚ β”œβ”€β”€ serial_bridge.py
385
+ β”‚ β”œβ”€β”€ frame.py
386
+ β”‚ β”œβ”€β”€ duty_cycle.py
387
+ β”‚ β”œβ”€β”€ peer_map.py
388
+ β”‚ └── adapters/{meshtastic,rfm95w,sx126x}.py
389
+ β”œβ”€β”€ evidence/ # M30
390
+ β”‚ β”œβ”€β”€ service.py
391
+ β”‚ β”œβ”€β”€ claim.py
392
+ β”‚ β”œβ”€β”€ store.py
393
+ β”‚ β”œβ”€β”€ query.py
394
+ β”‚ β”œβ”€β”€ extractor.py
395
+ β”‚ β”œβ”€β”€ ebkh_adapter.py
396
+ β”‚ └── trust.py
397
+ β”œβ”€β”€ civdef/ # M31
398
+ β”‚ β”œβ”€β”€ service.py
399
+ β”‚ β”œβ”€β”€ alert.py
400
+ β”‚ β”œβ”€β”€ role.py
401
+ β”‚ β”œβ”€β”€ audit.py
402
+ β”‚ β”œβ”€β”€ target.py
403
+ β”‚ β”œβ”€β”€ ack.py
404
+ β”‚ └── regions/nrw.py
405
+ β”œβ”€β”€ protocol/ # M32 runtime
406
+ β”‚ β”œβ”€β”€ service.py
407
+ β”‚ β”œβ”€β”€ registry.py
408
+ β”‚ └── report.py
409
+ └── transport/
410
+ └── tensor/ # X08
411
+ β”œβ”€β”€ session.py
412
+ β”œβ”€β”€ frame.py
413
+ β”œβ”€β”€ flow.py
414
+ └── compress.py
415
+
416
+ protocol/ # M32 spec artefacts (repo root)
417
+ β”œβ”€β”€ VERSION
418
+ β”œβ”€β”€ README.md
419
+ β”œβ”€β”€ CHANGELOG.md
420
+ β”œβ”€β”€ governance.md
421
+ β”œβ”€β”€ versioning.md
422
+ β”œβ”€β”€ reference-implementations.md
423
+ β”œβ”€β”€ core/...
424
+ └── experimental/...
425
+
426
+ conformance/ # X09 suite (repo root)
427
+ β”œβ”€β”€ VERSION
428
+ β”œβ”€β”€ runner.py
429
+ β”œβ”€β”€ report.py
430
+ β”œβ”€β”€ harness/...
431
+ β”œβ”€β”€ suites/...
432
+ └── vectors/...
433
+ ```
434
+
435
+ ---
436
+
437
+ ## 8. Configuration index
438
+
439
+ Each module defines its own `*Config` dataclass; all are surfaced through the global `HearthnetConfig` and read from `~/.config/hearthnet/config.toml`. Phase 3 additions:
440
+
441
+ ```python
442
+ @dataclass(frozen=True)
443
+ class HearthnetConfig:
444
+ # ... (Phase 1, Phase 2 fields) ...
445
+ distributed_llm: DistributedLlmConfig
446
+ moe: MoeConfig
447
+ fedlearn: FedLearnConfig
448
+ lora: LoraConfig
449
+ evidence: EvidenceConfig
450
+ civdef: CivDefConfig
451
+ tensor_transport: TensorTransportConfig
452
+ protocol: ProtocolConfig
453
+ ```
454
+
455
+ Every Phase 3 config has `enabled: bool = False` except `protocol` (default `True`). The bus dispatcher refuses to register Phase 3 capabilities when their module's enabled flag is False.
456
+
457
+ ---
458
+
459
+ ## 9. Build order (recap from `00-OVERVIEW.md`)
460
+
461
+ Phase 3 has eight independent tracks A-H that can be parallelised:
462
+
463
+ ```
464
+ Track A: X09 conformance suite scaffolding β†’ M32 protocol service
465
+ Track B: M30 evidence + EBKH adapter
466
+ Track C: M27 MoE machine experts (router, registry, scorer)
467
+ Track D: M27 human-in-the-loop coordinator (depends on Track C base)
468
+ Track E: M28 federated LoRA aggregation
469
+ Track F: X08 tensor transport β†’ M26 distributed inference
470
+ Track G: M29 LoRa beacons (hardware-gated)
471
+ Track H: M31 civil defense (depends on M30 evidence)
472
+ ```
473
+
474
+ Tracks A and F unlock the most downstream work (M32 needs X09; M26 needs X08). Tracks G and H are most easily deferred if Phase 3 needs to ship a minimal cut.
475
+
476
+ ---
477
+
478
+ ## 10. Open-question summary
479
+
480
+ Each module spec has its own Β§10 with detailed open questions. The recurring themes across Phase 3:
481
+
482
+ 1. **Real-world identity binding** (M28 sybil defence, M31 institutional keys, M30 EBKH trust roots, M27 human verification) β€” the cryptographic story is solid; the social/institutional story is the work.
483
+ 2. **Adversarial robustness** (M26 byzantine shards, M28 poisoning, M30 disputed claims, M31 forged certs) β€” all have stub defences and known harder problems.
484
+ 3. **Second implementation** (M32) β€” until a non-reference impl exists, conformance is performative. This is the single most important next step.
485
+ 4. **Cross-Land / cross-border generalisation** (M31 regional adapter, M30 EBKH OSINT scope, M29 regulatory regions) β€” designed for NRW first; structures admit other regions but they're unbuilt.
486
+ 5. **Resource tiers** (M26 phone-class participants, M28 hardware fairness) β€” heterogeneous hardware aggregation is largely unsolved.
487
+ 6. **Privacy / DP calibration** (M28 noise scale, M30 sensitive claims, M29 sender hash) β€” defaults are conservative; tuning is operator-by-operator.
488
+
489
+ Each module also lists module-specific items. Read them.
490
+
491
+ ---
492
+
493
+ *Last updated: spec set v3.0. Phase 3 specs were authored with the intent that any of M26–M31 could be cut from a shipping release without affecting Phase 1 + Phase 2 functionality. M32 + X09 are the long-term durability investment and should ship even when other Phase 3 modules don't.*
docs/p2_p3/M14-federation.md ADDED
@@ -0,0 +1,434 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # M14 β€” Federation
2
+
3
+ **Spec version:** v1.0 (Phase 2)
4
+ **Depends on:** M01 (identity), M16 (tokens), X02 (events), X05 (DHT, for cross-LAN bootstrap), X01 (transport), M03 (bus), M15 (relay, for NAT traversal)
5
+ **Depended on by:** M22 (mobile uses federation for cross-community discovery), all services indirectly (federated calls route through M14)
6
+
7
+ ---
8
+
9
+ ## 1. Responsibility
10
+
11
+ Let two **communities** establish a trust link and grant each other scoped access to specific capabilities. Federation is the bridge from "your house" to "the neighbourhood".
12
+
13
+ - Establish a federation between two communities (mutual cross-signature)
14
+ - Enforce scope on cross-community capability calls
15
+ - Route federated calls through a designated **Bridge** node profile
16
+ - Heartbeat between federated communities so liveness is visible
17
+ - Issue federation tokens (via M16) for individual calls
18
+
19
+ Out of scope:
20
+ - Many-community mesh federation β€” Phase 3 (this module handles bilateral only)
21
+ - Anonymous federation β€” never
22
+
23
+ ---
24
+
25
+ ## 2. File layout
26
+
27
+ ```
28
+ hearthnet/federation/
29
+ β”œβ”€β”€ __init__.py
30
+ β”œβ”€β”€ manifest.py # FederationManifest builder + verifier
31
+ β”œβ”€β”€ peering.py # the cross-sig handshake
32
+ β”œβ”€β”€ relay_client.py # outbound calls into peer community
33
+ └── service.py # FederationService β€” registers federation.* capabilities
34
+ ```
35
+
36
+ ---
37
+
38
+ ## 3. Federation manifest
39
+
40
+ Format defined in [CAP2 Β§6.1](../CAPABILITY_CONTRACT_v2.md). Key properties:
41
+
42
+ - Lives in **both** communities' event logs
43
+ - Signed by anchors of both, with per-community `min_signatures_to_federate` co-signatures
44
+ - Carries the **scope** each side grants the other
45
+ - Has an expiry (`expires_at`) β€” typically 1 year, with auto-renewal via heartbeats
46
+
47
+ A node may belong to **one** community (Phase 1) and that community may be federated with **N** other communities. Federation is a community-level relationship, not per-node.
48
+
49
+ ---
50
+
51
+ ## 4. Public API
52
+
53
+ ### 4.1 `manifest.py`
54
+
55
+ ```python
56
+ # hearthnet/federation/manifest.py
57
+ from dataclasses import dataclass
58
+
59
+ @dataclass(frozen=True)
60
+ class FederationScope:
61
+ """What one community grants the other."""
62
+ capabilities: list[str] # ["rag.query@1.0"]
63
+ params_constraints: dict[str, list[str]] # {"corpus":["public-emergency"]}
64
+ rate_limit_per_minute: int # cross-community budget
65
+ data_visibility: str # "public_corpora_only"|"members_only"|"open"
66
+
67
+ @dataclass(frozen=True)
68
+ class FederationManifest:
69
+ schema_version: int
70
+ federation_id: str
71
+ community_a: str
72
+ community_b: str
73
+ established_at: str
74
+ expires_at: str
75
+ scope_a_to_b: FederationScope # what A grants B
76
+ scope_b_to_a: FederationScope # what B grants A
77
+ bootstrap_endpoints_a: list[Endpoint]
78
+ bootstrap_endpoints_b: list[Endpoint]
79
+ signature_a: dict # {signed_by, signature, co_signers}
80
+ signature_b: dict
81
+
82
+ def grants_to(self, calling_community_id: str) -> FederationScope | None:
83
+ """Returns the scope grant *to* the calling community (if federated, else None)."""
84
+
85
+ def is_expired(self, now: datetime | None = None) -> bool: ...
86
+
87
+ def build_federation_proposal(
88
+ our_community_manifest: CommunityManifest,
89
+ peer_community_manifest_url: str,
90
+ proposed_scope_to_grant: FederationScope,
91
+ proposed_scope_to_receive: FederationScope,
92
+ bootstrap_endpoints: list[Endpoint],
93
+ ) -> 'FederationProposal':
94
+ """Step 1: prepare a proposal. Not yet a manifest β€” just a draft for the other side."""
95
+
96
+ @dataclass(frozen=True)
97
+ class FederationProposal:
98
+ community_a: str
99
+ community_b: str
100
+ scope_a_to_b: FederationScope
101
+ scope_b_to_a: FederationScope
102
+ bootstrap_endpoints_a: list[Endpoint]
103
+ bootstrap_endpoints_b: list[Endpoint]
104
+ proposer_signature: str
105
+
106
+ def co_sign_federation(
107
+ proposal: FederationProposal,
108
+ signing_kp: KeyPair,
109
+ role: str, # "a" or "b"
110
+ ) -> dict:
111
+ """Returns {signed_by, signature, co_signers[]} payload."""
112
+
113
+ def finalize_federation_manifest(
114
+ proposal: FederationProposal,
115
+ sig_a: dict,
116
+ sig_b: dict,
117
+ ) -> FederationManifest:
118
+ """Assemble fully-signed manifest after both sides have signed."""
119
+
120
+ def parse_federation_manifest(blob: bytes | dict) -> FederationManifest: ...
121
+ def verify_federation_manifest(
122
+ m: FederationManifest,
123
+ community_a_manifest: CommunityManifest,
124
+ community_b_manifest: CommunityManifest,
125
+ ) -> None:
126
+ """Verify both sides signed, anchors are valid in their communities,
127
+ co-signer counts meet policy, expiry is in the future."""
128
+ ```
129
+
130
+ ### 4.2 `peering.py`
131
+
132
+ ```python
133
+ # hearthnet/federation/peering.py
134
+ class FederationHandshake:
135
+ """Conducts the multi-step cross-signing handshake.
136
+ Stateful; one instance per active proposal."""
137
+
138
+ def __init__(
139
+ self,
140
+ our_community_manifest: CommunityManifest,
141
+ our_kp: KeyPair,
142
+ transport_client: HttpClient,
143
+ event_log: EventLog,
144
+ ):
145
+ ...
146
+
147
+ async def initiate(
148
+ self,
149
+ peer_endpoints: list[Endpoint],
150
+ scope_to_grant: FederationScope,
151
+ scope_to_receive: FederationScope,
152
+ ) -> FederationProposal:
153
+ """1. Fetch peer's community manifest.
154
+ 2. Build proposal.
155
+ 3. Sign as community A's anchor.
156
+ 4. POST to peer.
157
+ 5. Receive peer's signed proposal back.
158
+ 6. Verify both signatures and gather more local co-signers if policy requires.
159
+ Returns the fully-signed proposal ready to finalize."""
160
+
161
+ async def accept(self, proposal: FederationProposal) -> FederationManifest:
162
+ """The other side accepting an incoming proposal.
163
+ Returns the finalized manifest (publishable to event log)."""
164
+
165
+ async def publish(self, manifest: FederationManifest) -> None:
166
+ """Append federation.peer.added event to local log.
167
+ Push the manifest to peer so they can do the same."""
168
+ ```
169
+
170
+ ### 4.3 `relay_client.py`
171
+
172
+ ```python
173
+ # hearthnet/federation/relay_client.py
174
+ class FederationCaller:
175
+ """Outbound side: makes calls into federated communities.
176
+ Used by services when their request triggers a federated lookup
177
+ (e.g. rag.query across federated corpora)."""
178
+
179
+ def __init__(
180
+ self,
181
+ bus: CapabilityBus,
182
+ our_kp: KeyPair,
183
+ our_community_id: str,
184
+ federation_manifests_provider: Callable[[], list[FederationManifest]],
185
+ ):
186
+ ...
187
+
188
+ async def call_in_peer(
189
+ self,
190
+ peer_community_id: str,
191
+ capability: str,
192
+ version: Version,
193
+ body: dict,
194
+ *,
195
+ timeout_seconds: float | None = None,
196
+ ) -> dict:
197
+ """1. Look up federation manifest for peer_community_id.
198
+ 2. Verify scope includes (capability, params).
199
+ 3. Issue an auth.token via local M16 with capability scope.
200
+ 4. Pick a peer Bridge endpoint (from manifest.bootstrap_endpoints_b).
201
+ 5. POST /bus/v1/call to peer's federation.proxy@1.0 with token + body.
202
+ Returns the result. Raises FederationError on scope/auth issues."""
203
+
204
+ async def stream_in_peer(...) -> AsyncIterator[Frame]:
205
+ """Streaming variant."""
206
+ ```
207
+
208
+ ### 4.4 `service.py`
209
+
210
+ ```python
211
+ # hearthnet/federation/service.py
212
+ class FederationService:
213
+ name = "federation"
214
+ version = "1.0"
215
+
216
+ def __init__(
217
+ self,
218
+ bus: CapabilityBus,
219
+ event_log: EventLog,
220
+ replay_engine: ReplayEngine,
221
+ author_kp: KeyPair,
222
+ community_manifest_provider: Callable[[], CommunityManifest],
223
+ revocation_cache: RevocationCache,
224
+ ):
225
+ ...
226
+
227
+ def capabilities(self) -> list[tuple[CapabilityDescriptor, Callable, ParamsPredicate]]:
228
+ """Registers: federation.peer.add, federation.peer.remove,
229
+ federation.peer.list, federation.proxy (all @1.0)."""
230
+
231
+ async def start(self) -> None: ...
232
+ async def stop(self) -> None: ...
233
+ def health(self) -> dict: ...
234
+
235
+ # --- handlers ---
236
+
237
+ async def handle_peer_add(self, req: RouteRequest) -> dict:
238
+ """CAP2 Β§4.1. Run handshake; check co-signatures; emit event."""
239
+
240
+ async def handle_peer_remove(self, req: RouteRequest) -> dict:
241
+ """CAP2 Β§4.2."""
242
+
243
+ async def handle_peer_list(self, req: RouteRequest) -> dict:
244
+ """CAP2 Β§4.3."""
245
+
246
+ async def handle_proxy(self, req: RouteRequest) -> dict | AsyncIterator[dict]:
247
+ """CAP2 Β§4.4. Forward a federated call to the local bus.
248
+ 1. Verify token attached to request.
249
+ 2. Look up federation manifest, check scope.
250
+ 3. Call bus.call(target_capability, version, body) internally.
251
+ 4. Return result (or stream frames)."""
252
+
253
+ # --- maintenance ---
254
+
255
+ async def heartbeat_loop(self) -> None:
256
+ """Per FEDERATION_HEARTBEAT_SECONDS, ping each federated peer's
257
+ federation.peer.list. Update last_heartbeat. Emit
258
+ federation.heartbeat event."""
259
+ ```
260
+
261
+ ---
262
+
263
+ ## 5. Behaviour
264
+
265
+ ### 5.1 Two-phase handshake
266
+
267
+ ```
268
+ Community A's anchor decides to federate with B.
269
+ ↓
270
+ A: peering.initiate(B_endpoints, scope_a_to_b, scope_b_to_a)
271
+ β†’ fetch B's community manifest
272
+ β†’ build proposal; sign as A
273
+ β†’ POST /federation/proposal to B's bridge
274
+ ↓
275
+ B receives proposal, presents to a trusted member.
276
+ human decision (or auto-policy)
277
+ ↓
278
+ B: peering.accept(proposal)
279
+ β†’ sign as B
280
+ β†’ return signed proposal
281
+ ↓
282
+ A: gather more local co-signers if our policy.min_signatures_to_federate > 1
283
+ ↓
284
+ A: finalize_federation_manifest(proposal, sig_a, sig_b)
285
+ ↓
286
+ A: publish federation.peer.added event locally
287
+ ↓
288
+ A: POST manifest to B so they publish too
289
+ ↓
290
+ both communities heartbeat each other periodically
291
+ ```
292
+
293
+ ### 5.2 Bridge node profile
294
+
295
+ A community designates one or more nodes with `profile: "bridge"`. Bridge nodes:
296
+
297
+ - Always-on (best-effort)
298
+ - Have a publicly-reachable endpoint or a relay-tier (M15) registration
299
+ - Run `FederationService` and act as the proxy for inbound federated calls
300
+ - Hold the bandwidth budget for cross-community traffic
301
+
302
+ Non-bridge nodes can still **call into** federated communities (via M14 `FederationCaller`); they just don't *serve* cross-community calls.
303
+
304
+ ### 5.3 Scope enforcement (inbound)
305
+
306
+ When `federation.proxy` is invoked:
307
+
308
+ 1. Caller signature verified (Phase 1 Β§1.3)
309
+ 2. Caller's community is parsed from token's `iss`
310
+ 3. Federation manifest lookup; absent β†’ `not_federated`
311
+ 4. Scope check: `(capability, version)` ∈ scope and params allowed β†’ else `federation_forbidden`
312
+ 5. Token's signature verified against issuer's community anchors
313
+ 6. Token's `aud` must match our community
314
+ 7. Token's `scope` βŠ† federation manifest's scope (caller's community can't grant themselves more than they were granted)
315
+ 8. Dispatch internally via bus
316
+ 9. Record metrics: `hearthnet_federation_calls_total{peer_community, capability, result}`
317
+
318
+ ### 5.4 Heartbeats and expiry
319
+
320
+ - Every `FEDERATION_HEARTBEAT_SECONDS` (300), each bridge calls `federation.peer.list` on each federated peer
321
+ - If a heartbeat fails 3 times in a row, the peer is marked `degraded` in the local view
322
+ - Federation manifests have `expires_at`. 30 days before expiry, a renewal handshake is auto-initiated. If renewal fails by expiry, the federation lapses; calls return `not_federated`.
323
+
324
+ ### 5.5 Revocation
325
+
326
+ To break a federation:
327
+
328
+ - Either side may call `federation.peer.remove`
329
+ - Co-signature requirements: same as creation (`policy.min_signatures_to_federate`)
330
+ - Event `federation.peer.removed` is published locally; peer is notified and publishes their own
331
+ - All outstanding tokens issued under this federation are implicitly revoked (M16 verifies federation still exists)
332
+
333
+ ### 5.6 Identity import (Phase 2.5 hook)
334
+
335
+ A federated user with NodeID in community A wishing to access community B's services *as themselves* (not via their community A anchor's token) can use `federation.identity.attest` (reserved capability, Phase 2.5). Out of scope for first cut.
336
+
337
+ ### 5.7 Trust transitivity
338
+
339
+ **Not transitive.** A↔B and B↔C do not imply A↔C. Each pair establishes its own manifest. This is intentional β€” explicit consent.
340
+
341
+ ### 5.8 Conflict: federation with revoked member's community
342
+
343
+ If community A has federated with B, and later A's anchor (the signer) is revoked from A:
344
+
345
+ - The federation manifest's signature *was* valid at sign time
346
+ - Going forward, A's community may renew with a new anchor signature
347
+ - B verifies federation against A's current anchor set on every call β€” if no current anchor co-signs, the federation is invalid
348
+
349
+ ---
350
+
351
+ ## 6. Discovery integration
352
+
353
+ A community wishing to find a federated peer they haven't talked to in a while:
354
+
355
+ 1. Look up `bootstrap_endpoints` in their stored federation manifest
356
+ 2. Try each; if all fail, fall back to [X05 DHT](../cross-cutting/X05-dht.md): `find_value(blake3(peer_community_id))`
357
+ 3. If DHT also returns nothing, try [M15 relay](M15-relay-tier.md): `relay.lookup_community(peer_community_id)`
358
+ 4. Only after all three fail β†’ mark federation as `unreachable`; UI shows offline indicator
359
+
360
+ ---
361
+
362
+ ## 7. Errors
363
+
364
+ `FederationError`:
365
+
366
+ - `not_federated` β€” no manifest for this peer
367
+ - `federation_expired` β€” manifest past expires_at
368
+ - `scope_violation` β€” request outside granted scope
369
+ - `bridge_unreachable` β€” couldn't reach any of peer's bridges
370
+ - `co_signer_insufficient` β€” proposal lacks required signature count
371
+ - `peer_community_invalid` β€” peer's manifest failed verification
372
+
373
+ Wire mapping per [CAP2 Β§9](../CAPABILITY_CONTRACT_v2.md).
374
+
375
+ ---
376
+
377
+ ## 8. Configuration
378
+
379
+ ```python
380
+ config.federation.enabled = False # opt-in
381
+ config.federation.bridge_node = False # we serve cross-community calls
382
+ config.federation.relay_url = None # M15 hosted relay for NAT
383
+ config.federation.auto_renew_days_before = 30
384
+ config.federation.max_peer_communities = 16
385
+ config.federation.heartbeat_seconds = FEDERATION_HEARTBEAT_SECONDS
386
+ config.federation.scope_default_rate_limit_per_minute = 60
387
+ ```
388
+
389
+ Constants: `FEDERATION_MANIFEST_TTL_SECONDS`, `FEDERATION_HEARTBEAT_SECONDS`.
390
+
391
+ ---
392
+
393
+ ## 9. Tests
394
+
395
+ ### Unit
396
+ - `test_federation_proposal_builds_correctly`
397
+ - `test_co_sign_signature_verifies`
398
+ - `test_finalize_requires_min_signers`
399
+ - `test_grants_to_returns_scope_for_correct_direction`
400
+ - `test_expired_federation_rejects_calls`
401
+
402
+ ### Integration
403
+ - `test_two_community_federation_round_trip` β€” A and B in different processes federate, then A queries B's RAG via proxy
404
+ - `test_scope_violation_returns_403`
405
+ - `test_heartbeat_marks_degraded_after_3_failures`
406
+ - `test_revocation_breaks_existing_tokens`
407
+ - `test_renewal_30_days_before_expiry`
408
+
409
+ ### Chaos
410
+ - `test_partition_during_federation_handshake_resumable`
411
+
412
+ ---
413
+
414
+ ## 10. Cross-references
415
+
416
+ | What | Where |
417
+ |------|-------|
418
+ | Federation manifest schema | [CAP2 Β§6.1](../CAPABILITY_CONTRACT_v2.md) |
419
+ | `federation.*` capabilities | [CAP2 Β§4.1–4.4](../CAPABILITY_CONTRACT_v2.md) |
420
+ | Token issuance for cross-community | [M16 Β§5.1](M16-tokens.md) |
421
+ | DHT bootstrap | [X05 Β§4.3](../cross-cutting/X05-dht.md) |
422
+ | Relay tier NAT traversal | [M15](M15-relay-tier.md) |
423
+ | Bridge node profile | [Phase 1 PRD Β§5.4 + this module Β§5.2](../../HEARTHNET_PRD_v2.md) |
424
+ | Phase 3 transitive federation | TBD |
425
+
426
+ ---
427
+
428
+ ## 11. Open questions
429
+
430
+ 1. **Multi-party federation (mesh of N>2 communities)** β€” currently bilateral only. Phase 3 candidate.
431
+ 2. **Federated marketplace** β€” should `market.list` cross federations? Reserved scope param; default off.
432
+ 3. **Federated identity** β€” single-sign-on across federated communities. Phase 2.5; design depends on token-on-token.
433
+ 4. **Federation revocation event propagation** β€” if A↔B and A↔C, and B unilaterally revokes A, should C see this? MVP: no, each pair is independent.
434
+ 5. **Audit log for federation activity** β€” should there be a separate "federation_audit" log so cross-community activity is easy to surface to operators? Yes, Phase 2.5.
docs/p2_p3/M15-relay-tier.md ADDED
@@ -0,0 +1,389 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # M15 β€” Relay Tier
2
+
3
+ **Spec version:** v1.0 (Phase 2)
4
+ **Depends on:** M01 (identity), M16 (tokens, for relay registration auth), X01 (transport), X05 (DHT), X04 (config)
5
+ **Depended on by:** M14 (federation, when bridges are NAT'd), M22 (mobile push delivery), X05 (DHT bootstrap)
6
+
7
+ ---
8
+
9
+ ## 1. Responsibility
10
+
11
+ A **relay node** is a public-internet-reachable service that helps HearthNet nodes that cannot directly reach each other (NAT, mobile networks, dynamic IPs). It provides:
12
+
13
+ - **NAT traversal**: registered nodes receive forwarded traffic from peers
14
+ - **Federation discovery**: a public lookup of "which IP currently runs community X's bridge"
15
+ - **Mobile push**: delivers chat/marketplace notifications to mobile devices via APNs/FCM
16
+ - **DHT bootstrap**: serves as a stable initial DHT endpoint
17
+
18
+ A relay is **infrastructure** β€” typically one or a few well-known servers per region. Christof's planned `relay.hearthnet.de` (Hetzner) is the reference deployment.
19
+
20
+ Relays do **not**:
21
+ - Store any community state long-term
22
+ - See cleartext of E2E-encrypted chat
23
+ - Make any trust decisions β€” they are credential-free transport for already-authenticated traffic
24
+ - Replace community anchors
25
+
26
+ ---
27
+
28
+ ## 2. File layout
29
+
30
+ ```
31
+ hearthnet/relay/ # client-side helpers, ships with normal Hearthnet
32
+ β”œβ”€β”€ __init__.py
33
+ β”œβ”€β”€ client.py # RelayClient β€” registration, forwarding fetch
34
+ β”œβ”€β”€ push_subscriber.py # iOS APNs / Android FCM token registration
35
+
36
+ relay-server/ # separate deployable, lives in /relay-server in the repo
37
+ β”œβ”€β”€ pyproject.toml
38
+ β”œβ”€β”€ README.md
39
+ β”œβ”€β”€ relay_server/
40
+ β”‚ β”œβ”€β”€ __init__.py
41
+ β”‚ β”œβ”€β”€ app.py # FastAPI app
42
+ β”‚ β”œβ”€β”€ registration.py # /relay/v1/register, /heartbeat
43
+ β”‚ β”œβ”€β”€ forward.py # /relay/v1/forward
44
+ β”‚ β”œβ”€β”€ lookup.py # /relay/v1/community/<id>
45
+ β”‚ β”œβ”€β”€ push.py # outbound APNs/FCM
46
+ β”‚ β”œβ”€β”€ billing.py # optional, for paid tier
47
+ β”‚ β”œβ”€β”€ storage.py # SQLite for registrations + push device tokens
48
+ β”‚ └── observability.py # logging, metrics
49
+ β”œβ”€β”€ deploy/
50
+ β”‚ β”œβ”€β”€ docker-compose.yml
51
+ β”‚ β”œβ”€β”€ caddy.Caddyfile # TLS termination
52
+ β”‚ └── systemd/relay.service
53
+ └── tests/
54
+ ```
55
+
56
+ The relay server is a thin FastAPI app deployable to any VPS. ~1500 LOC.
57
+
58
+ ---
59
+
60
+ ## 3. Wire surface (relay server endpoints)
61
+
62
+ | Endpoint | Method | Purpose |
63
+ |----------|--------|---------|
64
+ | `/relay/v1/register` | POST | A node registers itself for forwarding |
65
+ | `/relay/v1/heartbeat` | POST | Keep registration alive |
66
+ | `/relay/v1/deregister` | POST | Cleanly remove a registration |
67
+ | `/relay/v1/forward/<node_short>` | POST | A peer wanting to reach a registered node sends the encapsulated request here |
68
+ | `/relay/v1/community/<community_id>` | GET | Look up current bridge endpoints for a community |
69
+ | `/relay/v1/push/register` | POST | Register an APNs/FCM device token for push delivery |
70
+ | `/relay/v1/push/<device_id>` | POST | Send a push notification (typically from another node) |
71
+ | `/relay/v1/dht/bootstrap` | GET | Return a list of known-good DHT contacts |
72
+ | `/relay/v1/health` | GET | Operator endpoint |
73
+
74
+ Auth: every endpoint except `/health` requires a HearthNet signature OR a relay-issued token (registered nodes get a session token for performance).
75
+
76
+ ---
77
+
78
+ ## 4. Public API (client-side)
79
+
80
+ ### 4.1 `client.py`
81
+
82
+ ```python
83
+ # hearthnet/relay/client.py
84
+ @dataclass(frozen=True)
85
+ class RelayRegistration:
86
+ relay_url: str
87
+ node_id_full: str
88
+ expires_at: int # unix seconds
89
+ session_token: str # short-lived bearer for subsequent calls
90
+
91
+ class RelayClient:
92
+ """Used by federation bridges and mobile clients to be reachable through a relay."""
93
+
94
+ def __init__(
95
+ self,
96
+ relay_url: str,
97
+ kp: KeyPair,
98
+ community_id: str,
99
+ ):
100
+ ...
101
+
102
+ async def register(
103
+ self,
104
+ *,
105
+ capabilities_offered: list[str] | None = None,
106
+ external_endpoint_hint: Endpoint | None = None,
107
+ ) -> RelayRegistration:
108
+ """Register us with this relay. Relay will forward inbound /relay/v1/forward/<our_short>
109
+ calls to our actual endpoint via reverse-WebSocket (we hold a persistent WS to the relay).
110
+ Returns registration; client should hold an open WS until deregister."""
111
+
112
+ async def heartbeat(self) -> None:
113
+ """Refresh registration. Should be called every RELAY_REGISTRATION_TTL_SECONDS / 2."""
114
+
115
+ async def deregister(self) -> None: ...
116
+
117
+ async def maintain(self) -> None:
118
+ """Long-running task: keeps registration alive, reconnects on failure."""
119
+
120
+ # --- lookups (no registration required) ---
121
+
122
+ async def lookup_community(self, community_id: str) -> list[Endpoint]:
123
+ """Find current bridge endpoints for a community."""
124
+
125
+ async def dht_bootstrap_endpoints(self) -> list[Endpoint]: ...
126
+
127
+ async def send_push(
128
+ self,
129
+ device_id: str,
130
+ payload: dict,
131
+ *,
132
+ push_token: str, # token from M16 with relay.push scope
133
+ ) -> None:
134
+ """Send a push notification via this relay."""
135
+ ```
136
+
137
+ ### 4.2 `push_subscriber.py`
138
+
139
+ ```python
140
+ # hearthnet/relay/push_subscriber.py
141
+ class PushSubscriber:
142
+ """Mobile-side: registers an APNs / FCM device token with the relay
143
+ so the relay can deliver push notifications for chat / marketplace events."""
144
+
145
+ def __init__(
146
+ self,
147
+ relay_url: str,
148
+ kp: KeyPair,
149
+ community_id: str,
150
+ platform: str, # "ios" | "android" | "web"
151
+ ):
152
+ ...
153
+
154
+ async def register(self, device_token: str) -> str:
155
+ """Returns our PushDeviceID. Stored locally for later push send authorization."""
156
+
157
+ async def unregister(self) -> None: ...
158
+ ```
159
+
160
+ ### 4.3 Reverse-WebSocket pattern for forwarding
161
+
162
+ NAT'd nodes can't accept inbound connections. So:
163
+
164
+ 1. Node POSTs `/relay/v1/register`
165
+ 2. Server returns 101 Switching Protocols, upgrading to WebSocket
166
+ 3. Node holds the WS open; relay sends forwarded calls down it
167
+ 4. Node processes and responds back through the same WS
168
+
169
+ The relay server's `forward.py` proxies between the inbound HTTP caller and the registered node's WS. The inbound caller sees a normal HTTP/SSE response.
170
+
171
+ ---
172
+
173
+ ## 5. Server-side internals (sketch)
174
+
175
+ ### 5.1 Registration table
176
+
177
+ ```sql
178
+ CREATE TABLE registrations (
179
+ node_id_full TEXT PRIMARY KEY,
180
+ community_id TEXT NOT NULL,
181
+ external_ip TEXT,
182
+ ws_session_id TEXT, -- in-memory WS connection id
183
+ capabilities_offered TEXT, -- JSON
184
+ registered_at INTEGER,
185
+ expires_at INTEGER,
186
+ last_heartbeat INTEGER
187
+ );
188
+ CREATE INDEX idx_reg_community ON registrations(community_id);
189
+ ```
190
+
191
+ ### 5.2 Push table
192
+
193
+ ```sql
194
+ CREATE TABLE push_devices (
195
+ device_id TEXT PRIMARY KEY, -- ULID, assigned by relay
196
+ node_id_full TEXT NOT NULL,
197
+ community_id TEXT NOT NULL,
198
+ platform TEXT NOT NULL,
199
+ device_token TEXT NOT NULL, -- APNs / FCM token (kept secret on relay)
200
+ registered_at INTEGER,
201
+ last_active INTEGER
202
+ );
203
+ CREATE INDEX idx_push_node ON push_devices(node_id_full);
204
+ ```
205
+
206
+ ### 5.3 Forwarding flow
207
+
208
+ ```
209
+ peer P wants to call NAT'd node N (only knows N's NodeID and relay URL)
210
+ ↓
211
+ P β†’ POST https://relay.hearthnet.de/relay/v1/forward/<N_short>
212
+ headers: standard X-HearthNet-* (signed by P)
213
+ body: original capability call
214
+ ↓
215
+ relay looks up N's WS session
216
+ if absent β†’ 503 relay_unreachable
217
+ if present β†’ wraps request as a WS message and sends to N's WS
218
+ ↓
219
+ N processes β†’ sends response frames back through WS
220
+ ↓
221
+ relay streams those frames as HTTP/SSE response to P
222
+ ↓
223
+ relay never inspects body (E2E content); relay does check signatures
224
+ are valid (against peer manifests it caches) to prevent abuse
225
+ ```
226
+
227
+ ### 5.4 Push flow
228
+
229
+ ```
230
+ sender wants to push to mobile user U
231
+ ↓
232
+ sender β†’ bus.call("auth.token.issue", scope={"capabilities":["relay.push"],"audience":"<device_id>"})
233
+ ↓
234
+ sender β†’ POST relay/v1/push/<device_id> with token + payload
235
+ ↓
236
+ relay verifies token; resolves device_id β†’ APNs/FCM token
237
+ ↓
238
+ relay sends via APNs/FCM
239
+ ↓
240
+ Apple/Google delivers to the device
241
+ ↓
242
+ mobile app opens, calls bus to fetch new chat / event
243
+ ```
244
+
245
+ The payload itself is opaque to the relay (`{"event_type":"chat.message.sent","sender_short":"7H4G-..."}`). The mobile app fetches the actual content via the bus when it opens.
246
+
247
+ ### 5.5 Federation lookup flow
248
+
249
+ A community publishes its bridge endpoints to the relay via `/relay/v1/community/<id>` (POST, signed by an anchor). The relay caches `{community_id β†’ [endpoints, last_updated]}` for 24 hours. GETs are free, signed (anti-spam) but lightweight.
250
+
251
+ ---
252
+
253
+ ## 6. Behaviour
254
+
255
+ ### 6.1 Trust model
256
+
257
+ The relay is **untrusted-but-honest**. It:
258
+
259
+ - Sees who is talking to whom (NodeID-level)
260
+ - Sees signature envelopes (but not E2E ciphertext)
261
+ - Can deny service (DoS), refuse to forward, or rate-limit
262
+ - Can NOT impersonate anyone (no private keys)
263
+ - Can NOT decrypt E2E content (no DH secrets)
264
+ - Can NOT modify forwarded bytes without breaking signatures
265
+
266
+ Operators of relays are accountable through public reputation: the relay's URL is in plain sight in community configs. A misbehaving relay gets blackballed by communities.
267
+
268
+ ### 6.2 Rate limiting
269
+
270
+ | Endpoint | Limit |
271
+ |----------|-------|
272
+ | `/register` | 10 per hour per node |
273
+ | `/forward` | 10 RPS per (peer, target_node) |
274
+ | `/community/<id>` GET | 100 RPS total |
275
+ | `/push/<device>` | 60 per hour per (sender, device) |
276
+
277
+ Exceeded β†’ 429 + `retry_after_ms`.
278
+
279
+ ### 6.3 Tier policy (Christof's hosted instance)
280
+
281
+ | Tier | Communities | Push notifications | Cost |
282
+ |------|-------------|--------------------|------|
283
+ | Free | ≀ 5 nodes per community | 100/day | €0 |
284
+ | Hearth | ≀ 50 nodes | 5000/day | €5/month |
285
+ | Anchor | unlimited | 50000/day | οΏ½οΏ½25/month |
286
+ | Self-hosted | unlimited | unlimited | infrastructure |
287
+
288
+ The relay is open-source; any community can run their own. Hosted tier is a convenience layer.
289
+
290
+ ### 6.4 Privacy guarantees
291
+
292
+ - Per-call signature verification (the relay checks them, but signatures contain only public NodeID β€” not user identity in a deeper sense)
293
+ - Sender hides destination by sending to `forward/<short>`; the relay sees both
294
+ - For traffic-pattern privacy (who talks to whom), no protection β€” outside scope
295
+ - Logs retain registration + forwarding metadata for 30 days for abuse handling, then purged
296
+
297
+ ### 6.5 Failure modes
298
+
299
+ - Relay down β†’ mobile push delivery delayed; federated lookups fall back to DHT or stored endpoints; direct LAN calls unaffected
300
+ - Relay overloaded β†’ 429s; clients exponential-backoff
301
+ - Relay key rotation β†’ relay publishes new pubkey signed by previous key; clients update via standard manifest refresh
302
+
303
+ ---
304
+
305
+ ## 7. Configuration (client side)
306
+
307
+ ```python
308
+ config.relay.enabled = False
309
+ config.relay.urls = ["https://relay.hearthnet.de"]
310
+ config.relay.tier = "free" # informational
311
+ config.relay.register_as_bridge = False # if True, holds persistent WS to relay
312
+ config.relay.push_enabled = False
313
+ config.relay.push_platform = "web"
314
+ ```
315
+
316
+ Constants: `RELAY_REGISTRATION_TTL_SECONDS=7200`, `RELAY_PUSH_RETRY_MAX=5`.
317
+
318
+ ### Relay server config (`relay-server/relay_server/config.py`)
319
+
320
+ ```python
321
+ config.bind = "0.0.0.0:443"
322
+ config.tls_cert_file = "/etc/relay/cert.pem"
323
+ config.tls_key_file = "/etc/relay/key.pem"
324
+ config.database = "/var/lib/relay/relay.db"
325
+ config.apns_cert = "/etc/relay/apns.pem"
326
+ config.fcm_key_file = "/etc/relay/fcm.json"
327
+ config.tier = "free|hearth|anchor"
328
+ config.stripe_secret = None # for paid tiers
329
+ config.admin_token = "<random>" # for operator endpoints
330
+ ```
331
+
332
+ ---
333
+
334
+ ## 8. Errors
335
+
336
+ `RelayError` (client domain):
337
+
338
+ - `relay_unreachable` β€” TCP fails or 5xx
339
+ - `registration_expired` β€” call requires re-register
340
+ - `forward_target_offline` β€” target node not currently registered with this relay
341
+ - `push_token_invalid` β€” APNs/FCM rejected the device token
342
+ - `tier_limit_exceeded` β€” quota for this tier reached
343
+
344
+ Wire mapping: `relay_unreachable` is its own code in [CAP2 Β§9](../CAPABILITY_CONTRACT_v2.md).
345
+
346
+ ---
347
+
348
+ ## 9. Tests
349
+
350
+ ### Client-side unit
351
+ - `test_register_includes_signature`
352
+ - `test_heartbeat_refreshes_expires_at`
353
+ - `test_lookup_returns_endpoints`
354
+
355
+ ### Server-side unit
356
+ - `test_forward_requires_target_registered`
357
+ - `test_signature_required_on_register`
358
+ - `test_rate_limit_per_peer_target`
359
+ - `test_push_dispatch_apns_mock`
360
+
361
+ ### Integration
362
+ - `test_two_nat_peers_communicate_through_relay`
363
+ - `test_federation_bridge_via_relay`
364
+ - `test_push_delivered_to_real_test_device` (manual, with APNs sandbox)
365
+
366
+ ### Operational
367
+ - Smoke tests on the deployed `relay.hearthnet.de` instance run hourly
368
+
369
+ ---
370
+
371
+ ## 10. Cross-references
372
+
373
+ | What | Where |
374
+ |------|-------|
375
+ | Token use for push auth | [M16 Β§5.5](M16-tokens.md) |
376
+ | Federation routes through relay | [M14 Β§6](M14-federation.md) |
377
+ | DHT bootstrap endpoint | [X05 Β§4.4](../cross-cutting/X05-dht.md) |
378
+ | Mobile push subscriber | [M22 Β§6](M22-mobile-native.md) |
379
+ | Wire `relay_unreachable` | [CAP2 Β§9](../CAPABILITY_CONTRACT_v2.md) |
380
+
381
+ ---
382
+
383
+ ## 11. Open questions
384
+
385
+ 1. **TURN-style relay vs message relay** β€” current spec is message-level (peer sends entire capability call). Could also do session-level TCP relay (more efficient for streams). Phase 2.5 candidate.
386
+ 2. **STUN integration** β€” clients could try direct connection via STUN before falling back to relay. Phase 3.
387
+ 3. **Multi-relay redundancy** β€” a node could register with two relays for HA. MVP picks one; multi is Phase 2.5.
388
+ 4. **Payment integration** β€” Stripe webhooks β†’ tier upgrade. Implementation detail, not specced here.
389
+ 5. **Self-hosting documentation quality** β€” for the "appliance" go-to-market path, the relay needs a one-command install. Defer to `RELAY_OPERATIONS.md` doc.
docs/p2_p3/M16-tokens.md ADDED
@@ -0,0 +1,391 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # M16 β€” Capability Tokens
2
+
3
+ **Spec version:** v1.0 (Phase 2)
4
+ **Depends on:** M01 (identity), X02 (events, for `auth.token.*`), X04 (config), X03 (observability)
5
+ **Depended on by:** M14 (federation), M15 (relay), M22 (mobile), M23 (optionally, for session credentials)
6
+
7
+ ---
8
+
9
+ ## 1. Responsibility
10
+
11
+ Issue, verify, and revoke short-lived **capability tokens** for delegation. A token says: "the holder of this token may invoke capability X (with these constraints) on behalf of issuer Y, until time Z."
12
+
13
+ Tokens are the mechanism Phase 2 uses for:
14
+ - Federation calls (a federated peer presents a token issued by an anchor of the peer community)
15
+ - Mobile clients (the mobile app presents a token issued during onboarding)
16
+ - Limited-scope sharing (e.g. "let this neighbour query our emergency corpus for the next hour")
17
+
18
+ Per-request Ed25519 signatures (Phase 1 Β§1.3) remain the default authentication; tokens are an *additional* mechanism.
19
+
20
+ ---
21
+
22
+ ## 2. File layout
23
+
24
+ ```
25
+ hearthnet/identity/
26
+ └── tokens.py # CapabilityToken, encode/decode, verify, revocation cache
27
+
28
+ hearthnet/services/auth/
29
+ β”œβ”€β”€ __init__.py
30
+ └── service.py # AuthService β€” registers auth.token.* capabilities
31
+ ```
32
+
33
+ The token primitives live under `identity/` (low-level crypto). The capability handlers live as a normal service so they go through the bus.
34
+
35
+ ---
36
+
37
+ ## 3. Token envelope
38
+
39
+ Compact JWS-style. Compatible with off-the-shelf JWS decoders that accept `EdDSA`. Length budget: ≀ 800 bytes (fits a QR at error correction M).
40
+
41
+ ### 3.1 Header
42
+
43
+ ```json
44
+ {"alg": "EdDSA", "typ": "hntoken", "v": 1}
45
+ ```
46
+
47
+ ### 3.2 Payload
48
+
49
+ ```json
50
+ {
51
+ "iss": "ed25519:<issuer NodeID full form>",
52
+ "sub": "ed25519:<subject NodeID full form>",
53
+ "aud": "ed25519:<audience community_id, optional>",
54
+ "iat": 1717939200,
55
+ "exp": 1717942800,
56
+ "nbf": 1717939200,
57
+ "jti": "01HXR...",
58
+ "scope": {
59
+ "capabilities": ["rag.query@1.0", "embed.text@1.0"],
60
+ "params_constraints": {
61
+ "corpus": ["niederrhein-emergency"],
62
+ "model": ["bge-small-en-v1.5"]
63
+ },
64
+ "rate_limit_per_minute": 60,
65
+ "max_calls_total": null
66
+ },
67
+ "issued_via": "federation|onboarding|manual|relay"
68
+ }
69
+ ```
70
+
71
+ `sub` MAY be `"*"` for a bearer-style token (anyone with the token may use it). Used sparingly β€” only for federation proxies where the actual subject is unknown at issuance time.
72
+
73
+ ### 3.3 Signature
74
+
75
+ `Ed25519(base64url(header) + "." + base64url(payload))`. Final form:
76
+
77
+ ```
78
+ hntoken://v1/<base64url(header)>.<base64url(payload)>.<base64url(signature)>
79
+ ```
80
+
81
+ Total length: ~600–800 bytes typical.
82
+
83
+ ---
84
+
85
+ ## 4. Public API
86
+
87
+ ### 4.1 `hearthnet/identity/tokens.py`
88
+
89
+ ```python
90
+ # hearthnet/identity/tokens.py
91
+ from dataclasses import dataclass
92
+
93
+ @dataclass(frozen=True)
94
+ class TokenScope:
95
+ capabilities: list[str] # e.g. ["rag.query@1.0"]
96
+ params_constraints: dict[str, list[str]] # e.g. {"corpus": ["..."]}
97
+ rate_limit_per_minute: int
98
+ max_calls_total: int | None
99
+
100
+ @dataclass(frozen=True)
101
+ class CapabilityToken:
102
+ """The fully decoded token, ready for verification."""
103
+ issuer: str
104
+ subject: str # "*" for bearer
105
+ audience: str | None
106
+ issued_at: int # unix seconds
107
+ expires_at: int
108
+ not_before: int
109
+ jti: str # ULID
110
+ scope: TokenScope
111
+ issued_via: str # "federation"|"onboarding"|...
112
+ signature: bytes # raw 64 bytes
113
+
114
+ @property
115
+ def is_bearer(self) -> bool: ...
116
+
117
+ def is_active(self, now: int | None = None) -> bool: ...
118
+
119
+ def covers(self, capability_name: str, version: tuple[int, int],
120
+ params: dict | None = None) -> bool:
121
+ """True iff scope includes the capability and (if params_constraints set) every requested param value is in the allow-list."""
122
+
123
+ def issue_token(
124
+ issuer_kp: KeyPair,
125
+ subject: str,
126
+ scope: TokenScope,
127
+ *,
128
+ ttl_seconds: int = TOKEN_DEFAULT_TTL_SECONDS,
129
+ audience: str | None = None,
130
+ issued_via: str = "manual",
131
+ not_before_offset: int = 0,
132
+ ) -> tuple[CapabilityToken, str]:
133
+ """Build, sign, encode. Returns (token, encoded_str)."""
134
+
135
+ def encode_token(tok: CapabilityToken, header_signature: bytes) -> str:
136
+ """Render to 'hntoken://v1/...'."""
137
+
138
+ def decode_token(text: str) -> CapabilityToken:
139
+ """Parse + structural validation only. Does NOT verify the signature.
140
+ Raises TokenError on malformed input."""
141
+
142
+ def verify_token(
143
+ tok: CapabilityToken,
144
+ *,
145
+ expected_audience: str | None = None,
146
+ revocation_cache: 'RevocationCache | None' = None,
147
+ now: int | None = None,
148
+ community_manifest: CommunityManifest,
149
+ ) -> None:
150
+ """Verify signature against issuer's pubkey, expiry, nbf, audience,
151
+ revocation, and that the issuer is currently a community member
152
+ (not revoked at the issuer's community level).
153
+ Raises TokenError with specific code."""
154
+
155
+ class RevocationCache:
156
+ """In-memory + persisted (SQLite) cache of revoked JTIs.
157
+ Authoritative source is the event log."""
158
+
159
+ def __init__(self, db_path: Path):
160
+ ...
161
+
162
+ def add(self, jti: str, revoked_at: int) -> None: ...
163
+ def is_revoked(self, jti: str) -> bool: ...
164
+ def hydrate_from_log(self, event_log: EventLog) -> int:
165
+ """Read all auth.token.revoked events; bring cache up to date.
166
+ Returns rows added."""
167
+
168
+ class TokenError(Exception):
169
+ """code in {
170
+ 'token_invalid','token_expired','token_not_yet_valid',
171
+ 'token_signature_bad','token_audience_mismatch',
172
+ 'token_revoked','token_scope_insufficient',
173
+ 'token_issuer_revoked','token_malformed'}"""
174
+ code: str
175
+ ```
176
+
177
+ ### 4.2 `hearthnet/services/auth/service.py`
178
+
179
+ ```python
180
+ # hearthnet/services/auth/service.py
181
+ class AuthService:
182
+ """Registers auth.token.issue / revoke / introspect capabilities."""
183
+
184
+ name = "auth"
185
+ version = "1.0"
186
+
187
+ def __init__(
188
+ self,
189
+ author_kp: KeyPair,
190
+ event_log: EventLog,
191
+ community_manifest_provider: Callable[[], CommunityManifest],
192
+ revocation_cache: RevocationCache,
193
+ ):
194
+ ...
195
+
196
+ def capabilities(self) -> list[tuple[CapabilityDescriptor, Callable, ParamsPredicate]]:
197
+ """Registers: auth.token.issue@1.0, auth.token.revoke@1.0, auth.token.introspect@1.0."""
198
+
199
+ async def start(self) -> None:
200
+ """Hydrate the revocation cache from event log."""
201
+
202
+ async def stop(self) -> None: ...
203
+ def health(self) -> dict: ...
204
+
205
+ # --- handlers ---
206
+
207
+ async def handle_issue(self, req: RouteRequest) -> dict:
208
+ """CAP2 Β§4.5. Build a CapabilityToken, sign with author_kp, emit auth.token.issued event."""
209
+
210
+ async def handle_revoke(self, req: RouteRequest) -> dict:
211
+ """CAP2 Β§4.6. Verify caller is issuer (or 'trusted'). Append auth.token.revoked event."""
212
+
213
+ async def handle_introspect(self, req: RouteRequest) -> dict:
214
+ """CAP2 Β§4.7. Self-only. Returns active status and scope."""
215
+ ```
216
+
217
+ ---
218
+
219
+ ## 5. Behaviour
220
+
221
+ ### 5.1 Token-bearer call lifecycle
222
+
223
+ ```
224
+ caller hits any capability endpoint with:
225
+ X-HearthNet-Token: hntoken://v1/...
226
+ (and optionally X-HearthNet-Signature)
227
+ ↓
228
+ X01 transport extracts and decodes
229
+ ↓
230
+ verify_token(...) β€” signature, expiry, audience, revocation
231
+ ↓
232
+ on success:
233
+ caller_effective_identity = token.subject (or token.issuer if subject == "*")
234
+ scope_check (does token cover this capability?)
235
+ ↓
236
+ bus.handle_call() with the effective caller
237
+ ↓
238
+ record token usage in metrics: hearthnet_token_calls_total{issuer, scope_match}
239
+ ```
240
+
241
+ ### 5.2 Co-existence with per-request signing
242
+
243
+ A request MAY carry both `X-HearthNet-Signature` and `X-HearthNet-Token`:
244
+
245
+ - Signature: proves *who* is making this exact call right now
246
+ - Token: proves they're *allowed* to (via delegation)
247
+
248
+ The token's `sub` MUST equal the signature's `From` NodeID, unless `sub == "*"`. Mismatch β†’ `invalid_signature`.
249
+
250
+ This combination is the normal mode for federation: a federated peer's anchor signs with their key (signature) AND carries a token issued by their community's anchor delegating "rag.query is OK".
251
+
252
+ ### 5.3 Issuance authority
253
+
254
+ A node may issue a token iff:
255
+
256
+ - The capabilities in scope are ones the issuer's community offers (or grants via federation)
257
+ - TTL ≀ `policy.capability_token_ttl_seconds` (community-wide policy bound)
258
+ - The issuer is a `member` (level β‰₯ member) of the community
259
+
260
+ The handler enforces these before signing.
261
+
262
+ ### 5.4 Revocation
263
+
264
+ A token is revoked by appending `auth.token.revoked` to the event log:
265
+
266
+ - Issuer may revoke their own tokens
267
+ - A `trusted` member may revoke any token (operator override)
268
+ - The community root can revoke any token
269
+
270
+ Once the revoke event is in the log, all gossip-receiving nodes update their `RevocationCache`. Until that propagates, a revoked token may still be honoured briefly β€” design accepts up to 60 seconds of lag.
271
+
272
+ ### 5.5 Bearer tokens (`sub == "*"`)
273
+
274
+ Used sparingly:
275
+
276
+ - Federation proxy tokens: peer community gets one bearer token to make federated calls; rotation every 24h
277
+ - Mobile push tokens (M22): one bearer token tied to a `PushDeviceID`, longer TTL
278
+
279
+ Bearer tokens trade convenience for less revocability granularity. The `jti` is still unique so a specific bearer can be killed.
280
+
281
+ ### 5.6 Replay protection
282
+
283
+ Tokens are not single-use. Replay is mitigated by:
284
+ - Short TTL (default 1h)
285
+ - Audience binding (`aud` field): server rejects if `aud` β‰  ours
286
+ - Rate-limit budget (`scope.rate_limit_per_minute`)
287
+ - Revocation if abuse detected
288
+
289
+ For one-shot tokens (e.g. password-reset-style flows), set `max_calls_total: 1` and the server tracks usage via a per-jti counter.
290
+
291
+ ### 5.7 Token-on-token (delegation chains)
292
+
293
+ Phase 2: **forbidden**. A token holder cannot issue new tokens. This avoids a delegation tree we cannot audit.
294
+
295
+ Phase 3 may add bounded delegation with a `delegates: int` counter.
296
+
297
+ ---
298
+
299
+ ## 6. Storage
300
+
301
+ ### 6.1 Revocation cache table
302
+
303
+ ```sql
304
+ CREATE TABLE IF NOT EXISTS token_revocations (
305
+ jti TEXT PRIMARY KEY,
306
+ revoked_at INTEGER NOT NULL,
307
+ reason TEXT,
308
+ via_event_id TEXT
309
+ );
310
+ CREATE INDEX IF NOT EXISTS idx_revocations_time ON token_revocations(revoked_at);
311
+ ```
312
+
313
+ ### 6.2 Rate-limit counters
314
+
315
+ Per-(jti, minute) sliding window in memory. Persisted only when capacity-exceeded events fire (for audit).
316
+
317
+ ---
318
+
319
+ ## 7. Errors
320
+
321
+ `TokenError` β†’ wire mapping:
322
+
323
+ | TokenError code | Wire code | HTTP |
324
+ |-----------------|-----------|------|
325
+ | `token_malformed` | `bad_request` | 400 |
326
+ | `token_invalid` | `token_invalid` | 401 |
327
+ | `token_signature_bad` | `token_invalid` | 401 |
328
+ | `token_expired` | `token_expired` | 410 |
329
+ | `token_not_yet_valid` | `token_expired` | 410 |
330
+ | `token_audience_mismatch` | `unauthorized` | 401 |
331
+ | `token_revoked` | `token_revoked` | 401 |
332
+ | `token_scope_insufficient` | `token_scope_insufficient` | 403 |
333
+ | `token_issuer_revoked` | `revoked` | 403 |
334
+
335
+ ---
336
+
337
+ ## 8. Configuration
338
+
339
+ From [X04](../../cross-cutting/X04-config.md) (extension):
340
+
341
+ ```python
342
+ config.auth.enabled = True
343
+ config.auth.token_default_ttl_seconds = TOKEN_DEFAULT_TTL_SECONDS
344
+ config.auth.token_max_ttl_seconds = TOKEN_MAX_TTL_SECONDS
345
+ config.auth.allow_bearer_tokens = True
346
+ config.auth.federated_only_bearer = True # bearer tokens only issued for federation context
347
+ ```
348
+
349
+ ---
350
+
351
+ ## 9. Tests
352
+
353
+ ### Unit
354
+ - `test_token_encode_decode_roundtrip`
355
+ - `test_token_under_800_bytes`
356
+ - `test_token_signature_verified`
357
+ - `test_token_expired_rejected`
358
+ - `test_token_audience_mismatch_rejected`
359
+ - `test_token_scope_covers_exact_match`
360
+ - `test_token_scope_params_constraint_filtered`
361
+ - `test_revocation_event_updates_cache`
362
+ - `test_bearer_token_with_star_subject`
363
+
364
+ ### Integration
365
+ - `test_federated_call_with_token_succeeds`
366
+ - `test_revoked_token_rejected_within_60_seconds`
367
+ - `test_rate_limit_per_token_enforced`
368
+ - `test_mobile_client_token_authenticates`
369
+
370
+ ---
371
+
372
+ ## 10. Cross-references
373
+
374
+ | What | Where |
375
+ |------|-------|
376
+ | Token wire format | [CAP2 Β§6.2](../CAPABILITY_CONTRACT_v2.md) |
377
+ | Token-bearer requests | [CAP2 Β§5.2](../CAPABILITY_CONTRACT_v2.md) |
378
+ | `auth.token.*` capabilities | [CAP2 Β§4.5–4.7](../CAPABILITY_CONTRACT_v2.md) |
379
+ | Used by federation | [M14 Β§5](M14-federation.md) |
380
+ | Used by relay tier | [M15 Β§4](M15-relay-tier.md) |
381
+ | Used by mobile client | [M22 Β§4](M22-mobile-native.md) |
382
+ | Phase 1 identity primitives | [M01](../../modules/M01-identity.md) |
383
+
384
+ ---
385
+
386
+ ## 11. Open questions
387
+
388
+ 1. **Audience as community vs node** β€” Phase 2 uses community as audience. Should single-node audience be supported (one-call-to-one-node tokens)? Probably yes; adds `aud_kind: "community"|"node"`. Defer.
389
+ 2. **JWE for confidential scope** β€” current scope is in cleartext. Some scope values are sensitive (corpus names). Wrap payload in JWE? Defer; out of scope MVP for tokens.
390
+ 3. **Hardware-bound tokens** β€” Phase 3 idea: token bound to a TPM-attested device.
391
+ 4. **Token-on-token (delegation)** β€” explicitly Phase 3.
docs/p2_p3/M17-ocr.md ADDED
@@ -0,0 +1,305 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # M17 β€” OCR Service
2
+
3
+ **Spec version:** v1.0 (Phase 2)
4
+ **Depends on:** M03 (bus), M07 (blobs, for reading image/PDF inputs and storing extracted text), M11 (embedding, when integrating with M05 RAG), X04 (config), X03 (observability)
5
+ **Depended on by:** M05 RAG (ingest of scanned PDFs), M20 vision (img.describe can fall back to OCR for text-heavy images)
6
+
7
+ ---
8
+
9
+ ## 1. Responsibility
10
+
11
+ Provide `ocr.image@1.0` and `ocr.pdf@1.0`. Wrap several OCR backends so the bus can route between them by document type and language. Specifically engineered to handle:
12
+
13
+ - Modern German printed text (Tesseract)
14
+ - Handwriting (TrOCR / Microsoft Florence-OCR)
15
+ - Historical scripts β€” SΓΌtterlin, Kurrent, Latin, Arabic, Cyrillic (Christof's multilingual harness)
16
+ - Mixed-language documents (auto-detection)
17
+
18
+ ---
19
+
20
+ ## 2. File layout
21
+
22
+ ```
23
+ hearthnet/services/ocr/
24
+ β”œβ”€β”€ __init__.py
25
+ β”œβ”€β”€ service.py # OcrService
26
+ └── backends/
27
+ β”œβ”€β”€ __init__.py
28
+ β”œβ”€β”€ base.py # OcrBackend Protocol
29
+ β”œβ”€β”€ tesseract.py # Tesseract via pytesseract
30
+ β”œβ”€β”€ trocr.py # Microsoft TrOCR via transformers
31
+ β”œβ”€β”€ multilingual.py # Christof's self-improving harness (CHURRO, olmOCR-2)
32
+ └── florence_ocr.py # Florence-2 OCR mode (overlap with M20)
33
+ ```
34
+
35
+ ---
36
+
37
+ ## 3. Public API
38
+
39
+ ### 3.1 `backends/base.py`
40
+
41
+ ```python
42
+ # hearthnet/services/ocr/backends/base.py
43
+ from dataclasses import dataclass
44
+
45
+ @dataclass(frozen=True)
46
+ class OcrBlock:
47
+ text: str
48
+ bbox: tuple[int, int, int, int] # (x, y, w, h) in pixel coords
49
+ confidence: float # 0..1
50
+ language: str | None
51
+
52
+ @dataclass(frozen=True)
53
+ class OcrPageResult:
54
+ page: int # 1-indexed
55
+ text: str # concatenated, reading order
56
+ blocks: list[OcrBlock]
57
+ languages: list[str] # detected, ordered by prevalence
58
+ confidence_mean: float
59
+ ms: int
60
+
61
+ class OcrBackend(Protocol):
62
+ name: str # "tesseract" | "trocr" | "multilingual" | "florence_ocr"
63
+ languages_supported: list[str] # ISO 639-2 codes: "deu","eng","lat","ara","rus", ...
64
+ supports_handwriting: bool
65
+ max_image_pixels: int
66
+
67
+ async def warm(self) -> None: ...
68
+ async def close(self) -> None: ...
69
+
70
+ async def ocr_image(
71
+ self,
72
+ image_bytes: bytes,
73
+ *,
74
+ languages: list[str] | None, # None β†’ auto-detect
75
+ preprocess: dict | None = None, # {deskew, denoise, dpi}
76
+ ) -> OcrPageResult: ...
77
+
78
+ async def ocr_pdf_page(
79
+ self,
80
+ pdf_bytes: bytes,
81
+ *,
82
+ page: int,
83
+ languages: list[str] | None,
84
+ preprocess: dict | None = None,
85
+ ) -> OcrPageResult: ...
86
+
87
+ def health(self) -> dict: ...
88
+ ```
89
+
90
+ ### 3.2 Concrete backends
91
+
92
+ | File | Class | Notes |
93
+ |------|-------|-------|
94
+ | `backends/tesseract.py` | `TesseractBackend(min_confidence: float = 0.5)` | Languages: any installed traineddata. Subprocess via pytesseract. |
95
+ | `backends/trocr.py` | `TrocrBackend(model: str = "microsoft/trocr-large-handwritten", device: str = "auto")` | Handwriting; CUDA preferred. |
96
+ | `backends/multilingual.py` | `MultilingualHarnessBackend(model: str = "self-improving-ocr-v1", device: str = "auto", harness_dir: Path)` | Christof's harness (CHURRO, olmOCR-2, retrieval-augmented correction, Kurrent/SΓΌtterlin/Latin/Arabic/Cyrillic). Configured via `harness_dir`. |
97
+ | `backends/florence_ocr.py` | `FlorenceOcrBackend(model: str = "microsoft/Florence-2-large")` | Reuses M20 vision backend in OCR mode. |
98
+
99
+ `MultilingualHarnessBackend` is the headline integration for Christof's existing work. It exposes the same `OcrBackend` interface and lets the harness's internal page-level VLMs do the heavy lifting.
100
+
101
+ ### 3.3 `service.py`
102
+
103
+ ```python
104
+ # hearthnet/services/ocr/service.py
105
+ class OcrService:
106
+ name = "ocr"
107
+ version = "1.0"
108
+
109
+ def __init__(self, config: OcrConfig, blob_store: BlobStore, event_log: EventLog):
110
+ self._backends: dict[str, OcrBackend] = self._build_backends(config)
111
+
112
+ def capabilities(self) -> list[tuple[CapabilityDescriptor, Callable, ParamsPredicate]]:
113
+ """One ocr.image entry per backend; one ocr.pdf entry per backend.
114
+ params include backend name and supported languages."""
115
+
116
+ async def start(self) -> None: ...
117
+ async def stop(self) -> None: ...
118
+ def health(self) -> dict: ...
119
+
120
+ # --- handlers ---
121
+
122
+ async def handle_image(self, req: RouteRequest) -> dict:
123
+ """CAP2 Β§4.8.
124
+ 1. Resolve image_cid via blob_store
125
+ 2. Pick backend from params.backend
126
+ 3. Run; build response"""
127
+
128
+ async def handle_pdf(self, req: RouteRequest) -> AsyncIterator[dict]:
129
+ """CAP2 Β§4.9.
130
+ 1. Resolve doc_cid
131
+ 2. For each page in page_range:
132
+ emit 'progress' frame
133
+ emit 'page' frame
134
+ 3. If store_text:true, write concatenated text as new blob; emit done with text_cid"""
135
+ ```
136
+
137
+ ### 3.4 `params_compatible` predicate
138
+
139
+ ```python
140
+ def params_compatible(offered: dict, requested: dict) -> bool:
141
+ # backend must match if specified
142
+ if "backend" in requested and requested["backend"] != offered.get("backend"):
143
+ return False
144
+ # all requested languages must be supported by this backend
145
+ requested_langs = set(requested.get("languages", []))
146
+ offered_langs = set(offered.get("languages_supported", []))
147
+ return requested_langs.issubset(offered_langs)
148
+ ```
149
+
150
+ ---
151
+
152
+ ## 4. Behaviour
153
+
154
+ ### 4.1 Auto-language detection
155
+
156
+ If `languages` is omitted or set to `["auto"]`:
157
+
158
+ 1. Sample 3 random pages
159
+ 2. Run lightweight script detection (Tesseract `osd`)
160
+ 3. Choose top 2 scripts
161
+ 4. Re-run with that language set
162
+
163
+ Backends that don't support `osd` fall back to a fixed default (configured per backend).
164
+
165
+ ### 4.2 Preprocessing pipeline
166
+
167
+ `preprocess` dict supports:
168
+ - `deskew: bool` β€” straighten image
169
+ - `denoise: bool` β€” bilateral filter
170
+ - `binarize: bool` β€” Otsu threshold
171
+ - `dpi: int` β€” target resolution; upscale if lower
172
+ - `contrast_normalise: bool`
173
+
174
+ Default: `{"deskew": true, "denoise": false}`. Heavy preprocessing slows ingest meaningfully; only enable per document.
175
+
176
+ ### 4.3 Quality estimation
177
+
178
+ Each page result reports `confidence_mean`. Below 0.6, the service emits a `low_quality` warning frame and recommends:
179
+ - Trying a different backend (e.g. switch from Tesseract to multilingual harness for historic text)
180
+ - Raising DPI
181
+ - Re-scanning
182
+
183
+ ### 4.4 Integration with RAG
184
+
185
+ [M05 Β§10 open question 4](../../modules/M05-rag.md) is now answered:
186
+
187
+ ```
188
+ RagService.handle_ingest receives a scanned PDF (mime_type=image/scanned-pdf or detected)
189
+ β†’ bus.call("ocr.pdf", (1,0), {input:{doc_cid:..., store_text:true}})
190
+ β†’ receive text_cid
191
+ β†’ ingest the text_cid blob (which is now extracted plaintext) as normal
192
+ β†’ emit rag.document.ingested event (with metadata noting ocr_backend used)
193
+ ```
194
+
195
+ The OCR text is stored as a separate blob, content-addressed. Re-ingestion is idempotent.
196
+
197
+ ### 4.5 Page-range and parallelism
198
+
199
+ `page_range: [1, 50]` lets callers process partial documents. Pages are OCR'd serially within one call. For very large PDFs, callers should split into ranges and call concurrently β€” the bus enforces per-capability concurrency.
200
+
201
+ `OCR_MAX_PAGES_PER_REQUEST = 50` is the hard ceiling per call.
202
+
203
+ ### 4.6 PDF text-layer detection
204
+
205
+ Before OCR'ing, the service checks if the PDF has an extractable text layer (via `pypdf`). If yes and confidence is decent (heuristic), it returns the text-layer content directly β€” much cheaper than OCR. Caller can force OCR with `force_ocr: true`.
206
+
207
+ ### 4.7 Christof's multilingual harness integration
208
+
209
+ The `MultilingualHarnessBackend` wraps Christof's existing self-improving OCR pipeline:
210
+
211
+ - Internal models: CHURRO (page-level VLM), olmOCR-2 (page-level VLM)
212
+ - Retrieval-augmented correction over a script-specific corpus
213
+ - Kurrent + SΓΌtterlin support for German historical documents
214
+ - Latin / Arabic / Cyrillic script recognition
215
+
216
+ Configuration:
217
+
218
+ ```python
219
+ config.ocr.multilingual_harness_dir = Path("/srv/ocr-harness")
220
+ config.ocr.multilingual_max_pages_concurrent = 2
221
+ ```
222
+
223
+ The harness is GPU-intensive. On CPU-only nodes, it deregisters itself at startup.
224
+
225
+ ---
226
+
227
+ ## 5. Storage and lifecycle
228
+
229
+ - Input image/PDF: fetched from blob store via CID
230
+ - Output text: optionally stored as a new blob (`store_text: true`)
231
+ - Side-effect: `ocr.document.indexed` event in the community log (carries text_cid for downstream replication)
232
+
233
+ OCR backends do NOT cache results inside themselves. Reuse comes from caching at the RAG/blob layer (same `doc_cid` β†’ already-extracted-text blob).
234
+
235
+ ---
236
+
237
+ ## 6. Errors
238
+
239
+ | Condition | Wire code |
240
+ |-----------|-----------|
241
+ | Unknown backend | `not_found` |
242
+ | Languages not supported by any backend | `bad_request` |
243
+ | Image too large (> max_image_pixels) | `bad_request` |
244
+ | Page-range exceeds document | `bad_request` |
245
+ | > OCR_MAX_PAGES_PER_REQUEST | `bad_request` |
246
+ | Backend crash | `internal_error` |
247
+ | GPU OOM on multilingual | `capacity_exceeded` (with retry_after) |
248
+
249
+ ---
250
+
251
+ ## 7. Configuration
252
+
253
+ ```python
254
+ config.ocr.enabled = True
255
+ config.ocr.backends = [
256
+ OcrBackendConfig(name="tesseract", languages=["deu","eng","fra","lat"]),
257
+ OcrBackendConfig(name="trocr", model="microsoft/trocr-large-handwritten"),
258
+ OcrBackendConfig(name="multilingual", harness_dir=Path("/srv/ocr-harness")),
259
+ ]
260
+ config.ocr.default_dpi = OCR_DEFAULT_DPI # 300
261
+ config.ocr.max_pages_per_request = OCR_MAX_PAGES_PER_REQUEST
262
+ config.ocr.text_layer_first = True
263
+ ```
264
+
265
+ Constants: `OCR_DEFAULT_DPI`, `OCR_MAX_PAGES_PER_REQUEST`.
266
+
267
+ ---
268
+
269
+ ## 8. Tests
270
+
271
+ ### Unit
272
+ - `test_descriptor_schema_validates_meta_schema`
273
+ - `test_params_compatible_language_subset`
274
+ - `test_text_layer_short_circuits_when_present`
275
+ - `test_force_ocr_bypasses_text_layer`
276
+ - `test_low_quality_emits_warning_frame`
277
+
278
+ ### Integration
279
+ - `test_tesseract_german_print` (with a known sample)
280
+ - `test_trocr_handwriting_sample`
281
+ - `test_multilingual_kurrent_sample` (if harness installed)
282
+ - `test_rag_ingest_scanned_pdf_end_to_end`
283
+ - `test_ocr_pdf_progress_frames`
284
+
285
+ ---
286
+
287
+ ## 9. Cross-references
288
+
289
+ | What | Where |
290
+ |------|-------|
291
+ | `ocr.*` wire | [CAP2 Β§4.8–4.9](../CAPABILITY_CONTRACT_v2.md) |
292
+ | Blob store dependency | [M07 Β§3](../../modules/M07-file-blobs.md) |
293
+ | RAG integration | [M05 Β§10 q4](../../modules/M05-rag.md) β€” now resolved |
294
+ | Vision overlap (Florence-2 OCR mode) | [M20 Β§4.3](M20-vision.md) |
295
+ | `ocr.document.indexed` event | [CAP2 Β§7.1](../CAPABILITY_CONTRACT_v2.md) |
296
+ | Christof's harness | external project; this module is the integration surface |
297
+
298
+ ---
299
+
300
+ ## 10. Open questions
301
+
302
+ 1. **Multilingual harness auto-update.** The harness self-improves; should the model versions be event-logged so we can replay deterministically? Yes β€” record the harness version hash in each `ocr.document.indexed` event.
303
+ 2. **Manuscript-quality preprocessing.** Some historic documents need bespoke preprocessing (e.g. ink-bleed removal). Phase 2.5 might add a `preprocess_profile` enum.
304
+ 3. **Reading order from layout.** Currently we trust the backend's reading order. For multi-column documents, an explicit layout model (LayoutLMv3) might help. Phase 3.
305
+ 4. **Streaming OCR for very large images.** Currently atomic. Could tile and stream. Defer.
docs/p2_p3/M18-translation.md ADDED
@@ -0,0 +1,225 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # M18 β€” Translation Service
2
+
3
+ **Spec version:** v1.0 (Phase 2)
4
+ **Depends on:** M03 (bus), X04 (config), X03 (observability), `transformers`, `torch`
5
+ **Depended on by:** UI marketplace + chat (one-click translate), M19 STT (with `translate_to_en=true`)
6
+
7
+ ---
8
+
9
+ ## 1. Responsibility
10
+
11
+ Provide `trans.text@1.0`. Translate between languages, with strong emphasis on:
12
+ - German ↔ English (default)
13
+ - German ↔ Plattdeutsch (Niederrhein-specific, Christof's domain)
14
+ - Major European languages
15
+ - Optionally Arabic, Turkish, Russian, Ukrainian β€” useful in refugee-context emergencies
16
+
17
+ ---
18
+
19
+ ## 2. File layout
20
+
21
+ ```
22
+ hearthnet/services/translation/
23
+ β”œβ”€β”€ __init__.py
24
+ β”œβ”€β”€ service.py
25
+ └── backends/
26
+ β”œβ”€β”€ __init__.py
27
+ β”œβ”€β”€ base.py
28
+ β”œβ”€β”€ nllb.py # facebook/nllb-200-distilled-600M
29
+ └── plattdeutsch.py # specialised fine-tune, optional
30
+ ```
31
+
32
+ ---
33
+
34
+ ## 3. Public API
35
+
36
+ ### 3.1 `backends/base.py`
37
+
38
+ ```python
39
+ @dataclass(frozen=True)
40
+ class TranslationResult:
41
+ text: str
42
+ from_lang: str # ISO 639-1
43
+ to_lang: str
44
+ confidence: float # 0..1 if backend supports; else 1.0 placeholder
45
+ ms: int
46
+
47
+ class TranslationBackend(Protocol):
48
+ name: str
49
+ languages_pairs: list[tuple[str, str]] # supported (from, to) pairs
50
+ max_chars: int
51
+
52
+ async def warm(self) -> None: ...
53
+ async def close(self) -> None: ...
54
+
55
+ async def translate(
56
+ self,
57
+ text: str,
58
+ *,
59
+ from_lang: str, # "auto" supported
60
+ to_lang: str,
61
+ domain: str | None,
62
+ ) -> TranslationResult: ...
63
+
64
+ def detect_language(self, text: str) -> str | None: ...
65
+
66
+ def health(self) -> dict: ...
67
+ ```
68
+
69
+ ### 3.2 Concrete backends
70
+
71
+ ```python
72
+ class NllbBackend(TranslationBackend):
73
+ """facebook/nllb-200-distilled-600M (or larger variants).
74
+ 200+ language pairs out of the box."""
75
+
76
+ def __init__(
77
+ self,
78
+ model: str = "facebook/nllb-200-distilled-600M",
79
+ device: str = "auto",
80
+ max_chars: int = TRANSLATION_MAX_CHARS,
81
+ ):
82
+ ...
83
+
84
+ class PlattdeutschBackend(TranslationBackend):
85
+ """Optional specialised fine-tune.
86
+ If a Plattdeutsch fine-tune is present in models_dir, registers de↔nds pair.
87
+ Otherwise no-op (the backend reports zero language pairs and is filtered out)."""
88
+
89
+ def __init__(
90
+ self,
91
+ models_dir: Path,
92
+ device: str = "auto",
93
+ ):
94
+ ...
95
+ ```
96
+
97
+ ### 3.3 `service.py`
98
+
99
+ ```python
100
+ class TranslationService:
101
+ name = "translation"
102
+ version = "1.0"
103
+
104
+ def __init__(self, config: TranslationConfig):
105
+ self._backends: list[TranslationBackend] = self._build_backends(config)
106
+
107
+ def capabilities(self) -> list[tuple[CapabilityDescriptor, Callable, ParamsPredicate]]:
108
+ """One trans.text entry per backend. params declare languages_pairs."""
109
+
110
+ async def start(self) -> None: ...
111
+ async def stop(self) -> None: ...
112
+ def health(self) -> dict: ...
113
+
114
+ async def handle_translate(self, req: RouteRequest) -> dict:
115
+ """CAP2 Β§4.10."""
116
+ ```
117
+
118
+ ### 3.4 `params_compatible` predicate
119
+
120
+ ```python
121
+ def params_compatible(offered: dict, requested: dict) -> bool:
122
+ if "backend" in requested and requested["backend"] != offered.get("backend"):
123
+ return False
124
+ pair = (requested.get("from"), requested.get("to"))
125
+ if pair[0] == "auto":
126
+ # auto-detect; backend must support at least one source β†’ target pair
127
+ return any(t == pair[1] for (_, t) in offered.get("languages_pairs", []))
128
+ return pair in offered.get("languages_pairs", [])
129
+ ```
130
+
131
+ ---
132
+
133
+ ## 4. Behaviour
134
+
135
+ ### 4.1 Auto-detection
136
+
137
+ `from: "auto"`:
138
+ 1. Call `detect_language(text)` (NLLB has internal language detection)
139
+ 2. Substitute detected lang
140
+ 3. Translate
141
+
142
+ ### 4.2 Domain hints
143
+
144
+ `domain: "everyday" | "medical" | "legal" | "emergency"` is advisory. NLLB ignores it; specialised fine-tunes may use it.
145
+
146
+ ### 4.3 Niederrhein focus
147
+
148
+ `PlattdeutschBackend` is Christof's local interest. When installed:
149
+
150
+ - Registers pairs `("de", "nds")` and `("nds", "de")`
151
+ - Optionally `("en", "nds")` if fine-tune extends
152
+ - Used by the marketplace UI's "auf Platt" button in [M08 settings](../../modules/M08-ui.md) ext
153
+
154
+ ### 4.4 Length limits
155
+
156
+ - Single request: ≀ `TRANSLATION_MAX_CHARS` (4000)
157
+ - For longer texts, callers chunk by paragraph and recombine
158
+
159
+ ### 4.5 Batching
160
+
161
+ Internal: requests within 100 ms batched up to 8 strings per forward pass. Improves GPU utilisation. Demultiplexed back. Transparent to callers.
162
+
163
+ ### 4.6 Caching
164
+
165
+ In-memory LRU cache `(text_hash, from, to) β†’ result`, max 10k entries. Big wins for marketplace UI which re-translates same posts on every refresh.
166
+
167
+ ---
168
+
169
+ ## 5. Errors
170
+
171
+ | Condition | Wire code |
172
+ |-----------|-----------|
173
+ | Pair not supported by any backend | `not_found` |
174
+ | Text too long | `bad_request` |
175
+ | Detection failed | `bad_request` |
176
+ | Backend OOM | `capacity_exceeded` |
177
+
178
+ ---
179
+
180
+ ## 6. Configuration
181
+
182
+ ```python
183
+ config.translation.enabled = True
184
+ config.translation.backends = [
185
+ TranslationBackendConfig(name="nllb", model="facebook/nllb-200-distilled-600M", device="auto"),
186
+ TranslationBackendConfig(name="plattdeutsch", models_dir=Path("~/.hearthnet/models/plattdeutsch")),
187
+ ]
188
+ ```
189
+
190
+ Constants: `TRANSLATION_MAX_CHARS`.
191
+
192
+ ---
193
+
194
+ ## 7. Tests
195
+
196
+ ### Unit
197
+ - `test_descriptor_schema_validates`
198
+ - `test_params_compatible_pair_must_match`
199
+ - `test_auto_detect_substitutes_source_lang`
200
+ - `test_text_too_long_rejected`
201
+ - `test_cache_hit_returns_immediately`
202
+
203
+ ### Integration
204
+ - `test_german_to_english_quality` (BLEU above floor)
205
+ - `test_plattdeutsch_pair_registered_when_finetune_present`
206
+ - `test_marketplace_one_click_translate_end_to_end`
207
+
208
+ ---
209
+
210
+ ## 8. Cross-references
211
+
212
+ | What | Where |
213
+ |------|-------|
214
+ | `trans.text@1.0` wire | [CAP2 Β§4.10](../CAPABILITY_CONTRACT_v2.md) |
215
+ | STT translate-to-EN feature | [M19 Β§4.3](M19-stt-tts.md) |
216
+ | Marketplace one-click | [M08 ext](../../modules/M08-ui.md) |
217
+ | Niederrhein context | Christof's domain |
218
+
219
+ ---
220
+
221
+ ## 9. Open questions
222
+
223
+ 1. **Fine-tune in-the-loop.** A community could fine-tune the Plattdeutsch model on its own corpus over time. Reserved.
224
+ 2. **Document-level translation.** Currently per-string. Document-coherence translation (better than chunked) is Phase 3.
225
+ 3. **Glossary support.** Domain glossaries (technical terms, names) preserved across translation. Phase 2.5.
docs/p2_p3/M19-stt-tts.md ADDED
@@ -0,0 +1,331 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # M19 β€” Speech I/O (STT + TTS)
2
+
3
+ **Spec version:** v1.0 (Phase 2)
4
+ **Depends on:** M03 (bus), M07 (blobs, for audio I/O), X04 (config), X03 (observability), `openai-whisper`, `TTS` (Coqui XTTS-v2), `edge-tts` libs
5
+ **Depended on by:** M08 UI (voice query button), M22 mobile (voice notes), M18 (STT can chain into translation)
6
+
7
+ ---
8
+
9
+ ## 1. Responsibility
10
+
11
+ Two capabilities:
12
+
13
+ - `stt.transcribe@1.0` β€” audio β†’ text, with optional translate-to-English
14
+ - `tts.synthesize@1.0` β€” text β†’ audio
15
+
16
+ Two services in the same module because they share the speech domain and often pair (voice query β†’ STT β†’ LLM β†’ TTS).
17
+
18
+ ---
19
+
20
+ ## 2. File layout
21
+
22
+ ```
23
+ hearthnet/services/speech/
24
+ β”œβ”€β”€ __init__.py
25
+ β”œβ”€β”€ stt_service.py
26
+ β”œβ”€β”€ tts_service.py
27
+ └── backends/
28
+ β”œβ”€β”€ __init__.py
29
+ β”œβ”€β”€ base.py # SttBackend, TtsBackend protocols
30
+ β”œβ”€β”€ whisper.py # OpenAI Whisper local
31
+ β”œβ”€β”€ whisper_remote.py # HF inference API alternative
32
+ β”œβ”€β”€ xtts.py # Coqui XTTS-v2 (cloned voices)
33
+ └── edge_tts.py # Microsoft Edge-TTS (Christof has existing pipeline)
34
+ ```
35
+
36
+ ---
37
+
38
+ ## 3. STT β€” public API
39
+
40
+ ### 3.1 `backends/base.py` (STT)
41
+
42
+ ```python
43
+ @dataclass(frozen=True)
44
+ class SttSegment:
45
+ start_seconds: float
46
+ end_seconds: float
47
+ text: str
48
+ language: str
49
+ speaker: str | None # only if diarization enabled
50
+ confidence: float | None
51
+
52
+ @dataclass(frozen=True)
53
+ class SttResult:
54
+ segments: list[SttSegment]
55
+ language: str
56
+ duration_seconds: float
57
+ ms: int
58
+
59
+ class SttBackend(Protocol):
60
+ name: str
61
+ models: list[str] # "tiny" | "base" | "small" | "medium" | "large-v3"
62
+ languages_supported: list[str] # ISO 639-1
63
+ supports_diarization: bool
64
+
65
+ async def warm(self, model: str) -> None: ...
66
+ async def close(self) -> None: ...
67
+
68
+ async def transcribe(
69
+ self,
70
+ audio_bytes: bytes,
71
+ *,
72
+ model: str,
73
+ language: str | None, # "auto" handled by caller
74
+ diarize: bool,
75
+ translate_to_en: bool,
76
+ ) -> AsyncIterator[SttSegment]:
77
+ """Yields segments as they are produced. Backend may produce in big chunks
78
+ or near-realtime depending on model + hardware."""
79
+
80
+ def health(self) -> dict: ...
81
+ ```
82
+
83
+ ### 3.2 `stt_service.py`
84
+
85
+ ```python
86
+ class SttService:
87
+ name = "stt"
88
+ version = "1.0"
89
+
90
+ def __init__(self, config: SpeechConfig, blob_store: BlobStore):
91
+ ...
92
+
93
+ def capabilities(self) -> list[tuple[CapabilityDescriptor, Callable, ParamsPredicate]]:
94
+ """One stt.transcribe per (backend, model) combo."""
95
+
96
+ async def start(self) -> None: ...
97
+ async def stop(self) -> None: ...
98
+ def health(self) -> dict: ...
99
+
100
+ async def handle_transcribe(self, req: RouteRequest) -> AsyncIterator[dict]:
101
+ """CAP2 Β§4.11.
102
+ 1. Fetch audio blob by CID
103
+ 2. Verify duration ≀ STT_MAX_AUDIO_SECONDS
104
+ 3. Stream segments
105
+ 4. Emit done with total stats"""
106
+ ```
107
+
108
+ ### 3.3 Concrete STT backends
109
+
110
+ ```python
111
+ class WhisperBackend(SttBackend):
112
+ """Local Whisper via openai-whisper or faster-whisper."""
113
+
114
+ def __init__(self, models_dir: Path, default_model: str = "large-v3", device: str = "auto"):
115
+ ...
116
+
117
+ class WhisperRemoteBackend(SttBackend):
118
+ """HF Inference API. requires_internet=True. Used as fallback when local Whisper not available."""
119
+
120
+ def __init__(self, model: str = "openai/whisper-large-v3", token_env: str = "HF_TOKEN"):
121
+ ...
122
+ ```
123
+
124
+ ---
125
+
126
+ ## 4. TTS β€” public API
127
+
128
+ ### 4.1 `backends/base.py` (TTS)
129
+
130
+ ```python
131
+ @dataclass(frozen=True)
132
+ class TtsResult:
133
+ audio_format: str # "ogg_vorbis" | "mp3" | "wav"
134
+ sample_rate: int # Hz
135
+ duration_seconds: float
136
+ total_bytes: int
137
+ ms: int
138
+
139
+ class TtsBackend(Protocol):
140
+ name: str
141
+ voices: list[str]
142
+ languages_supported: list[str]
143
+ formats_supported: list[str]
144
+ cloned_voices_supported: bool
145
+
146
+ async def warm(self, voice: str) -> None: ...
147
+ async def close(self) -> None: ...
148
+
149
+ async def synthesize(
150
+ self,
151
+ text: str,
152
+ *,
153
+ voice: str,
154
+ language: str,
155
+ speed: float, # 0.5..2.0; 1.0 default
156
+ output_format: str, # "ogg_vorbis"|"mp3"|"wav"
157
+ chunk_size_bytes: int = 16384,
158
+ ) -> AsyncIterator[bytes]:
159
+ """Yields raw audio chunks."""
160
+
161
+ def health(self) -> dict: ...
162
+ ```
163
+
164
+ ### 4.2 `tts_service.py`
165
+
166
+ ```python
167
+ class TtsService:
168
+ name = "tts"
169
+ version = "1.0"
170
+
171
+ def __init__(self, config: SpeechConfig):
172
+ ...
173
+
174
+ def capabilities(self) -> list[tuple[CapabilityDescriptor, Callable, ParamsPredicate]]:
175
+ """One tts.synthesize per (backend, voice) pair (or backend-only if many voices)."""
176
+
177
+ async def start(self) -> None: ...
178
+ async def stop(self) -> None: ...
179
+ def health(self) -> dict: ...
180
+
181
+ async def handle_synthesize(self, req: RouteRequest) -> AsyncIterator[dict]:
182
+ """CAP2 Β§4.12.
183
+ 1. Validate text length ≀ TTS_MAX_TEXT_CHARS
184
+ 2. Pick backend and voice
185
+ 3. Stream chunks (base64 in 'chunk' frame)
186
+ 4. Emit done with metadata"""
187
+ ```
188
+
189
+ ### 4.3 Concrete TTS backends
190
+
191
+ ```python
192
+ class XttsBackend(TtsBackend):
193
+ """Coqui XTTS-v2 (Christof has the pipeline from his podcast generator).
194
+ Supports voice cloning via reference audio."""
195
+
196
+ def __init__(
197
+ self,
198
+ model: str = "tts_models/multilingual/multi-dataset/xtts_v2",
199
+ voices_dir: Path = Path("~/.hearthnet/voices"),
200
+ device: str = "auto",
201
+ ):
202
+ ...
203
+
204
+ class EdgeTtsBackend(TtsBackend):
205
+ """Microsoft Edge-TTS β€” requires internet, many voices, very natural.
206
+ Used as default when xtts is too slow on a node."""
207
+
208
+ def __init__(self, default_voice: str = "de-DE-KatjaNeural"):
209
+ ...
210
+ ```
211
+
212
+ ---
213
+
214
+ ## 5. Behaviour
215
+
216
+ ### 5.1 STT streaming
217
+
218
+ For long audio:
219
+ - Local Whisper produces segments incrementally (~real time on a 4090, slower on CPU)
220
+ - Service emits one SSE `segment` frame per finalised segment
221
+ - Final `done` frame includes total duration and full language detection
222
+
223
+ ### 5.2 STT max length
224
+
225
+ `STT_MAX_AUDIO_SECONDS = 300`. Longer audio: caller chunks into 5-minute segments and concatenates results. Caller's responsibility to manage cross-chunk speaker continuity.
226
+
227
+ ### 5.3 Voice cloning (XTTS)
228
+
229
+ `XttsBackend` supports voice cloning when given a reference audio file:
230
+
231
+ ```python
232
+ config.tts.cloned_voices = [
233
+ ClonedVoiceConfig(name="hannes_v1", reference_path=Path("~/.hearthnet/voices/hannes-3s.wav"))
234
+ ]
235
+ ```
236
+
237
+ Each cloned voice is registered as a separate `voice` entry in the descriptor params. Cloning happens once at startup; serves quickly thereafter.
238
+
239
+ **Privacy note:** Voice cloning is powerful and risky. Communities SHOULD policy-restrict who can register cloned voices (suggested: `trust_required="anchor"` for voice cloning). MVP allows any member; document the risk.
240
+
241
+ ### 5.4 Audio format negotiation
242
+
243
+ - Input STT: any common format Whisper accepts (mp3, ogg, wav, m4a). Service normalises via `ffmpeg`.
244
+ - Output TTS: `ogg_vorbis` default (smallest), `mp3` widely-compatible, `wav` lossless.
245
+
246
+ ### 5.5 Edge-TTS internet dependency
247
+
248
+ `EdgeTtsBackend` requires internet. Deregistered automatically by [M09](../../modules/M09-emergency.md) when offline. XTTS local backend continues to work.
249
+
250
+ ### 5.6 STT β†’ TTS chain (voice assistant pattern)
251
+
252
+ The voice query button in M08 UI ext:
253
+ ```
254
+ mic β†’ audio blob via M07 β†’ stt.transcribe β†’ text
255
+ text β†’ llm.chat β†’ response text
256
+ response text β†’ tts.synthesize β†’ audio chunks β†’ speaker
257
+ ```
258
+
259
+ This is composed at the UI layer, not internally in the speech services.
260
+
261
+ ### 5.7 Christof's existing pipeline reuse
262
+
263
+ Christof has an established XTTS-v2 + Edge-TTS podcast generator pipeline. The `XttsBackend` and `EdgeTtsBackend` are designed to be drop-ins for that pipeline, sharing the same models directory.
264
+
265
+ ---
266
+
267
+ ## 6. Errors
268
+
269
+ | Condition | Wire code |
270
+ |-----------|-----------|
271
+ | Audio > STT_MAX_AUDIO_SECONDS | `bad_request` |
272
+ | Text > TTS_MAX_TEXT_CHARS | `bad_request` |
273
+ | Unknown voice | `not_found` |
274
+ | Audio decode failed (corrupt blob) | `bad_request` |
275
+ | Backend GPU OOM | `capacity_exceeded` |
276
+
277
+ ---
278
+
279
+ ## 7. Configuration
280
+
281
+ ```python
282
+ config.speech.enabled = True
283
+ config.speech.stt_backends = [
284
+ SttBackendConfig(name="whisper", default_model="large-v3", device="auto"),
285
+ ]
286
+ config.speech.tts_backends = [
287
+ TtsBackendConfig(name="xtts", voices_dir=Path("~/.hearthnet/voices")),
288
+ TtsBackendConfig(name="edge_tts", default_voice="de-DE-KatjaNeural"),
289
+ ]
290
+ config.speech.cloned_voices = [] # list[ClonedVoiceConfig]
291
+ ```
292
+
293
+ Constants: `STT_MAX_AUDIO_SECONDS`, `TTS_MAX_TEXT_CHARS`.
294
+
295
+ ---
296
+
297
+ ## 8. Tests
298
+
299
+ ### Unit
300
+ - `test_stt_descriptor_per_model`
301
+ - `test_tts_descriptor_per_voice`
302
+ - `test_stt_max_duration_rejected`
303
+ - `test_tts_max_length_rejected`
304
+
305
+ ### Integration
306
+ - `test_whisper_transcribes_de_audio` (test asset)
307
+ - `test_xtts_synthesises_then_decodes_to_correct_duration`
308
+ - `test_voice_chain_stt_llm_tts` β€” end-to-end
309
+ - `test_edge_tts_deregistered_when_offline`
310
+
311
+ ---
312
+
313
+ ## 9. Cross-references
314
+
315
+ | What | Where |
316
+ |------|-------|
317
+ | `stt.transcribe@1.0` wire | [CAP2 Β§4.11](../CAPABILITY_CONTRACT_v2.md) |
318
+ | `tts.synthesize@1.0` wire | [CAP2 Β§4.12](../CAPABILITY_CONTRACT_v2.md) |
319
+ | Voice query UI | M08 ext |
320
+ | Mobile voice notes | [M22 Β§4](M22-mobile-native.md) |
321
+ | Translation chain | [M18](M18-translation.md) |
322
+ | Emergency dereg for internet-bound backends | [M09 Β§5.2](../../modules/M09-emergency.md) |
323
+
324
+ ---
325
+
326
+ ## 10. Open questions
327
+
328
+ 1. **Streaming STT (mic input β†’ live caption)** β€” Phase 2.5. Requires WebSocket and a different backend init pattern.
329
+ 2. **Real-time TTS (sub-100ms first audio)** β€” XTTS is 500ms+; piper-tts is fast but limited voices. Phase 3.
330
+ 3. **Speaker enrollment** β€” explicit "this is who I am" speech sample so diarization can label by name. Phase 2.5.
331
+ 4. **Audio at-rest privacy** β€” should voice notes be E2E? [M23](M23-e2e-encryption.md) supports it; default ON for chat attachments.
docs/p2_p3/M20-vision.md ADDED
@@ -0,0 +1,369 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # M20 β€” Vision Services
2
+
3
+ **Spec version:** v1.0 (Phase 2)
4
+ **Depends on:** M03 (bus), M07 (blobs), M04 (LLM, extended), X04 (config), X03 (observability)
5
+ **Depended on by:** M08 UI (image describe in ask tab; generate in tools), M21 tools (vision used by tool-augmented LLM), M17 OCR (Florence-2 OCR mode shared)
6
+
7
+ ---
8
+
9
+ ## 1. Responsibility
10
+
11
+ Two capability families:
12
+
13
+ - `img.describe@1.0` β€” given an image CID, produce a caption, tags, object list, or OCR
14
+ - `img.generate@1.0` β€” given a prompt, generate an image
15
+
16
+ Plus: extend the LLM `llm.chat@2.0` request schema with **multimodal content** (text + image_cid in messages). The multimodal path goes through M04 backends that declare `modalities: ["text", "vision"]`. M20 is responsible for providing the vision backends those LLMs depend on.
17
+
18
+ Christof's existing pipelines: Florence-2 (describe), FLUX.1-dev with LoRAs (generate), MiniCPM-V (multimodal). All wired in.
19
+
20
+ ---
21
+
22
+ ## 2. File layout
23
+
24
+ ```
25
+ hearthnet/services/image/
26
+ β”œβ”€β”€ __init__.py
27
+ β”œβ”€β”€ describe_service.py
28
+ β”œβ”€β”€ generate_service.py
29
+ └── backends/
30
+ β”œβ”€β”€ __init__.py
31
+ β”œβ”€β”€ base.py # ImageDescribeBackend, ImageGenerateBackend
32
+ β”œβ”€β”€ florence2.py # Microsoft Florence-2 large
33
+ β”œβ”€β”€ minicpm_v.py # OpenBMB MiniCPM-V (also usable for chat-with-vision)
34
+ β”œβ”€β”€ flux.py # black-forest-labs FLUX.1-dev with LoRA support
35
+ └── stable_diffusion.py # Optional SD-XL fallback
36
+ ```
37
+
38
+ ---
39
+
40
+ ## 3. Public API β€” describe
41
+
42
+ ### 3.1 `backends/base.py`
43
+
44
+ ```python
45
+ @dataclass(frozen=True)
46
+ class ImageDescription:
47
+ caption: str
48
+ detailed_caption: str | None
49
+ tags: list[str]
50
+ objects: list[dict] # [{label, bbox, confidence}]
51
+ ocr_text: str | None
52
+ language: str
53
+ ms: int
54
+
55
+ class ImageDescribeBackend(Protocol):
56
+ name: str
57
+ tasks_supported: list[str] # subset of {"caption","detailed_caption","ocr","objects","tags"}
58
+ languages: list[str]
59
+ max_pixels: int
60
+
61
+ async def warm(self) -> None: ...
62
+ async def close(self) -> None: ...
63
+
64
+ async def describe(
65
+ self,
66
+ image_bytes: bytes,
67
+ *,
68
+ task: str,
69
+ language: str = "en",
70
+ ) -> ImageDescription: ...
71
+
72
+ def health(self) -> dict: ...
73
+ ```
74
+
75
+ ### 3.2 `describe_service.py`
76
+
77
+ ```python
78
+ class ImageDescribeService:
79
+ name = "image.describe"
80
+ version = "1.0"
81
+
82
+ def __init__(self, config: VisionConfig, blob_store: BlobStore):
83
+ ...
84
+
85
+ def capabilities(self) -> list[tuple[CapabilityDescriptor, Callable, ParamsPredicate]]:
86
+ """One img.describe per backend. Params include backend name and tasks_supported."""
87
+
88
+ async def start(self) -> None: ...
89
+ async def stop(self) -> None: ...
90
+ def health(self) -> dict: ...
91
+
92
+ async def handle_describe(self, req: RouteRequest) -> dict:
93
+ """CAP2 Β§4.13."""
94
+ ```
95
+
96
+ ### 3.3 Concrete describe backends
97
+
98
+ ```python
99
+ class Florence2Backend(ImageDescribeBackend):
100
+ def __init__(self, model: str = "microsoft/Florence-2-large", device: str = "auto"):
101
+ ...
102
+
103
+ class MinicpmVBackend(ImageDescribeBackend):
104
+ """Used both standalone (img.describe) and as an LLM vision backend (M04 extension)."""
105
+ def __init__(self, model: str = "openbmb/MiniCPM-V-2_6", device: str = "auto"):
106
+ ...
107
+ ```
108
+
109
+ ---
110
+
111
+ ## 4. Public API β€” generate
112
+
113
+ ### 4.1 `backends/base.py`
114
+
115
+ ```python
116
+ @dataclass(frozen=True)
117
+ class GenerationResult:
118
+ image_bytes: bytes
119
+ width: int
120
+ height: int
121
+ format: str # "png" | "webp" | "jpg"
122
+ seed: int
123
+ ms: int
124
+
125
+ class ImageGenerateBackend(Protocol):
126
+ name: str
127
+ models: list[str]
128
+ loras_available: list[str]
129
+ max_resolution: tuple[int, int]
130
+ min_resolution: tuple[int, int]
131
+ supports_negative_prompt: bool
132
+
133
+ async def warm(self, model: str) -> None: ...
134
+ async def close(self) -> None: ...
135
+
136
+ async def generate(
137
+ self,
138
+ prompt: str,
139
+ *,
140
+ model: str,
141
+ lora: str | None,
142
+ negative_prompt: str | None,
143
+ width: int,
144
+ height: int,
145
+ steps: int,
146
+ seed: int | None,
147
+ progress_cb: Callable[[int, int], None] | None = None,
148
+ ) -> GenerationResult: ...
149
+
150
+ def health(self) -> dict: ...
151
+ ```
152
+
153
+ ### 4.2 `generate_service.py`
154
+
155
+ ```python
156
+ class ImageGenerateService:
157
+ name = "image.generate"
158
+ version = "1.0"
159
+
160
+ def __init__(self, config: VisionConfig, blob_store: BlobStore):
161
+ ...
162
+
163
+ def capabilities(self) -> list[tuple[CapabilityDescriptor, Callable, ParamsPredicate]]:
164
+ """One img.generate per (backend, model) combo. params declare loras_available."""
165
+
166
+ async def start(self) -> None: ...
167
+ async def stop(self) -> None: ...
168
+ def health(self) -> dict: ...
169
+
170
+ async def handle_generate(self, req: RouteRequest) -> AsyncIterator[dict]:
171
+ """CAP2 Β§4.14.
172
+ 1. Generate (streaming progress frames)
173
+ 2. Store resulting image as blob
174
+ 3. Emit done with image_cid"""
175
+ ```
176
+
177
+ ### 4.3 Concrete generate backends
178
+
179
+ ```python
180
+ class FluxBackend(ImageGenerateBackend):
181
+ """FLUX.1-dev with LoRA support; Christof's existing pipeline."""
182
+ def __init__(
183
+ self,
184
+ model: str = "black-forest-labs/FLUX.1-dev",
185
+ device: str = "auto",
186
+ loras_dir: Path = Path("~/.hearthnet/loras"),
187
+ ):
188
+ ...
189
+
190
+ class StableDiffusionBackend(ImageGenerateBackend):
191
+ """SD-XL fallback for nodes with smaller GPUs."""
192
+ def __init__(self, model: str = "stabilityai/stable-diffusion-xl-base-1.0", device: str = "auto"):
193
+ ...
194
+ ```
195
+
196
+ ---
197
+
198
+ ## 5. Multimodal LLM extension (M04 hook)
199
+
200
+ ### 5.1 Message content array
201
+
202
+ In `llm.chat@2.0` (CAP2 Β§4.23), each `messages[].content` may be a list:
203
+
204
+ ```json
205
+ [
206
+ {"type": "text", "text": "Was ist auf diesem Bild?"},
207
+ {"type": "image", "image_cid": "blake3:..."}
208
+ ]
209
+ ```
210
+
211
+ Backends that declare `modalities: ["text", "vision"]` in their descriptor must handle the array form. Backends that don't either:
212
+ - Are skipped by the router (params_compatible returns False when message contains image and `modalities βŠ‰ {"vision"}`)
213
+ - Or fall back: extract text content only, ignore images (worse UX; not recommended)
214
+
215
+ ### 5.2 Vision-capable backends in M04
216
+
217
+ These M04 backends gain a `modalities: ["text","vision"]` declaration in Phase 2:
218
+
219
+ | Backend | Vision support |
220
+ |---------|----------------|
221
+ | `MinicpmVBackend` (M04 entry β€” same model as M20's describe) | Yes; native multimodal |
222
+ | `AnthropicApiBackend` | Yes; Claude vision via API |
223
+ | `OpenAiApiBackend` | Yes; GPT-4V |
224
+ | `Llava` (new, optional) | Yes; LLaVA via llama.cpp |
225
+ | `LlamaCppBackend` | Yes if model is multimodal (LLaVA-format) |
226
+ | `OllamaBackend` | Yes for vision models |
227
+ | Others | No |
228
+
229
+ The `M04.LlmService._build_backends` constructs these with their vision flag.
230
+
231
+ ### 5.3 Image preprocessing
232
+
233
+ For LLM context, images are:
234
+ - Loaded from blob store via CID
235
+ - Resized to backend's preferred resolution (e.g. 1024Γ—1024 for MiniCPM-V)
236
+ - Encoded base64 or sent as bytes per backend's protocol
237
+
238
+ This is opaque to the caller β€” the multimodal `messages` array is the contract.
239
+
240
+ ---
241
+
242
+ ## 6. Behaviour
243
+
244
+ ### 6.1 Image describe lifecycle
245
+
246
+ ```
247
+ caller β†’ bus.call("img.describe", (1,0), {input:{image_cid:..., task:"detailed_caption"}})
248
+ β†’ ImageDescribeService.handle_describe
249
+ β†’ blob_store.read_blob_bytes(image_cid)
250
+ β†’ backend.describe(bytes, task=...)
251
+ β†’ return ImageDescription serialised
252
+ ```
253
+
254
+ ### 6.2 Image generate lifecycle
255
+
256
+ ```
257
+ caller β†’ bus.stream("img.generate", (1,0), {input:{prompt:"...", steps:20}})
258
+ β†’ ImageGenerateService.handle_generate
259
+ β†’ backend.generate(...) with progress_cb
260
+ β†’ for each step: emit 'progress' frame
261
+ β†’ on completion: blob_store.write_blob(image)
262
+ β†’ emit 'done' frame with image_cid
263
+ ```
264
+
265
+ ### 6.3 Safety filters
266
+
267
+ - `img.generate` prompts pass through a configurable safety filter list (regex blocklist + optional LLM-based classifier)
268
+ - Generation of identifiable persons is blocked by default (configurable: `config.vision.allow_identifiable_persons`)
269
+ - NSFW filter on output (Stable Diffusion has built-in; FLUX needs separate model)
270
+ - Failed safety β†’ `bad_request` with `reason: "safety_filter"`
271
+
272
+ ### 6.4 LoRA management
273
+
274
+ `FluxBackend.loras_available` lists LoRAs found in `loras_dir`. Caller can request a specific LoRA in `params.lora`. Loading a LoRA takes a few seconds on first use; cached thereafter.
275
+
276
+ Christof's existing LoRAs (local-style, sketches, etc.) drop into the `loras_dir`. The backend auto-discovers them.
277
+
278
+ ### 6.5 GPU pressure
279
+
280
+ Vision models are heavy. Recommended:
281
+
282
+ - One Florence-2 instance per node (always-loaded)
283
+ - FLUX/SD only loaded on-demand (warm on first request, kept hot for 5 minutes)
284
+ - `max_concurrent = 1` for FLUX; `2` for describe backends
285
+
286
+ These limits are declared in the capability descriptor so the bus throttles correctly.
287
+
288
+ ### 6.6 Multimodal LLM call routing
289
+
290
+ When a user sends a multimodal message:
291
+
292
+ ```
293
+ UI β†’ bus.stream("llm.chat", (2,0), {input:{messages:[{role:"user", content:[{type:"text",text:"..."},{type:"image",image_cid:"..."}]}]}})
294
+ β†’ Router.route filters candidates to those with modalities βŠ‡ {"vision"}
295
+ β†’ picks best (e.g. MinicpmV local, fall back to Anthropic API)
296
+ β†’ backend handles base64 / image-token-injection internally
297
+ ```
298
+
299
+ If no vision-capable backend is online, the call returns `not_found` with a helpful `alt_capabilities` hint pointing to describe-then-text-only fallback (UI can offer this).
300
+
301
+ ---
302
+
303
+ ## 7. Errors
304
+
305
+ | Condition | Wire code |
306
+ |-----------|-----------|
307
+ | Unknown task | `bad_request` |
308
+ | Image too large | `bad_request` |
309
+ | Prompt safety violation | `bad_request` (reason=safety_filter) |
310
+ | LoRA not found | `not_found` |
311
+ | GPU OOM | `capacity_exceeded` |
312
+ | Backend missing for requested task | `not_implemented` |
313
+
314
+ ---
315
+
316
+ ## 8. Configuration
317
+
318
+ ```python
319
+ config.vision.enabled = True
320
+ config.vision.describe_backends = [
321
+ DescribeBackendConfig(name="florence2", model="microsoft/Florence-2-large", device="auto"),
322
+ DescribeBackendConfig(name="minicpm_v", model="openbmb/MiniCPM-V-2_6", device="auto"),
323
+ ]
324
+ config.vision.generate_backends = [
325
+ GenerateBackendConfig(name="flux", model="black-forest-labs/FLUX.1-dev",
326
+ loras_dir=Path("~/.hearthnet/loras"), device="auto"),
327
+ ]
328
+ config.vision.allow_identifiable_persons = False
329
+ config.vision.safety_blocklist_file = None # optional regex file
330
+ ```
331
+
332
+ ---
333
+
334
+ ## 9. Tests
335
+
336
+ ### Unit
337
+ - `test_describe_descriptor_per_backend`
338
+ - `test_safety_filter_blocks_known_pattern`
339
+ - `test_lora_discovery`
340
+ - `test_oom_returns_capacity_exceeded`
341
+
342
+ ### Integration
343
+ - `test_florence2_caption_sample` (test image)
344
+ - `test_flux_generate_with_lora_progress_frames`
345
+ - `test_multimodal_llm_routes_to_vision_backend`
346
+ - `test_describe_then_text_fallback_when_no_vision_llm`
347
+
348
+ ---
349
+
350
+ ## 10. Cross-references
351
+
352
+ | What | Where |
353
+ |------|-------|
354
+ | `img.*` wire | [CAP2 Β§4.13–4.14](../CAPABILITY_CONTRACT_v2.md) |
355
+ | Multimodal `llm.chat@2.0` | [CAP2 Β§4.23](../CAPABILITY_CONTRACT_v2.md) |
356
+ | LLM service extension | M04 (extended in Phase 2 β€” see [00-OVERVIEW Β§1](../00-OVERVIEW.md)) |
357
+ | OCR overlap | [M17 Β§3.2 florence_ocr](M17-ocr.md) |
358
+ | Christof's pipelines | external, this is the integration |
359
+
360
+ ---
361
+
362
+ ## 11. Open questions
363
+
364
+ 1. **Video** β€” Phase 3 considers `video.describe` and `video.generate` (LTX-Video). Not in Phase 2.
365
+ 2. **Image editing (inpainting)** β€” Phase 2.5: `img.edit@1.0` capability. Reserved.
366
+ 3. **Control nets (depth, edge, pose)** β€” Phase 2.5.
367
+ 4. **3D generation** β€” Phase 3 with TripoSR or similar.
368
+ 5. **Safety filter quality** β€” regex blocklist is weak. An LLM-as-judge classifier is better but adds latency. Configurable; default off.
369
+ 6. **LoRA stacking** β€” caller specifies multiple `loras: ["...","..."]`. Implementable but adds attack surface (prompt-LoRA combos). Defer.
docs/p2_p3/M21-tool-calls.md ADDED
@@ -0,0 +1,317 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # M21 β€” Tool Calls (LLM Tool-Use)
2
+
3
+ **Spec version:** v1.0 (Phase 2)
4
+ **Depends on:** M03 (bus), M04 (LLM, extended), X06 (WebSocket, for in-stream tool loops), X04 (config), X03 (observability)
5
+ **Depended on by:** M08 UI (ask tab gains tool-augmented mode), M22 mobile (same), any future agent applications
6
+
7
+ ---
8
+
9
+ ## 1. Responsibility
10
+
11
+ Let the LLM call other HearthNet capabilities mid-generation. Specifically:
12
+
13
+ - Declare `tools` in the `llm.chat@2.0` request
14
+ - Receive `tool_call_delta` and `tool_call` stream frames from the LLM
15
+ - Execute each tool call against the bus
16
+ - Feed results back into the LLM via `tool_result` (WebSocket) or via a follow-up `llm.chat` call (SSE)
17
+ - Provide `llm.tools.call@1.0` as a convenience that wraps the bus dispatch
18
+
19
+ This module is **a protocol + helper**, not a service in the usual sense. The actual LLM work lives in M04; M21 documents how the tool flow is structured and provides utilities both sides need.
20
+
21
+ ---
22
+
23
+ ## 2. File layout
24
+
25
+ ```
26
+ hearthnet/services/llm/
27
+ └── tools.py # ToolDefinition, ToolCall, ToolResult, ToolExecutor
28
+
29
+ hearthnet/services/auth/ (already in M16)
30
+ # the auth service also registers llm.tools.call@1.0 as a wrapper capability
31
+ ```
32
+
33
+ `tools.py` is small (~250 LOC). It lives inside the LLM service package because tool usage is intrinsic to chat completion.
34
+
35
+ ---
36
+
37
+ ## 3. Public API
38
+
39
+ ### 3.1 Data types
40
+
41
+ ```python
42
+ # hearthnet/services/llm/tools.py
43
+ from dataclasses import dataclass
44
+ from typing import AsyncIterator, Callable
45
+
46
+ @dataclass(frozen=True)
47
+ class ToolDefinition:
48
+ """A tool the caller offers to the LLM.
49
+ Translated by the LLM backend into its native tool format."""
50
+ name: str # short identifier visible to the LLM
51
+ description: str # human-readable, drives LLM selection
52
+ parameters_schema: dict # JSON Schema for arguments
53
+ bound_capability: str | None # if set, ToolExecutor dispatches via bus
54
+ bound_version: tuple[int, int] | None
55
+ side_effects: bool # "is this a write?" β€” affects retry semantics
56
+
57
+ @dataclass(frozen=True)
58
+ class ToolCall:
59
+ """A request from the LLM to execute a tool."""
60
+ id: str # opaque, generated by the LLM
61
+ name: str
62
+ arguments: dict # validated against parameters_schema
63
+
64
+ @dataclass(frozen=True)
65
+ class ToolResult:
66
+ """The result of executing a ToolCall, fed back to the LLM."""
67
+ tool_call_id: str
68
+ name: str
69
+ content: str | dict # serialisable; if dict, becomes JSON
70
+ is_error: bool
71
+ ```
72
+
73
+ ### 3.2 `ToolExecutor`
74
+
75
+ ```python
76
+ class ToolExecutor:
77
+ """Wraps the orchestration loop: forward LLM tool calls to the bus,
78
+ collect results, re-inject into the LLM."""
79
+
80
+ def __init__(
81
+ self,
82
+ bus: CapabilityBus,
83
+ tools: list[ToolDefinition],
84
+ *,
85
+ max_iterations: int = 6,
86
+ per_tool_timeout_seconds: int = 30,
87
+ ):
88
+ ...
89
+
90
+ @property
91
+ def native_definitions(self) -> list[dict]:
92
+ """Returns tools in the request schema's format (CAP2 Β§4.23 input.tools)."""
93
+
94
+ async def dispatch(self, call: ToolCall) -> ToolResult:
95
+ """Validate call.arguments against the tool's parameters_schema.
96
+ If bound_capability: bus.call(bound_capability, bound_version, {input: call.arguments}).
97
+ Returns ToolResult. Catches and surfaces errors as is_error=True."""
98
+
99
+ async def run_chat_with_tools(
100
+ self,
101
+ chat_request_body: dict,
102
+ *,
103
+ stream_to: Callable[[dict], Awaitable[None]] | None = None,
104
+ ) -> dict:
105
+ """Orchestrator helper. Loops:
106
+ 1. call bus.stream("llm.chat", (2,0), body)
107
+ 2. accumulate text + tool_call frames
108
+ 3. on tool_call_complete: dispatch, append tool_result message
109
+ 4. re-call llm.chat with extended messages
110
+ 5. stop when no more tool calls OR max_iterations reached
111
+ Returns the final assistant message."""
112
+ ```
113
+
114
+ ### 3.3 Wire-level frames (recap of CAP2 Β§4.23 and Β§5.1 with the tool flow)
115
+
116
+ LLM emits:
117
+
118
+ ```
119
+ event: token
120
+ data: {"text":"I'll search "}
121
+
122
+ event: tool_call_delta
123
+ data: {"id":"tc_1","name":"rag.query","arguments_delta":"{\"query\":\""}
124
+
125
+ event: tool_call_delta
126
+ data: {"id":"tc_1","arguments_delta":"Regenwasser\""}
127
+
128
+ event: tool_call
129
+ data: {"id":"tc_1","name":"rag.query","arguments":{"query":"Regenwasser","corpus":"niederrhein-emergency"}}
130
+ ```
131
+
132
+ Caller dispatches and replies (over WebSocket OR by re-calling `llm.chat` with the tool result added to messages):
133
+
134
+ WebSocket:
135
+ ```
136
+ client β†’ {"type":"tool_result","tool_call_id":"tc_1","body":{"chunks":[...]}}
137
+ ```
138
+
139
+ SSE fallback:
140
+ ```
141
+ caller re-calls llm.chat with messages = original_messages + [
142
+ {"role":"assistant","content":"...","tool_calls":[{"id":"tc_1","name":"rag.query","arguments":{...}}]},
143
+ {"role":"tool","tool_call_id":"tc_1","content":"<JSON of tool result>"}
144
+ ]
145
+ ```
146
+
147
+ Both paths converge: LLM continues and eventually emits `done`.
148
+
149
+ ---
150
+
151
+ ## 4. Behaviour
152
+
153
+ ### 4.1 Tool selection heuristics
154
+
155
+ The LLM picks tools based on:
156
+ - Tool descriptions (descriptive English/German helps)
157
+ - `tool_choice` parameter:
158
+ - `"auto"` (default): LLM decides
159
+ - `"none"`: forbid tool use even if tools are declared
160
+ - `"required"`: must call at least one tool
161
+ - `{"name":"rag.query"}`: must call specifically this tool
162
+
163
+ Backends translate these to their native API.
164
+
165
+ ### 4.2 Built-in tools
166
+
167
+ When `ToolExecutor` is instantiated by the UI, it can auto-include a set of standard tools bound to common bus capabilities:
168
+
169
+ | Tool name | Bound to | Use case |
170
+ |-----------|----------|----------|
171
+ | `search_corpus` | `rag.query@1.0` | Search a corpus |
172
+ | `list_corpora` | `rag.list_corpora@1.0` | What's available |
173
+ | `translate` | `trans.text@1.0` | Translate snippets |
174
+ | `find_neighbour` | (custom β€” list peers in current community) | "Wer ist da?" |
175
+ | `list_marketplace` | `market.list@1.0` | Active posts |
176
+ | `describe_image` | `img.describe@1.0` | Inspect uploaded images |
177
+ | `transcribe_audio` | `stt.transcribe@1.0` | Voice-input chained |
178
+
179
+ These are *suggested defaults*. Real applications pick what fits.
180
+
181
+ ### 4.3 Validation
182
+
183
+ `ToolExecutor.dispatch` validates `call.arguments` against `parameters_schema` before calling the bus. Invalid args β†’ `ToolResult(is_error=True, content="invalid_arguments: ...")`. The LLM sees the error and typically self-corrects.
184
+
185
+ ### 4.4 Iteration limits
186
+
187
+ `max_iterations` (default 6) prevents runaway tool loops. After the limit, `ToolExecutor.run_chat_with_tools` injects a final `tool` message saying "iteration limit reached; finalise your answer" and forces `tool_choice="none"` on the next call.
188
+
189
+ ### 4.5 Side-effect tools
190
+
191
+ Tools where `side_effects: True` (like `market.post`, `chat.send`) require explicit confirmation. By default, `ToolExecutor` raises `ToolError("requires_confirmation")` on side-effect calls, expecting the orchestrator (UI) to present a confirmation dialog.
192
+
193
+ UI flow:
194
+ ```
195
+ LLM emits tool_call to market.post
196
+ ToolExecutor sees side_effects=True
197
+ emits a 'confirmation_required' frame upstream
198
+ UI shows "Allow LLM to post this?"
199
+ user clicks yes β†’ orchestrator calls ToolExecutor.dispatch_confirmed(call)
200
+ ```
201
+
202
+ ### 4.6 Parallel tool calls
203
+
204
+ LLMs (Claude, GPT-4) can emit multiple `tool_call` frames in one turn. `ToolExecutor` dispatches them in parallel (bounded by `max_concurrent=4`). Results are submitted together in the next LLM turn.
205
+
206
+ ### 4.7 Tool call composition (tools that call tools)
207
+
208
+ A `bound_capability` may itself be `llm.tools.call@1.0`. This allows defining higher-level tools as compositions of bus capabilities + LLM reasoning. Recursion limit = `max_iterations`.
209
+
210
+ ### 4.8 Trust and tokens
211
+
212
+ Tool dispatch goes through the bus and inherits the caller's trust level. The LLM cannot escalate by emitting a tool call β€” the tool inherits the caller's permissions. For cross-community tool calls, the caller must hold an appropriate token (M16).
213
+
214
+ ### 4.9 LLM backend translation
215
+
216
+ Backends translate `ToolDefinition` to their native protocol:
217
+
218
+ | Backend | Native format |
219
+ |---------|--------------|
220
+ | `AnthropicApiBackend` | Anthropic Messages tools |
221
+ | `OpenAiApiBackend` | OpenAI function calling |
222
+ | `OllamaBackend` (some models) | Ollama tool calls |
223
+ | `LlamaCppBackend` (with grammar) | JSON-Schema grammar constraint |
224
+ | `MinicpmVBackend` | MiniCPM tool format |
225
+ | `NemotronBackend` | OpenAI-compatible |
226
+ | `OpenBmbBackend` | OpenAI-compatible |
227
+ | Others | Tools ignored; backend emits a notice on `tool_choice="required"` |
228
+
229
+ This translation lives inside each backend's `chat()` method.
230
+
231
+ ---
232
+
233
+ ## 5. `llm.tools.call@1.0` capability
234
+
235
+ Convenience wrapper. Used when a caller wants to invoke a bus capability as if it were a tool result, without going through the LLM:
236
+
237
+ Already specified in [CAP2 Β§4.24](../CAPABILITY_CONTRACT_v2.md).
238
+
239
+ The handler lives in `M04.LlmService.handle_tools_call`:
240
+
241
+ ```python
242
+ async def handle_tools_call(self, req: RouteRequest) -> dict:
243
+ """1. Validate target_body against target_capability's request schema (via bus.schema)
244
+ 2. bus.call(target_capability, target_version, target_body)
245
+ 3. Return result"""
246
+ ```
247
+
248
+ Mostly used by orchestrators that want a single audit-trail capability for "tool execution".
249
+
250
+ ---
251
+
252
+ ## 6. Configuration
253
+
254
+ ```python
255
+ config.llm.tools_enabled = True
256
+ config.llm.tools_max_iterations = 6
257
+ config.llm.tools_per_tool_timeout_seconds = 30
258
+ config.llm.tools_max_parallel = 4
259
+ config.llm.tools_default_set = ["search_corpus","list_corpora","translate","list_marketplace"]
260
+ config.llm.tools_require_confirmation_for_side_effects = True
261
+ ```
262
+
263
+ ---
264
+
265
+ ## 7. Errors
266
+
267
+ | Condition | Wire code |
268
+ |-----------|-----------|
269
+ | Tool not in declared set | `bad_request` |
270
+ | Tool arguments fail schema | `bad_request` |
271
+ | Tool execution timed out | `timeout` |
272
+ | Tool returned `internal_error` | propagated as `internal_error` |
273
+ | Iteration limit reached | (graceful β€” final answer forced) |
274
+ | Caller's token doesn't cover bound capability | `token_scope_insufficient` |
275
+
276
+ ---
277
+
278
+ ## 8. Tests
279
+
280
+ ### Unit
281
+ - `test_tool_definition_to_native_format` β€” per backend
282
+ - `test_dispatch_validates_arguments`
283
+ - `test_side_effect_tool_requires_confirmation`
284
+ - `test_iteration_limit_forces_finalisation`
285
+ - `test_parallel_tool_calls_collected`
286
+
287
+ ### Integration
288
+ - `test_search_corpus_tool_used_for_grounded_answer` β€” LLM is asked a question, calls rag.query, answers
289
+ - `test_translate_chain` β€” user types in DE, LLM uses trans.text tool internally
290
+ - `test_market_post_requires_confirmation`
291
+ - `test_recursive_tool_call_limited_by_max_iterations`
292
+
293
+ ### Manual
294
+ - Confirm Anthropic Claude, OpenAI GPT-4, Ollama Mistral, MiniCPM-V all produce well-formed tool_call frames on the same test prompt.
295
+
296
+ ---
297
+
298
+ ## 9. Cross-references
299
+
300
+ | What | Where |
301
+ |------|-------|
302
+ | `llm.chat@2.0` tools field | [CAP2 Β§4.23](../CAPABILITY_CONTRACT_v2.md) |
303
+ | Tool-call stream frames | [CAP2 Β§5.1, X06 Β§6.6](../cross-cutting/X06-websocket.md) |
304
+ | `llm.tools.call@1.0` | [CAP2 Β§4.24](../CAPABILITY_CONTRACT_v2.md) |
305
+ | M04 backend extensions | M04 (extended in Phase 2) |
306
+ | Token scope for cross-community tool dispatch | [M16 Β§5.2](M16-tokens.md) |
307
+ | Confirmation UI hook | M08 ext |
308
+
309
+ ---
310
+
311
+ ## 10. Open questions
312
+
313
+ 1. **Tool result streaming.** Currently a tool result is atomic. For long-running tool calls (e.g. `img.generate`), the LLM has to wait. Phase 2.5 may stream tool progress back.
314
+ 2. **Tool memoisation.** Repeated `search_corpus(q)` in one chat could be cached. Defer.
315
+ 3. **Tool authority lineage.** When a tool is called by an LLM running on Node A on behalf of User on Node B, which token does the tool inherit? Currently the user's. This may be insufficient for federation. Phase 2.5.
316
+ 4. **Tool calls that issue tokens.** Could a tool be "issue me a token for capability X"? Probably yes; specify carefully to avoid privilege escalation. Defer.
317
+ 5. **Tool selection telemetry.** Which tools does the LLM actually pick? Useful for tuning descriptions. Log to trace ring buffer; surface in observability dashboard.
docs/p2_p3/M22-mobile-native.md ADDED
@@ -0,0 +1,325 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # M22 β€” Mobile Native Client
2
+
3
+ **Spec version:** v1.0 (Phase 2)
4
+ **Depends on:** M01 (identity), M15 (relay tier for push and NAT traversal), M16 (tokens for app auth), M23 (E2E for chat), X06 (WebSocket for live updates), the entire Phase 1 bus protocol as a wire client
5
+ **Depended on by:** end users on iOS/Android β€” first non-web HearthNet surface
6
+
7
+ ---
8
+
9
+ ## 1. Responsibility
10
+
11
+ Native mobile application that:
12
+
13
+ - Onboards into a community (scan invite QR or paste invite blob)
14
+ - Stores keys in the device's secure enclave (iOS Keychain, Android Keystore)
15
+ - Receives push notifications via the relay tier (M15)
16
+ - Calls the community's anchor(s) via the bus protocol over HTTP/SSE or WebSocket
17
+ - Provides UI for chat (1:1 + group), marketplace, ask (LLM), community feed
18
+ - Operates fully when the user's anchor is reachable; degrades gracefully when not
19
+
20
+ This module specifies **the contract between the mobile client and the rest of HearthNet**. The actual Flutter codebase lives in a separate repo (`/mobile-native`) and is not Python.
21
+
22
+ ---
23
+
24
+ ## 2. File layout
25
+
26
+ ```
27
+ mobile-native/ # separate Flutter project
28
+ β”œβ”€β”€ pubspec.yaml
29
+ β”œβ”€β”€ README.md
30
+ β”œβ”€β”€ lib/
31
+ β”‚ β”œβ”€β”€ main.dart
32
+ β”‚ β”œβ”€β”€ onboarding/ # invite scan, key gen
33
+ β”‚ β”œβ”€β”€ identity/ # key storage, signing
34
+ β”‚ β”‚ β”œβ”€β”€ secure_storage.dart
35
+ β”‚ β”‚ └── signing.dart
36
+ β”‚ β”œβ”€β”€ bus/ # protocol client
37
+ β”‚ β”‚ β”œβ”€β”€ http_client.dart
38
+ β”‚ β”‚ β”œβ”€β”€ ws_client.dart
39
+ β”‚ β”‚ └── sse_client.dart
40
+ β”‚ β”œβ”€β”€ crypto/ # E2E using cryptography_flutter / libsodium bindings
41
+ β”‚ β”‚ β”œβ”€β”€ x25519.dart
42
+ β”‚ β”‚ β”œβ”€β”€ ratchet.dart
43
+ β”‚ β”‚ └── envelope.dart
44
+ β”‚ β”œβ”€β”€ push/ # APNs / FCM hookup
45
+ β”‚ β”‚ └── subscriber.dart
46
+ β”‚ β”œβ”€β”€ ui/ # screens
47
+ β”‚ β”‚ β”œβ”€β”€ chat.dart
48
+ β”‚ β”‚ β”œβ”€β”€ marketplace.dart
49
+ β”‚ β”‚ β”œβ”€β”€ ask.dart
50
+ β”‚ β”‚ └── community.dart
51
+ β”‚ └── settings/
52
+ β”œβ”€β”€ ios/
53
+ β”œβ”€β”€ android/
54
+ └── tests/
55
+
56
+ hearthnet/mobile/ # Python-side helper (in the main package)
57
+ β”œβ”€β”€ __init__.py
58
+ β”œβ”€β”€ invite.py # mobile-targeted invite QR generation
59
+ └── push_authority.py # bus-side service for mobile push token registry
60
+ ```
61
+
62
+ The Python `hearthnet/mobile/` package contains the anchor-side helpers used by the existing community. The Flutter code is its own world; this spec governs the wire contract it must implement.
63
+
64
+ ---
65
+
66
+ ## 3. Onboarding flow
67
+
68
+ ```
69
+ User installs HearthNet app
70
+ ↓
71
+ App: "Scan invite QR or paste invite link"
72
+ ↓
73
+ On scan: parse hnvite:// blob (Phase 1 Β§13)
74
+ ↓
75
+ App generates Ed25519 keypair via libsodium binding
76
+ Persists private key in iOS Keychain (kSecAttrAccessibleAfterFirstUnlock)
77
+ or Android Keystore (KeyProperties.AUTH_REQUIRED if biometric set)
78
+ ↓
79
+ App calls bus.call("onboard.complete", input={
80
+ "invite_token": "...",
81
+ "node_id_full": "<our new ed25519 pub>",
82
+ "device_class": "mobile",
83
+ "display_name": "<user-entered>",
84
+ "platform": "ios|android",
85
+ })
86
+ ↓
87
+ Anchor processes invite (M13), emits node.joined event
88
+ ↓
89
+ App fetches community manifest (signed); pins it
90
+ ↓
91
+ App registers for push:
92
+ - obtain APNs/FCM device token from OS
93
+ - bus.call("relay.push.register", input={device_token, platform})
94
+ ↓
95
+ App publishes E2E prekey bundle (e2e.prekeys.published event)
96
+ ↓
97
+ Ready to use
98
+ ```
99
+
100
+ ---
101
+
102
+ ## 4. Bus protocol on mobile
103
+
104
+ ### 4.1 Same wire as desktop
105
+
106
+ The mobile client speaks the same HTTP/SSE/WebSocket protocol as Phase 1 anchors. It is a **client** that calls into anchors/services in its community; it does not host capabilities itself (mostly β€” see Β§4.4).
107
+
108
+ ### 4.2 Endpoint selection
109
+
110
+ The mobile client maintains an ordered list of endpoints:
111
+
112
+ 1. **Cached anchor endpoints** from last successful manifest fetch
113
+ 2. **Local network discovery** (mDNS via [Bonjour iOS / NSD Android]) β€” only when on Wi-Fi
114
+ 3. **Relay tier** (M15) β€” for NAT traversal when on cellular
115
+ 4. **DHT lookup** (X05) β€” last resort
116
+
117
+ Per call, the client tries endpoints in order; first success wins. Persistent failures bubble up as `relay_unreachable` or `network_unreachable`.
118
+
119
+ ### 4.3 Reconnection
120
+
121
+ WebSocket connections are persistent for tool-call loops and live pubsub. On disconnect, the client reconnects with exponential backoff (1s, 2s, 4s, ..., capped at 60s). Reconnect re-subscribes to all active topics.
122
+
123
+ ### 4.4 Mobile-as-callable
124
+
125
+ In Phase 2, the mobile client does NOT register capabilities back into the community. It's purely a caller. Phase 3 may allow simple things (`market.list` from local cache while offline).
126
+
127
+ ### 4.5 Token-bearer mode
128
+
129
+ For background-fetch (when the app is suspended), the OS may run a brief task. Background tasks use a **bearer token** from M16 (issued at onboarding, refreshed on each app foreground). The token has scope:
130
+
131
+ ```json
132
+ {
133
+ "capabilities": ["chat.fetch","marketplace.list"],
134
+ "rate_limit_per_minute": 30
135
+ }
136
+ ```
137
+
138
+ This avoids needing the user's biometric to unlock the private key for background polling.
139
+
140
+ ---
141
+
142
+ ## 5. Push notifications
143
+
144
+ ### 5.1 What triggers a push
145
+
146
+ When the following events occur in the community, anchors send a push to relevant subscribed mobile devices:
147
+
148
+ - `chat.message.sent` where recipient is the mobile user
149
+ - `chat.thread.message.sent` where mobile user is a thread member
150
+ - `marketplace.post.created` where post matches user's subscribed categories
151
+ - `community.alert` (broadcast emergency alert)
152
+ - `node.joined` (subscribed users only)
153
+
154
+ ### 5.2 Push payload shape
155
+
156
+ Per [M15 Β§5.4](M15-relay-tier.md), the payload is minimal:
157
+
158
+ ```json
159
+ {
160
+ "event_type": "chat.message.sent",
161
+ "sender_short": "7H4G-...",
162
+ "preview": "Hallo Jana, ich bring..." // optional cleartext preview if not E2E
163
+ }
164
+ ```
165
+
166
+ For E2E messages, the preview is absent and the app must fetch + decrypt on open.
167
+
168
+ ### 5.3 iOS vs Android specifics
169
+
170
+ **iOS:** APNs payload with `aps.alert.title` and `aps.alert.body`. Background mode enabled to fetch on receive (`content-available: 1`).
171
+
172
+ **Android:** FCM `data` message (not `notification` β€” we control display). Handled by the app's `FirebaseMessagingService`.
173
+
174
+ ### 5.4 Quiet hours
175
+
176
+ User-configurable. Push silenced 22:00–07:00 by default; emergency alerts override.
177
+
178
+ ### 5.5 Mute and per-thread settings
179
+
180
+ Per-thread mute, per-category marketplace silence. Stored in `mobile.preferences` event (self-only, encrypted-at-rest on device).
181
+
182
+ ---
183
+
184
+ ## 6. Secure key storage
185
+
186
+ ### 6.1 iOS β€” Keychain Services
187
+
188
+ ```dart
189
+ // lib/identity/secure_storage.dart (sketch)
190
+ const _accessibility = 'kSecAttrAccessibleAfterFirstUnlock';
191
+
192
+ Future<void> storePrivateKey(String label, Uint8List bytes) async {
193
+ await KeychainAccess.setData(
194
+ label: label,
195
+ data: bytes,
196
+ accessibility: _accessibility,
197
+ accessControl: AccessControl.userPresence, // biometric on key use
198
+ );
199
+ }
200
+ ```
201
+
202
+ ### 6.2 Android β€” Keystore
203
+
204
+ Hardware-backed if available (StrongBox on modern Pixels), else TEE.
205
+
206
+ ```dart
207
+ final cipher = await CryptographyFlutter.aesGcm(
208
+ keyId: 'hearthnet_identity_v1',
209
+ requireAuth: AuthMethod.biometric,
210
+ );
211
+ ```
212
+
213
+ ### 6.3 Backup
214
+
215
+ Private keys are NEVER backed up via iCloud / Google Backup. App-level backup uses an encrypted export blob (user-chosen passphrase) that the user is expected to save out-of-band (e.g. password manager, written down).
216
+
217
+ `config.mobile.cloud_backup_allowed = false` enforced.
218
+
219
+ ### 6.4 Lost device
220
+
221
+ If the user loses the device, they:
222
+
223
+ 1. Wait out the device's session β€” eventually messages stop delivering
224
+ 2. From another device, call `node.revoke` on this NodeID (anchor co-signs)
225
+ 3. The revoked NodeID is then blacklisted; the lost phone, even if recovered, can't authenticate
226
+
227
+ This is identical to the Phase 1 node revocation flow.
228
+
229
+ ---
230
+
231
+ ## 7. UI surface (mobile)
232
+
233
+ Mirrors the web UI ([M08](../../modules/M08-ui.md)) but native:
234
+
235
+ | Tab | Content |
236
+ |-----|---------|
237
+ | **Chat** | 1:1 conversations + group threads. Live updates via WS. Voice notes optional (record β†’ STT β†’ send transcript + audio attachment). |
238
+ | **Market** | List + post; image attachments via camera/gallery. |
239
+ | **Ask** | LLM chat. Tool-augmented mode available. Voice input button. |
240
+ | **Community** | Member list, recent events, federation peers. |
241
+ | **Settings** | Push prefs, language, backup, advanced. |
242
+
243
+ All UI is plain Material/Cupertino β€” no exotic frameworks. Matches Christof's preference for boring, durable tech.
244
+
245
+ ---
246
+
247
+ ## 8. Configuration
248
+
249
+ Mobile-side (lives in app):
250
+
251
+ ```dart
252
+ const config = {
253
+ 'community_id': '<from invite>',
254
+ 'anchor_endpoints': [/* from manifest */],
255
+ 'relay_url': 'https://relay.hearthnet.de',
256
+ 'push_enabled': true,
257
+ 'quiet_hours': {'start': '22:00', 'end': '07:00'},
258
+ 'background_fetch_minutes': 15,
259
+ };
260
+ ```
261
+
262
+ Anchor-side (Python, for the push_authority service):
263
+
264
+ ```python
265
+ config.mobile.push_enabled = True
266
+ config.mobile.push_categories_marketplace = ["essentials","emergency"]
267
+ config.mobile.push_quiet_hours_default = ("22:00","07:00")
268
+ ```
269
+
270
+ ---
271
+
272
+ ## 9. Errors
273
+
274
+ | Condition | UI presentation |
275
+ |-----------|-----------------|
276
+ | No network | Offline banner; queued sends |
277
+ | Anchor unreachable | "Your community anchor is offline. Retrying..." |
278
+ | Relay unreachable | Falls back to direct; warns if all fail |
279
+ | Token expired | Silent refresh; only surface if refresh fails |
280
+ | Push delivery failed | No UI; logged for diagnostics |
281
+ | Manifest signature mismatch | Hard block; re-onboard required |
282
+
283
+ ---
284
+
285
+ ## 10. Tests
286
+
287
+ ### Flutter side
288
+ - Widget tests for each tab
289
+ - Integration test for onboarding flow with a mock anchor
290
+ - E2E test using a real anchor in CI (Linux runner running hearthnet)
291
+
292
+ ### Anchor-side Python
293
+ - `test_push_subscription_recorded`
294
+ - `test_push_dispatch_on_chat_message_sent`
295
+ - `test_quiet_hours_silences_non_emergency`
296
+ - `test_revocation_revokes_mobile_session`
297
+
298
+ ### Manual
299
+ - iOS + Android smoke tests on physical devices
300
+ - Background-fetch verified across 30-minute suspensions
301
+
302
+ ---
303
+
304
+ ## 11. Cross-references
305
+
306
+ | What | Where |
307
+ |------|-------|
308
+ | Bus protocol | [Phase 1 CAP Β§5](../../CAPABILITY_CONTRACT.md) |
309
+ | Push relay tier | [M15 Β§5.4](M15-relay-tier.md) |
310
+ | Token-bearer auth | [M16 Β§5.5](M16-tokens.md) |
311
+ | E2E chat | [M23](M23-e2e-encryption.md) |
312
+ | WebSocket | [X06](../cross-cutting/X06-websocket.md) |
313
+ | Invite blobs | [Phase 1 CAP Β§13](../../CAPABILITY_CONTRACT.md) |
314
+ | Web UI (mirror) | [M08](../../modules/M08-ui.md) |
315
+
316
+ ---
317
+
318
+ ## 12. Open questions
319
+
320
+ 1. **Flutter vs React Native vs native (Swift + Kotlin).** Choosing Flutter for shared codebase and Christof's stated preference for boring/durable stacks. Reconsider if Flutter's keychain support is shaky.
321
+ 2. **End-to-end encryption library in Flutter.** Need a libsodium binding that matches the Python side bit-exactly. `flutter_sodium` is well-maintained; verify on both platforms.
322
+ 3. **Background fetch reliability.** iOS throttles aggressively. We accept "best effort"; push is the real delivery mechanism.
323
+ 4. **Offline mode depth.** Mobile-only LLM (small Phi-3 / Gemma 2B) is Phase 3.
324
+ 5. **Web push for PWA.** Could the same flow target a PWA (no native app)? Yes, with FCM web push; documented but not built in Phase 2.
325
+ 6. **Family-share licence.** Christof might want to ship the iOS app to family members under his account; App Store policy permits this within Family Sharing.
docs/p2_p3/M23-e2e-encryption.md ADDED
@@ -0,0 +1,474 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # M23 β€” End-to-End Encryption
2
+
3
+ **Spec version:** v1.0 (Phase 2)
4
+ **Depends on:** M01 (identity, key derivation), M07 (blobs, for prekey bundles), X02 (events, for prekey/session events), X04 (config), `pynacl`
5
+ **Depended on by:** M10 chat (extended), M25 group chat, optionally M07 file encryption
6
+
7
+ ---
8
+
9
+ ## 1. Responsibility
10
+
11
+ Provide end-to-end encryption between community members for:
12
+
13
+ - **1:1 chat** (via M10 extension): every chat message encrypted with a per-sender ratchet
14
+ - **Group chat** (M25): per-thread sender keys
15
+ - **File envelopes** (M07 extension): chunks optionally wrapped in a per-recipient envelope
16
+
17
+ The cryptographic design borrows from Signal but stays simpler:
18
+
19
+ - **X3DH** for initial key agreement (one identity key + one signed prekey + one one-time prekey)
20
+ - **Double Ratchet** for per-session forward-secrecy
21
+ - **Sender keys** (Signal-style) for group threads
22
+ - **Per-blob envelope** for file encryption
23
+
24
+ This module owns the crypto primitives and session state. M10 and M25 own the message protocol that calls into M23.
25
+
26
+ ---
27
+
28
+ ## 2. File layout
29
+
30
+ ```
31
+ hearthnet/crypto/
32
+ β”œβ”€β”€ __init__.py
33
+ β”œβ”€β”€ kem.py # X25519 key agreement (X3DH)
34
+ β”œβ”€β”€ ratchet.py # Double Ratchet, per-session
35
+ β”œβ”€β”€ sender_keys.py # Group sender keys (M25 helper)
36
+ β”œβ”€β”€ envelope.py # File envelope encryption (chunks)
37
+ └── prekeys.py # Prekey bundle storage and publication
38
+ ```
39
+
40
+ The `hearthnet/crypto/` directory is a NEW top-level package in Phase 2.
41
+
42
+ ---
43
+
44
+ ## 3. Public API
45
+
46
+ ### 3.1 `kem.py` β€” X3DH
47
+
48
+ ```python
49
+ # hearthnet/crypto/kem.py
50
+ from dataclasses import dataclass
51
+
52
+ @dataclass(frozen=True)
53
+ class X25519KeyPair:
54
+ private: bytes # 32 bytes
55
+ public: bytes # 32 bytes "x25519:<base64>"
56
+
57
+ def x25519_generate() -> X25519KeyPair: ...
58
+ def x25519_dh(our_priv: bytes, their_pub: bytes) -> bytes:
59
+ """Computes the shared secret. 32 bytes."""
60
+
61
+ @dataclass(frozen=True)
62
+ class PrekeyBundle:
63
+ """What a recipient publishes so senders can establish a session without them being online."""
64
+ node_id_full: str
65
+ identity_pub: bytes # long-lived; derived from Ed25519 identity via SignedConversion
66
+ signed_prekey_pub: bytes
67
+ signed_prekey_sig: bytes # Ed25519 signature over signed_prekey_pub by identity Ed25519 key
68
+ one_time_prekeys_pub: list[bytes] # depleted on use
69
+ published_at: int # unix seconds
70
+
71
+ def derive_identity_x25519_from_ed25519(ed_kp: KeyPair) -> X25519KeyPair:
72
+ """Use the standard nacl conversion. Single x25519 identity key per device."""
73
+
74
+ def build_prekey_bundle(
75
+ ed_kp: KeyPair,
76
+ *,
77
+ num_one_time: int = E2E_PREKEY_BUNDLE_SIZE,
78
+ ) -> tuple[PrekeyBundle, X25519KeyPair, list[X25519KeyPair]]:
79
+ """Returns (bundle, signed_prekey_full, one_time_prekeys_full).
80
+ Caller persists the private halves; publishes only the bundle."""
81
+
82
+ def x3dh_initiator(
83
+ our_identity_x: X25519KeyPair,
84
+ our_ephemeral: X25519KeyPair,
85
+ their_bundle: PrekeyBundle,
86
+ ) -> tuple[bytes, dict]:
87
+ """Returns (shared_secret, session_init_message).
88
+ session_init_message includes our identity_pub + ephemeral_pub + used_otp_index."""
89
+
90
+ def x3dh_responder(
91
+ our_identity_x: X25519KeyPair,
92
+ our_signed_prekey: X25519KeyPair,
93
+ our_one_time_prekey: X25519KeyPair | None,
94
+ their_identity_pub: bytes,
95
+ their_ephemeral_pub: bytes,
96
+ ) -> bytes:
97
+ """Returns shared_secret."""
98
+ ```
99
+
100
+ ### 3.2 `ratchet.py` β€” Double Ratchet
101
+
102
+ ```python
103
+ # hearthnet/crypto/ratchet.py
104
+ @dataclass
105
+ class RatchetState:
106
+ """One session, one direction. There are two per session (send + receive)."""
107
+ root_key: bytes
108
+ chain_key: bytes
109
+ counter: int
110
+ epoch: int
111
+ skipped_messages: dict[tuple[int, int], bytes] # (epoch, counter) β†’ message_key
112
+ dh_keypair: X25519KeyPair | None
113
+ remote_dh_pub: bytes | None
114
+
115
+ @dataclass
116
+ class RatchetSession:
117
+ """A bidirectional encrypted session between two NodeIDs."""
118
+ peer_node_id_full: str
119
+ send_state: RatchetState
120
+ recv_state: RatchetState
121
+
122
+ def is_established(self) -> bool: ...
123
+
124
+ def init_session_initiator(shared_secret: bytes, peer_dh_pub: bytes) -> RatchetSession: ...
125
+ def init_session_responder(shared_secret: bytes, our_dh_kp: X25519KeyPair) -> RatchetSession: ...
126
+
127
+ @dataclass(frozen=True)
128
+ class RatchetMessageHeader:
129
+ dh_pub: bytes # current sender's DH pub
130
+ epoch: int
131
+ counter: int
132
+
133
+ def encrypt_message(
134
+ session: RatchetSession,
135
+ plaintext: bytes,
136
+ *,
137
+ aad: bytes = b"",
138
+ ) -> tuple[RatchetMessageHeader, bytes]:
139
+ """Returns (header, ciphertext). Mutates session.send_state."""
140
+
141
+ def decrypt_message(
142
+ session: RatchetSession,
143
+ header: RatchetMessageHeader,
144
+ ciphertext: bytes,
145
+ *,
146
+ aad: bytes = b"",
147
+ ) -> bytes:
148
+ """Verifies + decrypts. Mutates session.recv_state.
149
+ Tolerates up to E2E_RATCHET_MAX_OUT_OF_ORDER out-of-order messages via skipped_messages."""
150
+
151
+ class RatchetError(Exception):
152
+ """code in {
153
+ 'session_not_established','decrypt_failed','out_of_order_too_far',
154
+ 'message_too_old','aad_mismatch'}"""
155
+ code: str
156
+
157
+ # Persistence: sessions are serialised via dataclasses_json into a SQLite table.
158
+ ```
159
+
160
+ ### 3.3 `sender_keys.py` β€” Group ratchet
161
+
162
+ ```python
163
+ # hearthnet/crypto/sender_keys.py
164
+ @dataclass
165
+ class SenderKeyState:
166
+ """Per (group, sender) sender-key chain. Each sender broadcasts a chain key
167
+ to all group members at thread create / join."""
168
+ thread_id: str
169
+ sender_node_id: str
170
+ chain_key: bytes
171
+ counter: int
172
+ signature_keypair: tuple[bytes, bytes] | None # ed25519, for message signing
173
+
174
+ @dataclass
175
+ class GroupSession:
176
+ """One per thread; holds all members' SenderKeyStates."""
177
+ thread_id: str
178
+ sender_keys: dict[str, SenderKeyState] # sender_node_id β†’ state
179
+
180
+ def init_sender_key(thread_id: str, sender_node_id: str) -> SenderKeyState: ...
181
+ def encrypt_for_group(session: GroupSession, sender_node_id: str, plaintext: bytes) -> tuple[dict, bytes]: ...
182
+ def decrypt_for_group(session: GroupSession, header: dict, ciphertext: bytes) -> bytes: ...
183
+
184
+ def serialise_sender_key_distribution(state: SenderKeyState) -> bytes:
185
+ """Serialise a sender key for sending to other group members.
186
+ MUST be sent inside a pairwise Double Ratchet session, not in cleartext."""
187
+
188
+ def consume_sender_key_distribution(bytes_blob: bytes, session: GroupSession) -> None: ...
189
+ ```
190
+
191
+ ### 3.4 `envelope.py` β€” File envelope
192
+
193
+ ```python
194
+ # hearthnet/crypto/envelope.py
195
+ @dataclass(frozen=True)
196
+ class FileEnvelopeHeader:
197
+ recipient_node_ids: list[str]
198
+ wrapped_keys: dict[str, bytes] # node_id_short β†’ wrapped symmetric key
199
+ nonce: bytes # 12 bytes
200
+ chunk_size_bytes: int
201
+
202
+ def encrypt_blob_for(
203
+ recipients: list[str], # NodeIDs full
204
+ plaintext: bytes,
205
+ sender_kp: KeyPair,
206
+ sessions_provider: Callable[[str], RatchetSession | None],
207
+ ) -> tuple[FileEnvelopeHeader, bytes]:
208
+ """1. Generate random 32-byte symmetric key
209
+ 2. For each recipient: encrypt key with their ratchet session (one-shot use of next message key)
210
+ 3. ChaCha20-Poly1305-encrypt plaintext with the symmetric key + nonce
211
+ 4. Return (header, ciphertext)"""
212
+
213
+ def decrypt_blob_for_self(
214
+ header: FileEnvelopeHeader,
215
+ ciphertext: bytes,
216
+ our_node_id_full: str,
217
+ sessions_provider: Callable[[str], RatchetSession | None],
218
+ ) -> bytes:
219
+ """1. Find our wrapped key in header.wrapped_keys
220
+ 2. Decrypt symmetric key via sender's ratchet session
221
+ 3. ChaCha20-Poly1305-decrypt"""
222
+ ```
223
+
224
+ ### 3.5 `prekeys.py` β€” Publication and consumption
225
+
226
+ ```python
227
+ # hearthnet/crypto/prekeys.py
228
+ class PrekeyStore:
229
+ """Persists this node's private prekey material and the bundles we've consumed for others."""
230
+
231
+ def __init__(self, db_path: Path):
232
+ ...
233
+
234
+ def publish_self(
235
+ self,
236
+ ed_kp: KeyPair,
237
+ event_log: EventLog,
238
+ *,
239
+ num_one_time: int = E2E_PREKEY_BUNDLE_SIZE,
240
+ ) -> None:
241
+ """1. Build PrekeyBundle (kem.build_prekey_bundle)
242
+ 2. Persist private halves locally
243
+ 3. Emit e2e.prekeys.published event with public halves"""
244
+
245
+ def get_peer_bundle(self, peer_node_id_full: str) -> PrekeyBundle | None:
246
+ """Look in local cache; if absent, fetch from peer via bus.
247
+ Returns one bundle including one consumable one-time prekey if available."""
248
+
249
+ def consume_one_time_prekey(self, our_otp_index: int) -> X25519KeyPair | None:
250
+ """Server-side: when someone uses one of our one-time prekeys, return + remove it."""
251
+
252
+ def refill_one_time_prekeys_if_low(self, ed_kp: KeyPair, event_log: EventLog) -> int:
253
+ """If fewer than E2E_PREKEY_BUNDLE_SIZE / 4 remain, publish a new bundle.
254
+ Returns count added."""
255
+ ```
256
+
257
+ ---
258
+
259
+ ## 4. Behaviour
260
+
261
+ ### 4.1 Session establishment lifecycle (1:1)
262
+
263
+ ```
264
+ Alice wants to send Bob an encrypted message; no session exists.
265
+ ↓
266
+ PrekeyStore.get_peer_bundle("ed25519:bob") β†’ fetch from Bob's most recent e2e.prekeys.published event
267
+ ↓
268
+ x3dh_initiator(alice_identity_x, alice_ephemeral, bob_bundle) β†’ (shared_secret, init_msg)
269
+ ↓
270
+ init_session_initiator(shared_secret, bob_signed_prekey_pub) β†’ RatchetSession
271
+ ↓
272
+ encrypt_message(session, plaintext) β†’ (header, ciphertext)
273
+ ↓
274
+ Alice sends: chat.message.sent event with data.body = {
275
+ "e2e": true,
276
+ "header": { x3dh_init: init_msg, ratchet_header: header },
277
+ "ciphertext": "<base64>"
278
+ }
279
+ ↓
280
+ Bob receives event.
281
+ x3dh_responder(...) β†’ shared_secret
282
+ init_session_responder(...) β†’ RatchetSession
283
+ decrypt_message(...) β†’ plaintext
284
+ Emit e2e.session.established event so Alice can clean up retries
285
+ ↓
286
+ Subsequent messages: just header + ciphertext (no x3dh_init).
287
+ ```
288
+
289
+ ### 4.2 Group session establishment
290
+
291
+ ```
292
+ Thread creator emits chat.thread.created with members and an ed25519:thread_signing_root.
293
+ ↓
294
+ Each member generates a SenderKeyState for themselves in this thread.
295
+ ↓
296
+ Each member, in a pairwise loop, sends their sender key distribution to each other member
297
+ inside their 1:1 ratchet sessions (so non-thread-members never see it).
298
+ ↓
299
+ Once everyone has everyone's sender keys, encrypt/decrypt happens with sender keys
300
+ (chain ratchet only; no DH ratchet on the group session itself).
301
+ ```
302
+
303
+ When a member is added later, the inviter must re-distribute all existing senders' current chain states to the new member (rewinds to the message they should start being able to read β€” usually the current state, not history).
304
+
305
+ When a member is removed, existing sender keys are still known to them. **All members must rotate their sender keys** to achieve forward secrecy after removal. UI prompts this.
306
+
307
+ ### 4.3 Out-of-order messages
308
+
309
+ Up to `E2E_RATCHET_MAX_OUT_OF_ORDER` (32) skipped message keys are cached per session. Beyond that, `out_of_order_too_far` is raised; the message is dropped and the sender notified (out-of-band) that they should rekey.
310
+
311
+ ### 4.4 Rekeying
312
+
313
+ After `E2E_RATCHET_REKEY_AFTER_MESSAGES` messages on the same DH ratchet, the next message includes a new DH ephemeral. Standard Double Ratchet behaviour. Transparent to users.
314
+
315
+ ### 4.5 Session loss recovery
316
+
317
+ If a node's session state is lost (disk corruption, fresh install with same keys), the peer doesn't know β€” messages will fail to decrypt. Recovery flow:
318
+
319
+ 1. Decrypting node returns `e2e_decrypt_failed` via pubsub
320
+ 2. Sending node sees this and re-initiates X3DH
321
+ 3. New session replaces old; resends recent messages
322
+
323
+ UI shows "session was reset" so users know context might have been lost.
324
+
325
+ ### 4.6 Identity X25519 derivation
326
+
327
+ We derive a per-device X25519 identity key from the Ed25519 identity key, using libsodium's `crypto_sign_ed25519_pk_to_curve25519`. This way:
328
+
329
+ - Only one identity key to maintain
330
+ - Anyone with the public Ed25519 (in the community manifest) can derive the X25519 pub
331
+ - Signed prekey signatures use the Ed25519 key (already established as device identity)
332
+
333
+ ### 4.7 Prekey publication
334
+
335
+ Each node publishes a fresh `e2e.prekeys.published` event on startup if their last one is > 24h old. The event contains:
336
+
337
+ - `identity_pubkey` (X25519 form)
338
+ - `signed_prekey` (with Ed25519 signature)
339
+ - `one_time_prekeys[]` (up to `E2E_PREKEY_BUNDLE_SIZE` = 20)
340
+
341
+ Consumers find a peer's bundle by reading their latest `e2e.prekeys.published` event from the log.
342
+
343
+ ### 4.8 What is NOT E2E
344
+
345
+ Even with M23 active:
346
+ - Event envelope (sender, recipient, lamport, event_type, wall_clock) is cleartext within the community
347
+ - Signatures over events remain valid for community-level audit
348
+ - Message *metadata* leaks to community members (who talked to whom and when), just not content
349
+
350
+ This is intentional: communities are trust roots; complete anonymity within a community is not a goal.
351
+
352
+ ### 4.9 File envelope
353
+
354
+ For file blobs, `encrypt_blob_for(recipients, plaintext, ...)` produces a single ciphertext, with a small per-recipient header. Senders pick recipients explicitly (e.g. group thread members for an attachment). Bystanders cannot decrypt even if they fetch the blob via M07 `file.read`.
355
+
356
+ The blob's CID is the hash of the **ciphertext**, so the same plaintext sent to different recipient sets has different CIDs. Costs more storage; needed for security.
357
+
358
+ ---
359
+
360
+ ## 5. Persistence
361
+
362
+ ### 5.1 Sessions table
363
+
364
+ ```sql
365
+ CREATE TABLE ratchet_sessions (
366
+ peer_node_id_full TEXT PRIMARY KEY,
367
+ session_blob BLOB NOT NULL, -- serialised RatchetSession
368
+ established_at INTEGER NOT NULL,
369
+ last_used INTEGER NOT NULL
370
+ );
371
+ ```
372
+
373
+ ### 5.2 Group sessions table
374
+
375
+ ```sql
376
+ CREATE TABLE group_sessions (
377
+ thread_id TEXT PRIMARY KEY,
378
+ session_blob BLOB NOT NULL,
379
+ updated_at INTEGER NOT NULL
380
+ );
381
+ ```
382
+
383
+ ### 5.3 Prekey private halves
384
+
385
+ ```sql
386
+ CREATE TABLE prekey_private (
387
+ kind TEXT NOT NULL, -- 'identity'|'signed_prekey'|'one_time'
388
+ index_or_id TEXT NOT NULL, -- '0' for identity; 'spk_v1' for signed; otp index for OTPs
389
+ private_key BLOB NOT NULL,
390
+ consumed_at INTEGER, -- only set for one-time, when used
391
+ PRIMARY KEY (kind, index_or_id)
392
+ );
393
+ ```
394
+
395
+ Files locked at 0600. Backed up nightly via `hearthnet export` (encrypted with user passphrase).
396
+
397
+ ---
398
+
399
+ ## 6. Errors
400
+
401
+ `RatchetError` codes (M23-internal):
402
+ - `session_not_established`
403
+ - `decrypt_failed`
404
+ - `out_of_order_too_far`
405
+ - `message_too_old`
406
+ - `aad_mismatch`
407
+
408
+ Wire mapping per [CAP2 Β§9](../CAPABILITY_CONTRACT_v2.md):
409
+ - `e2e_session_missing` ← `session_not_established`
410
+ - `e2e_decrypt_failed` ← `decrypt_failed`, `aad_mismatch`
411
+ - `ratchet_out_of_order` ← `out_of_order_too_far`
412
+
413
+ ---
414
+
415
+ ## 7. Configuration
416
+
417
+ ```python
418
+ config.e2e.enabled = True
419
+ config.e2e.chat_default_enabled = True # new 1:1 chats default to E2E
420
+ config.e2e.group_default_enabled = True
421
+ config.e2e.file_default_enabled = False # opt-in per blob
422
+ config.e2e.prekey_refill_count = E2E_PREKEY_BUNDLE_SIZE
423
+ config.e2e.rekey_after_messages = E2E_RATCHET_REKEY_AFTER_MESSAGES
424
+ config.e2e.max_out_of_order = E2E_RATCHET_MAX_OUT_OF_ORDER
425
+ ```
426
+
427
+ ---
428
+
429
+ ## 8. Tests
430
+
431
+ ### Unit
432
+ - `test_x25519_dh_symmetric`
433
+ - `test_x3dh_initiator_responder_agree`
434
+ - `test_ratchet_encrypt_decrypt_roundtrip`
435
+ - `test_ratchet_out_of_order_within_window`
436
+ - `test_ratchet_out_of_order_too_far_rejected`
437
+ - `test_rekey_after_n_messages`
438
+ - `test_group_sender_key_distribution_pairwise_only`
439
+ - `test_blob_envelope_recipient_only_can_decrypt`
440
+
441
+ ### Integration
442
+ - `test_two_node_first_message_x3dh_session_persists`
443
+ - `test_session_recovery_after_disk_wipe`
444
+ - `test_group_add_member_can_decrypt_subsequent`
445
+ - `test_group_remove_member_cannot_decrypt_after_rotation`
446
+ - `test_file_envelope_2_recipients`
447
+
448
+ ### Adversarial
449
+ - `test_replay_old_ratchet_message_rejected`
450
+ - `test_modified_ciphertext_decrypt_fails`
451
+ - `test_one_time_prekey_consumed_once`
452
+
453
+ ---
454
+
455
+ ## 9. Cross-references
456
+
457
+ | What | Where |
458
+ |------|-------|
459
+ | `e2e.*` events | [CAP2 Β§7](../CAPABILITY_CONTRACT_v2.md) |
460
+ | Encrypted chat body envelope | [CAP2 Β§1.1, Β§7.2 chat.message.sent](../CAPABILITY_CONTRACT_v2.md) |
461
+ | Chat service hook | [M10 ext](../../modules/M10-chat.md) β€” Phase 2 extension |
462
+ | Group chat | [M25](M25-group-chat.md) |
463
+ | File envelope use | [M07 ext](../../modules/M07-file-blobs.md) |
464
+ | Identity key conversion | [M01](../../modules/M01-identity.md) |
465
+
466
+ ---
467
+
468
+ ## 10. Open questions
469
+
470
+ 1. **Post-quantum readiness.** X25519 + ChaCha20-Poly1305 is not PQ-safe. Hybrid (X25519 + ML-KEM-768) is Phase 3.
471
+ 2. **Verification of session identity.** Signal does safety numbers; HearthNet can do the same. UI ergonomics deferred.
472
+ 3. **Multi-device per identity.** If a user has anchor + mobile + laptop, do they share keys or have separate ones? Currently separate (each device is a separate NodeID; group threads include all of them). Could unify with a "linked devices" Phase 3 feature.
473
+ 4. **Forward secrecy on group membership change.** Current spec asks members to rotate sender keys on removal. UX of forcing this needs design.
474
+ 5. **Cryptographic auditing.** This module should be reviewed by a real cryptographer before going to civil-defence pilots. Listed in `THREAT_MODEL_v2.md`.
docs/p2_p3/M24-rerank.md ADDED
@@ -0,0 +1,227 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # M24 β€” Reranking Service
2
+
3
+ **Spec version:** v2.0
4
+ **Depends on:** [M03 Capability Bus](../../modules/M03-capability-bus.md), [M01 Identity](../../modules/M01-identity.md), [X03 Observability](../../cross-cutting/X03-observability.md)
5
+ **Depended on by:** [M05 RAG](../../modules/M05-rag.md) (extension), [M06 Marketplace](../../modules/M06-marketplace.md) (extension)
6
+
7
+ ---
8
+
9
+ ## 1. Responsibility
10
+
11
+ Re-score a candidate list of documents against a query using a cross-encoder, producing a higher-precision ordering than dense retrieval alone can deliver.
12
+
13
+ The capability is intentionally narrow: take query + N short docs, return ranked list. The service does **not** retrieve documents, does **not** fetch from blobs, does **not** know about corpora. Callers (typically `rag.query` and `market.search`) do retrieval first, then ask the reranker to refine the top 100.
14
+
15
+ This is the smallest service in Phase 2 β€” one model, one method, no streaming β€” and the most underrated. Adding it to the RAG pipeline lifts answer quality more than any other Phase 2 module.
16
+
17
+ ---
18
+
19
+ ## 2. File layout
20
+
21
+ ```
22
+ hearthnet/services/rerank/
23
+ β”œβ”€β”€ __init__.py
24
+ β”œβ”€β”€ service.py # RerankService β€” capability registration
25
+ β”œβ”€β”€ selection.py # Picks a backend; loads on demand
26
+ └── backends/
27
+ β”œβ”€β”€ base.py # RerankBackend ABC
28
+ β”œβ”€β”€ bge_reranker.py # BGE-reranker-v2-m3 (default)
29
+ └── cross_encoder.py # Sentence-transformers fallback
30
+ ```
31
+
32
+ ---
33
+
34
+ ## 3. Public API
35
+
36
+ ### 3.1 `RerankBackend` (ABC)
37
+
38
+ ```python
39
+ class RerankBackend(Protocol):
40
+ name: str # e.g. "BAAI/bge-reranker-v2-m3"
41
+ max_doc_chars: int # truncate longer docs
42
+
43
+ async def score(self, query: str, documents: list[str]) -> list[float]:
44
+ """Return one score per document, same length and order."""
45
+ ...
46
+
47
+ async def health(self) -> dict[str, Any]:
48
+ """Return {"ok": bool, "loaded": bool, "model_id": ..., ...}."""
49
+ ...
50
+ ```
51
+
52
+ ### 3.2 `BgeRerankerBackend`
53
+
54
+ ```python
55
+ class BgeRerankerBackend:
56
+ def __init__(self, model_id: str = "BAAI/bge-reranker-v2-m3", device: str = "auto", max_batch: int = 32):
57
+ ...
58
+
59
+ async def score(self, query: str, documents: list[str]) -> list[float]:
60
+ # Tokenise (query, doc) pairs in batches of max_batch
61
+ # Forward pass; pooled logit becomes the score
62
+ # Higher = more relevant
63
+ ...
64
+ ```
65
+
66
+ ### 3.3 `RerankService`
67
+
68
+ ```python
69
+ class RerankService:
70
+ """Bus-facing facade. Picks backend, enforces RERANK_MAX_DOCS, emits metrics."""
71
+
72
+ def __init__(self, bus: CapabilityBus, settings: RerankSettings, observability: Observability):
73
+ ...
74
+
75
+ async def start(self) -> None:
76
+ # Register `rerank.text@1.0` on the bus
77
+ ...
78
+
79
+ async def rerank_text(self, body: RerankRequest) -> RerankResponse:
80
+ # 1. Validate len(documents) <= RERANK_MAX_DOCS (else bad_request)
81
+ # 2. Pick backend per `body.params.model` or default
82
+ # 3. Truncate docs to backend.max_doc_chars
83
+ # 4. Call backend.score
84
+ # 5. Sort descending, take top_k (or all)
85
+ # 6. Emit `rerank.latency_ms` metric
86
+ ...
87
+ ```
88
+
89
+ ### 3.4 Request / response dataclasses
90
+
91
+ ```python
92
+ @dataclass
93
+ class RerankDoc:
94
+ id: str
95
+ text: str
96
+
97
+ @dataclass
98
+ class RerankRequest:
99
+ query: str
100
+ documents: list[RerankDoc]
101
+ top_k: int = 10
102
+ params: dict[str, Any] = field(default_factory=dict) # {"model": "..."}
103
+
104
+ @dataclass
105
+ class RerankedDoc:
106
+ id: str
107
+ score: float
108
+
109
+ @dataclass
110
+ class RerankResponse:
111
+ ranked: list[RerankedDoc]
112
+ meta: dict[str, Any]
113
+ ```
114
+
115
+ ---
116
+
117
+ ## 4. Behaviour
118
+
119
+ ### 4.1 Backend selection
120
+
121
+ `params.model` is matched against installed backends (key = HuggingFace model id). Default is `BAAI/bge-reranker-v2-m3` because it handles β‰₯100 languages including German and Latin (relevant for the OCR'd historical doc corpus).
122
+
123
+ If `params.model` is supplied but unknown β†’ return `bad_request` with the list of installed backends.
124
+
125
+ ### 4.2 Cold start
126
+
127
+ Backend is loaded lazily on first call. First call latency budget: ≀ 60s on the RTX 5090 (model ~2 GB on disk). Subsequent calls: ≀ 200ms for 50 docs at ~512 chars each.
128
+
129
+ The service publishes `model_loaded` and `model_loading` health states; `rerank.text` calls during loading wait up to `RERANK_LOAD_TIMEOUT_SECONDS` (default 60) then return `unavailable`.
130
+
131
+ ### 4.3 Score semantics
132
+
133
+ Scores are **raw logits**, not normalised probabilities. They are comparable within a single call but not across calls or backends. Callers MUST NOT compare a 0.91 score from BGE to a 0.91 from cross-encoder/ms-marco β€” different scales.
134
+
135
+ ### 4.4 Truncation
136
+
137
+ Documents longer than `backend.max_doc_chars` (default 2048) are truncated. The service logs `rerank.docs_truncated` counter. Truncation is from the right; callers who care about specific spans should pre-summarise or chunk before passing in.
138
+
139
+ ### 4.5 No streaming
140
+
141
+ `rerank.text@1.0` is non-streaming. Even at 100 docs the latency is well under 1s on GPU. If a Phase-3 use case demands streaming (e.g. 1000-doc reranks for academic search), introduce `rerank.text@2.0` with `progress` frames; do not retrofit v1.
142
+
143
+ ### 4.6 Integration with RAG (M05 extension)
144
+
145
+ `rag.query` in Phase 2 grows an internal pipeline:
146
+
147
+ ```
148
+ 1. Hybrid retrieval (dense + BM25) β†’ top 100 candidates
149
+ 2. Optional call to rerank.text@1.0 β†’ top 10
150
+ 3. Pass top 10 to llm.chat as context
151
+ ```
152
+
153
+ The hop to `rerank.text` is done via the bus, not via direct import. This keeps the policy ("which model?", "is reranking available?") in the service and out of the RAG core.
154
+
155
+ If `rerank.text@1.0` is unavailable in the local mesh, RAG falls back to dense scores alone and logs `rag.rerank_skipped` counter (not an error).
156
+
157
+ ### 4.7 Integration with Marketplace (M06 extension)
158
+
159
+ `market.search` follows the same pattern when the query is natural-language. For tag-based queries it skips reranking.
160
+
161
+ ---
162
+
163
+ ## 5. Errors
164
+
165
+ | Code | Cause |
166
+ |------|-------|
167
+ | `bad_request` | `len(documents) > RERANK_MAX_DOCS`, empty query, malformed payload |
168
+ | `unavailable` | Backend loading or hardware unavailable |
169
+ | `model_not_found` | Requested `params.model` is not installed |
170
+
171
+ `unavailable` is retryable; the other two are not.
172
+
173
+ ---
174
+
175
+ ## 6. Configuration
176
+
177
+ ```toml
178
+ [services.rerank]
179
+ enabled = true
180
+ default_model = "BAAI/bge-reranker-v2-m3"
181
+ device = "auto" # "auto" | "cuda" | "cpu"
182
+ max_batch = 32
183
+ max_doc_chars = 2048
184
+ load_timeout_seconds = 60
185
+ trust_required = "member"
186
+ ```
187
+
188
+ Behind a feature flag: when `enabled=false`, the capability simply does not register and RAG falls back to dense-only.
189
+
190
+ ---
191
+
192
+ ## 7. Tests
193
+
194
+ ### 7.1 Unit
195
+ - Sorting: scores `[0.1, 0.9, 0.5]` produce ranked order `[1, 2, 0]`
196
+ - Truncation: 4000-char doc gets truncated to 2048 before scoring
197
+ - `top_k` honoured; returns at most `top_k` results
198
+ - Bad request when `documents=[]` or `len > RERANK_MAX_DOCS`
199
+
200
+ ### 7.2 Integration
201
+ - End-to-end: `rag.query` with reranking vs without, on the niederrhein-emergency corpus, asserts at least one expected document moves into top 3 with rerank that wasn't there without
202
+ - Cross-language: German query, mixed German/English candidates, BGE reranker should put the German candidate first when relevance is equal
203
+
204
+ ### 7.3 Performance
205
+ - 100 docs @ 1024 chars: p50 ≀ 300ms on RTX 5090; p95 ≀ 600ms
206
+ - CPU fallback (no GPU): p50 ≀ 4s for 50 docs (acceptable; degraded)
207
+
208
+ ### 7.4 Failure-mode
209
+ - Backend crash mid-call: caller receives `unavailable`; service self-heals on next call
210
+ - Concurrent calls: 20 parallel reranks should not deadlock; backend serialises behind a single semaphore
211
+
212
+ ---
213
+
214
+ ## 8. Cross-references
215
+
216
+ - Capability spec: [CAPABILITY_CONTRACT_v2 Β§4.15](../CAPABILITY_CONTRACT_v2.md#415-reranktext10)
217
+ - Used by: M05 RAG extension, M06 Marketplace extension
218
+ - Observability: emits `rerank.calls_total`, `rerank.latency_ms`, `rerank.docs_truncated`, `rerank.errors_total{code}`
219
+
220
+ ---
221
+
222
+ ## 9. Open questions
223
+
224
+ 1. **Reciprocal rank fusion** with dense scores as the alternative when rerank is unavailable β€” worth implementing in M05 as the fallback path?
225
+ 2. **ColBERT-style late interaction** β€” heavier model, higher quality. Worth a second backend, or wait for Phase 3 to evaluate?
226
+ 3. **Reranker for code/diff content** β€” different model family (e.g. `BAAI/bge-code-reranker`). Should `params.model` selection be auto-inferred from query/doc content?
227
+ 4. **Caching** β€” query+doc-pair hash β†’ score, evict LRU. Worth it for repeated queries in chat-driven RAG sessions, or premature optimisation?
docs/p2_p3/M25-group-chat.md ADDED
@@ -0,0 +1,293 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # M25 β€” Group Chat
2
+
3
+ **Spec version:** v2.0
4
+ **Depends on:** [M10 Chat 1:1](../../modules/M10-chat.md), [M23 E2E Encryption](M23-e2e-encryption.md), [M16 Capability Tokens](M16-tokens.md), [M03 Capability Bus](../../modules/M03-capability-bus.md), [X02 Event Log](../../cross-cutting/X02-events.md)
5
+ **Depended on by:** UI (web + M22 mobile), [M14 Federation](M14-federation.md) (cross-community threads)
6
+
7
+ ---
8
+
9
+ ## 1. Responsibility
10
+
11
+ Multi-party threaded conversations with the same guarantees as 1:1 chat: end-to-end encryption (optional but default on), event-log-anchored history, no central server required, members can come and go.
12
+
13
+ A thread is a long-lived object identified by a ULID. It has an authoritative member list maintained in the event log, an encryption "group session" (M23 sender keys), and message history. Threads do not currently support reactions, replies, threading-within-thread, or rich content β€” those are explicit non-goals for Phase 2 and may arrive in Phase 3 once usage informs design.
14
+
15
+ Group threads are the substrate the **Nachbarschaftshilfe** use case wants: a Sankt-Martins-ComitΓ© planning thread, a "Wer hat Werkzeug?" workshop thread, a household coordination thread between Christof, Jana, and grandparents.
16
+
17
+ ---
18
+
19
+ ## 2. File layout
20
+
21
+ ```
22
+ hearthnet/services/chat/
23
+ β”œβ”€β”€ thread_service.py # ThreadService β€” capability registration & dispatch
24
+ β”œβ”€β”€ thread_views.py # Materialised views: thread list, member list, history
25
+ β”œβ”€β”€ thread_store.py # Read-only projections; not the source of truth
26
+ β”œβ”€β”€ group_session.py # Wraps M23 sender keys for a thread
27
+ └── moderation.py # Phase-2: remove-member, archive β€” minimal
28
+ ```
29
+
30
+ ---
31
+
32
+ ## 3. Public API
33
+
34
+ ### 3.1 Dataclasses
35
+
36
+ ```python
37
+ @dataclass(frozen=True)
38
+ class Thread:
39
+ thread_id: ThreadID
40
+ name: str
41
+ created_at: datetime
42
+ created_by: NodeID
43
+ members: frozenset[NodeID]
44
+ e2e_enabled: bool
45
+ ratchet_root: str | None # x25519 pubkey of group session root, None if cleartext
46
+ archived: bool
47
+
48
+ @dataclass(frozen=True)
49
+ class ThreadMessage:
50
+ event_id: EventID
51
+ thread_id: ThreadID
52
+ client_id: ClientID
53
+ sender: NodeID
54
+ sent_at: datetime
55
+ body: str | None # cleartext if e2e_enabled=False
56
+ encrypted: EncryptedPayload | None
57
+ attachments: list[Attachment]
58
+ delivered_to: frozenset[NodeID] # tracked via chat.thread.message.delivered events
59
+ ```
60
+
61
+ ### 3.2 `ThreadService`
62
+
63
+ ```python
64
+ class ThreadService:
65
+ """Capability handlers for chat.thread.*"""
66
+
67
+ def __init__(
68
+ self,
69
+ bus: CapabilityBus,
70
+ event_log: EventLog,
71
+ identity: Identity,
72
+ encryption: EncryptionService, # M23
73
+ view_store: ThreadViewStore,
74
+ observability: Observability,
75
+ ): ...
76
+
77
+ async def start(self) -> None:
78
+ # Registers: chat.thread.create, .send, .history, .leave, .add_member, .archive
79
+ ...
80
+
81
+ # --- handlers (selected) ---
82
+ async def create(self, body: CreateThreadBody) -> CreateThreadResult: ...
83
+ async def send(self, body: SendThreadBody) -> SendThreadResult: ...
84
+ async def history(self, body: HistoryBody) -> HistoryResult: ...
85
+ async def leave(self, body: LeaveBody) -> LeaveResult: ...
86
+ async def add_member(self, body: AddMemberBody) -> AddMemberResult: ...
87
+ async def archive(self, body: ArchiveBody) -> ArchiveResult: ...
88
+ ```
89
+
90
+ ### 3.3 `ThreadViewStore`
91
+
92
+ ```python
93
+ class ThreadViewStore:
94
+ """Read model. Backed by SQLite; rebuilt from the event log on cold start."""
95
+
96
+ def list_for_member(self, node_id: NodeID) -> list[Thread]: ...
97
+ def get_thread(self, thread_id: ThreadID) -> Thread | None: ...
98
+ def get_messages(self, thread_id: ThreadID, since_lamport: int = 0, limit: int = 200) -> list[ThreadMessage]: ...
99
+ def members_of(self, thread_id: ThreadID) -> frozenset[NodeID]: ...
100
+
101
+ # Internal β€” subscribed to the event log:
102
+ async def apply(self, event: Event) -> None: ...
103
+ ```
104
+
105
+ ### 3.4 `GroupSession`
106
+
107
+ Thin wrapper around M23 sender keys; one per thread.
108
+
109
+ ```python
110
+ class GroupSession:
111
+ def __init__(self, thread_id: ThreadID, ratchet: SenderKeyRatchet): ...
112
+
113
+ def encrypt(self, plaintext: bytes) -> EncryptedPayload: ...
114
+ def decrypt(self, sender: NodeID, payload: EncryptedPayload) -> bytes: ...
115
+ def rekey(self) -> None: ...
116
+ def add_member(self, new_member: NodeID, their_identity_pubkey: bytes) -> None: ...
117
+ def remove_member(self, leaving_member: NodeID) -> None: ...
118
+ ```
119
+
120
+ ---
121
+
122
+ ## 4. Behaviour
123
+
124
+ ### 4.1 Thread creation
125
+
126
+ `chat.thread.create@1.0` flow:
127
+
128
+ 1. Caller emits `chat.thread.created` event into the event log with:
129
+ - `thread_id` (newly minted ULID)
130
+ - initial member list
131
+ - `e2e_enabled` flag
132
+ - if e2e_enabled: a freshly generated `ratchet_root_pubkey` and a per-member encrypted **sender key** payload (see [M23 Β§6.3](M23-e2e-encryption.md))
133
+ 2. Each member's node sees the event arrive, decrypts the sender key payload addressed to itself, and constructs the GroupSession.
134
+ 3. The view store materialises a new Thread row.
135
+
136
+ If any member is offline at creation, they will receive the event when they next sync. Their GroupSession constructs lazily on first decrypt.
137
+
138
+ ### 4.2 Sending
139
+
140
+ `chat.thread.send@1.0` flow:
141
+
142
+ 1. Verify caller is in `Thread.members`.
143
+ 2. If `e2e_enabled`, encrypt body with the GroupSession's current sender key. The ciphertext is opaque to the event log β€” even other community members who are not in the thread cannot read it.
144
+ 3. Emit `chat.thread.message.sent` event.
145
+ 4. The event reaches all members (regular event-log propagation, no thread-specific transport).
146
+ 5. Each member's GroupSession decrypts; the message appears in their UI.
147
+
148
+ ### 4.3 Membership changes
149
+
150
+ #### Adding a member
151
+
152
+ 1. Any existing member can issue `chat.thread.add_member` (Phase 2; later phases may add policies like "only admin can add").
153
+ 2. The caller's GroupSession is **rekeyed**: a new sender key is generated, encrypted under each existing member's pubkey and the new member's pubkey, and emitted in the `chat.thread.member.added` event.
154
+ 3. The new member cannot read **prior** messages β€” they joined at the new epoch. (This is by design and standard for sender-key group encryption: forward-secrecy is preserved.) Old messages remain encrypted with the old sender key, which the new member never sees.
155
+
156
+ #### Removing a member
157
+
158
+ `chat.thread.remove_member` (or self-leave via `chat.thread.leave`):
159
+
160
+ 1. Emit `chat.thread.member.removed`.
161
+ 2. The remaining members rekey the GroupSession (similar to add but excluding the removed member). New messages are not readable by the removed member.
162
+ 3. The removed member's UI marks the thread as "you left" and stops decrypting incoming messages. Their event log still contains old messages they can still read; they just can't read new ones.
163
+
164
+ ### 4.4 History
165
+
166
+ `chat.thread.history@1.0`:
167
+
168
+ - **Self-only** capability (you can only ask for history of threads you're a member of).
169
+ - Returns from local view store. No cross-node query needed β€” every member already has the events.
170
+ - Pagination by `since_lamport` + `limit`. Messages return in **logical (Lamport) order**, not wall-clock order, to match what other members will see.
171
+
172
+ ### 4.5 Read-receipts / delivery tracking
173
+
174
+ Each member's node emits `chat.thread.message.delivered` (lightweight, no payload beyond `event_id` reference) when they materialise a message. UI shows "delivered to 4/5" by counting these events. Optional β€” `policy.chat.delivery_receipts_enabled` (default true) controls whether they're emitted.
175
+
176
+ ### 4.6 Archiving
177
+
178
+ `chat.thread.archived` is a soft state. Archived threads are hidden from the default thread list, no longer rekey on membership change, and no longer accept sends. Members can still read history. An archived thread can be unarchived by any member.
179
+
180
+ There is no "delete thread". Events are immutable. A thread that is archived and whose messages are all expired (via X02 retention policies) becomes effectively gone.
181
+
182
+ ### 4.7 Attachments
183
+
184
+ `attachments` carry `cid` (blob CID) and `name`. The blob itself is uploaded via `file.put` separately. Members of the thread are by definition authorised to fetch the blob β€” the bus enforces this via a capability token issued automatically when sending an attachment in an E2E thread:
185
+
186
+ ```
187
+ On send-with-attachment:
188
+ 1. Service issues a short-lived (24h) token via M16 with:
189
+ scope.capabilities = ["file.fetch@1.0"]
190
+ scope.params_constraints.cid = [attachment.cid]
191
+ audience = thread.members (excluding self)
192
+ 2. Token is included in the encrypted message body.
193
+ 3. Recipients use the token when fetching the blob from whichever node holds it.
194
+ ```
195
+
196
+ This avoids the "file is restricted but everyone in the thread should access it" coordination problem.
197
+
198
+ ### 4.8 Federation of threads
199
+
200
+ A thread MAY include members from federated communities. Mechanics:
201
+
202
+ - The thread's `community_id` (the one in event headers) is the *creator's* community.
203
+ - Members from federated communities subscribe to the thread's events via the standard federation event-bridge (see [M14 Β§6](M14-federation.md)).
204
+ - Federated members are full participants β€” they can send, leave, be removed β€” provided the federation manifest grants `chat.thread.send@1.0`.
205
+ - The view store on a federated member's node carries the foreign-community thread alongside their local-community threads, distinguished by `Thread.community_id` field for UI purposes.
206
+
207
+ If federation is revoked, foreign members are silently removed from the thread on the next rekey.
208
+
209
+ ### 4.9 Throughput and limits
210
+
211
+ - `THREAD_MAX_MEMBERS = 200` (Phase-2 conservative; larger groups should be a different module).
212
+ - `THREAD_MAX_MESSAGE_BYTES = 64 * 1024` for the cleartext body.
213
+ - `THREAD_RATE_LIMIT_PER_SENDER_PER_MINUTE = 60` (anti-spam, enforced by ThreadService).
214
+ - Beyond these β†’ `bad_request` or `too_many_requests`.
215
+
216
+ ---
217
+
218
+ ## 5. Errors
219
+
220
+ | Code | Cause |
221
+ |------|-------|
222
+ | `bad_request` | Empty member list, malformed body, member list contains caller twice |
223
+ | `unauthorized` | Caller not a member of the thread (for send/history/leave/add) |
224
+ | `not_found` | `thread_id` unknown |
225
+ | `e2e_session_missing` | Caller has no GroupSession yet (sender keys not received) |
226
+ | `e2e_decrypt_failed` | Local key state corrupt; UI should prompt for a manual rekey |
227
+ | `too_many_requests` | Rate limit exceeded |
228
+ | `policy_violation` | E.g. trying to add member outside of federation scope |
229
+
230
+ ---
231
+
232
+ ## 6. Configuration
233
+
234
+ ```toml
235
+ [services.chat.thread]
236
+ enabled = true
237
+ max_members = 200
238
+ max_message_bytes = 65536
239
+ rate_limit_per_sender_per_minute = 60
240
+ delivery_receipts_enabled = true
241
+ allow_federated_members = true
242
+
243
+ [services.chat.thread.archival]
244
+ auto_archive_after_days_idle = 0 # 0 = never auto-archive
245
+ ```
246
+
247
+ ---
248
+
249
+ ## 7. Tests
250
+
251
+ ### 7.1 Unit
252
+ - Create thread with 3 members; verify GroupSession is constructable by each member from the `chat.thread.created` payload
253
+ - Send + decrypt round-trip
254
+ - Add member; old messages remain undecryptable for them, new ones work
255
+ - Remove member; their session can't decrypt new messages
256
+ - Self-leave; cleanup is graceful (no orphan state)
257
+ - History pagination: 1000 messages, fetch 200 + 200 + 200... covers all
258
+
259
+ ### 7.2 Integration
260
+ - Three nodes on one LAN form a thread; messages propagate via gossip
261
+ - Same with one member partitioned; their replay on reconnect works
262
+ - E2E on/off threads coexist; switching one to the other is not supported (must create a new thread)
263
+ - Federation: a federated peer's node receives the `chat.thread.created` event via the bridge and constructs a working GroupSession
264
+
265
+ ### 7.3 Adversarial
266
+ - A non-member tries to call `chat.thread.send` β†’ `unauthorized`
267
+ - A non-member subscribes to `chat.thread.message.<id>` pubsub: receives encrypted blobs they can't decrypt (no information leak beyond traffic patterns and member list)
268
+ - Replay: replaying an old `chat.thread.message.sent` event by IP-level adversary is rejected by per-message nonce in the E2E header
269
+ - Rekey storm: 100 sequential add/remove operations finish within 30s on the dev rig; no deadlock
270
+
271
+ ### 7.4 Performance
272
+ - 50-member thread, 1 msg/s: p95 deliver-to-decrypt latency < 500ms on LAN
273
+ - History fetch of 10,000 messages: < 2s on SSD
274
+
275
+ ---
276
+
277
+ ## 8. Cross-references
278
+
279
+ - Capability spec: [CAPABILITY_CONTRACT_v2 Β§4.16–4.19](../CAPABILITY_CONTRACT_v2.md)
280
+ - Encryption primitives: [M23 Β§6 sender keys](M23-e2e-encryption.md)
281
+ - Event types: [CAPABILITY_CONTRACT_v2 Β§7.1](../CAPABILITY_CONTRACT_v2.md#71-new-event-types)
282
+ - Federation: [M14 Β§6](M14-federation.md)
283
+
284
+ ---
285
+
286
+ ## 9. Open questions
287
+
288
+ 1. **Reactions / replies / rich content** β€” explicitly out of Phase 2. Worth a survey of community use before designing. (Likely Phase 3 add-on, gated on "are people actually asking for it?")
289
+ 2. **Per-thread retention policy** β€” currently inherits the community-wide retention. Different threads might want different policies (planning thread = 30 days, household chat = forever).
290
+ 3. **Read-only threads** (announcements) β€” pseudo-thread where only one member can send. Worth a flag or worth a dedicated capability?
291
+ 4. **Thread search** β€” could plug into `rag.*`. Indexing of decrypted message text would be opt-in per thread; raises privacy concerns.
292
+ 5. **Cross-thread mentions / linking** β€” e.g. "see thread X for context". Probably as a UI affordance (markdown link), not a protocol feature.
293
+ 6. **Disappearing messages** β€” Signal-style auto-expiry per-thread. Useful for sensitive coordination; adds complexity. Phase 3 candidate.
docs/p2_p3/M26-distributed-inference.md ADDED
@@ -0,0 +1,325 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # M26 β€” Distributed Inference
2
+
3
+ **Spec version:** v3.0 β€” *experimental*
4
+ **Depends on:** [X08 Tensor Transport](../cross-cutting/X08-tensor-transport.md), [X06 WebSocket](../../phase-2/cross-cutting/X06-websocket.md), [M04 LLM](../../modules/M04-llm.md), [M16 Tokens](../../phase-2/modules/M16-tokens.md), [X03 Observability](../../cross-cutting/X03-observability.md)
5
+ **Depended on by:** Optional `experimental.distributed_llm.chat` backend in M04
6
+
7
+ ---
8
+
9
+ ## 1. Responsibility
10
+
11
+ Run a single LLM forward pass across multiple machines on the same LAN (or, in extremely careful setups, a single federation). Take a 7B model that doesn't fit on any one anchor's GPU and split it: anchor A holds layers 0–7, anchor B holds 8–15, anchor C holds 16–23, etc. A request orchestrator chains them and streams tokens back to the user.
12
+
13
+ This is a **research module**. It exists for two reasons:
14
+
15
+ 1. **Resilience.** When a community's biggest GPU breaks, the next-biggest fleet of GPUs can still serve mid-sized models cooperatively.
16
+ 2. **Reach.** A community of three households, each with consumer hardware, can collectively run a model none of them could run alone.
17
+
18
+ It is explicitly **not** for serving production user-facing LLM traffic at scale. The latency is worse than local inference (typically 2–4Γ— per token), the orchestration is fragile (one shard offline = retry the whole pipeline), and the GPU memory savings come at significant complexity cost. Communities should default to local inference; this module exists for the cases where local isn't enough.
19
+
20
+ ---
21
+
22
+ ## 2. Non-goals (loud and clear)
23
+
24
+ - **Large models.** 70B-class models are out of scope. The math says you'd need ten 24 GB GPUs to host one, which is the wrong problem for a neighbourhood mesh to solve.
25
+ - **Cross-WAN sharding.** Inference across the public internet is uneconomical (latency, bandwidth). Limit to same LAN or same-VPN federation.
26
+ - **Heterogeneous shards across model versions.** All shards in a pipeline must serve the **exact same model and weights checksum**. No partial-model recovery.
27
+ - **Replacing local inference.** When `policy.research.enable = false`, this module is inert.
28
+
29
+ ---
30
+
31
+ ## 3. File layout
32
+
33
+ ```
34
+ hearthnet/distributed_inference/
35
+ β”œβ”€β”€ __init__.py
36
+ β”œβ”€β”€ shard.py # Shard, ShardDescriptor, ShardServer
37
+ β”œβ”€β”€ pipeline.py # Pipeline, PipelineOrchestrator
38
+ β”œβ”€β”€ routing.py # Picks a set of shards that cover [0..N] layers
39
+ β”œβ”€β”€ health.py # Heartbeats, failover detection
40
+ └── backends/
41
+ β”œβ”€β”€ base.py
42
+ β”œβ”€β”€ petals_like.py # uses bigscience/petals client/server primitives
43
+ └── small_model_layered.py # custom impl for small models (≀ 3B)
44
+ ```
45
+
46
+ ---
47
+
48
+ ## 4. Public API
49
+
50
+ ### 4.1 Dataclasses
51
+
52
+ ```python
53
+ @dataclass(frozen=True)
54
+ class ShardDescriptor:
55
+ shard_id: ShardID # "<model_id>:<lo>-<hi>"
56
+ model_id: str # HF model id
57
+ weights_sha256: str # full model weights hash; shards must match
58
+ layer_range: tuple[int,int] # inclusive
59
+ vram_required_mb: int
60
+ max_concurrent_streams: int
61
+ host: NodeID
62
+ endpoint: Endpoint # ws://...
63
+ advertised_at: datetime
64
+
65
+ @dataclass
66
+ class Pipeline:
67
+ pipeline_id: str
68
+ model_id: str
69
+ weights_sha256: str
70
+ total_layers: int
71
+ ordered_shards: list[ShardDescriptor]
72
+ established_at: datetime
73
+
74
+ @dataclass
75
+ class ShardHealth:
76
+ shard_id: ShardID
77
+ online: bool
78
+ last_seen: datetime
79
+ p95_latency_ms: float
80
+ queue_depth: int
81
+ ```
82
+
83
+ ### 4.2 `ShardServer`
84
+
85
+ ```python
86
+ class ShardServer:
87
+ """Hosts one contiguous shard. Loaded on demand; lazy-evictable under memory pressure."""
88
+
89
+ def __init__(self, descriptor: ShardDescriptor, model_loader: ModelLoader, settings: ShardSettings): ...
90
+
91
+ async def start(self) -> None:
92
+ # Load weights for the layer range; register `experimental.distributed_llm.shard.serve` on the bus
93
+ ...
94
+
95
+ async def forward(self, activations_in: TensorChunkStream) -> TensorChunkStream:
96
+ """The hot path. Receives activations, runs layers, emits activations."""
97
+ ...
98
+
99
+ async def health(self) -> ShardHealth: ...
100
+
101
+ async def evict(self) -> None:
102
+ """Free VRAM; triggered by host memory manager."""
103
+ ...
104
+ ```
105
+
106
+ ### 4.3 `PipelineOrchestrator`
107
+
108
+ ```python
109
+ class PipelineOrchestrator:
110
+ """
111
+ Chooses shards to cover the model's layers, opens streams to each, and
112
+ pumps activations through them in order. Handles failover.
113
+ """
114
+
115
+ def __init__(
116
+ self,
117
+ bus: CapabilityBus,
118
+ router: ShardRouter,
119
+ health: ShardHealthTracker,
120
+ observability: Observability,
121
+ ): ...
122
+
123
+ async def chat(
124
+ self,
125
+ request: LlmChatRequest,
126
+ params: DistributedChatParams,
127
+ ) -> AsyncIterator[StreamFrame]:
128
+ # 1. Resolve a Pipeline covering all layers of the target model
129
+ # 2. Open WS streams to each shard via X08 tensor transport
130
+ # 3. For each token step:
131
+ # embedding β†’ shard 0 β†’ shard 1 β†’ ... β†’ shard N β†’ token sample
132
+ # 4. Yield `token_delta` frames; emit `shard_status` and `shard_failover` diagnostics
133
+ # 5. On any shard failure, attempt re-routing once; if that fails and
134
+ # `params.fallback_to_local`, fall back to local inference and emit a
135
+ # `pipeline_aborted` frame
136
+ ...
137
+ ```
138
+
139
+ ### 4.4 `ShardRouter`
140
+
141
+ ```python
142
+ class ShardRouter:
143
+ """
144
+ Given a model_id and an `experimental.shard.advertised` event stream,
145
+ pick a covering set of shards minimising:
146
+ - total network hops
147
+ - max per-shard queue depth
148
+ - chance of overlap with the caller's own GPU (avoid self-as-shard)
149
+ """
150
+
151
+ def __init__(self, store: ShardStore, settings: RoutingSettings): ...
152
+
153
+ async def pick(self, model_id: str, weights_sha256: str) -> Pipeline: ...
154
+ async def repick(self, pipeline: Pipeline, exclude: set[ShardID]) -> Pipeline: ...
155
+ ```
156
+
157
+ ---
158
+
159
+ ## 5. Behaviour
160
+
161
+ ### 5.1 Shard advertisement and discovery
162
+
163
+ A node hosting a shard emits `experimental.shard.advertised` events into the community event log. The event carries `ShardDescriptor` fields plus a timestamp. Advertisements expire after `DISTRIBUTED_SHARD_HEALTH_TIMEOUT_S * 4` (default 120s); shard hosts must re-advertise via heartbeat.
164
+
165
+ When a node opts out (`policy.research.enable=false`), it does not emit advertisements. Existing advertisements expire normally.
166
+
167
+ The shard store is a local read model built from these events, indexed by `(model_id, weights_sha256, layer_range)`.
168
+
169
+ ### 5.2 Pipeline construction
170
+
171
+ `ShardRouter.pick`:
172
+
173
+ 1. Filter advertisements to those matching `model_id` and `weights_sha256`.
174
+ 2. Greedy cover: starting from layer 0, pick the shard with the lowest queue depth that includes the next uncovered layer; advance the cursor; repeat. Returns failure if any layer is uncoverable.
175
+ 3. Prefer shards on the same LAN if possible (LAN advertisements have a lower "hop weight" metric attached by Discovery).
176
+ 4. Avoid sharding to **self** as the first shard β€” embedding + sampling should stay on the orchestrator.
177
+
178
+ Constructed pipelines are not persisted; they're per-call.
179
+
180
+ ### 5.3 Forward pass
181
+
182
+ Per token:
183
+
184
+ ```
185
+ [orchestrator] embedding β†’ [shard 0] layers 0..7 β†’ [shard 1] layers 8..15 β†’ ... β†’ [orchestrator] sample
186
+ ```
187
+
188
+ Activations flow as **fp16 tensors** by default (configurable to fp32 for debugging). Each hop is a WebSocket binary frame stream (see [X08](../cross-cutting/X08-tensor-transport.md)). The orchestrator interleaves token-N and token-N+1: as soon as shard 0 finishes token N, the orchestrator pushes token N+1's embedding into shard 0 while shard 1 is still processing N. This pipeline parallelism approaches the latency of the longest-latency shard at steady state.
189
+
190
+ ### 5.4 Failure handling
191
+
192
+ If a shard's stream errors or stalls past `DISTRIBUTED_SHARD_HEALTH_TIMEOUT_S`:
193
+
194
+ 1. The orchestrator emits a `shard_status` frame with `status:"degraded"`.
195
+ 2. Calls `router.repick(pipeline, exclude={failed_shard_id})`.
196
+ 3. If repick succeeds, opens a fresh stream to the replacement and emits `shard_failover` frame. **In-flight tokens are restarted** (no mid-token recovery).
197
+ 4. If repick fails and `params.fallback_to_local`, the orchestrator silently restarts the call as a local-only `llm.chat@2.0` against any local model that matches.
198
+ 5. Else: emit `pipeline_aborted` frame and return `shard_unavailable`.
199
+
200
+ `DISTRIBUTED_FALLBACK_TO_LOCAL_AFTER_FAILURES` (default 2): if failover happens that many times in one call, give up and fall back to local.
201
+
202
+ ### 5.5 Streaming and backpressure
203
+
204
+ Tensor-chunk streams use a window of `TENSOR_FLOW_CONTROL_WINDOW` chunks (default 16). Each chunk is at most `TENSOR_CHUNK_BYTES` (1 MB). If the downstream shard's send queue fills, the orchestrator pauses upstream until ACKs drain. See [X08 Β§4](../cross-cutting/X08-tensor-transport.md).
205
+
206
+ ### 5.6 Concurrency
207
+
208
+ A shard's `max_concurrent_streams` is honoured strictly. If the orchestrator's call would exceed it, the orchestrator picks a different shard (via `router.repick`) rather than queuing.
209
+
210
+ A shard's GPU memory budget is enforced by the shard host's own resource manager; a shard exceeding its budget gets evicted and re-advertises with `vram_required_mb` updated next time it loads.
211
+
212
+ ### 5.7 Models supported
213
+
214
+ Phase 3 launches with two backend choices:
215
+
216
+ | Backend | Models | Notes |
217
+ |---------|--------|-------|
218
+ | `small_model_layered` | Qwen2.5-{1.5B,3B,7B}, Llama-3.2-{1B,3B}, MiniCPM-3 | Custom HearthNet impl; PyTorch model surgery to expose per-layer forward |
219
+ | `petals_like` | (vendored from BigScience Petals) | Optional; only if user installs `hearthnet[petals]` extra |
220
+
221
+ The `small_model_layered` backend handles models up to roughly 7B parameters cleanly; beyond that the activation transport becomes the bottleneck.
222
+
223
+ ### 5.8 Security boundary
224
+
225
+ A shard host receives activation tensors which **leak training data residue**. Treat activations as sensitive: do not log them, do not persist, do not retain past forward pass. Per-call signed authentication; the caller's identity is recorded in metrics but not in logs of tensor contents.
226
+
227
+ A malicious shard could degrade outputs subtly. Detection is hard in general; the orchestrator does **basic sanity checks** (norm bounds, NaN/Inf detection) but cannot detect adversarial corruption. Communities should only enable distributed inference among members they trust as much as they trust the LLM service operator.
228
+
229
+ ### 5.9 Privacy threat surface
230
+
231
+ A shard sees the activations of every request routed through it. With effort, a shard host can reconstruct approximate input text (especially the prompt) from activations of intermediate layers. This is **a real concern, not a theoretical one**.
232
+
233
+ Mitigations (none perfect):
234
+ - Restrict participation to members at trust level `trusted` or higher.
235
+ - Mix activations with a small amount of noise at the orchestrator (research; not yet implemented).
236
+ - Use this module only for queries the requester would already trust the community with.
237
+
238
+ ### 5.10 Observability
239
+
240
+ Per call, emit:
241
+ - `distributed_inference.pipeline_construct_ms`
242
+ - `distributed_inference.first_token_ms`
243
+ - `distributed_inference.tokens_per_second`
244
+ - `distributed_inference.shard_latency_ms{shard_id}` histograms
245
+ - `distributed_inference.failovers_total`
246
+ - `distributed_inference.fallback_to_local_total`
247
+
248
+ ---
249
+
250
+ ## 6. Errors
251
+
252
+ | Code | Cause |
253
+ |------|-------|
254
+ | `experimental_disabled` | `policy.research.enable=false` |
255
+ | `shard_unavailable` | No shard covers a required layer range, or all candidates are at max concurrency |
256
+ | `pipeline_stalled` | No progress within timeout |
257
+ | `weights_mismatch` | A shard's advertised `weights_sha256` differs from requested |
258
+ | `bad_request` | Unknown model, malformed pipeline params |
259
+
260
+ ---
261
+
262
+ ## 7. Configuration
263
+
264
+ ```toml
265
+ [research.distributed_inference]
266
+ enabled = false
267
+ backend = "small_model_layered"
268
+ max_shards_per_request = 16
269
+ shard_health_timeout_seconds = 30
270
+ fallback_to_local = true
271
+ activation_dtype = "fp16" # "fp16" | "fp32"
272
+ allow_self_as_shard = false
273
+ max_concurrent_pipelines = 4
274
+
275
+ [research.distributed_inference.host]
276
+ serve_shards = false
277
+ shard_eviction_idle_seconds = 600
278
+ shard_max_vram_mb = 20000
279
+ ```
280
+
281
+ ---
282
+
283
+ ## 8. Tests
284
+
285
+ ### 8.1 Unit
286
+ - ShardRouter cover algorithm: 16-layer model + 3 advertised shards (0-7, 4-11, 8-15) β†’ picks {0-7, 8-15}, ignores overlap shard
287
+ - Sanity bounds on activations: NaN injection triggers `pipeline_stalled` (via failed health check on subsequent chunk)
288
+ - Pipeline construction with weights mismatch β†’ `weights_mismatch`
289
+
290
+ ### 8.2 Integration (LAN)
291
+ - Two-node setup, 1.5B model split as 0-7 / 8-15; happy-path tokens/sec measured; baseline single-machine inference also measured; ratio reported (expect 0.4–0.6Γ— local)
292
+ - Shard host kill mid-stream; failover to a third node; total call still succeeds; latency penalty bounded
293
+ - Concurrent two-pipeline test on three nodes; no deadlock; per-call latency degrades < 2Γ—
294
+
295
+ ### 8.3 Adversarial
296
+ - Malicious shard returns garbage activations: orchestrator's NaN/Inf detector catches the call; metric `distributed_inference.shard_corruption_detected_total` increments; pipeline aborts
297
+ - Slowloris shard (returns one chunk per second): `pipeline_stalled` after timeout; failover succeeds
298
+
299
+ ### 8.4 Performance budget
300
+ - 3B model, 2-shard pipeline, RTX 5090 + RTX 4090: β‰₯ 8 tokens/sec sustained
301
+ - First-token latency ≀ 800ms
302
+ - Construction-to-first-byte ≀ 500ms
303
+ - Tensor-chunk overhead per hop ≀ 25ms p95
304
+
305
+ ---
306
+
307
+ ## 9. Cross-references
308
+
309
+ - Capability spec: [CAPABILITY_CONTRACT_v3 Β§4.1–4.3](../CAPABILITY_CONTRACT_v3.md)
310
+ - Tensor transport: [X08](../cross-cutting/X08-tensor-transport.md)
311
+ - Base LLM service: [M04](../../modules/M04-llm.md)
312
+ - Trust levels: [M01](../../modules/M01-identity.md)
313
+
314
+ ---
315
+
316
+ ## 10. Open research questions
317
+
318
+ 1. **Activation privacy.** Can we add fast-to-compute noise that preserves inference accuracy but defeats activation-inversion attacks? Cite the Geiping et al. inversion paper as the threat baseline.
319
+ 2. **Mid-token recovery.** Currently a shard failure restarts the in-flight token. Could we use micro-checkpointing (every K tokens) to recover without a restart? Latency cost?
320
+ 3. **Heterogeneous shards.** Could a 4090 host the early layers (heavier compute per layer) and a 3060 the later ones, while remaining balanced? Probably yes β€” automated load assignment is the research question.
321
+ 4. **Async pipeline.** Currently the orchestrator interleaves at the token level. Could it interleave at the layer level (one shard processes token N+2 while another processes N+1) for higher throughput? In theory yes; coordination protocol unclear.
322
+ 5. **Mixed local + distributed.** When the orchestrator could host some layers itself (it has a GPU), should it? When? Currently `allow_self_as_shard=false`. A heuristic that considers compute headroom would be richer.
323
+ 6. **Adversarial detection.** Beyond NaN/Inf, can we cheaply detect activation tampering by comparing to a small "shadow inference" on a tiny model? Cost vs. benefit unclear.
324
+ 7. **Pricing / incentive.** A shard host pays in GPU time. A community-internal token-economy is explicitly out of scope (00-OVERVIEW Β§8). But a *reputational* signal β€” "this anchor served 4000 shard-tokens this week" β€” could be helpful. Should it be a metric?
325
+ 8. **Backend strategy.** `petals_like` vs `small_model_layered`: which delivers better quality / latency / robustness for our target models? An honest A/B is the answer.
docs/p2_p3/M27-moe-routing.md ADDED
@@ -0,0 +1,378 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # M27 β€” MoE Expert Routing
2
+
3
+ **Spec version:** v3.0 β€” *experimental*
4
+ **Depends on:** [M03 Capability Bus](../../modules/M03-capability-bus.md), [M04 LLM](../../modules/M04-llm.md), [M10 Chat](../../modules/M10-chat.md), [M25 Group Chat](../../phase-2/modules/M25-group-chat.md), [X02 Event Log](../../cross-cutting/X02-events.md), [M16 Tokens](../../phase-2/modules/M16-tokens.md)
5
+ **Depended on by:** Optional routing layer in M03 Bus dispatcher
6
+
7
+ ---
8
+
9
+ ## 1. Responsibility
10
+
11
+ "MoE" here means **Mixture of Experts** in a generalised sense: given a question, route to the best expert. The experts can be:
12
+
13
+ - a **model** running locally on some node ("Llama-3.2-3B is good at code"),
14
+ - a **service** capability ("the niederrhein-emergency RAG corpus knows this"),
15
+ - a **human** with declared expertise ("Maria in Issum has organised Sankt Martins for 20 years"),
16
+ - an **external** API or another community via federation.
17
+
18
+ The router takes a query summary + tags, asks "which expert would do this best", returns top-K candidates with scores. The caller chooses one (or the system chooses automatically with a confidence threshold) and the chosen expert serves the request.
19
+
20
+ This module is a research bet that **knowing-who-to-ask is more valuable than scaling-the-model**. A 3B model that knows when to defer to a neighbour with first-hand knowledge will outperform a 70B model that has to confabulate. Whether that bet pays off is exactly what the research is for.
21
+
22
+ ---
23
+
24
+ ## 2. Non-goals
25
+
26
+ - **Auction-style routing.** Experts do not bid; there is no money flowing. Routing is by score, not price.
27
+ - **Mandatory routing.** The router is a *recommender*. The caller can always choose to run a query against a default LLM. MoE routing is opt-in per capability or per call.
28
+ - **Replacing RAG.** RAG still does retrieval inside a corpus. The router decides *which corpus* (or which non-RAG expert) β€” different layer.
29
+ - **Routing without consent.** A human expert never gets pinged unless they have explicitly registered availability for the topic.
30
+
31
+ ---
32
+
33
+ ## 3. File layout
34
+
35
+ ```
36
+ hearthnet/moe/
37
+ β”œβ”€β”€ __init__.py
38
+ β”œβ”€β”€ router.py # MoeRouter β€” the capability handler
39
+ β”œβ”€β”€ scorer.py # Learned scoring model; rule-based fallback
40
+ β”œβ”€β”€ expert_registry.py # Tracks registered experts and their declared topics
41
+ β”œβ”€β”€ human_in_the_loop.py # Coordinates handoff to a human; manages timeouts
42
+ └── feedback.py # Records routing outcomes to train the scorer
43
+ ```
44
+
45
+ ---
46
+
47
+ ## 4. Public API
48
+
49
+ ### 4.1 Dataclasses
50
+
51
+ ```python
52
+ @dataclass(frozen=True)
53
+ class ExpertDescriptor:
54
+ expert_id: ExpertID # "human:<NodeID>" | "model:<id>" | "service:<cap_name>" | "external:<url>"
55
+ kind: ExpertKind
56
+ topics: frozenset[str]
57
+ capabilities: frozenset[str] # for kind=service: which bus capability to invoke
58
+ availability: AvailabilityWindow
59
+ consent_to_route: bool
60
+ score_bias: float = 0.0 # operator nudge; positive favours, negative dispels
61
+ registered_at: datetime
62
+ expires_at: datetime | None
63
+
64
+ @dataclass
65
+ class RouteCandidate:
66
+ expert_id: ExpertID
67
+ kind: ExpertKind
68
+ score: float
69
+ expected_latency_minutes: int # for humans, hours; for models, seconds
70
+ rationale: str
71
+ name: str | None # for display
72
+
73
+ @dataclass
74
+ class RouteResult:
75
+ candidates: list[RouteCandidate]
76
+ rationale: str
77
+ routed_at: datetime
78
+
79
+ @dataclass
80
+ class Handoff:
81
+ handoff_id: str
82
+ expert_id: ExpertID
83
+ context_summary: str
84
+ initiated_at: datetime
85
+ deadline_at: datetime
86
+ state: Literal["pending","accepted","declined","completed","timed_out"]
87
+ thread_id: ThreadID | None
88
+ ```
89
+
90
+ ### 4.2 `MoeRouter`
91
+
92
+ ```python
93
+ class MoeRouter:
94
+ """Capability handler for experimental.moe.route@1.0"""
95
+
96
+ def __init__(
97
+ self,
98
+ bus: CapabilityBus,
99
+ registry: ExpertRegistry,
100
+ scorer: Scorer,
101
+ feedback: FeedbackStore,
102
+ settings: MoeSettings,
103
+ observability: Observability,
104
+ ): ...
105
+
106
+ async def start(self) -> None: ...
107
+
108
+ async def route(self, body: RouteBody) -> RouteResult: ...
109
+ async def handoff(self, body: HandoffBody) -> Handoff: ...
110
+ async def feedback_outcome(self, handoff_id: str, outcome: Outcome) -> None: ...
111
+ ```
112
+
113
+ ### 4.3 `Scorer`
114
+
115
+ ```python
116
+ class Scorer(Protocol):
117
+ name: str
118
+
119
+ def score(
120
+ self,
121
+ request_summary: str,
122
+ tags: list[str],
123
+ candidates: list[ExpertDescriptor],
124
+ context: ScoringContext,
125
+ ) -> list[float]:
126
+ """Return one score per candidate, same order."""
127
+ ...
128
+
129
+ class RuleBasedScorer:
130
+ """The default scorer. Pure rules: tag overlap, recency of expert activity, availability, bias. No ML."""
131
+ ...
132
+
133
+ class LearnedScorer:
134
+ """A small classifier trained on feedback outcomes. Off by default; activated once
135
+ MOE_ROUTER_TRAIN_MIN_EXAMPLES historical handoffs with outcomes are recorded."""
136
+ ...
137
+ ```
138
+
139
+ ### 4.4 `ExpertRegistry`
140
+
141
+ ```python
142
+ class ExpertRegistry:
143
+ """
144
+ Materialised view of `experimental.moe.expert.registered` events.
145
+ Indexed by topic for fast routing.
146
+ """
147
+
148
+ def __init__(self, event_log: EventLog): ...
149
+
150
+ def by_topic(self, tag: str) -> list[ExpertDescriptor]: ...
151
+ def by_id(self, expert_id: ExpertID) -> ExpertDescriptor | None: ...
152
+ def available_now(self) -> list[ExpertDescriptor]: ...
153
+ ```
154
+
155
+ ### 4.5 `HumanInTheLoopCoordinator`
156
+
157
+ ```python
158
+ class HumanInTheLoopCoordinator:
159
+ """
160
+ Manages handoff-to-human flows.
161
+ - Sends a chat invitation (M25 thread) to the chosen human.
162
+ - Awaits acceptance within HANDOFF_RESPONSE_DEADLINE_MINUTES.
163
+ - Falls back to next-best candidate if declined or timed out.
164
+ - Stores audit trail (who routed where, why, outcome).
165
+ """
166
+
167
+ def __init__(self, bus: CapabilityBus, thread_service: ThreadService, registry: ExpertRegistry, settings: HitlSettings): ...
168
+
169
+ async def initiate(self, handoff: Handoff) -> None: ...
170
+ async def on_response(self, handoff_id: str, accepted: bool) -> None: ...
171
+ async def on_completion(self, handoff_id: str, outcome: Outcome) -> None: ...
172
+ ```
173
+
174
+ ---
175
+
176
+ ## 5. Behaviour
177
+
178
+ ### 5.1 Expert registration
179
+
180
+ A node calls `experimental.moe.expert.register@1.0` to register an expert. For `kind="human"`:
181
+
182
+ - The caller must be the human in question (self-registration) OR be an anchor registering on behalf of a member with explicit consent token.
183
+ - `consent_to_route=true` is mandatory; humans without it are silently excluded from routing.
184
+ - Topics are free-form strings, lowercased, kebab-case. The registry will compute embeddings of topics so that `"sankt_martins"` and `"sankt martins"` match.
185
+
186
+ For `kind="model"` or `kind="service"`:
187
+
188
+ - The node hosting the model/service self-registers.
189
+ - Topics describe what the model is good at, e.g. `["code","python"]` or `["niederrhein-history","local-genealogy"]`.
190
+
191
+ For `kind="external"`:
192
+
193
+ - An anchor registers a third-party endpoint (HF Inference, OpenAI, Anthropic, another HearthNet community via federation).
194
+ - External experts are not consulted unless `policy.research.moe_allow_external=true`.
195
+
196
+ ### 5.2 Routing flow
197
+
198
+ `experimental.moe.route@1.0`:
199
+
200
+ 1. Caller submits `{request_summary, tags, top_k}`.
201
+ 2. The registry returns all currently-available experts whose topics overlap `tags` or whose semantic similarity to `request_summary` is above `MOE_TOPIC_SIMILARITY_THRESHOLD` (default 0.55).
202
+ 3. Scorer scores each candidate.
203
+ 4. Apply score biases (per-expert operator nudges, per-community policy).
204
+ 5. Sort descending; return top-K with rationales.
205
+ 6. The caller decides whether to:
206
+ - Route automatically (if top score β‰₯ `MOE_AUTO_ROUTE_THRESHOLD`, default 0.85),
207
+ - Present the user with the candidate list to choose,
208
+ - Or fall back to default LLM.
209
+
210
+ ### 5.3 Handoff to a human expert
211
+
212
+ `experimental.moe.expert.handoff@1.0`:
213
+
214
+ 1. The coordinator creates a new E2E group thread (M25) with the requester + chosen human.
215
+ 2. The requester's question (or its summary) is posted to the thread.
216
+ 3. The chosen human receives a notification.
217
+ 4. If the human accepts within `HANDOFF_RESPONSE_DEADLINE_MINUTES` (default 60, configurable), the thread proceeds normally.
218
+ 5. If declined or timed out: the coordinator silently picks the next candidate (or falls back to a model). The requester is informed once total wait exceeds `HANDOFF_WAIT_BUDGET_MINUTES` (default 30).
219
+
220
+ The human's UI shows handoffs as a low-priority "questions for you" inbox; they are not interruptive notifications by default. Policy can flip this for community types where instant response matters (e.g. civdef pilot, M31).
221
+
222
+ ### 5.4 Routing model + RAG hybrid
223
+
224
+ A common pattern: a chat session begins, the user asks a question. The orchestrator:
225
+
226
+ 1. Sends `request_summary` (the question + recent thread context) to the router.
227
+ 2. If the top expert is a `service` expert pointing at `rag.query@1.0` against a specific corpus β†’ run RAG against that corpus.
228
+ 3. If the top is a `model` expert β†’ run `llm.chat` against that model.
229
+ 4. If the top is a `human` expert and the user explicitly opted in to human handoff β†’ initiate handoff.
230
+ 5. Else default model.
231
+
232
+ This is opt-in; the chat UI surfaces a "ask a neighbour?" affordance when human handoff is a credible option.
233
+
234
+ ### 5.5 Feedback loop
235
+
236
+ After every routed call, the caller (or a UI signal: "did this answer your question?") records an `Outcome`:
237
+
238
+ ```python
239
+ @dataclass
240
+ class Outcome:
241
+ handoff_id: str
242
+ helpful: bool | None # None if no signal
243
+ user_rating: int | None # 1–5
244
+ completion_time_seconds: float | None
245
+ handed_off_again: bool # the user asked the question elsewhere
246
+ ```
247
+
248
+ Feedback is stored in `FeedbackStore` (SQLite). Once `MOE_ROUTER_TRAIN_MIN_EXAMPLES` (default 200) outcomes exist, the `LearnedScorer` becomes available and is retrained every `MOE_ROUTER_RETRAIN_EVERY_HOURS` (default 24). Communities can flip back to `RuleBasedScorer` at any time.
249
+
250
+ ### 5.6 Privacy in routing
251
+
252
+ Routing is **observable** by definition β€” the request summary is sent to the router, which inspects it to pick an expert. Implications:
253
+
254
+ - The router lives on the **caller's own node**; the request summary is not transmitted off-node for routing.
255
+ - When the chosen expert is on a different node, the request body is sent over the bus as usual (signed, optionally E2E if it's a chat thread).
256
+ - The router does not log full request summaries. It logs `tags`, the candidate list, and the chosen expert. The summary is held in memory for the duration of the call.
257
+ - For handoff to humans, the human sees the actual question β€” they need to. The handoff event in the audit trail records "request handed off to X", not the question's content.
258
+
259
+ ### 5.7 Cross-community routing (federation)
260
+
261
+ A federation manifest can include the scope `moe.route@1.0`. In that case the router can include experts from federated communities in its candidate list. Cross-community handoff:
262
+
263
+ - Initiates a federated thread (M25 + M14): the thread's events are bridged to the federated community.
264
+ - The expert's identity (e.g. "Lukas from Geldern") is visible to the requester.
265
+ - Federation scope must include `chat.thread.send@1.0` and `chat.thread.history@1.0` for the thread to function.
266
+
267
+ ### 5.8 Operational policy
268
+
269
+ Communities can configure:
270
+
271
+ ```yaml
272
+ moe:
273
+ enabled: true
274
+ auto_route_threshold: 0.85
275
+ topic_similarity_threshold: 0.55
276
+ human_handoff:
277
+ enabled: true
278
+ response_deadline_minutes: 60
279
+ wait_budget_minutes: 30
280
+ allowed_during_quiet_hours: false # no human pings 22:00–06:00 local
281
+ external_experts: false
282
+ cross_community: false
283
+ ```
284
+
285
+ ### 5.9 Anti-abuse
286
+
287
+ - **Rate limit per requester:** `MOE_REQUESTS_PER_USER_PER_HOUR` (default 60). Prevents one user from spamming the human-expert pool.
288
+ - **Per-expert cooldown:** an expert is not offered the same user's request within `MOE_PER_EXPERT_COOLDOWN_MINUTES` (default 30).
289
+ - **Decline penalty:** an expert who declines 3 handoffs in a row gets temporarily marked `availability=false` until they update their registration.
290
+
291
+ ---
292
+
293
+ ## 6. Errors
294
+
295
+ | Code | Cause |
296
+ |------|-------|
297
+ | `experimental_disabled` | Research not enabled |
298
+ | `bad_request` | Empty `request_summary`, malformed tags |
299
+ | `not_found` | `expert_id` does not exist (in `handoff`) |
300
+ | `handoff_declined` | Chosen expert declined and no fallback was permitted |
301
+ | `handoff_timed_out` | No response within deadline |
302
+ | `policy_violation` | Cross-community handoff but federation does not allow |
303
+
304
+ ---
305
+
306
+ ## 7. Configuration
307
+
308
+ ```toml
309
+ [research.moe]
310
+ enabled = false
311
+ scorer = "rule_based" # "rule_based" | "learned"
312
+ auto_route_threshold = 0.85
313
+ topic_similarity_threshold = 0.55
314
+ top_k_default = 3
315
+ requests_per_user_per_hour = 60
316
+ per_expert_cooldown_minutes = 30
317
+ allow_external = false
318
+ allow_cross_community = false
319
+
320
+ [research.moe.human_handoff]
321
+ enabled = true
322
+ response_deadline_minutes = 60
323
+ wait_budget_minutes = 30
324
+ allowed_during_quiet_hours = false
325
+ quiet_hours_start = "22:00"
326
+ quiet_hours_end = "06:00"
327
+
328
+ [research.moe.learned_scorer]
329
+ train_min_examples = 200
330
+ retrain_every_hours = 24
331
+ model_kind = "logistic_regression" # small, interpretable
332
+ ```
333
+
334
+ ---
335
+
336
+ ## 8. Tests
337
+
338
+ ### 8.1 Unit
339
+ - RuleBasedScorer: tag-overlap dominance test (4 candidates, exact tag match scores highest)
340
+ - Availability filter: expert with `availability` window not covering "now" is excluded
341
+ - Cooldown: same user calls twice within `per_expert_cooldown_minutes` β†’ second call excludes that expert
342
+
343
+ ### 8.2 Integration
344
+ - Three-node community: two humans + one model registered as experts for `{cooking, niederrhein-history}`. Query about Sankt-Martins-Lieder β†’ human expert chosen; handoff flow completes.
345
+ - Handoff decline: chosen expert declines, fallback picks next candidate; user sees a single thread experience without knowing about the decline.
346
+ - Cross-community: federation manifest grants `moe.route`; query routed to an expert in the federated community; thread bridged correctly.
347
+
348
+ ### 8.3 Adversarial
349
+ - Spam: one user submits 100 routes in 10 minutes β†’ rate-limit blocks at #60, returns `too_many_requests`.
350
+ - Decline-storm: an expert declines 10 in a row β†’ after the third, that expert is auto-unavailable; not offered as candidate until they re-register.
351
+ - Score injection: a community member tries to set `score_bias=999` on their own expert record β†’ registration rejects (caller must be anchor for `score_bias` outside `[-1, 1]`).
352
+
353
+ ### 8.4 UX
354
+ - Top-K presentation in chat UI: candidates show as a 3-button affordance under the user's question; user picks one; thread morphs accordingly.
355
+ - Outcome capture: thumbs-up/down on the answer records an `Outcome`; visible in router metrics dashboard.
356
+
357
+ ---
358
+
359
+ ## 9. Cross-references
360
+
361
+ - Capability spec: [CAPABILITY_CONTRACT_v3 Β§4.4–4.6](../CAPABILITY_CONTRACT_v3.md)
362
+ - Group chat (handoff substrate): [M25](../../phase-2/modules/M25-group-chat.md)
363
+ - Federation: [M14](../../phase-2/modules/M14-federation.md)
364
+
365
+ ---
366
+
367
+ ## 10. Open research questions
368
+
369
+ 1. **What signal predicts a good route?** Tag overlap is shallow. Embeddings of past handoffs vs current request might do better. The `LearnedScorer` is the placeholder; the actual feature engineering is unsolved.
370
+ 2. **Calibration.** Is a score of 0.85 actually 85% likely to be a good route? Reliability diagrams from feedback data needed.
371
+ 3. **Negative experts.** Should the router learn that "Llama-3.2 is *bad* at Niederrhein-Plattdeutsch" and avoid it? Currently only positive scores.
372
+ 4. **Cold-start.** A new community has no feedback data and no `LearnedScorer`. Bootstrapping by federation (borrowing experts from a more-mature peer)?
373
+ 5. **Human consent UX.** What is the right number of handoffs per week before it becomes a burden? Per-person ceiling, per-community ceiling, dynamic?
374
+ 6. **Privacy of the rationale.** Should the rationale ("Maria worked on Sankt Martins for 20 years") be visible to the requester? It can reveal information about Maria. Default: rationale is shown to requester only when the expert opts in to that.
375
+ 7. **Refusal protocol.** When a model expert "refuses" (e.g. "I cannot answer this"), should the router re-route, or trust the refusal? Mistaken refusals are a known LLM failure mode.
376
+ 8. **Expert overlap.** Two experts both register for `{sankt_martins}`. Both equally good. What's the tiebreaker that doesn't always favour the same person? Round-robin? Random? Both score 0.85 β€” caller chooses?
377
+ 9. **Network effects.** As more people register as experts, does the score signal get diluted? Empirical question.
378
+ 10. **Audit and review.** A community might want a quarterly "who was routed to most, on what topics" view β€” for fairness, for spotting overworked experts. UX for surfacing this respectfully.
docs/p2_p3/M28-fedlearn.md ADDED
@@ -0,0 +1,348 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # M28 β€” Federated Learning (LoRA Aggregation)
2
+
3
+ **Spec version:** v3.0 β€” *experimental*
4
+ **Depends on:** [M03 Capability Bus](../../modules/M03-capability-bus.md), [M04 LLM](../../modules/M04-llm.md), [M14 Federation](../../phase-2/modules/M14-federation.md), [X02 Event Log](../../cross-cutting/X02-events.md), [M16 Tokens](../../phase-2/modules/M16-tokens.md), [X06 WebSocket](../../phase-2/cross-cutting/X06-websocket.md)
5
+ **Depended on by:** nothing in MVP β€” opt-in research feature
6
+
7
+ ---
8
+
9
+ ## 1. Responsibility
10
+
11
+ Federated learning of small **LoRA adapters** on top of a shared base model. Each node trains locally on its own data, sends only the **adapter weight deltas** (not raw data, not full weights) to an aggregator, and receives back an averaged adapter that subsequent nodes can use or further refine.
12
+
13
+ The bet: a 3B-parameter base model with a *community-tuned* LoRA adapter ("how people in our village actually phrase things, what jargon our Feuerwehr uses, what the local agricultural calendar looks like") is more useful for the community than a generic 3B model, and we can do this without any node ever shipping its private data off-box.
14
+
15
+ This module deliberately stays at LoRA scope only. Full fine-tunes, distillation, and continual pre-training are explicitly out β€” both because they are bandwidth-hostile and because the privacy story for full-weight federation is significantly harder.
16
+
17
+ ---
18
+
19
+ ## 2. Non-goals
20
+
21
+ - **Federating raw data.** Never. Training data stays on the node that owns it.
22
+ - **Full fine-tunes.** LoRA only. If a use case truly needs more, that's a different research project.
23
+ - **Cross-base-model aggregation.** All participants in a round must run the same base model at the same quantisation. Heterogeneous aggregation is open research.
24
+ - **Mandatory participation.** Every node decides per-round whether to join. There is no "you must contribute back" rule.
25
+ - **Aggregator centralisation.** Any node can host an aggregator. There is no privileged aggregator role.
26
+ - **Hiding participation.** Whether you joined a round is visible to other participants in that round; only your data and your gradients are private.
27
+
28
+ ---
29
+
30
+ ## 3. File layout
31
+
32
+ ```
33
+ hearthnet/fedlearn/
34
+ β”œβ”€β”€ __init__.py
35
+ β”œβ”€β”€ coordinator.py # Orchestrates a round: announce, gather, aggregate, distribute
36
+ β”œβ”€β”€ participant.py # Local-side: respond to round announcements, train, submit
37
+ β”œβ”€β”€ trainer.py # Wraps M04 LLM in a LoRA training loop (peft + bitsandbytes)
38
+ β”œβ”€β”€ aggregator.py # FedAvg with optional secure aggregation
39
+ β”œβ”€β”€ delta.py # Serialise/deserialise LoRA deltas (state-dict subset)
40
+ β”œβ”€β”€ privacy.py # Optional DP-noise injection and gradient clipping
41
+ └── manifest.py # Round manifest: base model id, hyperparams, signature
42
+ ```
43
+
44
+ ---
45
+
46
+ ## 4. Public API
47
+
48
+ ### 4.1 Dataclasses
49
+
50
+ ```python
51
+ RoundID = NewType("RoundID", str) # ULID
52
+
53
+ @dataclass(frozen=True)
54
+ class RoundManifest:
55
+ round_id: RoundID
56
+ coordinator: NodeID
57
+ base_model_id: str # exact model id from M04 ("qwen2.5:3b-instruct-q4_K_M")
58
+ base_model_sha: str # SHA-256 of base weights; mismatch = exclusion
59
+ lora_target_modules: tuple[str, ...] # which linear layers carry LoRA (e.g. "q_proj","v_proj")
60
+ lora_rank: int # 4 ≀ r ≀ FEDLEARN_MAX_LORA_RANK
61
+ lora_alpha: int
62
+ lora_dropout: float
63
+ train_steps: int # max local SGD steps per participant
64
+ learning_rate: float
65
+ batch_size: int
66
+ seed: int # for deterministic init of LoRA matrices
67
+ dp_noise_scale: float # 0.0 = off
68
+ clip_norm: float # gradient clip; must be > 0 if DP on
69
+ min_participants: int # round aborts if fewer participants submit
70
+ max_participants: int
71
+ deadline: datetime # UTC; submissions after this dropped
72
+ topic: str # free-form: "niederrhein-emergency", "village-chat"
73
+ consent_text: str # human-readable; participant must accept
74
+ coordinator_sig: bytes # detached Ed25519 over the manifest
75
+
76
+ @dataclass
77
+ class ParticipantSubmission:
78
+ round_id: RoundID
79
+ participant: NodeID
80
+ delta_bytes: bytes # serialised LoRA state-dict
81
+ delta_sha: str
82
+ num_samples: int # for weighted FedAvg
83
+ train_loss: float # for telemetry only
84
+ submitted_at: datetime
85
+ signature: bytes # Ed25519 over (round_id, participant, delta_sha, num_samples)
86
+
87
+ @dataclass
88
+ class RoundResult:
89
+ round_id: RoundID
90
+ aggregated_delta_sha: str
91
+ n_participants: int
92
+ total_samples: int
93
+ aggregator: NodeID
94
+ completed_at: datetime
95
+ manifest_sha: str
96
+ download_url: str # capability bus uri for fetching the aggregated delta
97
+ ```
98
+
99
+ ### 4.2 Capabilities
100
+
101
+ ```python
102
+ async def fedlearn_round_announce(manifest: RoundManifest) -> RoundID
103
+ async def fedlearn_round_list(topic: str | None = None) -> list[RoundManifest]
104
+ async def fedlearn_round_join(round_id: RoundID, consent: bool) -> JoinReceipt
105
+ async def fedlearn_round_submit(submission: ParticipantSubmission) -> SubmitReceipt
106
+ async def fedlearn_round_status(round_id: RoundID) -> RoundStatus
107
+ async def fedlearn_round_finalize(round_id: RoundID) -> RoundResult # coordinator-only
108
+ async def fedlearn_adapter_fetch(sha: str) -> bytes
109
+ async def fedlearn_adapter_apply(sha: str, scope: Literal["session","node"]) -> ApplyReceipt
110
+ ```
111
+
112
+ All capabilities are in the `experimental.fedlearn.*` namespace and only registered on the bus when `experimental.fedlearn = true` in the node config.
113
+
114
+ ### 4.3 Coordinator class
115
+
116
+ ```python
117
+ class RoundCoordinator:
118
+ def __init__(self,
119
+ bus: CapabilityBus,
120
+ event_log: EventLog,
121
+ llm: LLMService,
122
+ fedlearn_config: FedLearnConfig): ...
123
+
124
+ async def announce_round(self, draft: RoundManifestDraft) -> RoundID: ...
125
+ async def collect_submissions(self, round_id: RoundID) -> list[ParticipantSubmission]: ...
126
+ async def aggregate(self, round_id: RoundID) -> bytes: ...
127
+ async def finalize_and_publish(self, round_id: RoundID) -> RoundResult: ...
128
+
129
+ # internal
130
+ async def _validate_submission(self, sub: ParticipantSubmission, manifest: RoundManifest) -> None: ...
131
+ async def _emit(self, evt: Event) -> None: ...
132
+ ```
133
+
134
+ ### 4.4 Participant class
135
+
136
+ ```python
137
+ class RoundParticipant:
138
+ def __init__(self,
139
+ bus: CapabilityBus,
140
+ event_log: EventLog,
141
+ llm: LLMService,
142
+ data_provider: TrainingDataProvider,
143
+ fedlearn_config: FedLearnConfig): ...
144
+
145
+ async def consider_round(self, manifest: RoundManifest) -> Decision: ...
146
+ async def train(self, manifest: RoundManifest) -> ParticipantSubmission: ...
147
+ async def submit(self, submission: ParticipantSubmission) -> SubmitReceipt: ...
148
+ async def apply_aggregated(self, result: RoundResult, scope: Literal["session","node"]) -> ApplyReceipt: ...
149
+ ```
150
+
151
+ ### 4.5 Aggregator
152
+
153
+ ```python
154
+ class FedAvgAggregator:
155
+ def __init__(self, manifest: RoundManifest): ...
156
+
157
+ def add(self, submission: ParticipantSubmission, delta: dict[str, Tensor]) -> None: ...
158
+ def aggregate(self) -> dict[str, Tensor]: ... # weighted by num_samples
159
+
160
+ class SecureFedAvgAggregator(FedAvgAggregator):
161
+ """Optional: pairwise masking so the aggregator sees only the sum, never individual deltas."""
162
+ def __init__(self, manifest: RoundManifest, mask_scheme: Literal["additive_pairwise"] = "additive_pairwise"): ...
163
+ ```
164
+
165
+ ### 4.6 Privacy helpers
166
+
167
+ ```python
168
+ def clip_gradient(state_dict: dict[str, Tensor], max_norm: float) -> dict[str, Tensor]
169
+ def add_dp_noise(state_dict: dict[str, Tensor], scale: float, rng: Generator) -> dict[str, Tensor]
170
+ def epsilon_estimate(scale: float, clip: float, n_steps: int, batch: int, dataset_size: int) -> float
171
+ ```
172
+
173
+ ---
174
+
175
+ ## 5. Behaviour
176
+
177
+ ### 5.1 Round lifecycle
178
+
179
+ ```
180
+ ANNOUNCED ──join──▢ JOINED ──train──▢ TRAINED ──submit──▢ SUBMITTED ──┐
181
+ β”‚ β”‚
182
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ aggregate β—€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
183
+ β”‚ β–Ό
184
+ └────deadline reached────▢ AGGREGATING ──finalize──▢ COMPLETED
185
+ β”‚
186
+ └──min_participants not met──▢ ABORTED
187
+ ```
188
+
189
+ State transitions are recorded as events (`fedlearn.round.*`) on the coordinator's event log. Participants see their own state mirrored via subscription.
190
+
191
+ ### 5.2 Manifest signing
192
+
193
+ Manifest is canonicalised (JCS, like federation manifests in M14 Β§5.2), then signed Ed25519 by the coordinator's node key. Participants must verify the signature before training. A manifest with an invalid signature is dropped silently and logged as a security event (`security.signature.invalid`).
194
+
195
+ ### 5.3 Consent flow
196
+
197
+ When `fedlearn.round.join` is called, the participant module must:
198
+
199
+ 1. Check `experimental.fedlearn` is enabled in node config. If not β†’ `experimental_disabled`.
200
+ 2. Display `manifest.consent_text` to the operator via the M11 Notifications path. The operator must explicitly accept. The acceptance is stored as a signed `fedlearn.consent.granted` event.
201
+ 3. Verify coordinator signature. If invalid β†’ `signature_invalid` (we deliberately don't say *whose* signature; bystanders learn nothing useful).
202
+ 4. Check `base_model_sha` against the locally-installed base model. If mismatch β†’ `base_model_mismatch`. Do not download a different base on demand; this is a hard error.
203
+ 5. Check resource budget: estimate VRAM and disk for the training run from `lora_rank * len(target_modules) * hidden_size`. If insufficient β†’ `insufficient_resources`.
204
+ 6. If all checks pass β†’ emit `fedlearn.round.joined`, return `JoinReceipt`.
205
+
206
+ ### 5.4 Local training
207
+
208
+ The trainer wraps M04's LLM handle in a HuggingFace `peft.LoraConfig` and uses `bitsandbytes` 4-bit base + fp16 LoRA matrices. Training data is provided by an injected `TrainingDataProvider` β€” the module never reaches into other modules' storage. Typical providers:
209
+
210
+ - `ChatHistoryProvider` (asks M10 for redacted, consented chat turns),
211
+ - `KBProvider` (asks M07 for documents tagged for training),
212
+ - `CustomFileProvider` (operator-curated training set).
213
+
214
+ After `train_steps` steps or convergence (loss plateau over a window), the trainer extracts the LoRA state-dict, applies optional gradient clipping and DP noise (if `manifest.dp_noise_scale > 0`), serialises, signs, and returns a `ParticipantSubmission`.
215
+
216
+ ### 5.5 Aggregation
217
+
218
+ The default aggregator is weighted **FedAvg**: each adapter weight is weighted by `num_samples` and averaged across submissions. After aggregation, the coordinator emits `fedlearn.round.aggregated` and stores the aggregated delta via the capability bus (using the same content-addressed file path that M06 Files uses).
219
+
220
+ If the round was declared with `secure=true` in the draft, `SecureFedAvgAggregator` is used: each participant pair establishes an additive mask, masks cancel in the sum, and the aggregator never sees individual deltas. This costs an extra round-trip between participants before submission (the *mask exchange phase*) and requires `min_participants β‰₯ 3`.
221
+
222
+ ### 5.6 Distribution
223
+
224
+ The aggregated adapter is published as a content-addressed file. Participants who joined the round get a `fedlearn.round.completed` event with the SHA. They can choose to:
225
+
226
+ - **Session apply** β€” load into a single LLM session via M04 (`llm.session.apply_adapter`),
227
+ - **Node apply** β€” install as the default adapter for the node (requires explicit operator action),
228
+ - **Discard** β€” do nothing.
229
+
230
+ Non-participants can also fetch and apply adapters they trust. There is no DRM and no whitelist: the aggregated delta is just a file with a SHA.
231
+
232
+ ### 5.7 Failure modes
233
+
234
+ - **Coordinator vanishes mid-round:** participants wait until `deadline`, then any participant can call `fedlearn.round.finalize_takeover(round_id)` which constructs the aggregated delta from received submissions and re-publishes. The takeover is signed by the takeover-node and is visible as such.
235
+ - **A participant submits garbage:** validation in `_validate_submission` checks tensor shapes, dtypes, finite-ness (no NaN/Inf), and that the delta is structurally a valid LoRA state-dict for the manifest's `lora_target_modules`. Garbage submissions are dropped and logged.
236
+ - **Sybil flooding:** all participants must be authenticated with M01 identity and the manifest can require a minimum reputation/trust score (this is open research β€” for v3.0 the field exists in the manifest but is not yet enforced).
237
+ - **Adversarial gradient (poisoning):** out of scope for v3.0; documented in Open Research Questions Β§10.
238
+
239
+ ---
240
+
241
+ ## 6. Errors
242
+
243
+ | Code | When |
244
+ |---------------------------------|---------------------------------------------------------------------|
245
+ | `experimental_disabled` | Caller invokes a fedlearn capability with the flag off |
246
+ | `signature_invalid` | Manifest or submission signature does not verify |
247
+ | `base_model_mismatch` | Local base model SHA differs from manifest |
248
+ | `insufficient_resources` | Estimated VRAM/disk exceeds budget |
249
+ | `consent_required` | join() called without an explicit consent record |
250
+ | `round_full` | `max_participants` reached |
251
+ | `round_closed` | Submission after deadline |
252
+ | `delta_invalid` | Submitted state-dict fails structural validation |
253
+ | `fedlearn_aggregation_failed` | Aggregation produced NaN/Inf or insufficient submissions |
254
+ | `fedlearn_min_participants_unmet` | Round closes with fewer than `min_participants` valid submissions |
255
+ | `fedlearn_aggregator_unreachable` | finalize() called while coordinator is offline and takeover not triggered |
256
+ | `adapter_not_found` | `fedlearn.adapter.fetch` for an unknown SHA |
257
+
258
+ ---
259
+
260
+ ## 7. Configuration
261
+
262
+ ```python
263
+ @dataclass(frozen=True)
264
+ class FedLearnConfig:
265
+ enabled: bool = False # master switch; default off
266
+ max_lora_rank: int = FEDLEARN_MAX_LORA_RANK # 64
267
+ max_lora_target_modules: int = FEDLEARN_MAX_LORA_TARGET_MODULES # 8
268
+ max_train_steps: int = FEDLEARN_MAX_TRAIN_STEPS # 1000
269
+ max_round_participants: int = FEDLEARN_MAX_PARTICIPANTS # 32
270
+ min_round_participants: int = FEDLEARN_MIN_PARTICIPANTS # 3
271
+ dp_noise_scale_default: float = FEDLEARN_DP_NOISE_SCALE_DEFAULT # 0.0 (off)
272
+ clip_norm_default: float = FEDLEARN_CLIP_NORM_DEFAULT # 1.0
273
+ submission_max_bytes: int = FEDLEARN_SUBMISSION_MAX_BYTES # 64 MiB
274
+ require_secure_aggregation: bool = False
275
+ auto_apply_aggregated: bool = False # never auto-apply by default
276
+ training_vram_budget_mb: int = 8192
277
+ training_disk_budget_mb: int = 4096
278
+ ```
279
+
280
+ All `FEDLEARN_*` constants live in `hearthnet/constants.py` so a single source of truth governs both validation and documentation generation.
281
+
282
+ ---
283
+
284
+ ## 8. Tests
285
+
286
+ ### 8.1 Unit
287
+
288
+ - `test_manifest_canonicalisation_stable` β€” re-encoding does not change SHA.
289
+ - `test_manifest_signature_roundtrip`.
290
+ - `test_delta_serialisation_roundtrip` β€” tensors preserve dtype and shape.
291
+ - `test_fedavg_weighted_arithmetic` β€” manually averaged deltas match aggregator output to within fp16 noise.
292
+ - `test_dp_noise_zero_is_identity` β€” `add_dp_noise(d, scale=0.0)` is a no-op.
293
+ - `test_clip_gradient_norm` β€” post-clip norm ≀ `max_norm`.
294
+ - `test_secure_aggregation_masks_cancel` β€” sum of masks across all pairs is zero.
295
+
296
+ ### 8.2 Property
297
+
298
+ - Across random shapes, `fedavg([d, d, d]) == d`.
299
+ - Across random submissions, `fedavg(submissions)` is finite when all inputs are finite.
300
+
301
+ ### 8.3 Integration
302
+
303
+ - Two-node loopback round on a 0.5B base model: announce β†’ join β†’ train (synthetic data, 10 steps) β†’ submit β†’ aggregate β†’ apply. Aggregated adapter must be loadable and must not blow up perplexity by more than 2x on a held-out set (sanity, not quality).
304
+ - Coordinator-failure round: simulate coordinator going offline after submissions received; takeover by another participant produces an aggregated delta with the same SHA.
305
+ - Sybil-defence stub: round with `min_participants=3` and only 2 valid submissions aborts with `fedlearn_min_participants_unmet`.
306
+
307
+ ### 8.4 Negative
308
+
309
+ - Wrong base SHA β†’ `base_model_mismatch`.
310
+ - Submission with NaN in one tensor β†’ `delta_invalid`.
311
+ - Submission missing one of the target modules β†’ `delta_invalid`.
312
+ - Manifest signed by an untrusted identity β†’ `signature_invalid`.
313
+ - Disabled flag β†’ `experimental_disabled` even for read-only queries.
314
+
315
+ ---
316
+
317
+ ## 9. Cross-references
318
+
319
+ - **Phase 1 M04 LLM** β€” provides the local model handle, exposes `llm.session.apply_adapter` and `llm.adapter.list`.
320
+ - **Phase 1 M07 Knowledge Base** β€” `KBProvider` reads tagged documents for training.
321
+ - **Phase 2 M14 Federation** β€” federated rounds across communities use the federation transport for manifest distribution and submission. Cross-community rounds require both communities' DPOs to sign the round consent.
322
+ - **Phase 2 M16 Tokens** β€” round participation tokens (`fedlearn-participant` scope) are issued by the coordinator and bound to a single round.
323
+ - **Phase 2 M25 Group Chat** β€” `village-chat` rounds typically draw training data from group chat history (consented turns only).
324
+ - **Phase 3 M30 Evidence/EBKH** β€” aggregated adapters can be tracked as claims in the evidence graph; "adapter X improved perplexity on held-out set Y" is a `claim.assert`.
325
+
326
+ ---
327
+
328
+ ## 10. Open research questions
329
+
330
+ 1. **Gradient poisoning defence.** Coordinated malicious participants can submit deltas that, when aggregated, degrade or backdoor the adapter. Median-based aggregation (Krum, trimmed mean) is a partial defence; an authenticated-data attestation (per-submission proof that gradients were computed on real, non-cherry-picked data) is the harder question. v3.0 ships FedAvg only; v3.1 may add Krum behind a flag.
331
+
332
+ 2. **Heterogeneous base models.** Today, every participant in a round must run the same base model at the same quantisation. Cross-base aggregation (e.g., projecting LoRA from Qwen-3B-Q4 to Qwen-3B-Q5 or even Qwen-3B β†’ Qwen-7B) is open. The naive approach (re-projecting via a translation matrix learnt from a calibration set) loses accuracy quickly.
333
+
334
+ 3. **Adaptive DP-noise.** Fixed `dp_noise_scale` is crude. Per-round noise calibration as a function of `min_participants` and `lora_rank` would tighten the privacy/utility tradeoff. Out of scope for v3.0.
335
+
336
+ 4. **Reputation-weighted FedAvg.** Weighting submissions by `num_samples * trust_score` instead of `num_samples` alone. Requires a credible trust signal, which the broader HearthNet design has not yet committed to.
337
+
338
+ 5. **Continual rounds.** Today each round produces a stand-alone adapter. Stacking rounds (round N tunes on top of round N-1's aggregate) raises questions about drift, fairness, and rollback. Probably belongs in a future M28b.
339
+
340
+ 6. **Cross-task adapters.** A `niederrhein-emergency` adapter and a `village-chat` adapter are trained separately. Whether they can be cleanly combined at inference time (LoRA composition) is a known-hard problem and explicitly not promised here.
341
+
342
+ 7. **Hardware-class fairness.** A round held by a participant with an RTX 5090 might exclude phone-class participants by setting `train_steps` too high. A "ranked tier" with separate aggregations per tier is one possibility. Currently the manifest is a single-tier flat artefact.
343
+
344
+ 8. **Audit of training data.** Even though raw data never leaves the node, the *fact that training happened on consented data* is currently un-auditable from the outside. A future zero-knowledge attestation of "this delta was computed on N samples each tagged training=true" would be useful. Out of scope.
345
+
346
+ ---
347
+
348
+ *Last updated: spec v3.0.*
docs/p2_p3/M29-lora-beacons.md ADDED
@@ -0,0 +1,322 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # M29 β€” LoRa Hardware Beacons
2
+
3
+ **Spec version:** v3.0 β€” *experimental*
4
+ **Depends on:** [M03 Capability Bus](../../modules/M03-capability-bus.md), [M02 Transport](../../modules/M02-transport.md), [X02 Event Log](../../cross-cutting/X02-events.md), [M11 Notifications](../../modules/M11-notifications.md), [M01 Identity](../../modules/M01-identity.md)
5
+ **Depended on by:** Civil Defense (M31) optionally consumes beacon-presence signals
6
+
7
+ ---
8
+
9
+ ## 1. Responsibility
10
+
11
+ Optional **out-of-band presence and panic-button channel** over 868 MHz LoRa hardware. When the internet is out, when the cellular network is down, when the power grid is wobbly β€” the LoRa stack still carries a 32-byte "I exist" ping or a tiny panic message between neighbours up to a few kilometres apart.
12
+
13
+ This module is explicitly **not a data channel**. No AI traffic, no chat content, no file transfer. The bandwidth is laughably small (sub-100 bytes per minute per node in a normal duty-cycle regime), the latency is awful, and the airwave is shared. What LoRa is good at is "I'm still here, and the gateway in the next village is still reachable."
14
+
15
+ The module exposes a small set of capabilities for sending and receiving beacons, mapping LoRa device IDs to HearthNet identities, and surfacing the resulting connectivity graph to the rest of the stack as a fallback signal. Hardware is a USB-attached LoRa stick (RFM95W, sx1276, sx1262 chipsets) bridged via serial.
16
+
17
+ ---
18
+
19
+ ## 2. Non-goals
20
+
21
+ - **General-purpose meshing of HearthNet over LoRa.** Bandwidth and duty cycle make this impossible at any useful scale.
22
+ - **Encryption of beacon contents.** Beacons carry identity hash + sequence + minimal flags only. Anything sensitive belongs on a different channel.
23
+ - **Replacing TETRA/BOS.** Emergency services have their own radio. This is a *neighbour-to-neighbour* fallback, not a replacement for professional emergency comms.
24
+ - **Hardware abstraction layer for every LoRa chipset.** v3.0 supports a small whitelist (RFM95W, sx1276, sx1262 via Meshtastic-firmware sticks). Others are open contributions.
25
+ - **Long-distance routing.** No multi-hop store-and-forward in v3.0. A beacon goes one hop or it doesn't go.
26
+ - **Legal interpretation of national radio regulations.** Each operator is responsible for complying with their local rules (BNetzA in Germany, ETSI in EU, FCC in US). The module enforces *configured* duty-cycle limits but cannot enforce the law on the operator.
27
+
28
+ ---
29
+
30
+ ## 3. File layout
31
+
32
+ ```
33
+ hearthnet/lora/
34
+ β”œβ”€β”€ __init__.py
35
+ β”œβ”€β”€ service.py # LoraBeaconService β€” the capability handler
36
+ β”œβ”€β”€ serial_bridge.py # USB-serial framing to the LoRa stick
37
+ β”œβ”€β”€ frame.py # Encode/decode the 32-byte beacon frame
38
+ β”œβ”€β”€ duty_cycle.py # Track airtime, enforce duty-cycle limits
39
+ β”œβ”€β”€ peer_map.py # LoRa device ID ↔ NodeID mapping (with TOFU verification)
40
+ └── adapters/
41
+ β”œβ”€β”€ __init__.py
42
+ β”œβ”€β”€ meshtastic.py # Meshtastic firmware stick
43
+ β”œβ”€β”€ rfm95w.py # Bare RFM95W via serial-port gateway firmware
44
+ └── sx126x.py # sx1262 module
45
+ ```
46
+
47
+ ---
48
+
49
+ ## 4. Public API
50
+
51
+ ### 4.1 Dataclasses
52
+
53
+ ```python
54
+ LoraBeaconID = NewType("LoraBeaconID", str) # device-local sequence + prefix
55
+ LoraDeviceID = NewType("LoraDeviceID", str) # hardware ID from the stick
56
+
57
+ @dataclass(frozen=True)
58
+ class LoraBeacon:
59
+ beacon_id: LoraBeaconID
60
+ sender_hash: bytes # 4-byte truncated SHA-256 of sender NodeID
61
+ sequence: int # u16, wraps
62
+ flags: int # u8 bit-field; see Β§5.2
63
+ rssi: int | None # dBm, on receive only
64
+ snr: float | None # dB, on receive only
65
+ timestamp: datetime # local clock at decode
66
+
67
+ @dataclass(frozen=True)
68
+ class LoraPeer:
69
+ node_id: NodeID
70
+ device_id: LoraDeviceID
71
+ sender_hash: bytes
72
+ first_seen: datetime
73
+ last_seen: datetime
74
+ rssi_recent: int | None
75
+ verified_tofu: bool # True after operator confirmation
76
+
77
+ @dataclass(frozen=True)
78
+ class DutyCycleStatus:
79
+ region: Literal["EU868","US915","AS923"]
80
+ window_seconds: int
81
+ airtime_used_ms: int
82
+ airtime_budget_ms: int # e.g. 36000 ms in EU868 1% window
83
+ next_tx_allowed_at: datetime
84
+ ```
85
+
86
+ ### 4.2 Capabilities
87
+
88
+ All under `experimental.lora.*`:
89
+
90
+ ```python
91
+ async def lora_status() -> LoraStatus
92
+ async def lora_beacon_send(flags: int = 0) -> LoraBeaconID
93
+ async def lora_panic_send() -> LoraBeaconID # sets FLAG_PANIC; bypasses normal pacing
94
+ async def lora_peer_list() -> list[LoraPeer]
95
+ async def lora_peer_verify(device_id: LoraDeviceID, node_id: NodeID) -> VerifyReceipt
96
+ async def lora_recent_beacons(since: datetime | None = None) -> list[LoraBeacon]
97
+ async def lora_duty_cycle() -> DutyCycleStatus
98
+ async def lora_subscribe_beacons() -> AsyncIterator[LoraBeacon]
99
+ ```
100
+
101
+ ### 4.3 Service class
102
+
103
+ ```python
104
+ class LoraBeaconService:
105
+ def __init__(self,
106
+ bus: CapabilityBus,
107
+ event_log: EventLog,
108
+ notifications: NotificationService,
109
+ identity: IdentityService,
110
+ config: LoraConfig): ...
111
+
112
+ async def start(self) -> None: ... # opens serial, begins RX loop
113
+ async def stop(self) -> None: ...
114
+ async def send_beacon(self, flags: int = 0) -> LoraBeaconID: ...
115
+ async def send_panic(self) -> LoraBeaconID: ...
116
+ async def on_frame_received(self, raw: bytes, rssi: int, snr: float) -> None: ...
117
+ async def _drain_rx(self) -> None: ...
118
+ def duty_cycle_status(self) -> DutyCycleStatus: ...
119
+ ```
120
+
121
+ ### 4.4 Serial bridge
122
+
123
+ ```python
124
+ class SerialBridge:
125
+ def __init__(self, port: str, baud: int = 115200, adapter: LoraAdapter = ...): ...
126
+ async def open(self) -> None: ...
127
+ async def close(self) -> None: ...
128
+ async def write(self, frame: bytes) -> None: ...
129
+ async def read(self) -> AsyncIterator[bytes]: ...
130
+
131
+ class LoraAdapter(Protocol):
132
+ """Per-chipset/firmware framing rules."""
133
+ name: str
134
+ def encode_tx(self, payload: bytes) -> bytes: ...
135
+ def decode_rx(self, raw: bytes) -> tuple[bytes, int, float]: ... # payload, rssi, snr
136
+ def at_init_commands(self) -> list[bytes]: ...
137
+ ```
138
+
139
+ ---
140
+
141
+ ## 5. Behaviour
142
+
143
+ ### 5.1 Beacon frame
144
+
145
+ Strictly 32 bytes, big-endian:
146
+
147
+ ```
148
+ offset size field
149
+ 0 1 version (currently 0x01)
150
+ 1 4 sender_hash (SHA-256(NodeID)[:4])
151
+ 5 2 sequence (u16, wraps)
152
+ 7 1 flags
153
+ 8 1 reserved (0x00)
154
+ 9 4 unix_seconds (sender's clock, u32; informational only)
155
+ 13 19 payload (currently zero-padded; reserved for future use)
156
+ ```
157
+
158
+ No payload content is carried beyond identity-hash + flags + clock. The flags field carries:
159
+
160
+ ```python
161
+ FLAG_PANIC = 0x01 # urgent attention requested
162
+ FLAG_OK = 0x02 # explicit "I'm fine" (operator pressed an OK button)
163
+ FLAG_GATEWAY = 0x04 # this node has an alternate transport currently up
164
+ FLAG_LOW_BATTERY = 0x08 # device-level low-battery indicator
165
+ FLAG_RESERVED_* = 0x10..0x80
166
+ ```
167
+
168
+ Frames are not encrypted. Frames *are* not anonymous either β€” the sender hash is small enough to collide (4 bytes), but stable enough that a passive observer can correlate beacons from the same sender over time. This is documented and acceptable for the threat model: LoRa airwaves are observable by construction.
169
+
170
+ ### 5.2 RX path
171
+
172
+ 1. `SerialBridge` yields raw frames as they arrive.
173
+ 2. `LoraAdapter.decode_rx` peels off the chipset framing and returns the 32-byte payload + RSSI + SNR.
174
+ 3. `service.on_frame_received` validates: length == 32, version == 0x01, sender_hash plausibly maps to a known or unknown peer.
175
+ 4. If `sender_hash` matches a verified peer in `peer_map`, the beacon is recorded against that peer.
176
+ 5. If `sender_hash` is unknown, a `lora.peer.unknown` event is emitted with a TOFU verification prompt for the operator.
177
+ 6. If `FLAG_PANIC` is set, a high-priority notification is raised via M11 regardless of peer-verification status.
178
+ 7. The beacon is published on the bus subscription `experimental.lora.beacon.received`.
179
+
180
+ ### 5.3 TX path and duty cycle
181
+
182
+ Beaconing follows a fixed cadence `LORA_BEACON_PERIOD_SECONDS` (default 600 = 10 minutes). Each transmission's airtime is computed from spreading factor and bandwidth (typical: SF9, BW125 β†’ ~165 ms per 32-byte frame) and added to the duty-cycle window.
183
+
184
+ The duty-cycle window enforces the region's regulation:
185
+
186
+ | Region | Window | Budget |
187
+ |--------|--------|--------|
188
+ | EU868 | 3600 s | 36 s (1%) |
189
+ | US915 | 3600 s | unlimited (FHSS) but config still applies |
190
+ | AS923 | 3600 s | 36 s (1%) |
191
+
192
+ If a normal `send_beacon` call would exceed the budget, it is **deferred** until the budget allows. `send_panic` ignores the duty-cycle limit (regulations universally permit emergency transmissions). The operator is told via notification that the duty-cycle override was used and the event log records `lora.duty_cycle.overridden`.
193
+
194
+ ### 5.4 Peer mapping (TOFU)
195
+
196
+ The first time a `sender_hash` is received, the module emits a notification: *"A new LoRa peer with hash 0xABCD1234 was heard. Do you recognise this device?"* The operator can:
197
+
198
+ - **Verify by NodeID** β€” provide a HearthNet NodeID; the module checks that `SHA-256(NodeID)[:4] == sender_hash` and stores the verified mapping.
199
+ - **Mark as unknown** β€” store the hash with no NodeID; future beacons from this hash will still be tracked but flagged unknown.
200
+ - **Block** β€” drop all beacons from this hash; never prompt again.
201
+
202
+ Hash collisions (two different NodeIDs producing the same 4-byte hash) are possible but unlikely. When two operators independently verify the same hash to different NodeIDs, the conflict is surfaced as a `lora.peer.conflict` event for manual resolution.
203
+
204
+ ### 5.5 Beacon-presence signal
205
+
206
+ Other modules can subscribe to `experimental.lora.beacon.received` to incorporate "this peer is alive on LoRa even though the internet says they're offline" into their own logic. M31 Civil Defense in particular uses this to corroborate that a target node is alive during an outage incident.
207
+
208
+ The presence signal is *advisory*: a node that beacons on LoRa is alive in the radio sense, but that says nothing about whether the operator is responsive or whether higher-layer services are available there.
209
+
210
+ ### 5.6 Failure modes
211
+
212
+ - **No stick attached or USB error:** `lora.status()` reports `unavailable`. The module starts in a disabled state; no errors are raised on startup, only logged.
213
+ - **Stick attached but firmware mismatch:** `at_init_commands` fail; the adapter raises `lora_hardware_unsupported` and the service stays disabled.
214
+ - **Receive flood:** the RX queue is bounded (`LORA_RX_QUEUE_MAX` default 256). Overflow drops oldest entries and emits a `lora.rx.dropped` event.
215
+ - **Clock skew:** beacons carry the sender's clock, but the receiver never trusts it for ordering β€” local arrival timestamp is authoritative.
216
+ - **Adversarial flooding:** an attacker on 868 MHz can spam frames; the duty-cycle limits *us* but not *them*. The service rate-limits beacons per `sender_hash` at the RX side (`LORA_PEER_RX_MAX_PER_MINUTE`, default 20) to avoid filling notifications. Excess beacons from one hash are dropped silently after the rate limit; this is a known DoS vector and documented in Β§10.
217
+
218
+ ---
219
+
220
+ ## 6. Errors
221
+
222
+ | Code | When |
223
+ |-------------------------------|------------------------------------------------------------------|
224
+ | `experimental_disabled` | Capability called with the flag off |
225
+ | `lora_hardware_unavailable` | No stick present or serial port not opened |
226
+ | `lora_hardware_unsupported` | Adapter init failed; firmware not whitelisted |
227
+ | `lora_duty_cycle_exhausted` | Non-panic send requested with budget at zero and override off |
228
+ | `lora_peer_unknown` | `lora.peer.verify` for a sender_hash we've never seen |
229
+ | `lora_peer_conflict` | verify() would create a (hash β†’ two distinct NodeIDs) mapping |
230
+ | `lora_frame_malformed` | RX frame fails structural validation |
231
+
232
+ ---
233
+
234
+ ## 7. Configuration
235
+
236
+ ```python
237
+ @dataclass(frozen=True)
238
+ class LoraConfig:
239
+ enabled: bool = False
240
+ serial_port: str = "/dev/ttyUSB0" # also Windows COM4, etc.
241
+ serial_baud: int = 115200
242
+ adapter: Literal["meshtastic","rfm95w","sx126x"] = "meshtastic"
243
+ region: Literal["EU868","US915","AS923"] = "EU868"
244
+ spreading_factor: int = 9 # 7..12; higher = more range, less rate
245
+ bandwidth_khz: int = 125
246
+ coding_rate_denom: int = 5 # 4/5
247
+ tx_power_dbm: int = 14 # legal max for EU868
248
+ beacon_period_seconds: int = LORA_BEACON_PERIOD_SECONDS_DEFAULT # 600
249
+ panic_burst_count: int = 3 # PANIC sends this many frames rapid-fire
250
+ panic_burst_gap_ms: int = 800
251
+ rx_queue_max: int = LORA_RX_QUEUE_MAX # 256
252
+ peer_rx_max_per_minute: int = LORA_PEER_RX_MAX_PER_MINUTE # 20
253
+ tofu_auto_accept: bool = False # never auto-trust new hashes by default
254
+ duty_cycle_override_for_panic: bool = True
255
+ ```
256
+
257
+ Constants live in `hearthnet/constants.py`.
258
+
259
+ ---
260
+
261
+ ## 8. Tests
262
+
263
+ ### 8.1 Unit
264
+
265
+ - `test_frame_encode_decode_roundtrip` β€” random payloads encode to exactly 32 bytes and round-trip.
266
+ - `test_sender_hash_matches_nodeid` β€” `SHA-256(NodeID)[:4]` matches the field in the encoded frame.
267
+ - `test_duty_cycle_tracks_airtime` β€” synthetic transmissions accumulate; budget drains; recovers over time.
268
+ - `test_panic_overrides_duty_cycle` β€” `send_panic` succeeds at zero budget when override is enabled.
269
+ - `test_panic_blocked_when_override_disabled` β€” `send_panic` returns `lora_duty_cycle_exhausted` when override is off.
270
+ - `test_peer_rx_rate_limit` β€” 30 frames from one hash within a minute β†’ only 20 surface.
271
+
272
+ ### 8.2 Integration (loopback)
273
+
274
+ - Mock `SerialBridge` echoes TX as RX after a configurable delay. Verify a sent beacon shows up in `recent_beacons` and on the subscription.
275
+ - Two simulated nodes (separate SerialBridges connected via an in-memory channel) β€” A sends, B receives, B's peer_map contains A after TOFU verification, RSSI/SNR are populated.
276
+
277
+ ### 8.3 Hardware-in-the-loop (optional)
278
+
279
+ - With a real LoRa stick, send N beacons and verify duty-cycle accounting matches what the firmware reports.
280
+ - Range test: two sticks at increasing distance; record packet-loss vs distance.
281
+
282
+ ### 8.4 Negative
283
+
284
+ - Disabled flag β†’ all capabilities return `experimental_disabled`.
285
+ - No serial port β†’ `lora_hardware_unavailable` on status.
286
+ - Truncated frame β†’ `lora_frame_malformed`, dropped.
287
+ - Conflicting verify β†’ `lora_peer_conflict`.
288
+
289
+ ---
290
+
291
+ ## 9. Cross-references
292
+
293
+ - **Phase 1 M01 Identity** β€” `SHA-256(NodeID)[:4]` is the sender hash; verification uses M01's NodeID type.
294
+ - **Phase 1 M02 Transport** β€” LoRa is *not* a Transport in the M02 sense. It does not carry capability-bus traffic; it lives parallel to M02 as an alternative signalling channel. The two share no code.
295
+ - **Phase 1 M11 Notifications** β€” high-priority panic-beacon notifications and TOFU prompts route through M11.
296
+ - **Phase 1 X02 Event Log** β€” `lora.*` events.
297
+ - **Phase 3 M31 Civil Defense** β€” beacon-presence is one corroborating signal for "is the target node alive" during an incident.
298
+ - **Phase 3 X09 Conformance Suite** β€” LoRa is an optional capability; conformance tests use a mock serial bridge.
299
+
300
+ ---
301
+
302
+ ## 10. Open research questions
303
+
304
+ 1. **Mesh routing.** Multi-hop store-and-forward over LoRa is well-explored in the Meshtastic project. Whether HearthNet should adopt it (and inherit the bandwidth tradeoffs) or keep one-hop simplicity is unsettled. Probably belongs in M29b.
305
+
306
+ 2. **Authenticated beacons.** Adding even a 4-byte MAC would let receivers reject forged sender-hashes. This costs payload space we don't have today. A 64-byte frame variant (`version 0x02`) with HMAC-truncated-to-8-bytes is the obvious extension.
307
+
308
+ 3. **DoS robustness.** Per-hash rate limiting is naive; an attacker just rotates hashes. The defence on 868 MHz is mostly the regulatory duty-cycle and physical proximity, neither of which we control in software. Documented as a known limitation.
309
+
310
+ 4. **Sleep-and-wake duty cycles.** Battery-powered nodes (a panic button by the bedside) want to sleep most of the time and wake on demand. Class-A/B/C LoRaWAN-style scheduling is the standard answer. Out of scope for v3.0.
311
+
312
+ 5. **Chipset coverage.** v3.0 supports a small whitelist. Each new chipset is an adapter shaped exactly like the existing ones; contributors are encouraged.
313
+
314
+ 6. **GPS integration.** Many LoRa sticks ship with a GPS module. We deliberately did not surface location data in v3.0 β€” location is privacy-sensitive and the use case is unclear. A future `FLAG_HAS_GPS` + paired side-channel might make sense for civil-defence scenarios.
315
+
316
+ 7. **Integration with civil-defence radio.** TETRA-BOS and BOS-Digitalfunk are professional networks we have no business interoperating with. But a *unidirectional* "did the BOS station broadcast a known alert" listener might be useful. Legally complex.
317
+
318
+ 8. **Network coding.** When multiple nearby nodes beacon, the airwave fills. Cooperative beacon scheduling (so neighbours don't transmit on top of each other) is a fun problem. Currently each node beacons independently and collisions are accepted.
319
+
320
+ ---
321
+
322
+ *Last updated: spec v3.0.*
docs/p2_p3/M30-evidence-ebkh.md ADDED
@@ -0,0 +1,384 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # M30 β€” Evidence Graph & EBKH Integration
2
+
3
+ **Spec version:** v3.0 β€” *experimental*
4
+ **Depends on:** [M03 Capability Bus](../../modules/M03-capability-bus.md), [X02 Event Log](../../cross-cutting/X02-events.md), [M01 Identity](../../modules/M01-identity.md), [M06 Files](../../modules/M06-files.md), [M07 Knowledge Base](../../modules/M07-knowledge-base.md), [M16 Tokens](../../phase-2/modules/M16-tokens.md), [M21 Tool Calls](../../phase-2/modules/M21-tool-calls.md)
5
+ **Depended on by:** M31 Civil Defense (alerts may carry an evidence chain); RAG quality reports
6
+
7
+ ---
8
+
9
+ ## 1. Responsibility
10
+
11
+ A **content-addressed claim graph** layered alongside the append-only event log, plus an integration adapter for Christof's existing **EBKH v3+** (event-sourced knowledge hub with OSINT capabilities, PostGIS, and the class-reference graph already described in his upstream work).
12
+
13
+ The event log answers "what happened, in what order". The evidence graph answers a different question: "what is asserted, by whom, on what basis, and what counterclaims exist?" This is necessary because in a community AI mesh, *information has provenance* β€” a claim about when the Sankt Martins parade starts, where the local emergency assembly point is, or what the Volksbank's IBAN is β€” must be traceable to a source, must be cross-checkable, and must support dispute. The event log doesn't do that; it just records that someone said something.
14
+
15
+ The module is small in line count but conceptually load-bearing: it defines what counts as evidence, what a claim is, how claims compose, and how the rest of the stack queries provenance. The EBKH adapter wires this to Christof's already-built PostGIS + OSINT infrastructure so the two systems converge rather than duplicate.
16
+
17
+ ---
18
+
19
+ ## 2. Non-goals
20
+
21
+ - **Replacing the event log.** The event log is still the source of truth for *what happened*. The evidence graph is a *derived* view over claims extracted from events plus claims asserted directly.
22
+ - **Adjudicating truth.** The module records claims, disputes, and attestations. It does not decide who is right.
23
+ - **General-purpose knowledge graph.** This is not a Wikidata clone. The schema is deliberately narrow: claim, source, attestation, dispute, derivation.
24
+ - **Replacing RAG.** RAG retrieves passages. The evidence graph annotates those passages with provenance and lets a downstream summariser say "according to source X (last verified Y)". Different layer.
25
+ - **OSINT collection.** EBKH already collects from external sources; this module *integrates* with that collector, it does not duplicate it.
26
+ - **Mandatory adoption.** RAG and chat can still operate without consulting the evidence graph. The graph is queried when callers explicitly want provenance.
27
+
28
+ ---
29
+
30
+ ## 3. File layout
31
+
32
+ ```
33
+ hearthnet/evidence/
34
+ β”œβ”€β”€ __init__.py
35
+ β”œβ”€β”€ service.py # EvidenceService β€” capability handler
36
+ β”œβ”€β”€ claim.py # Claim, ClaimSource, Attestation, Dispute, Derivation dataclasses
37
+ β”œβ”€β”€ store.py # ClaimStore β€” append-only with content-addressed ClaimIDs
38
+ β”œβ”€β”€ query.py # Provenance traversal: trace(), neighbours(), conflicts()
39
+ β”œβ”€β”€ extractor.py # Pulls claims out of events (chat, KB ingest, federation)
40
+ β”œβ”€β”€ ebkh_adapter.py # Bridge to Christof's EBKH v3+ via JSON-RPC
41
+ └── trust.py # Per-source trust scoring (advisory, no enforcement)
42
+ ```
43
+
44
+ ---
45
+
46
+ ## 4. Public API
47
+
48
+ ### 4.1 Dataclasses
49
+
50
+ ```python
51
+ ClaimID = NewType("ClaimID", str) # SHA-256 over canonical claim record
52
+ SourceID = NewType("SourceID", str) # "url:https://...", "node:<NodeID>", "doc:<FileID>", "ebkh:<entity_uri>"
53
+
54
+ EvidenceLevel = Literal[
55
+ "unverified", # claim made, no corroboration
56
+ "cited", # claim has at least one source
57
+ "cross_referenced", # β‰₯2 independent sources agree
58
+ "attested", # an identified party with skin in the game has signed
59
+ "disputed", # at least one counterclaim with sources
60
+ ]
61
+
62
+ @dataclass(frozen=True)
63
+ class ClaimSource:
64
+ source_id: SourceID
65
+ accessed_at: datetime
66
+ content_sha: str | None # SHA of the cited content if available
67
+ excerpt: str | None # short verbatim excerpt; max 280 chars
68
+
69
+ @dataclass(frozen=True)
70
+ class Claim:
71
+ claim_id: ClaimID
72
+ subject: str # canonical subject string
73
+ predicate: str # e.g. "starts_at", "located_at", "operated_by"
74
+ object: str # canonical object string
75
+ asserted_by: NodeID
76
+ asserted_at: datetime
77
+ sources: tuple[ClaimSource, ...]
78
+ confidence: float # asserter's self-rated confidence [0,1]
79
+ derived_from: tuple[ClaimID, ...] # parent claims if this is a derivation
80
+ signature: bytes # asserter's Ed25519 signature
81
+
82
+ @dataclass(frozen=True)
83
+ class Attestation:
84
+ claim_id: ClaimID
85
+ attester: NodeID
86
+ attested_at: datetime
87
+ rationale: str # human-readable "I know this because..."
88
+ role: str # "first-hand witness", "official record holder", "expert"
89
+ signature: bytes
90
+
91
+ @dataclass(frozen=True)
92
+ class Dispute:
93
+ claim_id: ClaimID # the claim being disputed
94
+ counterclaim_id: ClaimID # the claim made in response
95
+ disputer: NodeID
96
+ disputed_at: datetime
97
+ rationale: str
98
+ signature: bytes
99
+
100
+ @dataclass(frozen=True)
101
+ class ProvenanceTrace:
102
+ claim: Claim
103
+ sources: tuple[ClaimSource, ...]
104
+ attestations: tuple[Attestation, ...]
105
+ disputes: tuple[Dispute, ...]
106
+ derivation_tree: tuple[Claim, ...] # walked depth-first, deduplicated
107
+ evidence_level: EvidenceLevel
108
+ trust_score: float # advisory; from trust.py
109
+ ```
110
+
111
+ ### 4.2 Capabilities
112
+
113
+ All under `experimental.evidence.*`:
114
+
115
+ ```python
116
+ async def evidence_claim_assert(draft: ClaimDraft) -> ClaimID
117
+ async def evidence_claim_dispute(claim_id: ClaimID, counterclaim: ClaimDraft, rationale: str) -> ClaimID
118
+ async def evidence_claim_attest(claim_id: ClaimID, role: str, rationale: str) -> AttestationReceipt
119
+ async def evidence_claim_get(claim_id: ClaimID) -> Claim
120
+ async def evidence_claim_query(subject: str | None = None,
121
+ predicate: str | None = None,
122
+ object: str | None = None,
123
+ min_evidence: EvidenceLevel = "unverified") -> list[Claim]
124
+ async def evidence_provenance_trace(claim_id: ClaimID, max_depth: int = 5) -> ProvenanceTrace
125
+ async def evidence_subject_summary(subject: str) -> SubjectSummary
126
+ async def evidence_ebkh_sync(direction: Literal["pull","push","bidi"] = "bidi") -> SyncReport
127
+ ```
128
+
129
+ ### 4.3 Service class
130
+
131
+ ```python
132
+ class EvidenceService:
133
+ def __init__(self,
134
+ bus: CapabilityBus,
135
+ event_log: EventLog,
136
+ identity: IdentityService,
137
+ store: ClaimStore,
138
+ extractor: ClaimExtractor,
139
+ ebkh: EbkhAdapter | None,
140
+ trust: TrustScorer,
141
+ config: EvidenceConfig): ...
142
+
143
+ async def assert_claim(self, draft: ClaimDraft) -> ClaimID: ...
144
+ async def dispute(self, claim_id: ClaimID, counterclaim: ClaimDraft, rationale: str) -> ClaimID: ...
145
+ async def attest(self, claim_id: ClaimID, role: str, rationale: str) -> AttestationReceipt: ...
146
+ async def trace(self, claim_id: ClaimID, max_depth: int) -> ProvenanceTrace: ...
147
+ async def summarise_subject(self, subject: str) -> SubjectSummary: ...
148
+ async def evidence_level(self, claim_id: ClaimID) -> EvidenceLevel: ...
149
+ ```
150
+
151
+ ### 4.4 Claim store
152
+
153
+ ```python
154
+ class ClaimStore:
155
+ async def put(self, claim: Claim) -> ClaimID: ... # idempotent on ClaimID
156
+ async def get(self, claim_id: ClaimID) -> Claim | None: ...
157
+ async def by_subject(self, subject: str) -> list[Claim]: ...
158
+ async def by_triple(self, subject: str, predicate: str, object: str | None) -> list[Claim]: ...
159
+ async def disputes_of(self, claim_id: ClaimID) -> list[Dispute]: ...
160
+ async def attestations_of(self, claim_id: ClaimID) -> list[Attestation]: ...
161
+ async def derivatives_of(self, claim_id: ClaimID) -> list[Claim]: ...
162
+ ```
163
+
164
+ The store is append-only. A "retraction" is itself a claim (`predicate="retracted"`, `object="<claim_id>"`) and is treated as a special kind of dispute by the trace algorithm.
165
+
166
+ ### 4.5 Extractor
167
+
168
+ ```python
169
+ class ClaimExtractor:
170
+ """Watches the event log and proposes claims from candidate events.
171
+
172
+ Proposals are *suggestions*, not auto-asserted. A claim only enters
173
+ the store when an identified asserter signs it.
174
+ """
175
+ async def consume(self, evt: Event) -> list[ClaimDraft]: ...
176
+ def register_pattern(self, predicate: str, matcher: Callable[[Event], ClaimDraft | None]) -> None: ...
177
+ ```
178
+
179
+ Patterns shipped in v3.0:
180
+
181
+ - KB ingest event β†’ `claim(doc_sha, "contains_text", text_hash)` with the doc as source.
182
+ - Tool-call event (M21) with HTTP fetch β†’ `claim(url, "served", content_sha)` with the URL as source.
183
+ - Federation manifest event β†’ `claim(remote_node, "advertises_capability", cap_name)` with the manifest as source.
184
+ - LoRa beacon (M29) reception β†’ *not* auto-extracted; presence is logged but not claimed.
185
+
186
+ ### 4.6 EBKH adapter
187
+
188
+ ```python
189
+ class EbkhAdapter:
190
+ def __init__(self, endpoint: str, token: str, postgis_dsn: str | None = None): ...
191
+
192
+ async def push_claim(self, claim: Claim) -> EbkhRef: ...
193
+ async def pull_entity(self, entity_uri: str) -> list[Claim]: ...
194
+ async def query_spatial(self, bbox: Bbox, predicate: str | None = None) -> list[Claim]: ...
195
+ async def sync(self, direction: Literal["pull","push","bidi"]) -> SyncReport: ...
196
+ ```
197
+
198
+ The adapter speaks EBKH's JSON-RPC over HTTPS with an Ed25519-bound bearer token (issued via M16). Spatial queries piggy-back on EBKH's PostGIS layer β€” useful for civil-defence claims like "Sammelplatz is at geom(...)". For nodes without EBKH installed, the adapter is `None` and capabilities still function on the local claim store only.
199
+
200
+ ### 4.7 Trust scoring
201
+
202
+ `TrustScorer` produces an advisory `[0,1]` score for a source. The function is intentionally simple and visible:
203
+
204
+ ```python
205
+ class TrustScorer:
206
+ def score_source(self, source: ClaimSource, context: TrustContext) -> float: ...
207
+ def score_asserter(self, node_id: NodeID, context: TrustContext) -> float: ...
208
+ ```
209
+
210
+ Inputs include: how long the source has been known, how many of its prior claims were not disputed, whether it's signed by a verified identity, whether it's in an operator-curated allowlist or blocklist. The score is **always shown alongside the claim, never hidden**, and never causes a claim to be omitted from query results β€” only re-ranked. Operators can override individual scores.
211
+
212
+ ---
213
+
214
+ ## 5. Behaviour
215
+
216
+ ### 5.1 Canonicalisation and ClaimID
217
+
218
+ A claim's identity is its `ClaimID`, defined as:
219
+
220
+ ```
221
+ ClaimID = base32-no-pad( SHA-256( JCS({
222
+ subject, predicate, object,
223
+ asserted_by, asserted_at_iso8601,
224
+ sources: [{source_id, accessed_at_iso, content_sha} ...],
225
+ confidence_5dp,
226
+ derived_from: [...sorted ClaimIDs...],
227
+ }) ) )
228
+ ```
229
+
230
+ The signature is *not* part of the ClaimID β€” a different asserter making the identical claim would produce a different signature but the same record. To distinguish, we use `Claim.asserted_by` and the signature ensures non-repudiation. A claim asserted twice by the same node at the same instant with the same sources is genuinely the same claim and the store deduplicates.
231
+
232
+ ### 5.2 Evidence level computation
233
+
234
+ ```
235
+ unverified = no sources
236
+ cited = β‰₯1 source
237
+ cross_referenced = β‰₯2 sources, distinct source_ids, β‰₯1 not from asserter's own node
238
+ attested = β‰₯1 attestation with role in {"first-hand witness","official record holder"}
239
+ disputed = β‰₯1 unretracted dispute by a node with trust_score β‰₯ EVIDENCE_DISPUTE_MIN_TRUST
240
+ ```
241
+
242
+ Levels are not mutually exclusive in nature, but the API returns the *strongest applicable level* with `disputed` taking precedence over everything else if present. This way callers default to "show that there's a dispute" rather than burying it under a stronger-sounding label.
243
+
244
+ ### 5.3 Provenance trace algorithm
245
+
246
+ `trace(claim_id, max_depth)` does a depth-first walk over `derived_from` edges, deduplicates by ClaimID, and collects every source, attestation, and dispute encountered. The walk stops at `max_depth` (default 5) or at a cycle (cycles shouldn't exist by construction, but we guard anyway).
247
+
248
+ The result is a flat tuple in topological order from root claim outward. UI is expected to render this as a tree or a list, with disputes inlined wherever they occur.
249
+
250
+ ### 5.4 Subject summary
251
+
252
+ `summarise_subject(subject)` is the workhorse for the rest of the stack. It returns:
253
+
254
+ - All claims with this subject, grouped by predicate.
255
+ - For each predicate, the strongest claim by evidence level and trust score.
256
+ - All disputes affecting this subject.
257
+ - A flat list of distinct sources contributing.
258
+
259
+ This is what a RAG pipeline calls to add provenance to its retrieved passages, and what civil-defence (M31) calls to verify a target before publishing an alert.
260
+
261
+ ### 5.5 EBKH sync
262
+
263
+ `evidence.ebkh.sync` runs in three modes:
264
+
265
+ - **pull** β€” fetch claims for subjects in our local store from EBKH, add as new claims (asserted by the EBKH node identity).
266
+ - **push** β€” send our locally-asserted claims to EBKH; EBKH stores them tagged with our node ID.
267
+ - **bidi** β€” both, in that order.
268
+
269
+ Sync is idempotent. Each side stores the other-side's claim records; nothing is overwritten. Conflicts (same triple, different sources) become co-existing claims, and disputes can be raised normally.
270
+
271
+ EBKH's existing PostGIS schema is reused for spatial predicates. The adapter does *not* try to model the full EBKH schema in our claim graph; it surfaces what is asked for and lets EBKH remain the authoritative store for OSINT-collected material.
272
+
273
+ ### 5.6 Failure modes
274
+
275
+ - **Claim signature invalid** on receipt from another node β†’ reject; emit `security.signature.invalid`.
276
+ - **Dispute on a non-existent claim** β†’ `claim_not_found`.
277
+ - **Cyclic derivation** β†’ reject the new claim; `evidence_cycle_detected`. (This can only happen via malicious crafting; honest derivation cannot cycle.)
278
+ - **EBKH unreachable** during sync β†’ return a `SyncReport` with `partial=true` and the unreachable error; do not fail the calling operation.
279
+
280
+ ---
281
+
282
+ ## 6. Errors
283
+
284
+ | Code | When |
285
+ |-----------------------------------|-------------------------------------------------------------------|
286
+ | `experimental_disabled` | Capability called with the flag off |
287
+ | `claim_not_found` | Operation references a ClaimID we don't have |
288
+ | `claim_signature_invalid` | Signature doesn't verify against asserter's identity |
289
+ | `evidence_cycle_detected` | Proposed claim's derivation chain forms a cycle |
290
+ | `evidence_contradiction` | (advisory) two claims with the same triple but opposite objects |
291
+ | `ebkh_unavailable` | EBKH endpoint not configured or unreachable |
292
+ | `trust_below_threshold` | (advisory) attached to results; not an error condition by itself |
293
+
294
+ `evidence_contradiction` is *advisory* β€” returned in query results as a flag, not raised as an exception. The system never silently picks a winner.
295
+
296
+ ---
297
+
298
+ ## 7. Configuration
299
+
300
+ ```python
301
+ @dataclass(frozen=True)
302
+ class EvidenceConfig:
303
+ enabled: bool = False
304
+ auto_extract: bool = True # let extractor propose drafts
305
+ extract_patterns: tuple[str, ...] = ("kb_ingest","tool_fetch","federation_manifest")
306
+ claim_ttl_days: int = EVIDENCE_CLAIM_TTL_DAYS_DEFAULT # 365
307
+ trust_default: float = 0.5
308
+ dispute_min_trust: float = EVIDENCE_DISPUTE_MIN_TRUST # 0.3
309
+ ebkh_endpoint: str | None = None
310
+ ebkh_token_scope: str = "evidence-sync"
311
+ ebkh_sync_interval_minutes: int = 60
312
+ max_provenance_depth: int = EVIDENCE_MAX_PROVENANCE_DEPTH # 8
313
+ summary_max_predicates: int = 32
314
+ ```
315
+
316
+ Constants live in `hearthnet/constants.py`. `claim_ttl_days` does not delete claims β€” it marks them as stale for query purposes; the actual record is permanent.
317
+
318
+ ---
319
+
320
+ ## 8. Tests
321
+
322
+ ### 8.1 Unit
323
+
324
+ - `test_claim_id_canonicalisation` β€” re-ordering source list or whitespace changes do not affect ClaimID.
325
+ - `test_claim_signature_roundtrip`.
326
+ - `test_evidence_level_disputed_wins` β€” a claim with two sources *and* a dispute returns `disputed`.
327
+ - `test_provenance_trace_dedup` β€” diamond derivation graph yields each ancestor once.
328
+ - `test_extractor_kb_ingest_pattern` β€” KB ingest event produces a draft with the right predicate.
329
+ - `test_retraction_is_dispute` β€” a retraction shows up in `disputes_of`.
330
+
331
+ ### 8.2 Property
332
+
333
+ - For random claims, `evidence_level(c)` is monotonic when adding sources/attestations and falls to `disputed` on adding a dispute.
334
+ - For random derivation DAGs, `trace(root)` yields exactly the reachable set.
335
+
336
+ ### 8.3 Integration
337
+
338
+ - KB ingest β†’ claim drafted β†’ operator asserts β†’ query returns it.
339
+ - Dispute lifecycle: assert claim, attest it, dispute it, see `evidence_level=disputed`, retract the dispute (as a dispute-of-the-dispute), verify level returns to `attested`.
340
+ - EBKH adapter against a mock JSON-RPC endpoint: round-trip a spatial claim, verify the bbox query returns it.
341
+ - Federated extraction: a federation manifest event from M14 produces a claim about advertised capabilities, which is then visible to subject-query `summarise_subject(<remote_node_id>)`.
342
+
343
+ ### 8.4 Negative
344
+
345
+ - Cyclic derived_from input β†’ `evidence_cycle_detected`.
346
+ - Claim signed by an unknown identity β†’ `claim_signature_invalid`.
347
+ - EBKH endpoint configured but unreachable β†’ `sync` returns partial; capability does not raise.
348
+
349
+ ---
350
+
351
+ ## 9. Cross-references
352
+
353
+ - **Phase 1 M01 Identity** β€” every claim is signed; signatures verified against M01.
354
+ - **Phase 1 M06 Files** β€” `doc:<FileID>` sources resolve to content-addressed files.
355
+ - **Phase 1 M07 Knowledge Base** β€” KB ingest events feed the extractor.
356
+ - **Phase 1 X02 Event Log** β€” `evidence.claim.*`, `evidence.dispute.*`, `evidence.attestation.*` events.
357
+ - **Phase 2 M21 Tool Calls** β€” fetched URLs are extracted as claims for downstream provenance.
358
+ - **Phase 3 M31 Civil Defense** β€” alerts carry a top-level claim ID, allowing recipients to trace why an alert was issued.
359
+ - **Phase 3 X09 Conformance Suite** β€” provenance-trace correctness is part of the experimental suite.
360
+ - **External: EBKH v3+** β€” Christof's existing event-sourced knowledge hub; PostGIS-backed; this module is the integration point.
361
+
362
+ ---
363
+
364
+ ## 10. Open research questions
365
+
366
+ 1. **Claim semantics.** "subject/predicate/object" is intentionally loose. Whether to adopt RDF, JSON-LD, or a custom ontology with a controlled vocabulary is unsettled. v3.0 accepts free strings and ships a small recommended vocabulary in the docs.
367
+
368
+ 2. **Trust composition.** When a derived claim depends on three parent claims of varying trust, what's the derived trust? Min, product, weighted-average? Currently the trust scorer ignores derivation. A future version may compose explicitly.
369
+
370
+ 3. **Dispute escalation.** Today a dispute is a single counter-claim. In practice, communities will want threaded discussion attached to disputes. Whether this belongs here or in M10/M25 chat is a design call.
371
+
372
+ 4. **Time-bound claims.** "The Volksbank opens at 9:00" is true on weekdays but not Sundays. The schema has no first-class temporal modality. A pragmatic workaround is encoding temporal qualifiers into the predicate ("opens_at_weekday"), but a proper temporal logic layer would be cleaner. Out of scope.
373
+
374
+ 5. **Confidentiality.** Some claims are sensitive (a neighbour's medical condition, a Feuerwehr member's home address). The current model has no claim-level access control. The capability-bus tokens (M16) can scope access to the evidence service entirely, but not at a per-claim granularity. Open.
375
+
376
+ 6. **OSINT integration boundaries.** EBKH ingests from external feeds. When does an external feed become a sufficiently authoritative source to upgrade evidence level? The pragmatic stance in v3.0 is "operator decides via the trust allowlist". A future version may automate this.
377
+
378
+ 7. **Visualisation.** Provenance trees get big fast. A graph visualisation widget (probably d3 in plain HTML) would help operators. Specced but unbuilt.
379
+
380
+ 8. **Federated claim propagation.** Two communities federate, and one asserts a claim relevant to the other. Should it auto-mirror? Today, no β€” claims propagate only when explicitly queried via `ebkh_sync` or fetched on demand. A push model would be possible but worsens consent.
381
+
382
+ ---
383
+
384
+ *Last updated: spec v3.0.*
docs/p2_p3/M31-civil-defense.md ADDED
@@ -0,0 +1,410 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # M31 β€” Civil Defense (NRW BevΓΆlkerungsschutz Pilot)
2
+
3
+ **Spec version:** v3.0 β€” *experimental*
4
+ **Depends on:** [M03 Capability Bus](../../modules/M03-capability-bus.md), [X02 Event Log](../../cross-cutting/X02-events.md), [M01 Identity](../../modules/M01-identity.md), [M11 Notifications](../../modules/M11-notifications.md), [M14 Federation](../../phase-2/modules/M14-federation.md), [M16 Tokens](../../phase-2/modules/M16-tokens.md), [M30 Evidence](./M30-evidence-ebkh.md), [M29 LoRa Beacons](./M29-lora-beacons.md), [M22 Mobile Native](../../phase-2/modules/M22-mobile-native.md)
5
+ **Depended on by:** nothing β€” terminal module; civil defense is a downstream consumer of everything else
6
+
7
+ ---
8
+
9
+ ## 1. Responsibility
10
+
11
+ A scoped pilot for **NRW BevΓΆlkerungsschutz**: integrate HearthNet with the role structures that Germany's civil-defence ecosystem actually uses (THW, DRK, Feuerwehr, Katastrophenschutz) so that during an incident, role-certified members can publish authenticated alerts, coordinate locally, and produce a tamper-evident audit trail that survives legal review.
12
+
13
+ This module is deliberately **regional and regulated**. It does *not* try to be a global civil-defence platform. It encodes the role taxonomy, certificate semantics, and audit-retention rules that apply in Nordrhein-Westfalen, with hooks for other German LΓ€nder and EU regions to plug in later. The pilot lives in Issum and the Niederrhein because that's where Christof can actually walk into a Feuerwehrhaus and get this tested with humans who will use it under stress.
14
+
15
+ Where the rest of HearthNet aims for "soft consensus across neighbours", this module aims for "hard provenance, signed by an authority, retained per legal mandate". Different ergonomics. Different threat model.
16
+
17
+ ---
18
+
19
+ ## 2. Non-goals
20
+
21
+ - **Replacing official alert systems.** NINA, KATWARN, Cell-Broadcast, and BOS radio remain the authoritative channels. M31 is *complementary* β€” it works when official channels are degraded, congested, or geographically miss the affected area, and it carries the *local context* that mass-broadcast systems can't.
22
+ - **Issuing legally binding evacuation orders.** Those come from the Krisenstab and are out of any AI-mediated system's authority.
23
+ - **Modelling every German Land.** v3.0 targets NRW; M31 has a region adapter so others can be added, but the module ships with NRW only.
24
+ - **Replacing TETRA-BOS.** Professional emergency-services radio is its own thing. We coexist; we don't interop.
25
+ - **Automatic identity verification of certificate holders.** A role certificate carries who issued it and who it was issued to. *Verifying* that a holder is who they claim is the issuer's responsibility, not ours. We check the signature chain; we don't re-do the background check.
26
+ - **Persistent geolocation of helpers.** We record where alerts target and where reported incidents are. We do not continuously track helpers' phones.
27
+
28
+ ---
29
+
30
+ ## 3. File layout
31
+
32
+ ```
33
+ hearthnet/civdef/
34
+ β”œβ”€β”€ __init__.py
35
+ β”œβ”€β”€ service.py # CivilDefenseService β€” capability handler
36
+ β”œβ”€β”€ alert.py # Alert, AlertEnvelope, AlertSeverity dataclasses
37
+ β”œβ”€β”€ role.py # RoleCertificate, role schemas per region
38
+ β”œβ”€β”€ audit.py # Tamper-evident audit chain + export
39
+ β”œβ”€β”€ regions/
40
+ β”‚ β”œβ”€β”€ __init__.py
41
+ β”‚ β”œβ”€β”€ nrw.py # NRW role taxonomy & issuer trust roots
42
+ β”‚ └── _stubs.py # other LΓ€nder placeholders
43
+ β”œβ”€β”€ target.py # Geographic / role / channel targeting
44
+ └── ack.py # Acknowledgement collection
45
+ ```
46
+
47
+ ---
48
+
49
+ ## 4. Public API
50
+
51
+ ### 4.1 Dataclasses
52
+
53
+ ```python
54
+ AlertID = NewType("AlertID", str) # ULID
55
+ AlertSeverity = Literal["info","advisory","warning","emergency","extreme"]
56
+
57
+ @dataclass(frozen=True)
58
+ class RoleCertificate:
59
+ cert_id: str
60
+ holder: NodeID
61
+ role: str # canonical role, e.g. "DE.NRW.THW.OV.Leiter"
62
+ region: str # "DE.NRW.KreisKleve"
63
+ issuer: NodeID # issuing authority's HearthNet identity
64
+ issuer_chain: tuple[NodeID, ...] # chain back to a trust root
65
+ issued_at: datetime
66
+ expires_at: datetime
67
+ scopes: frozenset[str] # what this cert is allowed to do
68
+ signature: bytes
69
+ revocation_url: str | None
70
+
71
+ @dataclass(frozen=True)
72
+ class AlertTarget:
73
+ region: str # "DE.NRW.KreisKleve.Issum"
74
+ bbox: Bbox | None # optional precise geo target
75
+ roles: tuple[str, ...] # which roles should see this; empty = public
76
+ channels: tuple[Literal["push","lora","federation","local"], ...]
77
+
78
+ @dataclass(frozen=True)
79
+ class Alert:
80
+ alert_id: AlertID
81
+ severity: AlertSeverity
82
+ title: str # ≀ 80 chars
83
+ body: str # ≀ 1000 chars
84
+ target: AlertTarget
85
+ instructions: tuple[str, ...] # short imperative lines
86
+ published_at: datetime
87
+ expires_at: datetime
88
+ publisher: NodeID
89
+ publisher_role: str
90
+ publisher_cert: str # cert_id
91
+ evidence_claim: ClaimID | None # link to M30 claim chain if relevant
92
+ correlation_id: str | None # links to NINA/KATWARN ID if mirrored
93
+ signature: bytes # publisher signs the alert
94
+ issuer_attestation: bytes | None # optional co-sign by a higher-tier issuer
95
+
96
+ @dataclass(frozen=True)
97
+ class AlertEnvelope:
98
+ alert: Alert
99
+ federation_hops: tuple[NodeID, ...] # forward path for audit
100
+ received_at: datetime
101
+ received_via: Literal["bus","federation","lora_signal","manual"]
102
+
103
+ @dataclass(frozen=True)
104
+ class Ack:
105
+ alert_id: AlertID
106
+ acker: NodeID
107
+ acked_at: datetime
108
+ status: Literal["received","acting","need_help","standing_down","mistaken"]
109
+ note: str # ≀ 280 chars
110
+ signature: bytes
111
+
112
+ @dataclass(frozen=True)
113
+ class AuditEntry:
114
+ seq: int # monotonic per audit chain
115
+ alert_id: AlertID
116
+ event: str # "published","forwarded","acked","mirrored","cancelled"
117
+ actor: NodeID
118
+ at: datetime
119
+ payload_sha: str
120
+ prev_sha: str # chain-link to previous audit entry
121
+ signature: bytes
122
+ ```
123
+
124
+ ### 4.2 Capabilities
125
+
126
+ All under `experimental.civdef.*`:
127
+
128
+ ```python
129
+ async def civdef_alert_publish(draft: AlertDraft) -> AlertID
130
+ async def civdef_alert_cancel(alert_id: AlertID, reason: str) -> CancelReceipt
131
+ async def civdef_alert_list(active_only: bool = True,
132
+ severity_min: AlertSeverity = "info") -> list[Alert]
133
+ async def civdef_alert_get(alert_id: AlertID) -> AlertEnvelope
134
+ async def civdef_alert_subscribe(target_filter: AlertTarget | None = None) -> AsyncIterator[AlertEnvelope]
135
+ async def civdef_alert_ack(alert_id: AlertID, status: AckStatus, note: str = "") -> AckReceipt
136
+ async def civdef_alert_acks(alert_id: AlertID) -> list[Ack]
137
+ async def civdef_role_register(cert: RoleCertificate) -> RegisterReceipt
138
+ async def civdef_role_list() -> list[RoleCertificate]
139
+ async def civdef_role_revoke(cert_id: str, reason: str) -> RevokeReceipt # issuer-only
140
+ async def civdef_audit_export(alert_id: AlertID | None = None,
141
+ since: datetime | None = None,
142
+ format: Literal["jsonl","pdf"] = "jsonl") -> bytes
143
+ ```
144
+
145
+ ### 4.3 Service class
146
+
147
+ ```python
148
+ class CivilDefenseService:
149
+ def __init__(self,
150
+ bus: CapabilityBus,
151
+ event_log: EventLog,
152
+ identity: IdentityService,
153
+ notifications: NotificationService,
154
+ federation: FederationService,
155
+ evidence: EvidenceService | None,
156
+ region: RegionAdapter,
157
+ audit_store: AuditChainStore,
158
+ config: CivDefConfig): ...
159
+
160
+ async def publish_alert(self, draft: AlertDraft, publisher_cert: RoleCertificate) -> AlertID: ...
161
+ async def cancel_alert(self, alert_id: AlertID, reason: str, by_cert: RoleCertificate) -> None: ...
162
+ async def receive_alert(self, envelope: AlertEnvelope) -> None: ...
163
+ async def register_role(self, cert: RoleCertificate) -> None: ...
164
+ async def revoke_role(self, cert_id: str, by_cert: RoleCertificate, reason: str) -> None: ...
165
+ async def ack(self, alert_id: AlertID, status: AckStatus, note: str) -> AckReceipt: ...
166
+ async def export_audit(self, ...) -> bytes: ...
167
+ ```
168
+
169
+ ### 4.4 Region adapter
170
+
171
+ ```python
172
+ class RegionAdapter(Protocol):
173
+ region_code: str
174
+ trust_roots: tuple[NodeID, ...] # public keys of recognised issuers
175
+ role_schema: dict[str, RoleSpec] # role name β†’ spec
176
+ audit_retention_years: int
177
+ mandatory_severity_minimums: dict[str, AlertSeverity] # role β†’ max severity it can publish
178
+
179
+ def validate_role(self, cert: RoleCertificate) -> None: ...
180
+ def validate_alert(self, draft: AlertDraft, publisher_cert: RoleCertificate) -> None: ...
181
+ ```
182
+
183
+ `regions/nrw.py` ships the NRW taxonomy with roles drawn from real-world structure: `DE.NRW.<Kreis>.<Gemeinde>.<Org>.<Role>`, e.g. `DE.NRW.Kleve.Issum.Feuerwehr.Wehrleiter`, `DE.NRW.Kleve.THW.OV.Leiter`, `DE.NRW.Kleve.DRK.Ortsverein.Bereitschaftsleiter`, `DE.NRW.Kleve.KatS.Stabsleiter`. Each role declares maximum severity it may publish, geographic scope it may target, and whether it may co-sign cross-org alerts.
184
+
185
+ ### 4.5 Audit chain store
186
+
187
+ ```python
188
+ class AuditChainStore:
189
+ """Append-only, signed, hash-chained audit log.
190
+
191
+ Retention is governed by config.audit_retention_years; default is 10 (NRW pragmatic baseline,
192
+ operator must confirm against current Landesarchivgesetz at deployment time).
193
+ """
194
+ async def append(self, entry: AuditEntry) -> None: ...
195
+ async def latest(self) -> AuditEntry | None: ...
196
+ async def get_range(self, start_seq: int, end_seq: int) -> list[AuditEntry]: ...
197
+ async def verify_chain(self, start: int = 0, end: int | None = None) -> VerifyReport: ...
198
+ async def export(self, ...) -> bytes: ...
199
+ ```
200
+
201
+ ---
202
+
203
+ ## 5. Behaviour
204
+
205
+ ### 5.1 Role certification
206
+
207
+ Role certificates form a chain to a regional trust root. NRW's trust roots are configured at deployment time and should match published issuer keys (Innenministerium NRW, the Kreis Kleve administration, etc. β€” note that as of v3.0 these *do not* publish HearthNet-compatible keys; the pilot uses a substitute issuance ceremony where the local Wehrleiter signs certificates after manual identity verification, and a clear migration path to real institutional keys is documented).
208
+
209
+ A certificate may be:
210
+
211
+ - **Issued** β€” signed by an authority that itself chains to a trust root.
212
+ - **Active** β€” within validity window and not revoked.
213
+ - **Revoked** β€” explicitly revoked by issuer; revocation is itself signed and appended to the audit chain.
214
+ - **Expired** β€” past `expires_at`.
215
+
216
+ Service operations that require a role check the certificate at every invocation. Revocations propagate via federation; a node receiving a revocation must, on next receipt of an alert signed by the revoked cert, refuse delivery and emit `civdef.alert.dropped.revoked`.
217
+
218
+ ### 5.2 Alert publication
219
+
220
+ ```
221
+ publish_alert(draft, cert):
222
+ 1. cert.holder must equal self.identity β†’ else civdef_cert_not_owned
223
+ 2. cert active, not revoked, not expired β†’ else civdef_cert_invalid
224
+ 3. region.validate_role(cert) β†’ else civdef_cert_unrecognised
225
+ 4. region.validate_alert(draft, cert) (severity / scope match) β†’ else civdef_cert_out_of_scope
226
+ 5. Construct Alert with publisher_role from cert.role
227
+ 6. Sign Alert with self.identity
228
+ 7. (optional) collect issuer_attestation if config requires co-sign
229
+ 8. Append to audit chain: event="published"
230
+ 9. Emit civdef.alert.published event
231
+ 10. Distribute:
232
+ - "local" β†’ notifications via M11 to local subscribers
233
+ - "push" β†’ mobile-native delivery via M22
234
+ - "federation" β†’ M14 forwarding to federated nodes matching target.region
235
+ - "lora" β†’ if M29 enabled, set FLAG_PANIC on the next beacon as a presence-of-alert signal
236
+ 11. Optionally mirror to evidence graph (M30) as a claim record
237
+ 12. Return AlertID
238
+ ```
239
+
240
+ If the publisher loses connectivity mid-publish, the audit-chain `published` entry has already been appended locally, so the alert is recoverable on reconnect and re-distributes from there. Idempotent on AlertID.
241
+
242
+ ### 5.3 Targeting
243
+
244
+ `AlertTarget` is a set of orthogonal filters:
245
+
246
+ - **region** β€” hierarchical region code; matches by prefix (`DE.NRW.Kleve` matches `DE.NRW.Kleve.Issum`).
247
+ - **bbox** β€” optional geographic bounding box (overrides region for the precise area).
248
+ - **roles** β€” empty means public; non-empty restricts visibility to certificate holders of those roles.
249
+ - **channels** β€” which delivery mechanisms to use.
250
+
251
+ A receiving node filters on its own identity's location, registered roles, and active subscriptions. The filter is enforced **client-side at delivery** as well as **publisher-side at distribution**, so a node that mis-claims a role doesn't expose role-only content (the federation forwarder uses publisher-side filtering when forwarding `roles`-restricted alerts).
252
+
253
+ ### 5.4 Acknowledgements
254
+
255
+ When a role-targeted alert arrives, the recipient may ack with a status:
256
+
257
+ - `received` β€” read confirmation.
258
+ - `acting` β€” operationally taking action (e.g., Feuerwehr en route).
259
+ - `need_help` β€” recipient cannot act; help requested.
260
+ - `standing_down` β€” alert handled, recipient disengages.
261
+ - `mistaken` β€” the recipient believes this alert is in error; an attached `note` should explain.
262
+
263
+ Acks are signed, appended to the audit chain, and visible to the publisher via `civdef.alert.acks(alert_id)`. Public alerts (no `roles` filter) suppress acks unless `config.allow_public_ack=true` β€” to prevent ack floods on widely-distributed alerts.
264
+
265
+ ### 5.5 Cancellation
266
+
267
+ Cancellation requires a certificate with cancel scope (typically the original publisher or a same-or-higher role in the same region). A cancellation:
268
+
269
+ 1. Records the cancellation in the audit chain.
270
+ 2. Emits `civdef.alert.cancelled` to all original delivery channels.
271
+ 3. Marks the alert inactive in `civdef_alert_list` queries (`active_only=true`).
272
+
273
+ The original alert is not deleted. Audit retention applies to the cancellation as well.
274
+
275
+ ### 5.6 Audit chain
276
+
277
+ The audit chain is an append-only, hash-chained, signed log specific to this module. Each entry's `prev_sha` is the SHA-256 of the previous entry's canonicalised body, creating a tamper-evident chain. `verify_chain` walks from genesis (or a checkpoint) verifying signatures and hashes; failure raises `civdef_audit_chain_broken` and is surfaced as a high-priority operator notification.
278
+
279
+ Audit entries cover: alert published, alert forwarded (with federation hop), alert acked, alert cancelled, role certificate registered, role certificate revoked, audit chain checkpointed. Export produces `jsonl` (machine-readable, default) or `pdf` (operator-readable for legal review, generated via the public `pdf` skill).
280
+
281
+ Retention is governed by `CIVDEF_AUDIT_RETENTION_YEARS` (default 10 β€” operator must validate against current NRW Landesarchivgesetz at deployment; the constant is the recommendation, not the law).
282
+
283
+ ### 5.7 Federation interaction
284
+
285
+ Alerts cross federation boundaries via M14. The federation manifest must declare `civdef` as an advertised capability; otherwise the alert is not forwarded into the neighbouring community. Forwarding nodes append themselves to `AlertEnvelope.federation_hops` for audit, but do not re-sign the alert (the publisher's signature is the source of truth). The receiving community independently audits the alert against its own role schemas; if the publisher's role is not recognised, the alert is delivered with a `civdef.alert.foreign_role` flag and is *not* surfaced as a high-severity push.
286
+
287
+ ### 5.8 LoRa interaction
288
+
289
+ LoRa beacons (M29) carry no alert content; they carry only presence. When the local node receives a `severity ∈ {emergency, extreme}` alert and LoRa is enabled, the node sets `FLAG_PANIC` on its next beacon and increases beacon cadence to the panic-burst configured in M29. This is a *signal* that something is happening, not a *content* channel. Receivers must consult bus or notifications for the actual alert content.
290
+
291
+ ### 5.9 Failure modes
292
+
293
+ - **Publisher's cert revoked after publish, before propagation completes**: federation forwarders that have received the revocation drop the in-flight alert; nodes that have not yet seen the revocation propagate normally. Eventually consistent; documented limitation.
294
+ - **Audit chain corruption** (disk failure, manual tampering): `verify_chain` detects; the module enters degraded mode where new publishes are blocked until an operator acknowledges and re-checkpoints. Reads continue.
295
+ - **Trust root key compromise**: out of scope for v3.0 to *recover* automatically; documented incident response: revoke all certs chaining to the compromised root, rotate root, reissue.
296
+ - **Mass-ack flood**: `allow_public_ack=false` default; per-alert ack rate-limit `CIVDEF_ACK_MAX_PER_MINUTE_PER_NODE`.
297
+
298
+ ---
299
+
300
+ ## 6. Errors
301
+
302
+ | Code | When |
303
+ |-----------------------------------|-------------------------------------------------------------------|
304
+ | `experimental_disabled` | Capability called with the flag off |
305
+ | `civdef_cert_not_owned` | Publish/ack with a cert whose holder β‰  caller's identity |
306
+ | `civdef_cert_invalid` | Certificate expired, revoked, or signature broken |
307
+ | `civdef_cert_unrecognised` | Issuer chain doesn't terminate at a configured trust root |
308
+ | `civdef_cert_out_of_scope` | Cert's role/region doesn't authorise the requested action |
309
+ | `civdef_alert_not_found` | Operation references an unknown AlertID |
310
+ | `civdef_alert_target_invalid` | Target region/bbox malformed or outside the issuer's scope |
311
+ | `civdef_audit_chain_broken` | Hash or signature mismatch in the audit chain |
312
+ | `civdef_role_revoked` | Operation attempted with a revoked certificate |
313
+ | `civdef_region_unsupported` | No region adapter loaded for the requested region |
314
+ | `civdef_ack_rate_limited` | Ack rate exceeded for this alert from this node |
315
+
316
+ ---
317
+
318
+ ## 7. Configuration
319
+
320
+ ```python
321
+ @dataclass(frozen=True)
322
+ class CivDefConfig:
323
+ enabled: bool = False
324
+ region: str = "DE.NRW"
325
+ audit_retention_years: int = CIVDEF_AUDIT_RETENTION_YEARS # 10
326
+ require_issuer_cosign: dict[AlertSeverity, bool] = field(default_factory=lambda: {
327
+ "info": False, "advisory": False, "warning": False,
328
+ "emergency": True, "extreme": True,
329
+ })
330
+ allow_public_ack: bool = False
331
+ ack_max_per_minute_per_node: int = CIVDEF_ACK_MAX_PER_MINUTE_PER_NODE # 5
332
+ federation_forward: bool = True
333
+ lora_panic_signal: bool = True
334
+ severity_push_threshold: AlertSeverity = "warning" # below this, no mobile push
335
+ trust_roots_extra: tuple[NodeID, ...] = () # operator-added roots
336
+ region_adapter_overrides: dict[str, str] = field(default_factory=dict)
337
+ ```
338
+
339
+ Constants centralised in `hearthnet/constants.py`.
340
+
341
+ ---
342
+
343
+ ## 8. Tests
344
+
345
+ ### 8.1 Unit
346
+
347
+ - `test_role_cert_chain_to_root` β€” cert with valid chain β†’ accepted; broken chain β†’ rejected.
348
+ - `test_role_cert_expired` β€” past `expires_at` β†’ `civdef_cert_invalid`.
349
+ - `test_alert_signature_roundtrip`.
350
+ - `test_target_region_prefix_match` β€” `DE.NRW.Kleve` matches `DE.NRW.Kleve.Issum`, not `DE.NRW.Wesel`.
351
+ - `test_audit_chain_link` β€” appending entries chains correctly; `verify_chain` returns ok.
352
+ - `test_audit_chain_tamper_detected` β€” flip a byte in the middle; `verify_chain` reports the break.
353
+ - `test_severity_cap_per_role` β€” Wehrleiter publishing `extreme` β†’ `civdef_cert_out_of_scope` if schema caps at `emergency`.
354
+ - `test_revocation_propagates` β€” revoke cert; subsequent alerts from that cert dropped.
355
+
356
+ ### 8.2 Integration
357
+
358
+ - Two-node alert flow: node A (Wehrleiter cert) publishes `warning` alert targeting `DE.NRW.Kleve.Issum`; node B (resident in Issum, no cert) receives via M11 push.
359
+ - Role-targeted alert: A publishes alert with `roles=("DE.NRW.Kleve.THW.OV.Leiter",)`; B (without cert) does not receive; C (with cert) does.
360
+ - Federation: A publishes in community X; X federates to Y; Y's resident D receives with `federation_hops=[X]`.
361
+ - Cancellation: A cancels; B's alert list moves it to inactive.
362
+ - Audit export: publish, ack, cancel; export `jsonl`; round-trip parses and `verify_chain` passes.
363
+
364
+ ### 8.3 Negative / adversarial
365
+
366
+ - Forged cert chain (random issuer key) β†’ `civdef_cert_unrecognised`.
367
+ - Targeting `DE.BY` (outside NRW) from an NRW-only cert β†’ `civdef_alert_target_invalid`.
368
+ - Ack flood beyond rate limit β†’ `civdef_ack_rate_limited`.
369
+ - Tampered audit chain β†’ publish blocked until operator re-checkpoint.
370
+
371
+ ### 8.4 Tabletop
372
+
373
+ - Manual scenarios with Issum Feuerwehr volunteers: simulated Hochwasser event, simulated grid outage, simulated industrial incident on the A57. Goals: latency from alert publication to first ack, false-positive ack rate, operator-perceived clarity of UI under stress.
374
+
375
+ ---
376
+
377
+ ## 9. Cross-references
378
+
379
+ - **Phase 1 M01 Identity** β€” every cert, alert, ack, and audit entry is signed against M01 identities.
380
+ - **Phase 1 M11 Notifications** β€” alerts surface via notifications with priority mapped from `severity`.
381
+ - **Phase 2 M14 Federation** β€” alerts cross community boundaries via federation.
382
+ - **Phase 2 M16 Tokens** β€” cert validation reuses M16's signature primitives; alert distribution endpoints require `civdef-receive` scoped tokens.
383
+ - **Phase 2 M22 Mobile Native** β€” mobile push for `severity β‰₯ severity_push_threshold`.
384
+ - **Phase 3 M29 LoRa Beacons** β€” `FLAG_PANIC` corroboration during emergencies.
385
+ - **Phase 3 M30 Evidence** β€” alerts may carry an `evidence_claim` ClaimID; recipients can `evidence.provenance.trace` to see the reasoning chain.
386
+ - **Phase 3 X09 Conformance Suite** β€” civdef has a dedicated conformance section because of audit-chain integrity requirements.
387
+
388
+ ---
389
+
390
+ ## 10. Open research questions
391
+
392
+ 1. **Real institutional keys.** v3.0 uses substitute issuance because NRW authorities do not (yet) publish HearthNet-compatible keys. The migration path β€” getting the Innenministerium or Kreis Kleve to publish keys and sign initial role certs β€” is a political process, not a technical one. Documented; out of code scope.
393
+
394
+ 2. **NINA / KATWARN bridge.** A read-only mirror that pulls public NINA alerts and republishes them locally with a `correlation_id` is plausible and would be valuable. Whether it's M31's job or a separate bridge module is undecided.
395
+
396
+ 3. **Multi-Land schema.** The NRW role taxonomy is concrete; Bayern, Niedersachsen, Hessen each have variations (especially around KatS structures). A community-contributed `regions/` directory is the plan; v3.0 ships only NRW.
397
+
398
+ 4. **Co-signing UX.** When `require_issuer_cosign=true` for emergencies, the publisher must obtain a co-signature from a higher-tier issuer. Latency-sensitive. A pre-delegated "emergency co-sign authority" mechanism (similar to OCSP-stapling for certs) is the obvious extension. Not in v3.0.
399
+
400
+ 5. **Public-ack ergonomics.** Public alerts with `allow_public_ack=true` would let citizens self-report ("I am safe", "I need help"), but the failure modes (ack flood, false reports) are severe enough that v3.0 defaults this off. A future tier with rate limits and ack-content moderation is plausible.
401
+
402
+ 6. **Legal retention.** `CIVDEF_AUDIT_RETENTION_YEARS=10` is the operator-friendly default. Actual legal retention varies (NRW Landesarchivgesetz, federal data retention rules for civil-defence records, GDPR exceptions for vital interests). The deployment guide must explicitly walk operators through this; we cannot guess from code.
403
+
404
+ 7. **Cross-border alerts.** Issum borders the Netherlands. An alert about a Dutch industrial incident might originate from a Dutch system. Cross-border interop is interesting and outside v3.0 scope. The `region` adapter pattern doesn't preclude it.
405
+
406
+ 8. **Drills and false-alarm semantics.** A drill should look real enough to be useful and clearly different enough to not panic non-participants. A `drill=true` flag on Alert is the obvious addition; v3.0 omits it pending feedback from real drill rehearsals.
407
+
408
+ ---
409
+
410
+ *Last updated: spec v3.0.*
docs/p2_p3/M32-protocol-standard.md ADDED
@@ -0,0 +1,323 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # M32 β€” Protocol Standardisation & Conformance
2
+
3
+ **Spec version:** v3.0 β€” *experimental*
4
+ **Depends on:** [M03 Capability Bus](../../modules/M03-capability-bus.md), [X02 Event Log](../../cross-cutting/X02-events.md), [M14 Federation](../../phase-2/modules/M14-federation.md), [X09 Conformance Suite](../cross-cutting/X09-conformance-suite.md)
5
+ **Depended on by:** anyone building an alternate implementation of HearthNet
6
+
7
+ ---
8
+
9
+ ## 1. Responsibility
10
+
11
+ Turn the HearthNet specs from "Christof's working code with documentation" into something a **second team** could implement compatibly. This module is the bookkeeping for that: a versioned protocol document set, a conformance reporting capability, governance for how the spec changes, and a registry of known implementations and their conformance levels.
12
+
13
+ The premise is that long-term, a single implementation is fragile. If HearthNet only works as long as one person maintains it, it doesn't survive. Standardising the protocol (the wire formats, the capability contracts, the federation semantics) so other implementations can interoperate is the path to durability β€” and it sets up the social and legal scaffolding (versioning policy, change process, conformance claims) that other projects will need to take HearthNet seriously.
14
+
15
+ This is not "rewrite HearthNet as an RFC". It's "package what already exists in a form that can be cited, versioned, and conformance-tested". The reference implementation is HearthNet itself; the protocol document is the contract; the conformance suite (X09) is the proof.
16
+
17
+ ---
18
+
19
+ ## 2. Non-goals
20
+
21
+ - **Submitting to IETF / W3C in v3.0.** That's a multi-year governance process and is out of scope. We make ourselves *ready* for it by structuring the documents and adopting RFC-style versioning conventions, but we don't file anything.
22
+ - **Patent licensing.** There are no patented techniques in HearthNet's core (capability bus, event log, federation, etc. β€” all build on well-known primitives). We document this assumption but do not run a patent review.
23
+ - **Trademark of "HearthNet".** Out of scope. If the protocol survives, someone (probably ki-fusion-labs.de) eventually claims the name; v3.0 doesn't deal with that.
24
+ - **Conformance certification with a fee structure.** Conformance reports are self-published and free. No "HearthNet Insideβ„’ certification programme".
25
+ - **Backward compatibility forever.** Protocol versions can have breaking changes between major versions, with documented migration. Within a minor version, compatibility is required.
26
+
27
+ ---
28
+
29
+ ## 3. File layout
30
+
31
+ The module is unusual in that most of its content lives *outside* the `hearthnet/` Python package β€” in a top-level `protocol/` directory at the repository root.
32
+
33
+ ```
34
+ protocol/ # repo-root sibling of hearthnet/
35
+ β”œβ”€β”€ README.md
36
+ β”œβ”€β”€ VERSION # current protocol version (e.g. 3.0.0)
37
+ β”œβ”€β”€ CHANGELOG.md
38
+ β”œβ”€β”€ governance.md # change process, decision rights
39
+ β”œβ”€β”€ versioning.md # semver-with-twists rules
40
+ β”œβ”€β”€ reference-implementations.md # registry
41
+ β”œβ”€β”€ core/
42
+ β”‚ β”œβ”€β”€ 01-identity-and-addressing.md
43
+ β”‚ β”œβ”€β”€ 02-transport.md
44
+ β”‚ β”œβ”€β”€ 03-capability-bus.md
45
+ β”‚ β”œβ”€β”€ 04-event-log.md
46
+ β”‚ β”œβ”€β”€ 05-tokens.md
47
+ β”‚ β”œβ”€β”€ 06-federation.md
48
+ β”‚ └── ...
49
+ └── experimental/
50
+ β”œβ”€β”€ 30-evidence.md
51
+ └── 31-civil-defense.md
52
+
53
+ hearthnet/protocol/
54
+ β”œβ”€β”€ __init__.py
55
+ β”œβ”€β”€ service.py # ProtocolService β€” registry and report capability
56
+ β”œβ”€β”€ registry.py # In-memory + persisted registry of known impls
57
+ └── report.py # Conformance report dataclass + serialiser
58
+ ```
59
+
60
+ The `protocol/` directory is the **specification artefact** β€” versioned alongside code but conceptually independent. The Python `hearthnet/protocol/` module is the thin runtime surface that lets a HearthNet node expose its conformance information on the bus.
61
+
62
+ ---
63
+
64
+ ## 4. Public API
65
+
66
+ ### 4.1 Dataclasses
67
+
68
+ ```python
69
+ @dataclass(frozen=True)
70
+ class ProtocolVersion:
71
+ major: int
72
+ minor: int
73
+ patch: int
74
+ suffix: str = "" # "", "rc1", "experimental"
75
+
76
+ def __str__(self) -> str: ... # "3.0.0" or "3.0.0-rc1"
77
+ def is_compatible_with(self, other: ProtocolVersion) -> bool: ... # same major
78
+
79
+ @dataclass(frozen=True)
80
+ class ImplementationDescriptor:
81
+ name: str # e.g. "hearthnet-reference"
82
+ vendor: str # e.g. "ki-fusion-labs.de"
83
+ version: str # implementation version, e.g. "0.4.2"
84
+ protocol_versions: tuple[ProtocolVersion, ...] # which protocol versions supported
85
+ homepage_url: str | None
86
+ contact: str | None
87
+
88
+ @dataclass(frozen=True)
89
+ class ConformanceReport:
90
+ implementation: ImplementationDescriptor
91
+ protocol_version: ProtocolVersion
92
+ suite_version: str # X09 conformance suite version
93
+ ran_at: datetime
94
+ sections: dict[str, SectionResult] # section name β†’ result
95
+ overall: Literal["pass","fail","partial","skipped"]
96
+ signature: bytes # implementation signs its own report
97
+
98
+ @dataclass(frozen=True)
99
+ class SectionResult:
100
+ name: str
101
+ total: int
102
+ passed: int
103
+ failed: int
104
+ skipped: int
105
+ failures: tuple[FailureDetail, ...]
106
+ ```
107
+
108
+ ### 4.2 Capabilities
109
+
110
+ ```python
111
+ async def protocol_version_list() -> list[ProtocolVersion]
112
+ async def protocol_self_describe() -> ImplementationDescriptor
113
+ async def protocol_conformance_report(suite_version: str | None = None) -> ConformanceReport
114
+ async def protocol_registry_list() -> list[ImplementationDescriptor]
115
+ async def protocol_registry_announce(descriptor: ImplementationDescriptor) -> AnnounceReceipt
116
+ ```
117
+
118
+ These are stable (non-experimental) capabilities β€” the protocol must include its own self-description and conformance-reporting capability, otherwise interop is impossible.
119
+
120
+ ### 4.3 Service class
121
+
122
+ ```python
123
+ class ProtocolService:
124
+ def __init__(self,
125
+ bus: CapabilityBus,
126
+ event_log: EventLog,
127
+ federation: FederationService,
128
+ registry: ImplementationRegistry,
129
+ conformance_runner: ConformanceRunner | None,
130
+ config: ProtocolConfig): ...
131
+
132
+ def supported_versions(self) -> list[ProtocolVersion]: ...
133
+ def self_descriptor(self) -> ImplementationDescriptor: ...
134
+ async def run_conformance(self, suite_version: str | None = None) -> ConformanceReport: ...
135
+ async def announce(self, descriptor: ImplementationDescriptor) -> None: ... # to local registry + federation
136
+ async def registry_list(self) -> list[ImplementationDescriptor]: ...
137
+ ```
138
+
139
+ ### 4.4 Implementation registry
140
+
141
+ ```python
142
+ class ImplementationRegistry:
143
+ """Local registry of known implementations.
144
+
145
+ Populated by:
146
+ - self (this node, on startup)
147
+ - federation peers' announcements
148
+ - operator-curated additions
149
+ """
150
+ async def upsert(self, descriptor: ImplementationDescriptor, source: NodeID) -> None: ...
151
+ async def list(self) -> list[ImplementationDescriptor]: ...
152
+ async def known_by_name(self, name: str) -> list[ImplementationDescriptor]: ...
153
+ ```
154
+
155
+ ---
156
+
157
+ ## 5. Behaviour
158
+
159
+ ### 5.1 Versioning policy
160
+
161
+ Protocol versions follow **semver with explicit stability tiers**:
162
+
163
+ - **Major** (`X.0.0`): breaking changes to wire formats or capability contracts. New major version requires explicit migration documentation. Old majors remain readable for migration purposes for at least 2 years.
164
+ - **Minor** (`X.Y.0`): additive β€” new capabilities, new event types, new optional fields. Backward compatibility within the major is required: a `3.0` impl talking to a `3.2` impl must work, with the `3.0` side ignoring new fields/events.
165
+ - **Patch** (`X.Y.Z`): clarification, typo fixes, no functional change.
166
+ - **Suffix** `-experimental`: capabilities in the `experimental.*` namespace; can change without bumping major.
167
+
168
+ Each protocol document carries a frontmatter `protocol-version: 3.X.Y` that names the smallest version that contains it. The `protocol/VERSION` file is the current latest. The `CHANGELOG.md` lists every diff between versions.
169
+
170
+ ### 5.2 Stability tiers
171
+
172
+ Capabilities are tagged with one of:
173
+
174
+ - `stable` β€” frozen at the major version; any change is a breaking change.
175
+ - `provisional` β€” expected to become stable; minor-version breaking changes allowed with deprecation period.
176
+ - `experimental` β€” `experimental.*` namespace; may change or vanish.
177
+
178
+ The `protocol_self_describe` capability reports which capabilities the implementation supports and at which tier. A capability marked `stable` in the protocol but implemented at `experimental` tier in the node is a configuration error and is logged at startup.
179
+
180
+ ### 5.3 Conformance reporting
181
+
182
+ A `ConformanceReport` is the artefact produced by running the X09 conformance suite against a running node. The report is:
183
+
184
+ - **Self-signed** by the implementation β€” there is no central authority that "certifies" reports.
185
+ - **Reproducible** β€” the suite version is in the report; running the same suite against the same impl should produce equivalent results modulo timestamps.
186
+ - **Public** β€” implementations are encouraged to publish their reports openly (in their repo, on their website, federated via the registry).
187
+ - **Honest about partial conformance** β€” `partial` is a valid outcome and is more useful than a misleading `pass`.
188
+
189
+ Reports do not expire, but a report from suite version `1.0.0` is not equivalent to a report from `2.0.0`. The X09 suite versions independently of the protocol.
190
+
191
+ ### 5.4 Implementation registry & federation
192
+
193
+ Each HearthNet node announces its `ImplementationDescriptor` to its federation peers on connect. Peers add it to their local registry. Operators can query their local registry for "who else is out there" via `protocol_registry_list`.
194
+
195
+ The registry is *advisory*. There is no trust beyond "this peer claimed this descriptor at this time". A peer claiming to be an implementation it isn't is a security incident, but the registry doesn't authenticate vendor names β€” only that the node signed its descriptor.
196
+
197
+ ### 5.5 Governance (documented, not enforced)
198
+
199
+ The `protocol/governance.md` document describes how protocol changes happen:
200
+
201
+ 1. **Proposal**: any contributor writes a "change note" as a PR to `protocol/`. Includes motivation, exact spec diff, migration story, and conformance impact.
202
+ 2. **Discussion**: open period (default 4 weeks) for review.
203
+ 3. **Decision**: maintainers (initially just Christof, ideally expanding to a small group) accept, reject, or request revision. Rejections are logged with rationale.
204
+ 4. **Merge**: accepted change merges to `protocol/main` with version bump per Β§5.1 rules.
205
+ 5. **Release**: tagged release of the protocol document set independent of code releases.
206
+
207
+ This is a process document; the module does not *enforce* governance technically. Enforcement is social.
208
+
209
+ ### 5.6 Reference implementations registry
210
+
211
+ `protocol/reference-implementations.md` is a living document listing known implementations. v3.0 entries:
212
+
213
+ - **hearthnet-reference** (Python, this codebase). Status: complete, all stable + provisional + experimental capabilities.
214
+ - (placeholder for a second impl when one exists).
215
+
216
+ Adding an implementation to this document requires a PR demonstrating at minimum a passing conformance report for `core` sections of the X09 suite at the current major version.
217
+
218
+ ### 5.7 Migration documentation
219
+
220
+ Each major version transition ships a `protocol/migration/X-to-Y.md` document. v3.0 includes:
221
+
222
+ - `protocol/migration/2-to-3.md` β€” placeholder for now; v3.0 introduces only experimental capabilities, so a strict v2β†’v3 migration is effectively a no-op for stable code paths.
223
+
224
+ ### 5.8 Failure modes
225
+
226
+ - **Mismatched protocol versions in federation**: handled by M14 federation manifest version negotiation. M32 itself doesn't intervene at runtime; it just reports.
227
+ - **Conformance suite not present**: `protocol.conformance.report` returns `skipped` overall with reason `suite_not_installed`.
228
+ - **Conflicting registry entries** (same `name` claimed by two distinct vendors): both stored; the registry list returns both; operators decide.
229
+
230
+ ---
231
+
232
+ ## 6. Errors
233
+
234
+ | Code | When |
235
+ |-------------------------------|-------------------------------------------------------------------|
236
+ | `protocol_version_unknown` | Operation references a protocol version not in our table |
237
+ | `protocol_suite_not_installed`| Conformance report requested but X09 not available |
238
+ | `protocol_descriptor_invalid` | Announcement with malformed descriptor |
239
+ | `protocol_unsupported_capability` | Federation negotiation finds no compatible major version |
240
+
241
+ ---
242
+
243
+ ## 7. Configuration
244
+
245
+ ```python
246
+ @dataclass(frozen=True)
247
+ class ProtocolConfig:
248
+ enabled: bool = True # this one is enabled by default
249
+ supported_versions: tuple[str, ...] = ("3.0.0",)
250
+ default_announce_version: str = "3.0.0"
251
+ descriptor: ImplementationDescriptor = field(default_factory=lambda: ImplementationDescriptor(
252
+ name="hearthnet-reference",
253
+ vendor="ki-fusion-labs.de",
254
+ version="0.4.2",
255
+ protocol_versions=(ProtocolVersion(3, 0, 0),),
256
+ homepage_url="https://ki-fusion-labs.de/hearthnet",
257
+ contact=None,
258
+ ))
259
+ announce_to_federation: bool = True
260
+ conformance_auto_run_on_startup: bool = False
261
+ registry_max_entries: int = 4096
262
+ ```
263
+
264
+ ---
265
+
266
+ ## 8. Tests
267
+
268
+ ### 8.1 Unit
269
+
270
+ - `test_protocol_version_compat` β€” same major compatible; different major not.
271
+ - `test_descriptor_signature_roundtrip`.
272
+ - `test_registry_upsert_idempotent`.
273
+ - `test_conformance_report_signed` β€” self-signed report's signature verifies against the implementation's identity.
274
+ - `test_protocol_version_parse` β€” `"3.0.0"`, `"3.0.0-rc1"`, `"3.0.0-experimental"` parse correctly.
275
+
276
+ ### 8.2 Integration
277
+
278
+ - Two nodes (same impl, same version): announce β†’ registry shows both.
279
+ - Two nodes (same impl, different versions): registry shows both; federation negotiates highest compatible.
280
+ - Run conformance suite against the reference impl: must pass `core/*` sections by definition (the suite is built to match the spec).
281
+
282
+ ### 8.3 Spec-document tests
283
+
284
+ - `test_protocol_documents_present` β€” every protocol document referenced in `protocol/README.md` exists.
285
+ - `test_protocol_version_consistent` β€” `protocol/VERSION` matches the `default_announce_version` in code.
286
+ - `test_changelog_format` β€” `protocol/CHANGELOG.md` parses as a sequence of versioned entries with semver-valid ordering.
287
+
288
+ ### 8.4 Negative
289
+
290
+ - Malformed descriptor β†’ `protocol_descriptor_invalid`.
291
+ - Federation peer announces protocol `99.0.0` β†’ registry stores it, federation negotiation declines.
292
+
293
+ ---
294
+
295
+ ## 9. Cross-references
296
+
297
+ - **All Phase 1, 2, 3 modules** β€” they collectively *are* the protocol. M32's job is the meta-layer.
298
+ - **Phase 2 M14 Federation** β€” federation manifest carries protocol version; negotiation uses M32's compat rules.
299
+ - **Phase 3 X09 Conformance Suite** β€” produces the data that M32's report capability surfaces.
300
+
301
+ ---
302
+
303
+ ## 10. Open research questions
304
+
305
+ 1. **Independent implementation.** The protocol is real only when a second implementation exists. v3.0 ships only the reference. A small "minimal HearthNet" written in Go or Rust as a contrast implementation would prove the spec is implementable from the documents alone. Concrete next step, but not in v3.0.
306
+
307
+ 2. **Formal verification of wire formats.** TLS-style formal proofs of the federation handshake and capability-bus dispatch would be valuable. Out of v3.0 scope; documented as a research direction.
308
+
309
+ 3. **Governance bootstrapping.** "Christof decides" is fine for now and honest about the project's state. Transitioning to a multi-maintainer model needs a path β€” a Tech Steering Committee, a foundation, or simply a documented succession plan. Currently undefined.
310
+
311
+ 4. **Standards-body engagement.** If the protocol matures, IETF (for federation/transport) and W3C (for capability-bus semantics if they look RPC-like) are plausible homes. v3.0 deliberately avoids premature standards engagement; the bar is "second implementation exists and is interoperable".
312
+
313
+ 5. **Legal entity.** ki-fusion-labs.de is currently the operating entity. Whether a separate legal entity (e.V., foundation) is needed for a multi-vendor protocol is a real question. Out of code scope.
314
+
315
+ 6. **Trademark and naming.** "HearthNet" as a trademark is undefined; the protocol could be renamed to something more obviously generic at standardisation time. The reference impl can keep the name.
316
+
317
+ 7. **Optionality flags vs separate profiles.** A node might support `core` only, or `core + federation`, or everything. Whether to model this as per-capability optionality (current approach) or named profiles (`HearthNet-Core`, `HearthNet-Federated`, `HearthNet-Civdef`) is a design question that needs feedback from a second impl team.
318
+
319
+ 8. **Conformance suite drift.** The X09 suite is the source of truth for "what does conformance mean"; the protocol documents describe the *intent*. When the two disagree, currently the suite wins (because it's executable). This is pragmatic but not principled. A future version may flip this and use the suite to *test* the documents, with the documents as primary.
320
+
321
+ ---
322
+
323
+ *Last updated: spec v3.0.*
docs/p2_p3/X05-dht.md ADDED
@@ -0,0 +1,374 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # X05 β€” Distributed Hash Table (DHT)
2
+
3
+ **Spec version:** v1.0 (Phase 2)
4
+ **Depends on:** M01 (identity), X01 (transport), X04 (config), X03 (observability)
5
+ **Depended on by:** M14 (federation discovery), M07 ext (background blob replication via content routing), M02 ext (cross-LAN peer discovery), M15 (relay bootstrap)
6
+
7
+ ---
8
+
9
+ ## 1. Responsibility
10
+
11
+ Provide a Kademlia-style DHT over the internet that lets:
12
+
13
+ - A node find peers of its own community across LANs (cross-LAN extension of M02)
14
+ - A node find sources of a specific CID across communities (extension of M07's local source index)
15
+ - A node bootstrap into a federation (find an anchor of community X without knowing its IP)
16
+
17
+ Out of scope:
18
+ - Permanent storage in the DHT β€” DHT entries are TTL'd advertisements only
19
+ - Anonymity (no onion routing β€” there's no anonymity goal)
20
+ - Sybil resistance β€” communities are the trust roots; DHT is an unreliable hint layer
21
+
22
+ ---
23
+
24
+ ## 2. File layout
25
+
26
+ ```
27
+ hearthnet/dht/
28
+ β”œβ”€β”€ __init__.py
29
+ β”œβ”€β”€ kademlia.py # KademliaNode, routing table
30
+ β”œβ”€β”€ routing.py # FindNode, FindValue, Store, Ping RPCs
31
+ β”œβ”€β”€ storage.py # local DHT k/v store with TTL
32
+ └── bootstrap.py # bootstrap peer list, NAT-aware reachability
33
+ ```
34
+
35
+ ---
36
+
37
+ ## 3. Concepts
38
+
39
+ ### 3.1 Key space
40
+
41
+ XOR-distance over a 256-bit key space. Keys are derived as:
42
+
43
+ - For peers: `key = blake3(node_id_full)[:32]`
44
+ - For CIDs: `key = blake3(cid_string)[:32]`
45
+ - For communities: `key = blake3(community_id_full)[:32]`
46
+
47
+ ### 3.2 Bucket structure
48
+
49
+ Standard Kademlia: 256 buckets of size `DHT_REPLICATION_K = 8` (from Phase 2 constants). Concurrent lookups: `DHT_ALPHA = 3`.
50
+
51
+ ### 3.3 Values stored
52
+
53
+ The DHT does **not** store community state. It stores small, signed advertisements:
54
+
55
+ | Value type | Key | Value (signed) | TTL |
56
+ |------------|-----|----------------|-----|
57
+ | Peer presence | `blake3(node_id)` | `{endpoints, community_id, expires_at}` | matches manifest TTL (30s) |
58
+ | CID source | `blake3(cid)` | `{node_id, last_seen}` | 1 hour |
59
+ | Community bootstrap | `blake3(community_id)` | `{anchor_node_ids, endpoints, manifest_url}` | 24 hours |
60
+
61
+ The DHT is a **hint cache**. Authoritative state lives in community event logs.
62
+
63
+ ---
64
+
65
+ ## 4. Public API
66
+
67
+ ### 4.1 `kademlia.py`
68
+
69
+ ```python
70
+ # hearthnet/dht/kademlia.py
71
+ from dataclasses import dataclass
72
+
73
+ @dataclass(frozen=True)
74
+ class DhtContact:
75
+ node_key: bytes # 32 bytes
76
+ node_id_full: str
77
+ endpoint: Endpoint
78
+ last_seen: float
79
+
80
+ @dataclass(frozen=True)
81
+ class DhtValue:
82
+ """A stored advertisement. The payload is a signed dict."""
83
+ key: bytes
84
+ payload: dict # has 'signature' field
85
+ expires_at: int # unix seconds
86
+
87
+ class KademliaNode:
88
+ """One node's view of the DHT.
89
+ Provides high-level find_node / find_value / store APIs."""
90
+
91
+ def __init__(
92
+ self,
93
+ kp: KeyPair,
94
+ endpoint: Endpoint,
95
+ transport_client: HttpClient,
96
+ bootstrap_endpoints: list[Endpoint],
97
+ ):
98
+ ...
99
+
100
+ async def start(self) -> None:
101
+ """Bootstrap: ping bootstrap_endpoints, populate routing table."""
102
+
103
+ async def stop(self) -> None: ...
104
+
105
+ # --- public lookups ---
106
+
107
+ async def find_node(self, target_key: bytes) -> list[DhtContact]:
108
+ """Return the k closest contacts to target_key."""
109
+
110
+ async def find_value(self, key: bytes) -> list[DhtValue]:
111
+ """Return values stored at this key (or empty)."""
112
+
113
+ async def store(self, value: DhtValue) -> int:
114
+ """Replicate to k closest nodes. Returns count of successful stores."""
115
+
116
+ # --- maintenance ---
117
+
118
+ async def refresh_buckets(self) -> None:
119
+ """Per DHT_REFRESH_SECONDS: ping a random key in each bucket to liveness-check it."""
120
+
121
+ async def republish_values(self) -> None:
122
+ """Per DHT_REPUBLISH_SECONDS: re-store our own advertisements so TTL doesn't expire them."""
123
+
124
+ # --- introspection ---
125
+
126
+ def routing_table_size(self) -> int: ...
127
+ def stored_values(self) -> int: ...
128
+ ```
129
+
130
+ ### 4.2 `routing.py`
131
+
132
+ Wire RPCs exposed by the bus transport (X01) as additional endpoints under `/dht/v1/`:
133
+
134
+ | Endpoint | Method | Purpose |
135
+ |----------|--------|---------|
136
+ | `/dht/v1/ping` | POST | Liveness, exchange contact info |
137
+ | `/dht/v1/find_node` | POST | Return k closest contacts to a key |
138
+ | `/dht/v1/find_value` | POST | Return values at a key OR closest contacts if absent |
139
+ | `/dht/v1/store` | POST | Accept a value into local storage (if we're among the k closest) |
140
+
141
+ ```python
142
+ # hearthnet/dht/routing.py
143
+ async def serve_ping(req: dict) -> dict: ...
144
+ async def serve_find_node(req: dict, kademlia: KademliaNode) -> dict: ...
145
+ async def serve_find_value(req: dict, kademlia: KademliaNode) -> dict: ...
146
+ async def serve_store(req: dict, kademlia: KademliaNode) -> dict: ...
147
+
148
+ # Request / response shapes documented inline in each function.
149
+ ```
150
+
151
+ ### 4.3 `storage.py`
152
+
153
+ ```python
154
+ # hearthnet/dht/storage.py
155
+ class DhtStore:
156
+ """Local key-value store with TTL eviction.
157
+ Backing: SQLite in <DATA>/dht/store.sqlite."""
158
+
159
+ def __init__(self, db_path: Path):
160
+ ...
161
+
162
+ def put(self, value: DhtValue) -> bool:
163
+ """Idempotent. Returns True if stored (we're in k closest), False if rejected."""
164
+
165
+ def get(self, key: bytes) -> list[DhtValue]:
166
+ """Return non-expired values for this key."""
167
+
168
+ def evict_expired(self) -> int: ...
169
+
170
+ def size(self) -> int: ...
171
+ ```
172
+
173
+ ### 4.4 `bootstrap.py`
174
+
175
+ ```python
176
+ # hearthnet/dht/bootstrap.py
177
+
178
+ DEFAULT_BOOTSTRAP_NODES: list[Endpoint] = [
179
+ # Filled at packaging time with community-run bootstrap endpoints.
180
+ # Christof's relay.hearthnet.de will be a default.
181
+ ]
182
+
183
+ async def is_reachable(endpoint: Endpoint, timeout_seconds: float = 5) -> bool:
184
+ """Send a ping; return True if responded."""
185
+
186
+ async def discover_external_ip() -> str | None:
187
+ """Use STUN against a public STUN server to learn our external IP.
188
+ Used by relay-assisted bootstrap to advertise reachable endpoints."""
189
+ ```
190
+
191
+ ---
192
+
193
+ ## 5. Behaviour
194
+
195
+ ### 5.1 Advertisement lifecycle
196
+
197
+ ```
198
+ Node starts β†’ KademliaNode.start()
199
+ β†’ ping bootstrap_endpoints; build initial routing table
200
+ β†’ store our peer presence: store(DhtValue(blake3(node_id), {...}, ttl=30s))
201
+ β†’ store community bootstrap: store(DhtValue(blake3(community_id), {anchors, ...}, ttl=24h))
202
+ β†’ for each pinned CID: store(DhtValue(blake3(cid), {node_id, ...}, ttl=1h))
203
+ ↓
204
+ Every MANIFEST_REPUBLISH_INTERVAL_SECONDS: re-store peer presence
205
+ Every DHT_REPUBLISH_SECONDS: re-store all our advertisements
206
+ Every DHT_REFRESH_SECONDS: refresh routing table buckets
207
+ ```
208
+
209
+ ### 5.2 Lookup integration with M02
210
+
211
+ When [M02 PeerRegistry](../../modules/M02-discovery.md) doesn't find a peer for a known community on the LAN:
212
+
213
+ ```python
214
+ # M02 extension (Phase 2)
215
+ async def find_remote_peers(community_id: str) -> list[PeerRecord]:
216
+ if dht is None:
217
+ return []
218
+ contacts = await dht.find_value(blake3(community_id))
219
+ candidates = [parse_community_bootstrap(v.payload) for v in contacts]
220
+ return await fetch_manifests_and_filter(candidates, community_id)
221
+ ```
222
+
223
+ ### 5.3 Lookup integration with M07
224
+
225
+ When [M07 TransferManager](../../modules/M07-file-blobs.md) needs sources for a CID and the local `file.cid.advertised` index is empty:
226
+
227
+ ```python
228
+ # M07 extension (Phase 2)
229
+ async def find_remote_sources(cid: str) -> list[str]: # NodeIDs
230
+ contacts = await dht.find_value(blake3(cid))
231
+ return [parse_source_advert(v.payload).node_id for v in contacts]
232
+ ```
233
+
234
+ ### 5.4 Signature requirement on stored values
235
+
236
+ Every DHT value's `payload` must contain a `signature` field signed by the advertiser. Receivers reject values whose signature does not validate against the advertiser's claimed NodeID. Cost is small; protection is essential β€” without it, anyone can poison the DHT.
237
+
238
+ ### 5.5 NAT traversal hooks
239
+
240
+ The DHT itself does not do hole-punching. It cooperates with [M15 Relay Tier](../M15-relay-tier.md):
241
+
242
+ - If our advertised endpoint is unreachable (NAT'd), we additionally advertise `via_relay: "<relay_url>"` in the value payload
243
+ - Peers wanting to reach us see the relay hint and route through it
244
+ - Direct peer-to-peer over NAT (STUN/TURN) is Phase 3
245
+
246
+ ### 5.6 Privacy of the DHT
247
+
248
+ The DHT is a public-internet-facing component (by definition). It leaks:
249
+ - Which NodeIDs exist
250
+ - Which communities exist
251
+ - Which CIDs are popular
252
+
253
+ It does **not** leak:
254
+ - The contents of any blob
255
+ - The contents of community event logs
256
+ - Who's actually a member of a community (membership is in the signed manifest, fetched out of band)
257
+
258
+ This is acceptable for a system whose goal is community resilience, not anonymity.
259
+
260
+ ### 5.7 Anti-spam
261
+
262
+ - Per-source rate limit on `store` calls: max 100 per minute per node
263
+ - Stored value size cap: 4 KB
264
+ - Per-bucket eviction prefers values with higher signature reputation (Phase 3)
265
+
266
+ ### 5.8 Bootstrap reachability
267
+
268
+ `bootstrap_endpoints` (from config) are tried in order. If all fail, the node logs a warning and continues with mDNS+UDP only. The DHT is best-effort.
269
+
270
+ ---
271
+
272
+ ## 6. Wire format (request/response examples)
273
+
274
+ ### 6.1 `POST /dht/v1/find_value`
275
+
276
+ Request:
277
+ ```json
278
+ {
279
+ "key": "blake3:<hex of 32 bytes>",
280
+ "from": "ed25519:<our NodeID>",
281
+ "trace_id": "01HXR...",
282
+ "signature": "ed25519:<over the above three fields canonicalised>"
283
+ }
284
+ ```
285
+
286
+ Response (value found):
287
+ ```json
288
+ {
289
+ "values": [
290
+ {
291
+ "key": "blake3:...",
292
+ "payload": {"node_id":"...","endpoints":[...],"signature":"ed25519:..."},
293
+ "expires_at": 1717942800
294
+ }
295
+ ]
296
+ }
297
+ ```
298
+
299
+ Response (not found, get closer contacts):
300
+ ```json
301
+ {
302
+ "values": [],
303
+ "closer": [{"node_id_full":"ed25519:...","endpoint":{"host":"...","port":7080}}, "..."]
304
+ }
305
+ ```
306
+
307
+ ---
308
+
309
+ ## 7. Errors
310
+
311
+ `DhtError`:
312
+
313
+ - `bootstrap_failed` β€” no bootstrap endpoint reachable
314
+ - `lookup_timeout` β€” couldn't find value or contacts within DHT_LOOKUP_TIMEOUT
315
+ - `store_unauthorized` β€” payload signature invalid
316
+ - `value_too_large` β€” > 4 KB
317
+ - `rate_limited` β€” per-source store rate exceeded
318
+
319
+ These don't always map to wire codes β€” most DHT activity is internal to the node. When they bubble up to a caller, `dht_lookup_failed` is the wire code.
320
+
321
+ ---
322
+
323
+ ## 8. Configuration
324
+
325
+ ```python
326
+ config.dht.enabled = False # opt-in; phase 1 default off
327
+ config.dht.bootstrap_endpoints = [...]
328
+ config.dht.public_endpoint_override = None # for nodes behind NAT, manual override
329
+ config.dht.advertise_cids = True # also advertise pinned CIDs
330
+ config.dht.advertise_community = True
331
+ ```
332
+
333
+ Constants used: `DHT_REPLICATION_K=8`, `DHT_ALPHA=3`, `DHT_REFRESH_SECONDS=3600`, `DHT_REPUBLISH_SECONDS=86400`.
334
+
335
+ ---
336
+
337
+ ## 9. Tests
338
+
339
+ ### Unit
340
+ - `test_xor_distance_metric`
341
+ - `test_routing_table_insert_eviction`
342
+ - `test_signed_value_verification`
343
+ - `test_unsigned_value_rejected`
344
+ - `test_ttl_eviction`
345
+
346
+ ### Integration
347
+ - `test_three_node_dht_find_value` β€” three KademliaNodes in process, store, find
348
+ - `test_bootstrap_picks_up_existing_dht`
349
+ - `test_partition_then_reconnect_converges`
350
+ - `test_value_republish_keeps_alive`
351
+
352
+ ### Property-based
353
+ - `test_kademlia_eventual_consistency_under_churn` (Hypothesis-driven)
354
+
355
+ ---
356
+
357
+ ## 10. Cross-references
358
+
359
+ | What | Where |
360
+ |------|-------|
361
+ | Used by federation bootstrap | [M14 Β§4.3](../modules/M14-federation.md) |
362
+ | Used by background blob replication | M07 ext (see [00-OVERVIEW Β§1](../00-OVERVIEW.md)) |
363
+ | Wire error code | `dht_lookup_failed` in [CAP2 Β§9](../CAPABILITY_CONTRACT_v2.md) |
364
+ | Phase 1 alternative (mDNS/UDP) | [M02](../../modules/M02-discovery.md) |
365
+ | Phase 3 sybil resistance | TBD |
366
+
367
+ ---
368
+
369
+ ## 11. Open questions
370
+
371
+ 1. **libp2p reuse vs custom Python.** libp2p has a Python port but it's heavyweight. A focused 1000-LOC Kademlia matches our needs and stays auditable. Decision: custom for now; can swap.
372
+ 2. **NAT hole punching.** Currently relay-only. STUN/TURN integration is Phase 3.
373
+ 3. **Public DHT vs federated DHTs.** Should the DHT itself be federated (per-community DHT joined via cross-sig)? Maybe. Defer.
374
+ 4. **Onion routing.** Out of scope. HearthNet has no anonymity goal.
docs/p2_p3/X06-websocket.md ADDED
@@ -0,0 +1,316 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # X06 β€” WebSocket Upgrade
2
+
3
+ **Spec version:** v1.0 (Phase 2)
4
+ **Depends on:** X01 (transport), X03 (observability), `websockets` Python library
5
+ **Depended on by:** X01 transport server (in-place extension), M21 (tool-call loops), M25 (group chat live), M22 (mobile push delivery)
6
+
7
+ ---
8
+
9
+ ## 1. Responsibility
10
+
11
+ Add bidirectional WebSocket transport alongside the existing HTTP/1.1 + SSE in [X01](../../cross-cutting/X01-transport.md). Use cases:
12
+
13
+ - Tool-call loops in `llm.chat` where the server needs to ask the client to execute a tool mid-stream
14
+ - Live pubsub topics that fan out many messages per second (group chat, federation heartbeats)
15
+ - Mobile clients on flaky cellular where reconnect is expensive
16
+
17
+ WebSockets do **not** replace the request/response model. They are an *upgrade* available on specific endpoints when both ends support v2 contract.
18
+
19
+ ---
20
+
21
+ ## 2. File layout
22
+
23
+ ```
24
+ hearthnet/transport/
25
+ └── websocket.py # WebSocket server-side handler + client-side wrapper
26
+ ```
27
+
28
+ Single file; the protocol is small.
29
+
30
+ ---
31
+
32
+ ## 3. Endpoints supporting upgrade
33
+
34
+ | Endpoint | Behaviour |
35
+ |----------|-----------|
36
+ | `/bus/v1/call` | When `Upgrade: websocket` present and capability descriptor supports streaming, upgrade and use frame protocol on the WS instead of SSE |
37
+ | `/pubsub/v1/subscribe` | When upgraded, server pushes messages on topic without long-polling |
38
+ | `/sync/v1/events` | NOT upgraded β€” sync is bursty and short-lived; HTTP fits |
39
+
40
+ ---
41
+
42
+ ## 4. WebSocket frame protocol
43
+
44
+ WebSocket frames carry the **same JSON event/data envelope** as SSE. This is deliberate β€” handlers can be written once and dispatched to either transport.
45
+
46
+ ### 4.1 Outbound (server β†’ client)
47
+
48
+ Each WebSocket message is one JSON object:
49
+
50
+ ```json
51
+ {"event": "token", "data": {"text": "Hallo "}, "seq": 12}
52
+ ```
53
+
54
+ `seq` is monotonic per-stream from the server. Used for backpressure ACKs.
55
+
56
+ ### 4.2 Inbound (client β†’ server)
57
+
58
+ Two kinds of messages:
59
+
60
+ #### Backpressure ACK
61
+
62
+ ```json
63
+ {"type":"ack","upto":8}
64
+ ```
65
+
66
+ #### Tool result (mid-stream)
67
+
68
+ ```json
69
+ {"type":"tool_result","tool_call_id":"tc_01HXR...","body":{...}}
70
+ ```
71
+
72
+ Used in tool-call loops (see [M21](../modules/M21-tool-calls.md)).
73
+
74
+ #### Cancel
75
+
76
+ ```json
77
+ {"type":"cancel"}
78
+ ```
79
+
80
+ Cleanly stops the current operation. Server must abort within 200 ms and emit a final `error` or `done` frame.
81
+
82
+ ### 4.3 Control frames
83
+
84
+ Standard WebSocket pings/pongs. `WEBSOCKET_PING_SECONDS = 30` between pings.
85
+
86
+ ---
87
+
88
+ ## 5. Public API
89
+
90
+ ### 5.1 Server side
91
+
92
+ ```python
93
+ # hearthnet/transport/websocket.py
94
+ class WebSocketSession:
95
+ """Wraps a WebSocket connection from the server's perspective."""
96
+
97
+ def __init__(self, ws: WebSocket, kp: KeyPair):
98
+ ...
99
+
100
+ @property
101
+ def closed(self) -> bool: ...
102
+ @property
103
+ def remote_node_id(self) -> str: ...
104
+
105
+ async def emit(self, event: str, data: dict) -> None:
106
+ """Send a frame; respect flow control."""
107
+
108
+ async def emit_token(self, token: dict) -> None: ...
109
+ async def emit_progress(self, current: int, total: int, stage: str) -> None: ...
110
+ async def emit_error(self, code: ErrorCode, **kwargs) -> None: ...
111
+ async def emit_done(self, **meta) -> None: ...
112
+
113
+ async def receive(self) -> WsClientFrame | None:
114
+ """Block until a client frame arrives, or None on close."""
115
+
116
+ async def close(self, code: int = 1000) -> None: ...
117
+
118
+ @dataclass(frozen=True)
119
+ class WsClientFrame:
120
+ type: str # "ack" | "tool_result" | "cancel"
121
+ data: dict
122
+ ```
123
+
124
+ ### 5.2 Client side
125
+
126
+ ```python
127
+ class WebSocketClient:
128
+ """Used by HttpClient (X01) when stream() is called with `prefer_ws=True`."""
129
+
130
+ def __init__(
131
+ self,
132
+ url: str,
133
+ kp: KeyPair,
134
+ community_id: str,
135
+ pinned_certs: PinnedCerts,
136
+ ):
137
+ ...
138
+
139
+ async def open(self) -> None: ...
140
+ async def close(self) -> None: ...
141
+
142
+ async def send_call(
143
+ self,
144
+ capability: str,
145
+ version: str,
146
+ body: dict,
147
+ *,
148
+ trace_id: str,
149
+ ) -> None:
150
+ """Initial call frame. Authentication via X-HearthNet-* headers
151
+ and a signed call-envelope sent as the first WS message."""
152
+
153
+ async def __aiter__(self) -> AsyncIterator[Frame]:
154
+ """Yields Frame objects (same shape as SSE Frame)."""
155
+
156
+ async def send_tool_result(self, tool_call_id: str, body: dict) -> None: ...
157
+ async def send_ack(self, upto: int) -> None: ...
158
+ async def cancel(self) -> None: ...
159
+ ```
160
+
161
+ ### 5.3 Upgrade negotiation on the server
162
+
163
+ X01's [HttpServer](../../cross-cutting/X01-transport.md) gets a small dispatch shim:
164
+
165
+ ```python
166
+ # in hearthnet/transport/server.py (Phase 2 extension)
167
+ async def dispatch_call(request: Request):
168
+ if request.headers.get("upgrade") == "websocket" and capability_supports_stream(...):
169
+ return await dispatch_via_websocket(request)
170
+ else:
171
+ return await dispatch_via_sse_or_json(request)
172
+ ```
173
+
174
+ `capability_supports_stream` checks the descriptor's `stream_schema` is not None.
175
+
176
+ ---
177
+
178
+ ## 6. Behaviour
179
+
180
+ ### 6.1 Handshake
181
+
182
+ ```
183
+ client β†’ GET /bus/v1/call
184
+ Connection: Upgrade
185
+ Upgrade: websocket
186
+ Sec-WebSocket-Protocol: hearthnet-bus.v2
187
+ (other X-HearthNet-* headers)
188
+ ↓
189
+ server: validates capability + initial signature
190
+ responds 101 Switching Protocols if v2 capable
191
+ responds 426 Upgrade Required (with downgrade hint) if not v2
192
+ ↓
193
+ client sends first message: signed call envelope
194
+ {"type":"call","envelope":{...},"signature":"ed25519:..."}
195
+ ↓
196
+ server: validates signature, dispatches to bus
197
+ ↓
198
+ server streams response frames; client streams ACKs / tool_results / cancels
199
+ ```
200
+
201
+ ### 6.2 Flow control
202
+
203
+ Same window-based FC as SSE (`STREAM_WINDOW_FRAMES = 16`, ACK every 8). Server checks `flow_control.send()` before each emit; client sends `ack` messages every 8 received frames.
204
+
205
+ ### 6.3 Idle handling
206
+
207
+ If no message in either direction for `WEBSOCKET_IDLE_CLOSE_SECONDS` (120s), server closes with code 1000. Client may reopen.
208
+
209
+ ### 6.4 Failure modes
210
+
211
+ | Symptom | Behaviour |
212
+ |---------|-----------|
213
+ | Client disconnect mid-stream | Server's task receives `CancelledError`, aborts the underlying capability within 200ms |
214
+ | Network drop | Either side's WS library raises; current stream is `error`-terminated locally |
215
+ | Server overload | Server may decline upgrade with 503 + retry hint; client falls back to SSE |
216
+ | Protocol version mismatch | Server replies 426 with `Sec-WebSocket-Protocol` listing supported versions |
217
+
218
+ ### 6.5 Pubsub via WS
219
+
220
+ Subscribing to a topic via WS:
221
+
222
+ ```
223
+ client GET /pubsub/v1/subscribe?topic=marketplace.post.created
224
+ Upgrade: websocket
225
+ ↓
226
+ server upgrades; sends backlog (if `since_seq` provided) then live messages
227
+ ↓
228
+ each message: {"event":"published","data":{...},"seq":N}
229
+ ↓
230
+ client sends ACKs to allow server to advance flow control
231
+ ```
232
+
233
+ This replaces the long-polling pattern from Phase 1 Β§8 for clients that hold the connection. The long-poll endpoint remains for non-WS clients.
234
+
235
+ ### 6.6 Tool-call loop (used by [M21](../modules/M21-tool-calls.md))
236
+
237
+ ```
238
+ server emits:
239
+ {"event":"token","data":{"text":"..."}}
240
+ {"event":"tool_call_delta","data":{"id":"tc_1","name":"rag.query","arguments_delta":"..."}}
241
+ ...
242
+ {"event":"tool_call","data":{"id":"tc_1","arguments":{"query":"...","corpus":"..."}}}
243
+ client must respond:
244
+ {"type":"tool_result","tool_call_id":"tc_1","body":{...result of bus.call("rag.query",...)...}}
245
+ server continues:
246
+ {"event":"token","data":{"text":"Based on the documents..."}}
247
+ ...
248
+ {"event":"done","data":{...}}
249
+ ```
250
+
251
+ Without WebSocket, the SSE-only fallback is for the *caller* (UI) to execute the tool and re-call `llm.chat` with the tool result added to messages. Both paths work; WS is more efficient.
252
+
253
+ ---
254
+
255
+ ## 7. Errors
256
+
257
+ `WebSocketError` codes (local domain):
258
+
259
+ - `upgrade_refused` β€” server returned 426 or 503
260
+ - `version_unsupported` β€” protocol mismatch
261
+ - `idle_timeout`
262
+ - `bad_frame` β€” malformed JSON or invalid `type`
263
+
264
+ On the wire, errors carried inside the WS as `event: error` frames map to the standard wire codes in [CAP Β§9](../../CAPABILITY_CONTRACT.md).
265
+
266
+ ---
267
+
268
+ ## 8. Configuration
269
+
270
+ ```python
271
+ config.transport.websocket_enabled = True
272
+ config.transport.websocket_idle_close_seconds = WEBSOCKET_IDLE_CLOSE_SECONDS
273
+ config.transport.websocket_ping_seconds = WEBSOCKET_PING_SECONDS
274
+ ```
275
+
276
+ ---
277
+
278
+ ## 9. Tests
279
+
280
+ ### Unit
281
+ - `test_ws_frame_shape_matches_sse`
282
+ - `test_signed_call_envelope_first_message`
283
+ - `test_invalid_signature_closes_connection`
284
+ - `test_idle_close_after_timeout`
285
+
286
+ ### Integration
287
+ - `test_two_node_ws_call_round_trip`
288
+ - `test_ws_stream_tokens_then_done`
289
+ - `test_ws_tool_result_inline`
290
+ - `test_ws_cancel_within_200ms`
291
+ - `test_ws_fallback_to_sse_when_426`
292
+ - `test_pubsub_via_ws_backlog_plus_live`
293
+
294
+ ### Chaos
295
+ - `test_ws_dropped_packet_recovery` (using `tc`)
296
+
297
+ ---
298
+
299
+ ## 10. Cross-references
300
+
301
+ | What | Where |
302
+ |------|-------|
303
+ | Endpoint upgrade | [CAP2 Β§5.1](../CAPABILITY_CONTRACT_v2.md) |
304
+ | Frame protocol shared with SSE | [X01 Β§6](../../cross-cutting/X01-transport.md), [CAP Β§5.3](../../CAPABILITY_CONTRACT.md) |
305
+ | Tool-call loop | [M21](../modules/M21-tool-calls.md) |
306
+ | Mobile client benefits | [M22 Β§5](../modules/M22-mobile-native.md) |
307
+ | Phase 3 considerations (WebTransport / QUIC) | TBD |
308
+
309
+ ---
310
+
311
+ ## 11. Open questions
312
+
313
+ 1. **HTTP/3 / WebTransport** β€” Phase 3 candidate; better on mobile, doesn't need TCP setup time on reconnect.
314
+ 2. **Binary frames** β€” JSON works; binary CBOR could save bytes. Defer until profiling shows it matters.
315
+ 3. **Multiplexing many capability calls on one WS** β€” currently one WS per call. Multiplex possible but adds complexity. Defer.
316
+ 4. **WSS certificate handling** β€” same TLS pinning as HTTPS; works because WS goes over the same TLS connection.
docs/p2_p3/X07-federated-metrics.md ADDED
@@ -0,0 +1,330 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # X07 β€” Federated Metrics
2
+
3
+ **Spec version:** v2.0
4
+ **Depends on:** [X03 Observability](../../cross-cutting/X03-observability.md), [M14 Federation](../modules/M14-federation.md), [M16 Tokens](../modules/M16-tokens.md), [X04 Config](../../cross-cutting/X04-config.md)
5
+ **Depended on by:** Operator dashboards (out of band), federation health UI
6
+
7
+ ---
8
+
9
+ ## 1. Responsibility
10
+
11
+ Take the per-node Prometheus metrics produced by [X03](../../cross-cutting/X03-observability.md) and aggregate them, with consent, into a community-level view β€” and, where federation grants it, into a federation-level view.
12
+
13
+ X03 gives each node a private view of itself. X07 gives:
14
+
15
+ - **The community founder** a dashboard like "how healthy is the mesh today, where are the hot spots, what's the GPU saturation across all anchors?"
16
+ - **A federated peer** a much narrower view β€” opt-in, aggregated, no per-node identifiers β€” like "Geldern reports 18 active members and 4.2k events/day".
17
+
18
+ The design rule is: **less information at greater distance**. Per-node detail stays on the node. The community sees aggregates. Federated peers see anonymised aggregates. There is no global "every node and what it does" surface.
19
+
20
+ ---
21
+
22
+ ## 2. File layout
23
+
24
+ ```
25
+ hearthnet/observability/
26
+ β”œβ”€β”€ federated.py # FederatedMetricsExporter & Aggregator
27
+ β”œβ”€β”€ otlp_export.py # Optional OpenTelemetry OTLP push
28
+ β”œβ”€β”€ aggregation_views.py # SQL-like views over time-series
29
+ └── consent.py # Per-metric publish consent
30
+ ```
31
+
32
+ ---
33
+
34
+ ## 3. Public API
35
+
36
+ ### 3.1 `FederatedMetricsExporter`
37
+
38
+ ```python
39
+ class FederatedMetricsExporter:
40
+ """
41
+ Pulls metrics from the local Prometheus registry, applies consent rules,
42
+ and publishes aggregated subsets either:
43
+ - to the community's aggregator anchor (mesh-internal)
44
+ - to an external OTLP collector (optional, off by default)
45
+ - to federated peers via the bus
46
+ """
47
+
48
+ def __init__(
49
+ self,
50
+ observability: Observability,
51
+ consent: ConsentPolicy,
52
+ bus: CapabilityBus,
53
+ settings: FederatedMetricsSettings,
54
+ ): ...
55
+
56
+ async def start(self) -> None: ...
57
+ async def stop(self) -> None: ...
58
+
59
+ # Triggered by tick; publishes to internal bus topic
60
+ async def publish_community(self) -> None: ...
61
+
62
+ # Triggered when federated peer requests
63
+ async def publish_federated(self, peer_community_id: NodeID) -> AggregatedSnapshot: ...
64
+
65
+ # OTLP push, off by default
66
+ async def push_otlp(self, endpoint: str) -> None: ...
67
+ ```
68
+
69
+ ### 3.2 `MetricsAggregator`
70
+
71
+ Runs on the **aggregator anchor** (any anchor designated by community policy; default is the founder's node):
72
+
73
+ ```python
74
+ class MetricsAggregator:
75
+ """
76
+ Subscribes to `observability.metrics.tick.*` events from all members,
77
+ keeps a 7-day rolling window, exposes:
78
+ - GET /metrics/community (Prometheus format, community-wide)
79
+ - capability `observability.community_snapshot@1.0`
80
+ """
81
+
82
+ def __init__(self, bus: CapabilityBus, event_log: EventLog, store: TimeSeriesStore): ...
83
+
84
+ async def start(self) -> None: ...
85
+
86
+ async def community_snapshot(self) -> CommunityMetrics: ...
87
+ async def federated_snapshot(self, peer_id: NodeID) -> AggregatedSnapshot: ...
88
+ ```
89
+
90
+ ### 3.3 Snapshot dataclasses
91
+
92
+ ```python
93
+ @dataclass
94
+ class NodeMetricsTick:
95
+ """What each node publishes every METRICS_TICK_SECONDS (default 60)."""
96
+ node_id: NodeID
97
+ timestamp: datetime
98
+ cpu_pct: float
99
+ mem_used_mb: int
100
+ mem_total_mb: int
101
+ gpu_pct: float | None
102
+ gpu_mem_used_mb: int | None
103
+ disk_used_gb: float
104
+ disk_total_gb: float
105
+ capability_calls_per_min: dict[str, int] # by capability
106
+ error_rate_per_min: dict[str, float]
107
+ p95_latency_ms_by_cap: dict[str, float]
108
+ online_seconds: int # since last restart
109
+
110
+ @dataclass
111
+ class CommunityMetrics:
112
+ """Aggregated over the community. Has per-node detail (members see members)."""
113
+ timestamp: datetime
114
+ nodes_total: int
115
+ nodes_online: int
116
+ nodes: list[NodeMetricsTick]
117
+ capability_calls_per_min_total: dict[str, int]
118
+ events_per_min: int
119
+ storage_used_gb: float
120
+ federation_links: int
121
+
122
+ @dataclass
123
+ class AggregatedSnapshot:
124
+ """For federated peers. No per-node detail, no identifiers, banded values."""
125
+ timestamp: datetime
126
+ community_id: NodeID
127
+ nodes_online_band: str # "10-20", "20-50", etc.
128
+ daily_active_members_band: str
129
+ capability_calls_per_day_top: list[tuple[str, str]] # [(cap, band)]
130
+ error_rate_band: str
131
+ federation_links_count: int
132
+ ```
133
+
134
+ ### 3.4 `ConsentPolicy`
135
+
136
+ ```python
137
+ @dataclass
138
+ class ConsentPolicy:
139
+ """
140
+ Loaded from policy.yaml. Controls what leaves the node.
141
+ """
142
+ publish_to_community: set[str] # metric names included in NodeMetricsTick
143
+ publish_to_federated: set[str] # subset, applied to AggregatedSnapshot
144
+ publish_to_external: bool # OTLP push on/off
145
+ aggregation_min_nodes: int # don't expose a metric unless β‰₯ N nodes contribute
146
+ banding: dict[str, list[int]] # metric β†’ bucket edges
147
+ ```
148
+
149
+ ---
150
+
151
+ ## 4. Behaviour
152
+
153
+ ### 4.1 Tick lifecycle
154
+
155
+ Every `METRICS_TICK_SECONDS` (default 60s) each node:
156
+
157
+ 1. Snapshots its local Prometheus registry.
158
+ 2. Filters per `ConsentPolicy.publish_to_community`.
159
+ 3. Constructs a `NodeMetricsTick`.
160
+ 4. Publishes to bus topic `observability.metrics.tick.<community_id>` over WebSocket pubsub (efficient: many small messages, low latency).
161
+ 5. Also writes a local rolling-window copy for debug.
162
+
163
+ The aggregator anchor subscribes to the topic, ingests into its time-series store, and computes `CommunityMetrics` on demand.
164
+
165
+ ### 4.2 Aggregator selection
166
+
167
+ The community policy contains:
168
+
169
+ ```yaml
170
+ observability:
171
+ aggregator_anchor: ed25519:<NodeID> # optional; if absent, any anchor self-elects
172
+ aggregator_failover_seconds: 600
173
+ ```
174
+
175
+ If the configured aggregator is offline for `aggregator_failover_seconds`, another anchor self-elects (lowest NodeID hash wins). A live community-wide view tolerates the aggregator going offline; nodes keep publishing ticks and a new aggregator picks up where the old one left off (with a brief gap).
176
+
177
+ ### 4.3 What gets exposed to whom
178
+
179
+ | Metric category | Self | Other members | Aggregator anchor | Federated peers | External OTLP |
180
+ |------------------|------|----------------|-------------------|------------------|---------------|
181
+ | CPU / mem / GPU per-node | βœ… | per policy | βœ… | ❌ | ❌ |
182
+ | Per-capability call counts | βœ… | βœ… | βœ… | banded only | optional |
183
+ | Per-capability latencies | βœ… | aggregated | βœ… | ❌ | ❌ |
184
+ | Error rates | βœ… | aggregated | βœ… | banded only | optional |
185
+ | Federation link count | βœ… | βœ… | βœ… | exact count | ❌ |
186
+ | File counts / sizes | βœ… | ❌ | aggregated | banded | ❌ |
187
+ | Identity of which node did what | βœ… | per policy | ❌ (anonymised on ingest) | ❌ | ❌ |
188
+
189
+ The aggregator does **not** store per-node identity in its long-term time series. It computes per-node views on the fly for the founder UI but persists only anonymised aggregates after `MEMBER_DETAIL_RETENTION_HOURS` (default 24).
190
+
191
+ ### 4.4 Banding
192
+
193
+ Federated snapshots use bands rather than exact numbers to prevent triangulation across multiple federations:
194
+
195
+ ```yaml
196
+ banding:
197
+ nodes_online: [0, 5, 10, 20, 50, 100, 500]
198
+ daily_active_members: [0, 3, 10, 30, 100]
199
+ capability_calls_per_day: [0, 100, 1000, 10000, 100000]
200
+ error_rate: [0, 0.01, 0.05, 0.10]
201
+ ```
202
+
203
+ Result: `"nodes_online_band": "10-20"` instead of `19`.
204
+
205
+ ### 4.5 OTLP push (external)
206
+
207
+ Off by default. When `publish_to_external=true`:
208
+
209
+ - Pushes to a configured OTLP endpoint (could be Grafana Cloud, self-hosted Tempo/Mimir, or your own collector).
210
+ - Only metrics in `publish_to_external` set leave the node.
211
+ - The receiver gets aggregated, banded data β€” same restrictions as a federated peer.
212
+ - TLS required; OTLP headers carry an API token (set via env var, not in policy file).
213
+
214
+ This is the path for an operator (Christof) who wants a single Grafana dashboard across all his bofrost-managed communities β€” but the protections still apply: external collector cannot reconstruct who did what.
215
+
216
+ ### 4.6 Trackio integration
217
+
218
+ Phase 1 already supports per-node Trackio logging. X07 adds: **the aggregator** can push a community-level summary to a Trackio space, useful for hackathon demos and HF leaderboard-style displays.
219
+
220
+ `policy.observability.trackio_community_space` (URL) is configurable. The aggregator anchor pushes `CommunityMetrics` rows hourly. Per-node detail is excluded from this path; only aggregates go.
221
+
222
+ ### 4.7 Federated peer queries
223
+
224
+ A federated peer asks for our snapshot via:
225
+
226
+ ```
227
+ POST /bus/v1/call
228
+ X-HearthNet-Community: <their community>
229
+ Capability: observability.federated_snapshot@1.0
230
+ Body: {"input": {"window_hours": 24}}
231
+ ```
232
+
233
+ Bus checks federation scope, calls the aggregator, returns `AggregatedSnapshot`. The peer's UI may display "Geldern: 10-20 nodes online, light activity today" alongside their own community.
234
+
235
+ ### 4.8 Cost & sizing
236
+
237
+ A `NodeMetricsTick` is roughly 500 bytes JSON. At 1 tick / 60s per node, a 50-node community publishes 50 Γ— 500B / 60s β‰ˆ 420 B/s on the metrics topic. Negligible.
238
+
239
+ The aggregator's time-series store is **DuckDB** (Phase 2 choice; SQLite would also work). Retention: 7 days at full per-node resolution, then daily roll-ups for 90 days, then weekly forever.
240
+
241
+ ---
242
+
243
+ ## 5. Errors
244
+
245
+ | Code | Cause |
246
+ |------|-------|
247
+ | `unavailable` | Aggregator anchor offline |
248
+ | `aggregation_too_few_nodes` | < `aggregation_min_nodes` nodes contributed; refusing to disclose |
249
+ | `federation_forbidden` | Peer requested a metric category not in federation scope |
250
+ | `consent_denied` | Local policy forbids this metric from leaving the node |
251
+
252
+ ---
253
+
254
+ ## 6. Configuration
255
+
256
+ ```toml
257
+ [observability.federated]
258
+ enabled = true
259
+ metrics_tick_seconds = 60
260
+ aggregator_failover_seconds = 600
261
+ member_detail_retention_hours = 24
262
+ aggregation_min_nodes = 3
263
+ publish_to_external = false
264
+ otlp_endpoint = ""
265
+ otlp_token_env = "OTLP_TOKEN"
266
+ trackio_community_space = ""
267
+
268
+ [observability.federated.consent.publish_to_community]
269
+ metrics = [
270
+ "node.cpu_pct", "node.mem_pct", "node.gpu_pct",
271
+ "node.online_seconds", "node.capability_calls_per_min",
272
+ "node.p95_latency_by_capability",
273
+ ]
274
+
275
+ [observability.federated.consent.publish_to_federated]
276
+ metrics = [
277
+ "community.nodes_online", "community.daily_active_members",
278
+ "community.capability_calls_top", "community.federation_links",
279
+ ]
280
+
281
+ [observability.federated.consent.banding]
282
+ "community.nodes_online" = [0, 5, 10, 20, 50, 100]
283
+ "community.daily_active_members" = [0, 3, 10, 30, 100]
284
+ "community.capability_calls_top" = [0, 100, 1000, 10000]
285
+ "community.error_rate" = [0, 0.01, 0.05, 0.10]
286
+ ```
287
+
288
+ ---
289
+
290
+ ## 7. Tests
291
+
292
+ ### 7.1 Unit
293
+ - Banding: value 17 with bands `[0,5,10,20,50]` returns `"10-20"`
294
+ - Aggregation refuses when contributors < min: `aggregation_too_few_nodes`
295
+ - Consent: a metric not in `publish_to_community` set is excluded from tick
296
+ - AggregatedSnapshot construction strips all NodeID fields
297
+
298
+ ### 7.2 Integration
299
+ - 5 nodes publish ticks for 5 minutes; aggregator's snapshot reflects 5 contributors with correct totals
300
+ - Aggregator kill / failover: a second anchor takes over within 10 minutes, snapshot resumes
301
+ - Federated peer requests snapshot; receives banded version; cannot infer specific node counts
302
+
303
+ ### 7.3 Adversarial
304
+ - Malicious node publishes inflated counters β†’ outlier detection drops obvious outliers (>3Οƒ) from the aggregate
305
+ - Federated peer requests snapshot for window the aggregator hasn't filled β†’ `aggregation_too_few_nodes`
306
+ - OTLP endpoint compromised: leaked data contains only banded aggregates; per-node attribution impossible
307
+
308
+ ### 7.4 Privacy
309
+ - Asserts: no NodeID, IP, or device-identifying string is present in `AggregatedSnapshot`
310
+ - Asserts: after `MEMBER_DETAIL_RETENTION_HOURS` the aggregator's persisted store contains no per-node rows
311
+
312
+ ---
313
+
314
+ ## 8. Cross-references
315
+
316
+ - Capability: `observability.community_snapshot@1.0`, `observability.federated_snapshot@1.0` (introduced here, listed in [CAPABILITY_CONTRACT_v2 Β§3](../CAPABILITY_CONTRACT_v2.md#3-complete-new-capabilities-list))
317
+ - Bus topic: `observability.metrics.tick.<community_id>`
318
+ - Underlying primitives: [X03](../../cross-cutting/X03-observability.md)
319
+ - Federation scope: [M14 Β§5](../modules/M14-federation.md)
320
+ - Policy schema: [X04](../../cross-cutting/X04-config.md)
321
+
322
+ ---
323
+
324
+ ## 9. Open questions
325
+
326
+ 1. **Differential privacy** β€” adding Laplacian noise to federated snapshots. Worth it for stronger guarantees, or does banding already suffice given small N?
327
+ 2. **Federation gossip of snapshots** — should snapshots propagate transitively (A→B→C sees A's banded numbers), or strictly point-to-point? Phase-2 default: point-to-point.
328
+ 3. **Per-capability cost accounting** β€” exposing GPU-seconds per capability call would help operators reason about cost / who's consuming what. Reveals usage patterns; needs consent design.
329
+ 4. **Histogram vs banded scalars** β€” banded scalars are simple but lose distribution shape. Full Prometheus histograms with aggregated buckets might be a better federated unit. Trade-off: bytes on the wire vs richness.
330
+ 5. **Aggregator beyond a single anchor** β€” at large scale (100+ nodes) a single aggregator becomes a bottleneck. Sharded aggregation (per-capability-prefix?) is a Phase-3 problem.
docs/p2_p3/X08-tensor-transport.md ADDED
@@ -0,0 +1,302 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # X08 β€” Tensor Transport
2
+
3
+ **Spec version:** v3.0 β€” *experimental*
4
+ **Depends on:** [X06 WebSocket](../../phase-2/cross-cutting/X06-websocket.md), [M02 Transport](../../modules/M02-transport.md), [M01 Identity](../../modules/M01-identity.md)
5
+ **Depended on by:** [M26 Distributed Inference](../modules/M26-distributed-inference.md)
6
+
7
+ ---
8
+
9
+ ## 1. Purpose
10
+
11
+ A binary, framed, flow-controlled transport for **tensor data** between HearthNet nodes β€” specifically the activations and gradients moved during M26 distributed inference. The text-oriented capability bus and JSON-shaped event envelopes are wrong for this traffic: tensors are large, dense, and benefit from binary representation, streaming, and explicit flow control.
12
+
13
+ X08 lives parallel to the bus, not on top of it. A tensor session is *negotiated* via the bus (M26 calls `pipeline.shard.connect` which returns an X08 endpoint URL and a session token), then the actual bytes move over a dedicated WebSocket binary channel.
14
+
15
+ Scope: bidirectional tensor streaming, fp16 by default, optional zstd compression above a threshold, 16-byte fixed-size headers, chunked payloads, ack-based flow control. Not a general-purpose RPC.
16
+
17
+ ---
18
+
19
+ ## 2. Non-goals
20
+
21
+ - **Replacing capability-bus traffic.** Control plane stays on the bus. X08 carries data only.
22
+ - **Persistent storage of tensors.** X08 is point-to-point, in-memory, ephemeral. Storage is the caller's job.
23
+ - **Cross-version negotiation of the frame format.** v3.0 ships one frame format. A future version bumps the major.
24
+ - **End-to-end encryption beyond TLS.** WebSocket runs over TLS via M02. Per-frame application-layer crypto is out of scope (the threat model doesn't require it because session establishment is authenticated, and the WSS hop is encrypted).
25
+ - **Reliable broadcast.** Sessions are 1:1. Multi-receiver fan-out is M26's problem if it needs it.
26
+
27
+ ---
28
+
29
+ ## 3. Wire format
30
+
31
+ ### 3.1 Frame
32
+
33
+ Every frame is a single WebSocket binary message. Frame layout (big-endian):
34
+
35
+ ```
36
+ offset size field
37
+ 0 1 version (currently 0x01)
38
+ 1 1 frame_type
39
+ 2 2 reserved (must be 0x0000)
40
+ 4 4 session_seq (u32, monotonic per session)
41
+ 8 4 payload_length (u32, bytes of body)
42
+ 12 4 flags
43
+ 16 ... body (payload_length bytes)
44
+ ```
45
+
46
+ The header is always 16 bytes. Body is opaque to the framing layer; its interpretation depends on `frame_type`.
47
+
48
+ ### 3.2 Frame types
49
+
50
+ ```
51
+ 0x01 TENSOR_DATA body = tensor chunk (see Β§3.4)
52
+ 0x02 TENSOR_END body = empty; marks last chunk of a tensor
53
+ 0x03 ACK body = empty; acknowledges receipt up to session_seq
54
+ 0x04 CONTROL_NACK body = utf-8 error reason
55
+ 0x05 CONTROL_HELLO body = HelloMsg (json, utf-8)
56
+ 0x06 CONTROL_BYE body = utf-8 reason, optional
57
+ 0x07 CONTROL_FLOWCTL body = FlowCtlMsg (json, utf-8)
58
+ 0x08 CONTROL_PING body = 8 bytes (echo nonce)
59
+ 0x09 CONTROL_PONG body = 8 bytes (echoed nonce)
60
+ ```
61
+
62
+ Frame types `0x10..0xFF` are reserved for future extensions and current implementations must close the session on unknown types.
63
+
64
+ ### 3.3 Flags
65
+
66
+ ```
67
+ 0x00000001 COMPRESSED payload is zstd-compressed
68
+ 0x00000002 FINAL last frame in this tensor (also implied by TENSOR_END)
69
+ 0x00000004 GRAD payload is a gradient (informational; for telemetry)
70
+ 0x00000008 ENCRYPTED reserved for future per-frame encryption
71
+ 0xFFFFFFF0 reserved
72
+ ```
73
+
74
+ ### 3.4 Tensor chunk body
75
+
76
+ A `TENSOR_DATA` body is:
77
+
78
+ ```
79
+ offset size field
80
+ 0 2 tensor_id (u16, scoped to this session)
81
+ 2 1 dtype (0x01=fp16, 0x02=fp32, 0x03=bf16, 0x04=int8)
82
+ 3 1 n_dims (1..8)
83
+ 4 n_dims*4 shape (u32 per dim, big-endian)
84
+ ... data_bytes (compressed if COMPRESSED flag set)
85
+ ```
86
+
87
+ `tensor_id` lets a session carry multiple concurrent tensors (e.g., parallel pipeline stages). A given `tensor_id` may be split across multiple `TENSOR_DATA` frames and is terminated by a `TENSOR_END` with the same `tensor_id`.
88
+
89
+ ### 3.5 HelloMsg
90
+
91
+ ```json
92
+ {
93
+ "session_id": "<ulid>",
94
+ "session_token": "<m16-token>",
95
+ "from": "<NodeID>",
96
+ "to": "<NodeID>",
97
+ "purpose": "pipeline.shard.forward",
98
+ "negotiation": {
99
+ "preferred_dtype": "fp16",
100
+ "compression": "zstd",
101
+ "max_chunk_bytes": 1048576,
102
+ "flow_window": 16
103
+ }
104
+ }
105
+ ```
106
+
107
+ Both parties exchange `CONTROL_HELLO` on connect; mismatched purposes or invalid tokens terminate the session with `CONTROL_BYE`.
108
+
109
+ ### 3.6 FlowCtlMsg
110
+
111
+ ```json
112
+ { "window": 16, "credits_added": 8 }
113
+ ```
114
+
115
+ Receiver-initiated. Says "I can accept N more in-flight chunks beyond what I've already acked". See Β§4.3.
116
+
117
+ ---
118
+
119
+ ## 4. Behaviour
120
+
121
+ ### 4.1 Session lifecycle
122
+
123
+ ```
124
+ CONNECT ──hello exchange──▢ READY ──tensor data──▢ STREAMING ──end/bye──▢ CLOSED
125
+ β”‚
126
+ β”œβ”€β”€ auth fails ──▢ NACK ──▢ CLOSED
127
+ └── timeout ──▢ CLOSED
128
+ ```
129
+
130
+ A session is opened by the side that initiated the bus call (the M26 caller for forward passes; the shard server for activations sent back if reverse direction is needed). The HelloMsg `session_token` is an M16 token scoped to the bus capability that authorised this session (e.g., `pipeline-shard-forward`); the receiver validates it before accepting any `TENSOR_DATA`.
131
+
132
+ ### 4.2 Sequencing
133
+
134
+ `session_seq` is a u32 starting at 1 and incrementing per outgoing frame from the sender. It wraps to 1 at 2^32-1 in the theoretical case but practically a single session is expected to be far below that. Wrap is supported by the protocol but is not exercised by tests.
135
+
136
+ The receiver tracks the highest `session_seq` it has processed and acknowledges via `ACK` frames whose `session_seq` echoes the highest contiguous received seq.
137
+
138
+ ### 4.3 Flow control
139
+
140
+ The receiver advertises a *credit window* in `CONTROL_FLOWCTL`. The sender may have at most `window` un-acked frames in flight. Initial window is set in `HelloMsg.negotiation.flow_window` (default `TENSOR_FLOW_CONTROL_WINDOW=16`). The receiver replenishes credits by sending `FLOWCTL` with `credits_added > 0` as it processes frames.
141
+
142
+ If the sender's in-flight count reaches the window, it pauses until an `ACK` or `FLOWCTL` arrives. There is no timeout-based unblock; if the receiver disappears, the underlying WebSocket eventually closes and the session ends.
143
+
144
+ ### 4.4 Compression
145
+
146
+ `COMPRESSED` flag is set per-frame, not per-session. The sender chooses; the receiver MUST support zstd (level 3 default). Compression is applied to the *body* (everything after the 16-byte header). The body's `payload_length` reflects the compressed size; the uncompressed shape is recovered from the tensor chunk header after decompression.
147
+
148
+ Compression is enabled when the raw body exceeds `TENSOR_COMPRESSION_THRESHOLD_BYTES` (default 64 KiB). Below this, the framing overhead dominates and compression is skipped.
149
+
150
+ ### 4.5 Chunking
151
+
152
+ A tensor larger than `TENSOR_CHUNK_BYTES` (default 1 MiB) is split into multiple `TENSOR_DATA` frames sharing the same `tensor_id`. The split is on raw-byte boundaries (after compression if compressed); the receiver concatenates raw bytes per `tensor_id` and then, on `TENSOR_END`, decompresses (if needed) and reconstructs the tensor using the shape declared in the *first* chunk for that `tensor_id`. Subsequent chunks for the same `tensor_id` repeat the dtype/shape header β€” the receiver MUST verify consistency or close the session with a NACK.
153
+
154
+ ### 4.6 Keepalive
155
+
156
+ Either side may send `CONTROL_PING` at any time; the peer must respond with `CONTROL_PONG` echoing the nonce. A session with no PING/PONG and no data for `TENSOR_KEEPALIVE_SECONDS` (default 30) sends a PING; failure to respond within 2Γ— that closes the session.
157
+
158
+ ### 4.7 Backpressure & cancellation
159
+
160
+ A caller cancelling a pipeline operation (M26) sends `CONTROL_BYE` with a reason. The receiver may discard in-flight tensors for the cancelled session. There is no "graceful drain" β€” cancellation is fast and lossy.
161
+
162
+ ### 4.8 Failure modes
163
+
164
+ - **Decompression fails**: NACK + close. The caller in M26 retries with the failover shard.
165
+ - **Tensor shape inconsistency across chunks**: NACK + close.
166
+ - **Auth failure on HelloMsg**: NACK + close before any data flows.
167
+ - **Unknown frame type**: close with NACK reason `unknown_frame_type`.
168
+ - **Sequence gap**: NACK + close. There is no out-of-order recovery; WebSocket delivers in order, so a gap means corruption.
169
+ - **Window overrun by sender**: NACK + close β€” the sender violated flow control.
170
+
171
+ ---
172
+
173
+ ## 5. API
174
+
175
+ X08 is a library, not a capability surface. Public Python API:
176
+
177
+ ```python
178
+ class TensorSession:
179
+ @classmethod
180
+ async def connect(cls,
181
+ url: str,
182
+ token: AuthToken,
183
+ *,
184
+ purpose: str,
185
+ remote: NodeID,
186
+ negotiation: SessionNegotiation | None = None) -> TensorSession: ...
187
+ @classmethod
188
+ async def accept(cls,
189
+ ws: WebSocket,
190
+ *,
191
+ expected_purpose: str,
192
+ validate_token: Callable[[AuthToken], None]) -> TensorSession: ...
193
+
194
+ async def send_tensor(self, tensor_id: int, t: Tensor, *, gradient: bool = False) -> None: ...
195
+ async def recv_tensor(self) -> RecvTensor: ...
196
+ async def close(self, reason: str = "") -> None: ...
197
+
198
+ @property
199
+ def session_id(self) -> str: ...
200
+ @property
201
+ def stats(self) -> SessionStats: ...
202
+
203
+ @dataclass(frozen=True)
204
+ class RecvTensor:
205
+ tensor_id: int
206
+ tensor: Tensor
207
+ is_grad: bool
208
+
209
+ @dataclass(frozen=True)
210
+ class SessionStats:
211
+ bytes_sent: int
212
+ bytes_received: int
213
+ bytes_compressed_out: int
214
+ bytes_uncompressed_out: int
215
+ frames_sent: int
216
+ frames_received: int
217
+ rtt_estimate_ms: float
218
+ ```
219
+
220
+ Implementations: `hearthnet/transport/tensor/` houses `session.py`, `frame.py`, `flow.py`, `compress.py`.
221
+
222
+ ---
223
+
224
+ ## 6. Configuration
225
+
226
+ ```python
227
+ @dataclass(frozen=True)
228
+ class TensorTransportConfig:
229
+ default_dtype: Literal["fp16","fp32","bf16","int8"] = "fp16"
230
+ chunk_bytes: int = TENSOR_CHUNK_BYTES # 1048576
231
+ flow_control_window: int = TENSOR_FLOW_CONTROL_WINDOW # 16
232
+ compression_threshold_bytes: int = TENSOR_COMPRESSION_THRESHOLD_BYTES # 65536
233
+ compression_level: int = 3 # zstd
234
+ keepalive_seconds: int = TENSOR_KEEPALIVE_SECONDS # 30
235
+ max_session_lifetime_seconds: int = 3600 # hard cap
236
+ max_concurrent_sessions: int = 64
237
+ rx_buffer_bytes_max: int = 64 * 1024 * 1024 # 64 MiB
238
+ ```
239
+
240
+ Constants in `hearthnet/constants.py`.
241
+
242
+ ---
243
+
244
+ ## 7. Tests
245
+
246
+ ### 7.1 Unit
247
+
248
+ - `test_frame_header_layout` β€” pack/unpack roundtrip for all frame types.
249
+ - `test_tensor_chunk_body` β€” pack/unpack roundtrip for all dtypes and ranks.
250
+ - `test_compression_roundtrip` β€” compressed body decompresses to identity.
251
+ - `test_chunking_reassembly` β€” 5 MiB tensor split into 5 chunks reassembles to identical bytes.
252
+ - `test_unknown_frame_type_closes` β€” receiver rejects 0xFF.
253
+ - `test_flow_control_blocks_at_window` β€” sender pauses at window edge, resumes on ACK.
254
+ - `test_seq_gap_closes` β€” injecting a missing seq forces NACK + close.
255
+
256
+ ### 7.2 Property
257
+
258
+ - Random tensor shapes and dtypes: send β†’ receive β†’ equal modulo dtype precision.
259
+ - Random chunk sizes that always sum to the same total: reassembly identical.
260
+
261
+ ### 7.3 Integration
262
+
263
+ - Loopback session over an in-memory WebSocket pair: send 10 tensors of varying size, verify all received, stats consistent.
264
+ - Two-process loopback: same as above but over a real localhost WSS.
265
+ - Cancellation mid-stream: sender sends half a tensor, receives BYE, no further frames sent.
266
+ - Auth failure: connect with bad token β†’ NACK on hello.
267
+
268
+ ### 7.4 Negative
269
+
270
+ - Send to a wrong purpose β†’ hello mismatch β†’ close.
271
+ - Send oversized tensor (exceeds rx_buffer_bytes_max) β†’ receiver NACKs with `tensor_too_large`.
272
+ - Corrupt frame in the middle of a tensor: receiver detects via shape inconsistency or decompression failure β†’ close.
273
+
274
+ ---
275
+
276
+ ## 8. Cross-references
277
+
278
+ - **Phase 1 M02 Transport** β€” provides the underlying WebSocket (WSS, TLS, certificate pinning).
279
+ - **Phase 2 X06 WebSocket** β€” defines the WebSocket framing and reconnection semantics that X08 layers on.
280
+ - **Phase 2 M16 Tokens** β€” session tokens authorise tensor transport sessions.
281
+ - **Phase 3 M26 Distributed Inference** β€” the primary consumer; defines purposes like `pipeline.shard.forward`, `pipeline.shard.backward`.
282
+ - **Phase 3 X09 Conformance Suite** β€” includes optional `tensor_transport` section, only run when M26 is enabled.
283
+
284
+ ---
285
+
286
+ ## 9. Open questions
287
+
288
+ 1. **Per-frame encryption.** The `ENCRYPTED` flag is reserved. The use case is post-quantum hardening above TLS, or end-to-end above a federation-relay path that terminates TLS at the relay. Not in v3.0.
289
+
290
+ 2. **Adaptive compression.** Fixed zstd level 3 is fine for typical activations. Per-session adaptive level (lower for hot, higher for warm tensors) is plausible. Out of scope.
291
+
292
+ 3. **GPU-direct transport.** Activations sit in GPU memory and round-tripping through CPU memory for serialisation is wasteful. Direct GPU-to-network (NVLink/RDMA) is interesting but assumes a specific hardware topology that HearthNet doesn't have. Not in v3.0.
293
+
294
+ 4. **Multipath.** Sending tensor chunks over multiple parallel WebSocket sessions to bond bandwidth is appealing but complicates ordering. v3.0 sticks to one session.
295
+
296
+ 5. **Sequence wrap.** Practically irrelevant; correctness at wrap is asserted but not battle-tested.
297
+
298
+ 6. **Flow control on the wire.** Currently we layer flow control on top of WebSocket, which already has some. The duplication is intentional (we want app-level explicit windowing for backpressure into the inference scheduler) but worth revisiting.
299
+
300
+ ---
301
+
302
+ *Last updated: spec v3.0.*
docs/p2_p3/X09-conformance-suite.md ADDED
@@ -0,0 +1,316 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # X09 β€” Conformance Suite
2
+
3
+ **Spec version:** v3.0 β€” *experimental*
4
+ **Depends on:** Every other module (the suite tests them); no runtime dependency in production
5
+ **Depended on by:** [M32 Protocol Standardisation](../modules/M32-protocol-standard.md)
6
+
7
+ ---
8
+
9
+ ## 1. Purpose
10
+
11
+ A black-box, implementation-agnostic test suite that defines what "HearthNet-compliant" means in practice. The suite spins up an instance of an implementation, drives it through specified interactions, observes the wire format and the capability behaviour, and produces a `ConformanceReport` (defined in M32).
12
+
13
+ Where the spec documents say "the system MUST do X", the conformance suite contains a test that observes whether the system does X. If a behaviour is described in a spec but not tested by X09, the spec wins in principle but the suite wins in practice β€” so we treat closing that gap as a continuous effort.
14
+
15
+ The suite is designed so that an alternate implementation (a future Go or Rust HearthNet) can be tested by the same suite. This is the entire point: it makes "interoperable" a measurable property.
16
+
17
+ ---
18
+
19
+ ## 2. Non-goals
20
+
21
+ - **Replacing per-module unit tests.** Each module ships its own unit and property tests as described in its spec. X09 sits one level higher and treats the implementation as a black box.
22
+ - **Performance benchmarks.** Conformance is correctness, not speed. A future X10 may handle benchmarks.
23
+ - **Security audits.** Out of scope. The suite includes some negative-path tests but is not a pen-test.
24
+ - **Visual / UX testing.** The web UI is exercised only via its capability-bus and HTTP API surfaces.
25
+ - **Locking in implementation detail.** Tests assert on observable behaviour (wire formats, capability responses, event log entries), never on internal state.
26
+
27
+ ---
28
+
29
+ ## 3. File layout
30
+
31
+ The suite lives at repository root, sibling to `hearthnet/` and `protocol/`:
32
+
33
+ ```
34
+ conformance/
35
+ β”œβ”€β”€ README.md
36
+ β”œβ”€β”€ VERSION # suite version, e.g. "1.0.0"
37
+ β”œβ”€β”€ pyproject.toml # standalone tool, runnable without hearthnet/
38
+ β”œβ”€β”€ runner.py # entry point: `python -m conformance.runner --target=...`
39
+ β”œβ”€β”€ report.py # builds ConformanceReport from results
40
+ β”œβ”€β”€ harness/
41
+ β”‚ β”œβ”€β”€ target.py # abstraction over a "system under test" (SUT)
42
+ β”‚ β”œβ”€β”€ docker_target.py # SUT in a docker container
43
+ β”‚ β”œβ”€β”€ local_target.py # SUT on the local network at a URL
44
+ β”‚ β”œβ”€β”€ fixtures.py # synthetic identities, tokens, files
45
+ β”‚ └── wire_capture.py # records bus / WS traffic for diffing
46
+ β”œβ”€β”€ suites/
47
+ β”‚ β”œβ”€β”€ core/
48
+ β”‚ β”‚ β”œβ”€β”€ identity/
49
+ β”‚ β”‚ β”œβ”€β”€ transport/
50
+ β”‚ β”‚ β”œβ”€β”€ bus/
51
+ β”‚ β”‚ β”œβ”€β”€ events/
52
+ β”‚ β”‚ β”œβ”€β”€ tokens/
53
+ β”‚ β”‚ β”œβ”€β”€ files/
54
+ β”‚ β”‚ β”œβ”€β”€ kb/
55
+ β”‚ β”‚ └── llm/
56
+ β”‚ β”œβ”€β”€ services/
57
+ β”‚ β”‚ β”œβ”€β”€ chat/
58
+ β”‚ β”‚ β”œβ”€β”€ group_chat/
59
+ β”‚ β”‚ β”œβ”€β”€ ocr/
60
+ β”‚ β”‚ β”œβ”€β”€ translation/
61
+ β”‚ β”‚ └── stt_tts/
62
+ β”‚ β”œβ”€β”€ federation/
63
+ β”‚ β”œβ”€β”€ experimental/
64
+ β”‚ β”‚ β”œβ”€β”€ distributed_inference/
65
+ β”‚ β”‚ β”œβ”€β”€ moe/
66
+ β”‚ β”‚ β”œβ”€β”€ fedlearn/
67
+ β”‚ β”‚ β”œβ”€β”€ evidence/
68
+ β”‚ β”‚ └── civdef/
69
+ β”‚ └── operability/
70
+ β”‚ β”œβ”€β”€ shutdown_clean/
71
+ β”‚ β”œβ”€β”€ restart_persistence/
72
+ β”‚ └── observability/
73
+ └── vectors/ # test vectors: canonical inputs and expected outputs
74
+ β”œβ”€β”€ identity/
75
+ β”œβ”€β”€ tokens/
76
+ β”œβ”€β”€ federation/
77
+ └── tensor_transport/
78
+ ```
79
+
80
+ The whole `conformance/` directory is published as part of every protocol release, with the `VERSION` file aligning with the protocol's release cadence (but versioned independently β€” see Β§4.1).
81
+
82
+ ---
83
+
84
+ ## 4. Architecture
85
+
86
+ ### 4.1 Suite versioning
87
+
88
+ Suite version follows semver. `suite_version` in `ConformanceReport` is the suite that produced the report. A protocol version is paired with a *minimum* suite version that is sufficient to test it. Newer suite versions test more thoroughly; older suite versions may not exercise newer protocol features.
89
+
90
+ ### 4.2 Target abstraction
91
+
92
+ ```python
93
+ class Target(Protocol):
94
+ """A system under test (SUT). The suite never touches the SUT's internals."""
95
+ base_url: str
96
+ admin_token: AuthToken
97
+
98
+ async def start(self) -> None: ... # for managed targets like docker
99
+ async def stop(self) -> None: ...
100
+ async def reset(self) -> None: ... # blank slate (for tests that need it)
101
+ async def bus_call(self, capability: str, payload: dict) -> dict: ...
102
+ async def event_subscribe(self, types: list[str]) -> AsyncIterator[Event]: ...
103
+ async def http_get(self, path: str, headers: dict | None = None) -> Response: ...
104
+ async def http_post(self, path: str, body: bytes, headers: dict | None = None) -> Response: ...
105
+ async def ws_connect(self, path: str, subprotocol: str | None = None) -> WebSocket: ...
106
+ async def capture_wire(self) -> WireCapture: ... # for federation/tensor tests
107
+ ```
108
+
109
+ Two concrete `Target` implementations ship:
110
+
111
+ - `LocalTarget`: SUT runs as a long-lived process accessible at a known URL. Simplest; used in CI against the reference implementation.
112
+ - `DockerTarget`: SUT runs in a docker container that the suite spawns. Useful for testing alternate implementations packaged as containers.
113
+
114
+ Authors of an alternate implementation supply their own `Target` subclass if needed.
115
+
116
+ ### 4.3 Test format
117
+
118
+ Tests are plain `pytest` cases under `suites/`. They use the target as an injected fixture:
119
+
120
+ ```python
121
+ # suites/core/identity/test_node_id_format.py
122
+
123
+ async def test_node_id_is_base32_no_pad(target: Target) -> None:
124
+ r = await target.bus_call("identity.self.describe", {})
125
+ node_id = r["node_id"]
126
+ assert re.fullmatch(r"[A-Z2-7]+", node_id), "NodeID must be base32 with no padding"
127
+ assert len(node_id) >= 52
128
+ ```
129
+
130
+ Each test asserts at most one *spec requirement*. The test docstring names the spec section it covers; the runner uses this to produce traceability from `SectionResult.failures` back to the relevant module spec.
131
+
132
+ ### 4.4 Wire vectors
133
+
134
+ For wire-format tests (federation manifest, tensor transport frame, token JWS envelope) the suite carries canonical byte vectors in `vectors/`. Tests assert that:
135
+
136
+ - The SUT, given a known input, produces a byte-equal output (after canonicalisation where applicable).
137
+ - The SUT, given a known byte vector, parses it without errors and produces the expected semantic content.
138
+
139
+ This catches subtle interop bugs β€” the kind of "we both speak JSON-with-tiny-differences" issue that has historically killed federated systems.
140
+
141
+ ### 4.5 Report aggregation
142
+
143
+ After a run, `report.py`:
144
+
145
+ 1. Collects per-test pass/fail/skip results.
146
+ 2. Groups by suite path β†’ SectionResult.
147
+ 3. Computes `overall`:
148
+ - `pass` if all `core/*` and all `services/*` sections passed (experimental and operability may fail without affecting `pass`).
149
+ - `partial` if `core/*` passed but anything else failed.
150
+ - `fail` if any `core/*` test failed.
151
+ - `skipped` if no sections ran.
152
+ 4. Signs the report with the SUT's identity (the SUT signs its own report β€” there is no external authority).
153
+ 5. Emits `report.json` and a human-readable `report.html`.
154
+
155
+ ### 4.6 Reproducibility
156
+
157
+ Every run produces a `run_manifest.json` containing:
158
+
159
+ - Suite version, suite git commit.
160
+ - Target type and configuration (without secrets).
161
+ - Random seed (suite seeds all RNGs deterministically for reproducibility).
162
+ - Test selection (which suites/tests were run vs skipped).
163
+ - Timestamps.
164
+
165
+ Replaying with the same manifest against the same SUT version must produce equivalent results modulo timestamps.
166
+
167
+ ---
168
+
169
+ ## 5. Required sections
170
+
171
+ A claim of "HearthNet-compliant at protocol version 3.0.0" requires passing **every test** under:
172
+
173
+ - `suites/core/identity/`
174
+ - `suites/core/transport/`
175
+ - `suites/core/bus/`
176
+ - `suites/core/events/`
177
+ - `suites/core/tokens/`
178
+ - `suites/core/files/`
179
+ - `suites/core/kb/` *(minimum: ingest, query)*
180
+ - `suites/core/llm/` *(minimum: chat capability, error handling)*
181
+
182
+ Plus passing the relevant *advertised-capability* sections under `suites/services/` for any service the implementation advertises. An implementation advertising `chat.thread.*` but not running `suites/services/chat/` is non-compliant by omission.
183
+
184
+ Federation is required for any implementation that advertises federation; otherwise it's optional. Experimental sections are *always* optional and `partial` is a valid honest outcome.
185
+
186
+ ---
187
+
188
+ ## 6. Behaviour
189
+
190
+ ### 6.1 Pre-flight
191
+
192
+ Before running tests, the runner:
193
+
194
+ 1. Confirms `target.start()` succeeded.
195
+ 2. Calls `protocol.self_describe` and `protocol.version_list` to discover what to test.
196
+ 3. Confirms `protocol_version` returned by the SUT is compatible with the suite's supported versions; if not, fails fast with `protocol_version_unsupported`.
197
+ 4. Resets the SUT (`target.reset()`).
198
+ 5. Loads vector files into memory.
199
+
200
+ ### 6.2 Test isolation
201
+
202
+ Each test must be independent β€” order should not matter. Tests that need a clean slate request `target.reset()` in a fixture; tests that need shared state declare it via pytest fixtures with explicit scope.
203
+
204
+ Tests use synthetic identities and tokens generated per-test, never the real operator's keys.
205
+
206
+ ### 6.3 Graceful skipping
207
+
208
+ A test that requires a capability not advertised by the SUT is *skipped*, not failed:
209
+
210
+ ```python
211
+ @requires_capability("experimental.fedlearn.round.announce")
212
+ async def test_fedlearn_round_announce_signs_manifest(target: Target) -> None:
213
+ ...
214
+ ```
215
+
216
+ `requires_capability` queries `protocol.self_describe`. Skipped tests appear in the report as `skipped` with a reason. They never flip `overall` to `fail`.
217
+
218
+ ### 6.4 Wire capture mode
219
+
220
+ For federation and tensor-transport sections, the runner may attach a `WireCapture` to record the raw bytes flowing between two SUT instances (or between an SUT and the suite's own simulator). The captured frames are checked against vectors and against the schema documented in the relevant cross-cutting spec.
221
+
222
+ Wire-capture mode requires the operator to have configured the SUT to log raw traffic to a known location (typically a Unix socket the suite reads). For SUTs that can't expose raw traffic, the suite falls back to behavioural assertions only and notes `partial` if wire vectors couldn't be verified.
223
+
224
+ ### 6.5 Operability sections
225
+
226
+ `operability/` sections test resilience properties:
227
+
228
+ - `shutdown_clean`: send SIGTERM (or container stop); verify no events are lost, the audit chain (M31) verifies, and the SUT restarts cleanly.
229
+ - `restart_persistence`: data created before restart is queryable after restart.
230
+ - `observability`: standard event types fire as expected; X03 observability conformance.
231
+
232
+ ### 6.6 Reporting failures
233
+
234
+ Each failure records:
235
+
236
+ - Spec section reference (e.g. `M14 Β§5.2 canonicalisation`).
237
+ - The actual observed value or behaviour.
238
+ - The expected value or behaviour.
239
+ - A reproduction recipe (capability call + payload, or wire vector identifier).
240
+
241
+ This is what makes a `partial` report useful: the failures are debuggable.
242
+
243
+ ---
244
+
245
+ ## 7. Configuration
246
+
247
+ ```python
248
+ @dataclass(frozen=True)
249
+ class ConformanceConfig:
250
+ target_kind: Literal["local","docker","custom"] = "local"
251
+ target_url: str = "http://127.0.0.1:7900"
252
+ target_admin_token: str | None = None # acquired out-of-band
253
+ docker_image: str | None = None
254
+ suite_filter: tuple[str, ...] = () # glob patterns; empty = all required + advertised
255
+ skip_experimental: bool = False
256
+ skip_operability: bool = False
257
+ wire_capture: bool = False
258
+ output_dir: str = "./conformance-report"
259
+ parallel: int = 1 # 1 by default to avoid test-isolation surprises
260
+ seed: int = 0xC0FFEE
261
+ ```
262
+
263
+ A typical CI invocation:
264
+
265
+ ```
266
+ python -m conformance.runner \
267
+ --target=docker \
268
+ --docker-image=hearthnet:latest \
269
+ --output-dir=./report
270
+ ```
271
+
272
+ ---
273
+
274
+ ## 8. Tests of the suite itself
275
+
276
+ The suite has its own tests, kept under `conformance/tests/`:
277
+
278
+ - `test_runner_smoke` β€” runs the suite against the reference impl, expects `overall=pass` for `core/*`.
279
+ - `test_skip_logic` β€” capabilities not advertised β†’ tests skipped, not failed.
280
+ - `test_seed_deterministic` β€” given a seed, two consecutive runs produce identical reports modulo timestamps.
281
+ - `test_report_schema` β€” generated `ConformanceReport` validates against the schema in `protocol/`.
282
+ - `test_vector_integrity` β€” every file in `vectors/` parses with the canonical loader.
283
+ - `test_known_partial` β€” the reference impl with `experimental.*` disabled produces a `partial` report (because experimental tests skip, not fail) β€” verify that the `overall` calculation correctly produces `pass`, since experimental skips don't flip the bit.
284
+
285
+ ---
286
+
287
+ ## 9. Cross-references
288
+
289
+ - **M32 Protocol Standardisation** β€” consumes `ConformanceReport`; the suite is the source of "what conformance means".
290
+ - **Every module spec** β€” the suite's tests reference the spec section they verify.
291
+ - **X02 Event Log, X03 Observability** β€” operability tests assert on these.
292
+ - **X06 WebSocket, X08 Tensor Transport** β€” wire-capture vectors live in `vectors/`.
293
+
294
+ ---
295
+
296
+ ## 10. Open questions
297
+
298
+ 1. **Adversarial tests.** v3.0 has minimal negative-path coverage in `core/`. A future suite version with a `security/` section that probes for known classes of mistakes (auth bypass, signature reuse, event-log forgery attempts) would be valuable. Out of scope for v3.0.
299
+
300
+ 2. **Conformance for partial implementations.** The current model gates `pass` on all required `core/*` passing. A future tiered model (`HearthNet-Bronze` = identity + transport + bus only; `HearthNet-Silver` adds services; `HearthNet-Gold` adds federation) is appealing for low-resource implementations. Not in v3.0.
301
+
302
+ 3. **Differential testing.** Once two implementations exist, running them side-by-side with the same input and asserting identical observable behaviour is the strongest interop test. The harness supports this in principle (two targets), but no tests in v3.0 actually use it because only one implementation exists.
303
+
304
+ 4. **Vector generation.** Today vectors are hand-curated. Tooling to *regenerate* vectors from the reference implementation and detect drift would prevent test rot. Planned, not implemented.
305
+
306
+ 5. **Reporting hub.** A public registry that collects published conformance reports from various implementations would help users assess interop status. Out of scope for the suite itself; M32's `protocol.registry.*` capabilities are the closest current analogue.
307
+
308
+ 6. **Performance regression guardrails.** Not conformance, but obviously valuable. A separate X10 (TBD) may handle this.
309
+
310
+ 7. **Long-haul tests.** Some bugs (memory leaks, slow drifts in audit chains) appear only after hours. The suite is built for short runs; a "soak mode" with `--duration=24h` would test these. Open.
311
+
312
+ 8. **Federation interop with non-HearthNet systems.** Out of scope. The suite verifies HearthNet ↔ HearthNet federation only.
313
+
314
+ ---
315
+
316
+ *Last updated: spec v3.0.*