NERDDISCO commited on
Commit
696222f
Β·
1 Parent(s): 41bd3fd

docs(story): move parts of the lib into web workers

Browse files
Files changed (1) hide show
  1. docs/planning/009_web_worker.md +167 -0
docs/planning/009_web_worker.md ADDED
@@ -0,0 +1,167 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # User Story 009: Web Worker Architecture (Main-thread Safe Web Library)
2
+
3
+ ## Story
4
+
5
+ **As a** user building robotics UIs that also render live camera previews and interactive controls
6
+ **I want** `@lerobot/web` to run heavy control/recording work off the main thread
7
+ **So that** my UI stays smooth (no flicker/jank) even when teleoperation and recording are active
8
+
9
+ ## Background
10
+
11
+ The current browser implementation runs teleoperation control loops, dataset assembly, and export logic on the main thread. When activating keyboard teleoperation while previewing a camera stream, the preview can flicker due to main-thread contention. This is a UX blocker for real-world apps that combine live video, UI interactions, and hardware control.
12
+
13
+ A worker-based architecture lets us move CPU-intensive, frequent, or bursty work off the main thread. The main thread remains responsible for DOM, video rendering and user interactions. The library must preserve the existing API (`calibrate()`, `teleoperate()`, `record()`) while transparently using workers when available, and cleanly falling back to the current approach otherwise.
14
+
15
+ ## Goals
16
+
17
+ - Identical public API to today’s `@lerobot/web` (no breaking changes)
18
+ - Main-thread safe by default: heavy or frequent work executes in a Web Worker
19
+ - Graceful fallback when workers or specific APIs aren’t available
20
+ - Type-safe, minimal-copy message protocol using Transferables when possible
21
+ - Strict library/demo separation: UI and storage remain in demos
22
+ - Maintain Python lerobot UX parity and behavior
23
+
24
+ ## Non-Goals (for this story)
25
+
26
+ - Changing dataset formats or camera acquisition approach
27
+ - Rewriting Web Serial API usage into worker (browser support is limited in workers)
28
+ - Introducing new external dependencies
29
+
30
+ ## Acceptance Criteria
31
+
32
+ - Smooth UI under load:
33
+ - With at least one active camera preview and keyboard teleoperation at 60–120 Hz, the preview does not flicker and UI remains responsive at ~60 FPS
34
+ - API compatibility:
35
+ - `calibrate()`, `teleoperate()`, `record()` signatures and return shapes are unchanged
36
+ - Feature-detect workers; automatically use worker-backed runtime when available, otherwise use current main-thread runtime
37
+ - Clear separation of responsibilities:
38
+ - Worker executes control loops, interpolation, dataset assembly, export packaging, and CPU-heavy transforms
39
+ - Main thread owns DOM/UI and browser-only APIs that are unavailable in workers (e.g., Web Serial write calls)
40
+ - Type-safe protocol:
41
+ - Strongly typed request/response messages with versioned `type` fields; Transferable payloads used for large data
42
+ - Reliability & fallback:
43
+ - If the worker crashes or becomes unavailable, operations fail gracefully with descriptive errors and suggest retry
44
+ - Fallback path (main-thread) is automatically used when worker creation fails
45
+ - Tests & docs:
46
+ - Unit tests cover protocol routing and basic round-trips
47
+ - Planning docs updated; README notes main-thread-safe architecture
48
+
49
+ ## Architecture Overview
50
+
51
+ ### Worker Boundaries
52
+
53
+ - Execute in Worker:
54
+ - Control loop scheduling and target computation for teleoperation (keyboard/direct and future teleoperators)
55
+ - Episode/frame buffering and interpolation (regularization) for recording
56
+ - Dataset assembly (tables/metadata), packaging (ZIP writer), and background export streaming
57
+ - Lightweight telemetry aggregation for UI
58
+ - Execute on Main Thread:
59
+ - DOM, UI, and camera previews (`<video>` elements)
60
+ - Web Serial API read/write bridge (if browser does not permit worker access)
61
+ - MediaRecorder handling (browser-optimized implementation already off main CPU in many engines)
62
+
63
+ ### Threading Model
64
+
65
+ - Main thread spawns one worker per β€œprocess” instance as needed:
66
+ - TeleoperationProcess β†’ TeleopWorker
67
+ - RecordProcess β†’ RecordWorker (can be shared or composed with teleop worker depending on lifecycle)
68
+ - The public process objects returned from `teleoperate()`/`record()` are proxies. Method calls post messages to the worker and return promises where appropriate.
69
+ - SerialBridge (main-thread): worker requests motor write/read; main thread performs Web Serial operations and returns results. This preserves worker advantages while respecting browser API constraints.
70
+
71
+ ### Message Protocol (Typed)
72
+
73
+ All messages include a discriminant `type` and a `requestId` when a response is expected.
74
+
75
+ - Teleoperation (examples):
76
+ - `teleop/start`, `teleop/stop`
77
+ - `teleop/update_key_state` { key, pressed }
78
+ - `teleop/move_motor` { motorName, position }
79
+ - `teleop/state_update` { motorConfigs, keyStates, lastUpdate } (worker β†’ main)
80
+ - `serial/write_position` { id, position } (worker β†’ main) β†’ `serial/ack`
81
+ - Recording (examples):
82
+ - `record/start`, `record/stop`, `record/next_episode`
83
+ - `record/frame_append` { payload transferable }
84
+ - `record/export_zip` { options } β†’ streaming progress events
85
+ - Error & lifecycle:
86
+ - `worker/error`, `worker/ready`, `worker/teardown`
87
+
88
+ Use Transferables (ArrayBuffer/MessagePort) for large payloads to avoid copies.
89
+
90
+ ### File Structure (web package)
91
+
92
+ ```
93
+ packages/web/src/
94
+ β”œβ”€β”€ workers/
95
+ β”‚ β”œβ”€β”€ teleop.worker.ts # Teleoperation control loop
96
+ β”‚ β”œβ”€β”€ record.worker.ts # Recording assembly/export
97
+ β”‚ β”œβ”€β”€ protocol.ts # Message types & guards
98
+ β”‚ └── utils.worker.ts # Worker-side helpers (interpolation, zip)
99
+ β”œβ”€β”€ bridges/
100
+ β”‚ └── serial-bridge.ts # Main-thread serial proxy for workers
101
+ β”œβ”€β”€ teleoperate.ts # Spawns worker, returns proxy process
102
+ β”œβ”€β”€ record.ts # Spawns worker, returns proxy process
103
+ └── types/
104
+ └── worker.ts # Public worker-related types (narrow)
105
+ ```
106
+
107
+ ### Lifecycle & Fallback
108
+
109
+ - On `teleoperate()`/`record()` call:
110
+ - Try to instantiate corresponding worker via `new Worker(new URL(...), { type: 'module' })`
111
+ - If success: wire protocol channels and return proxy-backed process
112
+ - If fail: fall back to current main-thread implementation (no behavioral changes)
113
+ - On `process.stop()` or page unload: send `worker/teardown` and terminate the worker
114
+
115
+ ### Performance Notes
116
+
117
+ - Control loop cadence generated inside worker to avoid main-thread timers
118
+ - Batch serial commands from worker to main-thread bridge to minimize postMessage overhead
119
+ - Use coarse-to-fine update: high-rate calculations in worker; lower-rate UI state updates to main thread (e.g., 10–20 Hz) for rendering
120
+ - For export, stream chunks from worker; main thread triggers download or HF upload
121
+
122
+ ### Error Handling
123
+
124
+ - All request/response messages enforce timeouts with descriptive errors
125
+ - Worker initialization guarded with feature detection and clear fallback
126
+ - Protocol version field enables future evolution without breaking older callers
127
+
128
+ ## Phased Implementation Plan
129
+
130
+ ### Phase 1: Dataset & Export Offload (Low Risk)
131
+
132
+ - Move episode interpolation, dataset assembly, and ZIP packaging to `record.worker.ts`
133
+ - Main thread keeps MediaRecorder and camera preview as-is
134
+ - Public API unchanged; verify ZIP download and HF upload via streamed messages
135
+
136
+ ### Phase 2: Teleoperation Offload with SerialBridge
137
+
138
+ - Move control loop scheduling and target computation to `teleop.worker.ts`
139
+ - Implement SerialBridge on main thread for Web Serial commands
140
+ - Worker posts motor write requests; main thread executes and responds
141
+ - Throttle state updates to UI while maintaining high-rate control internally
142
+
143
+ ### Phase 3: Fine-Grained Optimizations
144
+
145
+ - Introduce Transferables for large buffers
146
+ - Optional OffscreenCanvas pipelines for future video transforms (not required for current scope)
147
+ - Tune batching and message cadence under hardware testing
148
+
149
+ ### Phase 4: Reliability & Observability
150
+
151
+ - Heartbeat messages and auto-restart policy for worker failures
152
+ - Dev diagnostics toggles; production minimal logging
153
+
154
+ ## Risks & Mitigations
155
+
156
+ - Web Serial availability in workers: use main-thread SerialBridge (design accounts for this)
157
+ - Message overhead at high Hz: batch commands and reduce UI state update frequency
158
+ - Browser differences: feature-detect and test on Chromium, Firefox (where supported), Safari Technology Preview
159
+
160
+ ## Definition of Done
161
+
162
+ - UI remains smooth with active camera preview and keyboard teleoperation; no flicker observed in manual tests
163
+ - Worker-backed runtime enabled by default when available; fallback path verified
164
+ - `calibrate()`, `teleoperate()`, `record()` maintain identical signatures and behavior
165
+ - Typed protocol implemented with Transferables where applicable
166
+ - Unit tests for protocol routing and error timeouts
167
+ - Documentation updated (this user story + README note)