Spaces:

RemiFabre
/

marionette

Running

App Files Files Community

marionette / STARTUP_OPTIMIZATION.md

RemiFabre

Rewrite startup optimization doc with detailed explanations

17694b3 2 months ago

preview code

raw

history blame contribute delete

13.7 kB

Marionette Startup Optimization — Current State & Ideas

February 2026

What We're Measuring

From the moment the daemon spawns python -u -m marionette.main to the moment the robot's head starts moving (the user's first sign the app is alive). The sound starts 0.4s after the head moves — that 0.4s is fine and not something we want to optimize.

Current time to first head movement: ~7.3 seconds on the CM4.

The Two Blocks of Waiting

Block 1: Python Imports (3.48s)

When Python starts the Marionette app, it reads the source file and every import statement at the top of main.py. Each import loads a library into memory. Some libraries are big and slow to load.

Import	Time	What it is
`reachy_mini`	1.91s	The robot SDK. This is the biggest one because `reachy_mini.py` itself imports `cv2` (camera library, 0.3s), `scipy` (math library, 0.5s), `zenoh` (communication protocol, 0.2s), and more — even though Marionette doesn't use the camera or the math functions that need scipy.
`fastapi`	1.22s	The web framework that serves the UI. It also pulls in Starlette (HTTP server), anyio (async), etc.
`numpy`	0.38s	Math/array library. Required by everything.
`huggingface_hub`	0.31s	For uploading/downloading datasets.
Everything else	~0.27s	pydantic, soundfile, etc.

Nothing runs during this time. Python is just reading library code into memory. No robot communication, no audio, nothing. The app can't do anything until all imports finish because the code that does things (creating the Marionette class, connecting to the robot) is defined using types and functions from these imports.

Block 2: Creating the ReachyMini SDK Instance (3.42s)

After imports, Marionette creates its own object (__init__, 0.37s — fast), and then wrapped_run() is called. This is a method from the base class ReachyMiniApp (in the reachy_mini SDK). Here's what happens inside, step by step:

Step 1: Start the web server (~0.4s) — Uvicorn (the HTTP server) starts in a background thread so your browser can reach http://robot:8042. This happens in parallel with the rest, so it doesn't add to the total.

Step 2: Create the ReachyMini object (~3s) — This is the SDK's main class that lets you control the robot. Creating it does several things in sequence:

ReachyMini.__init__():
│
├─ daemon_check()                    ~0.1s
│  Scans all running processes to verify the daemon is running.
│
├─ ZenohClient()                     ~0.3s
│  Opens a Zenoh session (network connection to the daemon).
│  Creates subscribers for joint positions, head pose, status, etc.
│
├─ wait_for_connection()             ~0.5–2.0s  ← biggest variable
│  BLOCKS until the daemon has sent at least one joint position
│  update AND one head pose update over Zenoh. This proves the
│  daemon is alive and the robot is responding.
│  Polls by sleeping 1 second, checking, sleeping 1 second, etc.
│
├─ get_status()                      ~0.1s
│  Reads the daemon status (is it wireless? simulation? what's the IP?).
│  Usually instant because the status arrived during wait_for_connection.
│
└─ MediaManager()                    ~0.5–1.0s
   Initializes the audio (and optionally camera) system.
   ├─ Determines which backend to use (GStreamer on wireless)
   ├─ Gst.init()           ~0.1–0.5s  Starts the GStreamer runtime
   ├─ DeviceMonitor        ~0.1–0.3s  Scans for audio hardware (mic, speaker)
   ├─ Pipeline build       ~0.05s     Creates record + playback pipelines
   └─ GLib MainLoop        Spawns a background thread for GStreamer events

Only after all of this completes does run() get called, which starts the animation (head up + sound).

Why Can't We Just Play the Sound Right Away?

Because the sound has to go through the robot's audio system, which is managed by the ReachyMini SDK. Here's the chain:

Marionette wants to play a sound
  → calls reachy_mini.media.push_audio_sample()
    → which pushes data into a GStreamer playback pipeline
      → which outputs to the robot's speaker via ALSA

You can't call push_audio_sample() until the MediaManager exists, and the MediaManager can't be created until we know whether we're on a Wireless (GStreamer) or Lite (SoundDevice) robot, and we can't know that until we've connected to the daemon via Zenoh.

Similarly, moving the head requires reachy_mini.goto_target(), which sends commands through the Zenoh client, which must be connected first.

So the answer is: yes, it's the ReachyMini SDK instance creation that blocks everything, and we need it because both audio and motor control go through it.

Could We Pre-Load It?

The daemon launches apps as subprocesses on demand. There's no "pre-warm" mechanism — each time you click "Start Marionette" in the dashboard, it runs python -u -m marionette.main from scratch.

To pre-load, the daemon would need to either:

Keep a Python process with all imports already done, waiting for a "go" signal (complex, high memory)
Or cache compiled bytecode (Python already does this with .pyc files, but the actual import time is dominated by executing module-level code, not reading files)

This is why the realistic approach is to make each step faster or run steps in parallel, rather than trying to skip them.

Optimization Ideas — Explained

Idea 1: Lazy Imports in reachy_mini.py

What "lazy import" means: Instead of loading a library at the top of the file (which happens immediately when the file is imported), you load it inside the function that actually uses it:

# BEFORE — imported at the top of reachy_mini.py, always loaded:
import cv2
from scipy.spatial.transform import Rotation as R

class ReachyMini:
    def look_at_image(self, u, v, ...):
        points = np.array([[[u, v]]], dtype=np.float32)
        x_n, y_n = cv2.undistortPoints(points, ...)  # uses cv2
        ...

    def wake_up(self):
        pose[:3, :3] = R.from_euler("xyz", [20, 0, 0], degrees=True).as_matrix()  # uses scipy
        ...

# AFTER — imported only when needed:
class ReachyMini:
    def look_at_image(self, u, v, ...):
        import cv2  # loaded here, only if this method is called
        points = np.array([[[u, v]]], dtype=np.float32)
        x_n, y_n = cv2.undistortPoints(points, ...)
        ...

    def wake_up(self):
        from scipy.spatial.transform import Rotation as R  # loaded here
        pose[:3, :3] = R.from_euler("xyz", [20, 0, 0], degrees=True).as_matrix()
        ...

"Isn't it annoying to make sure we didn't miss anything?"

Not really, because:

Python has a simple rule: if you use a name that isn't imported, you get a NameError immediately. So if you miss a usage, it crashes on the first call — it doesn't silently break.
The scope is small. In reachy_mini.py, cv2 is only used in look_at_image() (one method). scipy.spatial.transform.Rotation is used in look_at_image(), look_at_world(), wake_up(), and goto_sleep() — four methods. You grep for cv2 and R.from_, move the import into each method, done.
Python caches imports. The first call to import cv2 inside a function takes 0.3s. Every subsequent call in the same process takes ~0 — Python just returns the cached module. So there's no performance penalty for having the import line in multiple methods.

Why it helps Marionette: Marionette never calls look_at_image() or look_at_world(). It does call goto_sleep() indirectly (via _goto_sleep_and_release), but only after the app is already running. So the 0.3s (cv2) + 0.5s (scipy) would not be paid at startup at all.

Estimated savings: 0.8s from the 1.91s reachy_mini import, bringing it down to ~1.1s.

Idea 2: Reduce Zenoh Poll Interval

What Zenoh is: Zenoh is a communication protocol (like MQTT or ROS topics). The daemon publishes robot state (joint positions, head pose) on Zenoh topics. The app subscribes to those topics to receive updates.

What happens during wait_for_connection(): After opening a Zenoh session and subscribing, the app needs to verify the daemon is alive and sending data. It does this by waiting for two events:

"I received at least one joint position update"
"I received at least one head pose update"

The current code checks these events in a loop:

while time.time() - start < timeout:          # timeout = 5 seconds
    if joint_received.is_set() and head_received.is_set():
        break
    time.sleep(1.0)   # ← sleeps for 1 FULL SECOND between checks

The problem: The daemon might respond in 0.1 seconds, but the app won't notice until it wakes up from its 1-second sleep. In the worst case, this adds almost 1 full second of pure waiting-for-nothing.

The fix: Change time.sleep(1.0) to time.sleep(0.1) — check 10 times per second instead of once. This way, as soon as the daemon responds, the app notices within 0.1s instead of up to 1.0s.

Why is the current code sleeping 1 second? Probably just a conservative default — it's not doing anything useful during that sleep, and checking more frequently costs essentially nothing (it's just checking if a boolean flag is set).

Estimated savings: 0–0.9s (depends on how quickly the daemon responds relative to the sleep cycle; on average ~0.45s).

Idea 3: Parallelize Zenoh Connection + MediaManager Init

The problem: Currently, ReachyMini.init() does everything one step at a time:

Step 1: Connect to daemon via Zenoh       (0.5–2.0s)  ──── sequential ────
Step 2: Wait for daemon status             (0.1s)       ──── sequential ────
Step 3: Initialize MediaManager/GStreamer  (0.5–1.0s)   ──── sequential ────
                                                    Total: 1.1–3.1s

But steps 1 and 3 don't fully depend on each other:

The Zenoh connection is about talking to the daemon to control motors and read sensors.
The GStreamer init is about setting up the local audio hardware (microphone, speaker, pipelines).

The only dependency is that MediaManager needs to know which backend to use (GStreamer vs SoundDevice vs WebRTC), which comes from the daemon status. But the actual heavy work (calling Gst.init(), scanning for audio devices, building pipelines) is purely local.

The idea: Split MediaManager init into two parts:

Backend selection (needs daemon status — must wait for Zenoh): "We're on a Wireless robot, use GStreamer"
Actual initialization (local work — can run in parallel): "Initialize GStreamer, find devices, build pipelines"

Start part 2 in a background thread as soon as part 1 is decided, while the Zenoh connection continues. The MediaManager becomes usable once both the Zenoh connection and the background init are done — whichever finishes last.

Step 1: Connect to daemon via Zenoh       (0.5–2.0s)  ─┐
Step 2: Daemon status arrives → decide backend          ├── parallel
Step 3: Initialize GStreamer (background)  (0.5–1.0s)  ─┘
                                                    Total: max(step1, step3) ≈ 0.5–2.0s
                                                    Savings: 0.5–1.0s

Estimated savings: 0.5–1.0s because the GStreamer init would happen during the Zenoh wait instead of after it.

Idea 4: Lazy-Import FastAPI

Same concept as idea 1 but for FastAPI (1.22s import time). Currently imported at the top of main.py:

from fastapi import HTTPException, UploadFile, File  # 1.22s

This can't be trivially lazy-imported in Marionette's main.py because the BaseModel classes, route decorators, etc. reference FastAPI types at class definition time. However, it could be lazy-imported in the reachy_mini SDK's app.py, where FastAPI is first used:

# Currently at top of app.py:
from fastapi import FastAPI

# Could become:
def __init__(self):
    if self.custom_app_url and not self.dont_start_webserver:
        from fastapi import FastAPI  # only imported here
        self.settings_app = FastAPI()

This doesn't save wall-clock time for Marionette specifically (because Marionette's own main.py also imports from FastAPI), but it would help other apps that don't use FastAPI and it shows the general pattern.

Estimated savings: 0s for Marionette specifically, but up to 1.2s for simpler apps.

Summary: What We Can Actually Do

What	Where	Savings	Effort
Lazy cv2/scipy in `reachy_mini.py`	reachy_mini repo	~0.8s	Small (grep + move imports to 4 methods)
Faster Zenoh poll (1.0s → 0.1s)	reachy_mini repo	~0.5s avg	Trivial (one line change)
Parallelize Zenoh + GStreamer	reachy_mini repo	~0.5–1.0s	Medium (threading in SDK init)
Total realistic savings		~1.8–2.3s
New time to first movement		~5.0–5.5s

The remaining ~5s floor is: Python startup + irreducible imports (numpy, zenoh, fastapi, huggingface_hub) + actual Zenoh network round-trip + bare minimum GStreamer init. These can't be optimized without fundamental architecture changes (pre-launched processes, compiled extensions, etc.).

How to Verify

After any change, deploy to the CM4 and check the [BOOT] lines:

./deploy_wireless.sh
ssh reachy journalctl -u reachy-mini-daemon -f | grep BOOT

The [BOOT] instrumentation is already in main.py and will show timing for each phase.