File size: 6,570 Bytes

# FlowMo Experiment Protocol

This document is the single paper-facing record for the public FlowMo experiments. Regenerated artifacts should replace the same paths instead of introducing version suffixes.

## Scope

The project has two formal comparison groups.

### A. Learned World Models

Purpose: compare image-input world-model architectures under the same simulator data, optimizer budget, rollout target, parameter scale, and planning interface.

| Directory | Report Name | Architecture | Comparison Purpose |
|---|---|---|---|
| `flowmo` | FlowMo | Shared image encoder; short object-motion state encoder; long strided ambient-drift context encoder; base transition plus zero-context residual | Proposed flow-momentum factorization. Tests whether separating endogenous motion from exogenous drift improves prediction and planning. |
| `leworldmodel` | LeWorldModel | JEPA-style image-latent predictor with action-conditioned residual transition | Tests whether simple current-image latent prediction is sufficient. |
| `planet` | RSSM | Recurrent state-space latent model with deterministic memory and stochastic latent state | Tests whether generic recurrent memory can absorb momentum and drift without a separate context factor. |
| `tdmpc2` | TD-MPC2 Dynamics | Compact action-conditioned latent dynamics with shared image encoder and rollout heads | Tests task-oriented latent dynamics under equal supervision. |

All learned methods receive clean top-down RGB boat images and action history. They do not receive flow labels, flow arrows, velocity vectors, trajectory overlays, or goal markers in the image.

### B. Traditional Non-WM Controllers

Purpose: compare downstream behavior against non-neural controllers that do not train a world model.

| Directory | Report Name | Input | Comparison Purpose |
|---|---|---|---|
| `pid_los_controller` | PID/LOS | Clean image pose estimate | Simple hand-designed waypoint tracking baseline. |
| `physics_mpc_no_flow` | Physics MPC No-Flow | Clean image pose estimate | Measures the cost of ignoring ambient current. |
| `current_estimator_mpc` | Current-Estimator MPC | Clean image pose estimate and recent drift | Strong classical current-compensation baseline. |
| `oracle_flow_mpc` | Oracle-Flow MPC | Clean image pose estimate and simulator local flow | Reference bound for control when true local flow is available. |

## Data

All methods use the same splits:

```text
train: data/paper/train.npz
unseen_flow_test: data/paper/test_unseen_flow.npz
unseen_boat_dynamics_test: data/paper/test_unseen_boat_params.npz
seen_flow_diagnostic: data/paper/diagnostic_seen_flow.npz
dataset_card: data/paper/dataset_card.md
generation_config: data/paper/generation_config.json
```

Observation protocol:

```text
image_size: 160 x 160
visual_scale: 2.5
rendering: online clean top-down RGB images
forbidden cues: flow arrows, velocity vectors, trajectory overlays, goal markers
```

Training budget:

```text
train_episodes: 2400
test_episodes: 480
train_windows: 393216
test_windows: 24576
batch_size: 256
steps: 20000
checkpoint_interval: 2000
num_workers: 4
render_mode: device
```

Precision policy:

```text
training: bf16 model autocast, fp32 losses and metrics
prediction_eval: bf16 model autocast, fp32 metrics
planning_eval: fp32
```

The precision split is intentional: BF16 speeds up image encoding and latent rollout on the RTX 5090 without measurable short-run loss drift, while CEM planning is dominated by small control tensors and did not improve under BF16.

## Prediction Evaluation

Datasets:

```text
test_unseen_flow
test_unseen_boat_params
diagnostic_seen_flow
```

Metrics:

```text
pos@1, pos@5, pos@10, pos@20, pos@40, pos@60
heading@20, heading@60
zero-action drift prediction error
no-flow momentum decay prediction error
same-action different-flow prediction error
```

FlowMo-only context diagnostics:

| Diagnostic | Operation | Evidence Sought |
|---|---|---|
| Inferred context | Normal rollout with inferred `c_t` | Best prediction under flow. |
| Zero context | Set `c_t=0` | Degraded flow prediction and limited change in no-flow. |
| Shuffled context | Use context from another episode | Worse rollout when hidden flow differs. |
| Same-flow transfer | Use context from another episode with the same hidden flow | Better transfer than wrong-flow context. |
| Context norm | Compare no-flow and flow `||c_t||` | Flow context should be larger than no-flow context. |

FlowMo latent probes:

```text
Train frozen linear probes from z_t, c_t, and [z_t,c_t].
Targets: object momentum (vx, vy, omega), local flow vector, episode drift vector.
Purpose: verify which latent carries object-motion information and which latent carries ambient-drift information.
```

## Planning Evaluation

Learned WM planners:

```text
flowmo
leworldmodel
planet
tdmpc2
```

All learned world models use the same route-aware CEM planner over their latent rollouts.

Traditional non-WM controllers:

```text
pid_los_controller
physics_mpc_no_flow
current_estimator_mpc
oracle_flow_mpc
```

Tasks:

```text
reach_uniform
counterflow
station_keeping
passive_to_active
waypoint_square
waypoint_zigzag
```

Boats:

```text
twin
triangle
```

Metrics:

```text
success rate
final distance
trajectory length over successful episodes
energy / thrust work over successful episodes
time to goal over successful episodes
```

## Required Outputs

Training outputs:

```text
experiments/<method>/checkpoint/paper.pt
experiments/<method>/checkpoint/paper_step_*.pt
experiments/<method>/result/parameter_count.json
experiments/<method>/result/paper_training.json
experiments/<method>/result/paper_training_trace.jsonl
```

Evaluation outputs:

```text
experiments/reports/paper_prediction_unseen_flow.json
experiments/reports/paper_prediction_unseen_boat_params.json
experiments/reports/paper_prediction_seen_flow_diagnostic.json
experiments/reports/paper_flowmo_latent_probes.json
experiments/reports/paper_planning/*.json
experiments/reports/paper_planning/gifs/*.gif
experiments/reports/paper_report.md
```

## Commands

Run the complete paper pipeline:

```bash
python -m experiments.run_paper_image_pipeline
```

Run stages separately:

```bash
python -m experiments.run_paper_image_pipeline --stages train
python -m experiments.run_paper_image_pipeline --stages prediction
python -m experiments.run_paper_image_pipeline --stages probe
python -m experiments.run_paper_image_pipeline --stages planning
python -m experiments.run_paper_image_pipeline --stages report
```

Run tests:

```bash
python -m pytest -q
```