Paper Task Plan

This is the execution plan for the public FlowMo experiments. The plan has two parts: A evaluates world models directly, and B evaluates downstream control behavior with traditional non-WM references.

A. Learned World Models

Purpose: test whether the FlowMo world-model architecture improves image-based prediction under hidden flow, boat momentum, actuator delay, and drag.

Shared setup:

Input: clean top-down boat images plus action history
No image cues: no flow arrows, no velocity vector, no goal marker
Training data: data/paper/train.npz
Evaluation data: data/paper/test.npz
Flow families: noflow, uniform, vortex_center, double_gyre, source_sink, source_sink_pair, gradient, shear, turbulent_patch, random_fourier
All flow fields are static. Localized flow structures are sampled near common
task routes so the boat encounters non-uniform current during rollout.
Training budget: shared optimizer, batch size, rollout horizon, step count, and checkpoint schedule
Training precision: BF16 model autocast, FP32 losses and metrics
Prediction precision: BF16 model autocast, FP32 metrics

Compared methods:

Method	Purpose
`flowmo`	Proposed flow-momentum WM. Tests explicit separation of short object-motion state and long ambient-drift context.
`leworldmodel`	JEPA-style latent predictor. Tests whether a simple current-image latent transition is sufficient.
`planet`	RSSM recurrent state-space WM. Tests whether generic recurrent latent memory can absorb momentum and flow effects without FlowMo's explicit context.
`tdmpc2`	Compact latent-dynamics WM. Tests whether a task-oriented latent transition architecture matches FlowMo under the same rollout supervision.

Primary A metrics:

pos@1, pos@5, pos@10, pos@20, pos@40, pos@60
heading@20, heading@60
zero-action drift prediction error
no-flow momentum decay prediction error
same-action different-flow prediction error
FlowMo inferred-context vs c=0 vs shuffled-context error

Required A outputs:

experiments/<method>/checkpoint/paper.pt
experiments/<method>/checkpoint/paper_step_*.pt
experiments/<method>/result/parameter_count.json
experiments/<method>/result/paper_training.json
experiments/reports/paper_prediction.json
experiments/reports/paper_flowmo_latent_probes.json

Core A conclusions:

1. Whether FlowMo has lower long-horizon rollout error.
2. Whether the gain holds across the full paper flow-family set.
3. Whether explicit drift context helps beyond ordinary recurrent history.
4. Whether the same architecture works for both twin and triangle boats.
5. Whether frozen linear probes recover object momentum from `z_t` and ambient drift from `c_t`.

B. Traditional Non-WM Controllers

Purpose: evaluate downstream control behavior and provide non-neural-control reference points. These methods do not train a world model.

Shared setup:

Input: clean top-down images converted to pose for classical control
Tasks: same simulator, same boats, same goals, same flow settings
Metrics: success, final distance, successful-episode trajectory length, successful-episode thrust energy, successful-episode time-to-goal
Planning precision: FP32

Compared methods:

Method	Purpose
`pid_los_controller`	Classical line-of-sight waypoint tracking. Tests a simple hand-designed controller.
`no_flow_los_controller`	No-flow LOS controller without external-current compensation. Tests how much hidden flow hurts a nominal dynamics controller.
`current_estimator_los_controller`	LOS controller with recent-drift current estimation. Tests a strong classical current-compensation baseline.
`oracle_flow_los_controller`	LOS controller with simulator true local flow feed-forward. Tests whether local flow feed-forward helps a simple geometric controller; it is not a full dynamics-MPC upper bound.

Planning tasks:

reach_target
station_keeping
waypoint_square
waypoint_zigzag

Boats:

twin
triangle

Flow families:

noflow
uniform
vortex_center
double_gyre
source_sink
source_sink_pair
gradient
shear
turbulent_patch
random_fourier

Required B outputs:

experiments/reports/paper_planning/*.json
experiments/reports/paper_planning/gifs/*.gif

Core B conclusions:

1. Whether WM-based planning is competitive with classical non-WM control.
2. Whether FlowMo improves success, final distance, energy, and path length versus other learned WMs.
3. How far FlowMo remains from the oracle-flow classical reference.