FlowMo-WM / experiments /docs /EXPERIMENT_PROTOCOL.md

Clean public repository for reproducibility

8e384df verified 8 days ago

7.01 kB

FlowMo Experiment Protocol

This document is the single paper-facing record for the public FlowMo experiments. Regenerated artifacts should replace the same paths instead of introducing version suffixes.

Scope

The project has two formal comparison groups.

A. Learned World Models

Purpose: compare image-input world-model architectures under the same simulator data, optimizer budget, rollout target, parameter scale, and planning interface.

Directory	Report Name	Architecture	Comparison Purpose
`flowmo`	FlowMo	Shared image encoder; short object-motion state encoder; long strided ambient-drift context encoder; base transition plus zero-context residual	Proposed flow-momentum factorization. Tests whether separating endogenous motion from exogenous drift improves prediction and planning.
`leworldmodel`	LeWorldModel	JEPA-style image-latent predictor with action-conditioned residual transition	Tests whether simple current-image latent prediction is sufficient.
`planet`	RSSM	Recurrent state-space latent model with deterministic memory and stochastic latent state	Tests whether generic recurrent memory can absorb momentum and drift without a separate context factor.
`tdmpc2`	TD-MPC2 Dynamics	Compact action-conditioned latent dynamics with shared image encoder and rollout heads	Tests task-oriented latent dynamics under equal supervision.

All learned methods receive clean top-down RGB boat images and action history. They do not receive flow labels, flow arrows, velocity vectors, trajectory overlays, or goal markers in the image.

B. Traditional Non-WM Controllers

Purpose: compare downstream behavior against non-neural controllers that do not train a world model.

Directory	Report Name	Input	Comparison Purpose
`pid_los_controller`	PID/LOS	Clean image pose estimate	Simple hand-designed waypoint tracking baseline.
`no_flow_los_controller`	No-Flow LOS Controller	Clean image pose estimate	Measures the cost of ignoring ambient current.
`current_estimator_los_controller`	Current-Estimator LOS Controller	Clean image pose estimate and recent drift	Strong classical current-compensation baseline.
`oracle_flow_los_controller`	Oracle-Flow LOS Controller	Clean image pose estimate and simulator local flow	True-local-flow feed-forward reference for a simple geometric controller, not a full dynamics-MPC upper bound.

Data

All methods use the same splits:

train: data/paper/train.npz
test: data/paper/test.npz
dataset_card: data/paper/dataset_card.md
generation_config: data/paper/generation_config.json

The train split, test split, and final planning evaluation use the same paper flow-family set: noflow, uniform, vortex_center, double_gyre, source_sink, source_sink_pair, gradient, shear, turbulent_patch, and random_fourier.

All paper flow fields are static. Localized structures are sampled near common task routes and waypoint corridors so that non-uniform flow is encountered by the boat during both training trajectories and final planning tasks.

Observation protocol:

image_size: 160 x 160
visual_scale: 2.5
rendering: online clean top-down RGB images
forbidden cues: flow arrows, velocity vectors, trajectory overlays, goal markers

Training budget:

train_episodes: 2400
test_episodes: 480
train_windows: 393216
test_windows: 24576
batch_size: 256
steps: 20000
checkpoint_interval: 2000
num_workers: 4
render_mode: device
training_parallel_jobs: 2
planning_parallel_jobs: 3

Precision policy:

training: bf16 model autocast, fp32 losses and metrics
prediction_eval: bf16 model autocast, fp32 metrics
planning_eval: fp32

The precision split is intentional: BF16 speeds up image encoding and latent rollout on the RTX 5090 without measurable short-run loss drift, while CEM planning is dominated by small control tensors and did not improve under BF16.

Prediction Evaluation

Dataset:

test

Metrics:

pos@1, pos@5, pos@10, pos@20, pos@40, pos@60
heading@20, heading@60
zero-action drift prediction error
no-flow momentum decay prediction error
same-action different-flow prediction error

FlowMo-only context diagnostics:

Diagnostic	Operation	Evidence Sought
Inferred context	Normal rollout with inferred `c_t`	Best prediction under flow.
Zero context	Set `c_t=0`	Degraded flow prediction and limited change in no-flow.
Shuffled context	Use context from another episode	Worse rollout when hidden flow differs.
Same-flow transfer	Use context from another episode with the same hidden flow	Better transfer than wrong-flow context.
Context norm	Compare no-flow and flow `

FlowMo latent probes:

Train frozen linear probes from z_t, c_t, and [z_t,c_t].
Targets: object momentum (vx, vy, omega), local flow vector, episode drift vector.
Purpose: verify which latent carries object-motion information and which latent carries ambient-drift information.

Planning Evaluation

Learned WM planners:

flowmo
leworldmodel
planet
tdmpc2

All learned world models use the same route-aware CEM planner over their latent rollouts.

Traditional non-WM controllers:

pid_los_controller
no_flow_los_controller
current_estimator_los_controller
oracle_flow_los_controller

Tasks:

reach_target
station_keeping
waypoint_square
waypoint_zigzag

Boats:

twin
triangle

Flow families:

noflow
uniform
vortex_center
double_gyre
source_sink
source_sink_pair
gradient
shear
turbulent_patch
random_fourier

Metrics:

success rate
final distance
trajectory length over successful episodes
control effort (`sum_t ||a_t||_2^2`) over successful episodes
time to goal over successful episodes

Required Outputs

Training outputs:

experiments/<method>/checkpoint/paper.pt
experiments/<method>/checkpoint/paper_step_*.pt
experiments/<method>/result/parameter_count.json
experiments/<method>/result/paper_training.json
experiments/<method>/result/paper_training_trace.jsonl

Evaluation outputs:

experiments/reports/paper_prediction.json
experiments/reports/paper_flowmo_latent_probes.json
experiments/reports/paper_planning/*.json
experiments/reports/paper_planning/gifs/*.gif
experiments/reports/paper_report.md

Commands

Run the complete paper pipeline:

python -m experiments.run_paper_image_pipeline

Run stages separately:

python -m experiments.run_paper_image_pipeline --stages train
python -m experiments.run_paper_image_pipeline --stages prediction
python -m experiments.run_paper_image_pipeline --stages probe
python -m experiments.run_paper_image_pipeline --stages planning
python -m experiments.run_paper_image_pipeline --stages report

Run tests:

python -m pytest -q