FlowMo-WM / experiments /TASK_PLAN.md
cccat6's picture
Rename traditional baseline artifacts as LOS controllers
cc396fd verified

Paper Task Plan

This is the execution plan for the public FlowMo experiments. The plan has two parts: A evaluates world models directly, and B evaluates downstream control behavior with traditional non-WM references.

A. Learned World Models

Purpose: test whether the FlowMo world-model architecture improves image-based prediction under hidden flow, boat momentum, actuator delay, and drag.

Shared setup:

Input: clean top-down boat images plus action history
No image cues: no flow arrows, no velocity vector, no goal marker
Training data: data/paper/train.npz
Evaluation data: data/paper/test.npz
Flow families: noflow, uniform, vortex_center, double_gyre, source_sink, source_sink_pair, gradient, shear, turbulent_patch, random_fourier
All flow fields are static. Localized flow structures are sampled near common
task routes so the boat encounters non-uniform current during rollout.
Training budget: shared optimizer, batch size, rollout horizon, step count, and checkpoint schedule
Training precision: BF16 model autocast, FP32 losses and metrics
Prediction precision: BF16 model autocast, FP32 metrics

Compared methods:

Method Purpose
flowmo Proposed flow-momentum WM. Tests explicit separation of short object-motion state and long ambient-drift context.
leworldmodel JEPA-style latent predictor. Tests whether a simple current-image latent transition is sufficient.
planet RSSM recurrent state-space WM. Tests whether generic recurrent latent memory can absorb momentum and flow effects without FlowMo's explicit context.
tdmpc2 Compact latent-dynamics WM. Tests whether a task-oriented latent transition architecture matches FlowMo under the same rollout supervision.

Primary A metrics:

pos@1, pos@5, pos@10, pos@20, pos@40, pos@60
heading@20, heading@60
zero-action drift prediction error
no-flow momentum decay prediction error
same-action different-flow prediction error
FlowMo inferred-context vs c=0 vs shuffled-context error

Required A outputs:

experiments/<method>/checkpoint/paper.pt
experiments/<method>/checkpoint/paper_step_*.pt
experiments/<method>/result/parameter_count.json
experiments/<method>/result/paper_training.json
experiments/reports/paper_prediction.json
experiments/reports/paper_flowmo_latent_probes.json

Core A conclusions:

1. Whether FlowMo has lower long-horizon rollout error.
2. Whether the gain holds across the full paper flow-family set.
3. Whether explicit drift context helps beyond ordinary recurrent history.
4. Whether the same architecture works for both twin and triangle boats.
5. Whether frozen linear probes recover object momentum from `z_t` and ambient drift from `c_t`.

B. Traditional Non-WM Controllers

Purpose: evaluate downstream control behavior and provide non-neural-control reference points. These methods do not train a world model.

Shared setup:

Input: clean top-down images converted to pose for classical control
Tasks: same simulator, same boats, same goals, same flow settings
Metrics: success, final distance, successful-episode trajectory length, successful-episode thrust energy, successful-episode time-to-goal
Planning precision: FP32

Compared methods:

Method Purpose
pid_los_controller Classical line-of-sight waypoint tracking. Tests a simple hand-designed controller.
no_flow_los_controller No-flow LOS controller without external-current compensation. Tests how much hidden flow hurts a nominal dynamics controller.
current_estimator_los_controller LOS controller with recent-drift current estimation. Tests a strong classical current-compensation baseline.
oracle_flow_los_controller LOS controller with simulator true local flow feed-forward. Tests whether local flow feed-forward helps a simple geometric controller; it is not a full dynamics-MPC upper bound.

Planning tasks:

reach_target
station_keeping
waypoint_square
waypoint_zigzag

Boats:

twin
triangle

Flow families:

noflow
uniform
vortex_center
double_gyre
source_sink
source_sink_pair
gradient
shear
turbulent_patch
random_fourier

Required B outputs:

experiments/reports/paper_planning/*.json
experiments/reports/paper_planning/gifs/*.gif

Core B conclusions:

1. Whether WM-based planning is competitive with classical non-WM control.
2. Whether FlowMo improves success, final distance, energy, and path length versus other learned WMs.
3. How far FlowMo remains from the oracle-flow classical reference.